DataSchema Class
Defines a schema for a datasets.
- Inheritance
-
builtins.objectDataSchema
Constructor
DataSchema(schema, **options)
Examples
from nimbusml import DataSchema, FileDataStream
from nimbusml import Pipeline
from nimbusml.ensemble import LightGbmRegressor
from nimbusml.feature_extraction.categorical import OneHotVectorizer
import numpy as np
import pandas as pd
data = pd.DataFrame(dict(real = [0.1, 2.2],
text = ['word','class'],
y = [1,3]))
data.to_csv('data.csv', index = False, header = True)
schema = DataSchema.read_schema('data.csv', collapse = False,
numeric_dtype = np.float32,
sep = ',')
print(schema)
#col=real:R4:0 col=text:TX:1 col=y:R4:2 header=+ sep=,
exp = Pipeline([
OneHotVectorizer(columns = ['text']),
LightGbmRegressor(minimum_example_count_per_leaf = 1)
])
exp.fit(FileDataStream('data.csv', schema = schema), 'y')
Remarks
The DataSchema class automatically generates a description of
the data schema from various data sources. The
data source may be a list, array, dataframe or a file. A schema
is required for all nimbusml
trainers and
transforms, and when not provided explicitly, it needs to be
inferred automatically before any data processing
can occur. In the case of list, array or dataframes, the schema
inference is usually straightforward, but when
the data source is a file, it may require further inspection to
ensure it matches the data, and that the types
are aligned as needed (e.g. R4 vs I4).
For more details on the schema format, refer to Schema, Types and Vector Type.
Methods
clone | |
extract_idv_schema_from_file | |
format_options |
Formats the options for the parser from the core library. |
parse |
Parses a schema defined as a string. |
read_schema |
Infers the schema of a data view. |
read_schema_file |
Infers the schema of a file. Additional options:
|
rename |
Renames a column. |
to_string |
Converts the schema into a string. |
clone
clone()
extract_idv_schema_from_file
extract_idv_schema_from_file(path)
Parameters
- path
format_options
Formats the options for the parser from the core library.
format_options(add_sep=False)
Parameters
- add_sep
the code library usually requires the separator, it is not added if the user does not explicitely specify it unless add_sep is True, in that case, the default value is added.
Returns
formatted options as a string
parse
Parses a schema defined as a string.
parse(schema)
Parameters
- schema
read_schema
Infers the schema of a data view.
read_schema(*data, **options)
Parameters
- data
features, labels, weights, groups
- collapse
(False by default), collapse columns for of the same type
if it follows read_csv function. Use internal structure of a
dataframe. If collapse* == 'all'
,
the method collapses all columns not specified in parameter names.
- sep
string value of file seperation character (for example: ',')
- header
whether the data has a header row; defaults to True
- dtype
change dtype of specific columns; takes dictionary of column names mapped to desired dtype
- numeric_dtype
if not None, changes all numeric types into this type
- names
specify new names for columns; takes dictionary of column index mapped to desired name
- ind
first column index (in case DataFrame are concatenated)
- tool
'pandas' or 'nimbusml'
Returns
schema as a string
read_schema_file
Infers the schema of a file.
Additional options:
collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If
collapse* == 'all'
, the method collapses all columns not specified in parameter names.numeric_dtype: if not None, changes all numeric types into this type
read_schema_file(filepath_or_buffer, tool='pandas', nrows=100, **options)
Parameters
- filepath_or_buffer
stream or filename
- tool
'pandas' or 'nimbusml'
- nrows
use the first top rows only
- options
additional options for read_csv from pandas or internal reader
Returns
schema
rename
Renames a column.
rename(old_name, new_name)
Parameters
- old_name
old name
- new_name
new_name
Returns
self
to_string
Converts the schema into a string.
to_string(add_sep=False)
Parameters
- add_sep
sep is not added if the user does not specify it, but it is required by the core library, the method adds the default value if not specified.
Returns
formatted schema as a string