Manage Data Schemas¶
Create Schema¶
schema
settings takes any data so you can use that in a variety of ways to pass parameters to the reader. This makes it easier for the data recipient to process your data files.
settings = \
{
'schema': {
'pandas': {
'sep': ',',
'encoding': 'utf8'
},
'dask': {
'sep': ',',
'encoding': 'utf8'
},
'xls': {
'pandas': {
'sheet_name':'Sheet1'
}
}
}
}
pipe.update_settings(settings) # update settings
The flexibility is good but you might want to consider adhering to metadata specifications such as https://frictionlessdata.io/specs/.
Using schema¶
You can pass schema
information for downstream processing.
# show schema
print(pipe.schema)
# use schema
df = pd.read_csv(pipe.dirpath/'test.csv', **pipe.schema['pandas'])
df = dd.read_csv(pipe.dirpath/'test.csv', **pipe.schema['dask'])
df = pd.read_excel(pipe.dirpath/'others.xlsx', **pipe.schema['xls']['pandas'])