A. Batchwise Data Processing The favorita dataset is too large to fit into memory for in one go. So, it is processed in chunks of 300000.
B. Data Preprocessing In this following dataset each product number-store number pair is treated as a separate entity and is denoted by an embedding of the following variables:
We treat each product number-store number pair as a separate entity
We include an additional ’open’ flag to denote whether data is present on a given day
Data is resampled at regular daily intervals,imputing any missing days using the last available observation
We apply a log-transform on the sales data, and adopt z-score normalization across all entities
Dropping where any record missing
The training set is made up of samples taken between 2015-01-01 to 2015-12-01. The validation set of samples from the 30 days after the training set. The test set of all entities over the 30-day horizon following the validation set.
We consider log sales, transactions, oil to be real-valued and the rest to be categorical.
Current Roadblock: - Fitting data into memory for modeling
Tasks Completed Till Now
Step 1: Batchwise Data Processing for Preprocessing The favorita dataset is too large to fit into memory for in one go. So, it is processed in chunks of 300000.
Step 2: Data Preprocessing In this following dataset each product number-store number pair is treated as a separate entity and is denoted by an embedding of the following variables:
We treat each product number-store number pair as a separate entity
We include an additional ’open’ flag to denote whether data is present on a given day
Data is resampled at regular daily intervals,imputing any missing days using the last available observation
We apply a log-transform on the sales data, and adopt z-score normalization across all entities
Dropping where any record missing
The training set is made up of samples taken between 2015-01-01 to 2015-12-01. The validation set of samples from the 30 days after the training set. The test set of all entities over the 30-day horizon following the validation set.
We consider log sales, transactions, oil to be real-valued and the rest to be categorical.