where is the data pipeline for training?
See original GitHub issueI do not find the info for training and testing on the kaggle dataset.
there is only one line for dataset creation in data_utils.py
getKaggleCriteoAdData(datafile="<path-to-train.txt>", o_filename=kaggleAdDisplayChallenge_processed.npz")
however, in the dlrm_s_pytorch.py, there is no where can input a dataset.
parser.add_argument("--data-set", type=str, default="kaggle") # or terabyte
parser.add_argument("--raw-data-file", type=str, default="")
parser.add_argument("--processed-data-file", type=str, default="")
these 3 lines above are useless, since no ref in the code for these parameters.
what is the purpose to publish the source code that is not able to read any dataset?
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
What Is A Data Pipeline For Machine Learning? - Pachyderm
Data pipelines capture and deliver the information that's being used in a machine learning model. It's how a team determines what data they...
Read more >Building a data pipeline - CS230 Deep Learning
One general tip mentioned in the performance guide is to put all the data processing pipeline on the CPU to make sure that...
Read more >AWS Data Pipeline Tutorial - Edureka
This article on AWS Data Pipeline Tutorial will help you understand how to store, process & analyse data at a centralised location using...
Read more >1. Introduction - Building Machine Learning Pipelines [Book]
The pipeline includes a variety of steps, including data preprocessing, model training, and model analysis, as well as the deployment of the model....
Read more >What is a Machine Learning Pipeline? - Valohai
The data collection, data cleaning, model training and evaluation are likely written in a single notebook. The notebook is run locally to produce...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
FYI, I am NOT one of the authors. But let me add a few comments in the meantime.
First of all, I’d like to say I and my colleagues have been completely fine with running DLRM (while the one before refactoring data loader) based on the readme.
I agree that the script could have had --raw-data-file by default but it’s also well described in the https://github.com/facebookresearch/dlrm#benchmarking.
For the earlier one, could you take a look at the following to get to data_utils.py?
Hope this helps meanwhile. Yongkee
Please refer to the detailed description in part 2 of the benchmarking section in the README file.
We do not distribute any datasets with this model. We do support interface with Kaggle Display Advertising Challenge Dataset. In order to use it: