question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

where is the data pipeline for training?

See original GitHub issue

I do not find the info for training and testing on the kaggle dataset.

there is only one line for dataset creation in data_utils.py

getKaggleCriteoAdData(datafile="<path-to-train.txt>", o_filename=kaggleAdDisplayChallenge_processed.npz")

however, in the dlrm_s_pytorch.py, there is no where can input a dataset.

    parser.add_argument("--data-set", type=str, default="kaggle")  # or terabyte
    parser.add_argument("--raw-data-file", type=str, default="")
    parser.add_argument("--processed-data-file", type=str, default="")

these 3 lines above are useless, since no ref in the code for these parameters.

what is the purpose to publish the source code that is not able to read any dataset?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
insoulitycommented, Sep 20, 2019

FYI, I am NOT one of the authors. But let me add a few comments in the meantime.

First of all, I’d like to say I and my colleagues have been completely fine with running DLRM (while the one before refactoring data loader) based on the readme.

I agree that the script could have had --raw-data-file by default but it’s also well described in the https://github.com/facebookresearch/dlrm#benchmarking.

For the earlier one, could you take a look at the following to get to data_utils.py?

Hope this helps meanwhile. Yongkee

2reactions
mnaumovfbcommented, Sep 19, 2019

Please refer to the detailed description in part 2 of the benchmarking section in the README file.

We do not distribute any datasets with this model. We do support interface with Kaggle Display Advertising Challenge Dataset. In order to use it:

  1. Please download the dataset yourself. Then, you will see a raw data file train.txt
  2. Please pass this file to the model using “–raw-data-file=…” command line argument
  3. The model will generate a processed .npz file, which can be used in subsequent runs
  4. Please refer to the ./bench/dlrm_s_criteo_kaggle.sh script for how it is used.
Read more comments on GitHub >

github_iconTop Results From Across the Web

What Is A Data Pipeline For Machine Learning? - Pachyderm
Data pipelines capture and deliver the information that's being used in a machine learning model. It's how a team determines what data they...
Read more >
Building a data pipeline - CS230 Deep Learning
One general tip mentioned in the performance guide is to put all the data processing pipeline on the CPU to make sure that...
Read more >
AWS Data Pipeline Tutorial - Edureka
This article on AWS Data Pipeline Tutorial will help you understand how to store, process & analyse data at a centralised location using...
Read more >
1. Introduction - Building Machine Learning Pipelines [Book]
The pipeline includes a variety of steps, including data preprocessing, model training, and model analysis, as well as the deployment of the model....
Read more >
What is a Machine Learning Pipeline? - Valohai
The data collection, data cleaning, model training and evaluation are likely written in a single notebook. The notebook is run locally to produce...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found