Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

where is the data pipeline for training?

See original GitHub issue

I do not find the info for training and testing on the kaggle dataset.

there is only one line for dataset creation in data_utils.py

getKaggleCriteoAdData(datafile="<path-to-train.txt>", o_filename=kaggleAdDisplayChallenge_processed.npz")

however, in the dlrm_s_pytorch.py, there is no where can input a dataset.

    parser.add_argument("--data-set", type=str, default="kaggle")  # or terabyte
    parser.add_argument("--raw-data-file", type=str, default="")
    parser.add_argument("--processed-data-file", type=str, default="")

these 3 lines above are useless, since no ref in the code for these parameters.

what is the purpose to publish the source code that is not able to read any dataset?

Issue Analytics

State:
Created 4 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

3reactions

insoulitycommented, Sep 20, 2019

FYI, I am NOT one of the authors. But let me add a few comments in the meantime.

First of all, I’d like to say I and my colleagues have been completely fine with running DLRM (while the one before refactoring data loader) based on the readme.

I agree that the script could have had --raw-data-file by default but it’s also well described in the https://github.com/facebookresearch/dlrm#benchmarking.

For the earlier one, could you take a look at the following to get to data_utils.py?

Hope this helps meanwhile. Yongkee

2reactions

mnaumovfbcommented, Sep 19, 2019

Please refer to the detailed description in part 2 of the benchmarking section in the README file.

We do not distribute any datasets with this model. We do support interface with Kaggle Display Advertising Challenge Dataset. In order to use it:

Please download the dataset yourself. Then, you will see a raw data file train.txt
Please pass this file to the model using “–raw-data-file=…” command line argument
The model will generate a processed .npz file, which can be used in subsequent runs
Please refer to the ./bench/dlrm_s_criteo_kaggle.sh script for how it is used.

Top Results From Across the Web

What Is A Data Pipeline For Machine Learning? - Pachyderm

Data pipelines capture and deliver the information that's being used in a machine learning model. It's how a team determines what data they...

Building a data pipeline - CS230 Deep Learning

One general tip mentioned in the performance guide is to put all the data processing pipeline on the CPU to make sure that...

AWS Data Pipeline Tutorial - Edureka

This article on AWS Data Pipeline Tutorial will help you understand how to store, process & analyse data at a centralised location using...

1. Introduction - Building Machine Learning Pipelines [Book]

The pipeline includes a variety of steps, including data preprocessing, model training, and model analysis, as well as the deployment of the model....

What is a Machine Learning Pipeline? - Valohai

The data collection, data cleaning, model training and evaluation are likely written in a single notebook. The notebook is run locally to produce...