Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to make TabularDataset loading csv faster

See original GitHub issue

I try torchtext with data from https://www.kaggle.com/c/quora-question-pairs/data, which size is 200MB.

train.csv: 404302 lines test.csv:3563490 lines

I use TabularDataset to load them, it’s too slow. For train.csv, it cost 5mins, and for test.csv, it can’t finish in 40mins. But it only need 3sec when use pandas to load train.csv, and 20 sec for keras’s texts_to_sequences to process.

here is my code, does anything wrong: `print(“Preparing CSV files…”)

QUESTION = data.Field(
    sequential=True,
    fix_length=fix_length,
    tokenize=tokenizer,
    init_token='SOS',
    eos_token='EOS',
    lower=lower
)

LABEL = data.Field(
    sequential=False,
)

print("Reading train csv file...")
train = data.TabularDataset(
    path= mypath + '/train.csv', format='csv', skip_header=True,
    fields=[
        ('id', None),
        ('qid1', None),
        ('qid2', None),
        ('question1', QUESTION),
        ('question2', QUESTION),
        ('label', LABEL),
    ])

print(vars(train[0]))`

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

BartekRoszakcommented, Nov 14, 2018

What tokenizer do you use? Maybe it could be better if you pre-tokenize text and join tokens with whitespace. Then you will need only str.split as tokenizer which works fast.

2reactions

prashantbudaniacommented, Feb 22, 2019

You should disable spacy’s pipelines if you are only using it for tokenization. More info here: https://spacy.io/usage/processing-pipelines#disabling

(because parser pipeline is the slowest and if you are not using that information, then it’s just waste of time)

Top Results From Across the Web

Load the same CSV file 10X times faster and with 10X less ...

Every column has it's own dtype in a pandas DataFrame, for example, integers have int64 , int32 , int16 etc… int8 can store...

Quickest ways to load CSV file data in R | Analytics Vidhya

Dataset 1: Starting Small. Let's have a look at the first dataset. It's got 8K rows and 23 categorical columns with few levels...

Create Tabular Dataset in Vertex AI - YouTube

This tutorial shows how to create Tabular Dataset in Vertex AI, from BigQuery table, Google Cloud Storage CSV file or Pandas DataFrame.

A complete guide to CSV files in Node.js - LogRocket Blog

js runtime environment, have applications and packages for reading, writing, and parsing CSV files. In this article, we will learn how to manage ......

Chapter 2 Reading in data locally and from the web

If we want to read the .csv file named happiness_report.csv into R, we could do this using either a relative or an absolute...