question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to make TabularDataset loading csv faster

See original GitHub issue

I try torchtext with data from https://www.kaggle.com/c/quora-question-pairs/data, which size is 200MB.

train.csv: 404302 lines test.csv:3563490 lines

I use TabularDataset to load them, it’s too slow. For train.csv, it cost 5mins, and for test.csv, it can’t finish in 40mins. But it only need 3sec when use pandas to load train.csv, and 20 sec for keras’s texts_to_sequences to process.

here is my code, does anything wrong: `print(“Preparing CSV files…”)

QUESTION = data.Field(
    sequential=True,
    fix_length=fix_length,
    tokenize=tokenizer,
    init_token='SOS',
    eos_token='EOS',
    lower=lower
)

LABEL = data.Field(
    sequential=False,
)

print("Reading train csv file...")
train = data.TabularDataset(
    path= mypath + '/train.csv', format='csv', skip_header=True,
    fields=[
        ('id', None),
        ('qid1', None),
        ('qid2', None),
        ('question1', QUESTION),
        ('question2', QUESTION),
        ('label', LABEL),
    ])

print(vars(train[0]))` 

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
BartekRoszakcommented, Nov 14, 2018

What tokenizer do you use? Maybe it could be better if you pre-tokenize text and join tokens with whitespace. Then you will need only str.split as tokenizer which works fast.

2reactions
prashantbudaniacommented, Feb 22, 2019

You should disable spacy’s pipelines if you are only using it for tokenization. More info here: https://spacy.io/usage/processing-pipelines#disabling

(because parser pipeline is the slowest and if you are not using that information, then it’s just waste of time)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Load the same CSV file 10X times faster and with 10X less ...
Every column has it's own dtype in a pandas DataFrame, for example, integers have int64 , int32 , int16 etc… int8 can store...
Read more >
Quickest ways to load CSV file data in R | Analytics Vidhya
Dataset 1: Starting Small. Let's have a look at the first dataset. It's got 8K rows and 23 categorical columns with few levels...
Read more >
Create Tabular Dataset in Vertex AI - YouTube
This tutorial shows how to create Tabular Dataset in Vertex AI, from BigQuery table, Google Cloud Storage CSV file or Pandas DataFrame.
Read more >
A complete guide to CSV files in Node.js - LogRocket Blog
js runtime environment, have applications and packages for reading, writing, and parsing CSV files. In this article, we will learn how to manage ......
Read more >
Chapter 2 Reading in data locally and from the web
If we want to read the .csv file named happiness_report.csv into R, we could do this using either a relative or an absolute...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found