How to make TabularDataset loading csv faster
See original GitHub issueI try torchtext with data from https://www.kaggle.com/c/quora-question-pairs/data, which size is 200MB.
train.csv: 404302 lines test.csv:3563490 lines
I use TabularDataset to load them, it’s too slow. For train.csv, it cost 5mins, and for test.csv, it can’t finish in 40mins. But it only need 3sec when use pandas to load train.csv, and 20 sec for keras’s texts_to_sequences to process.
here is my code, does anything wrong: `print(“Preparing CSV files…”)
QUESTION = data.Field(
sequential=True,
fix_length=fix_length,
tokenize=tokenizer,
init_token='SOS',
eos_token='EOS',
lower=lower
)
LABEL = data.Field(
sequential=False,
)
print("Reading train csv file...")
train = data.TabularDataset(
path= mypath + '/train.csv', format='csv', skip_header=True,
fields=[
('id', None),
('qid1', None),
('qid2', None),
('question1', QUESTION),
('question2', QUESTION),
('label', LABEL),
])
print(vars(train[0]))`
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Load the same CSV file 10X times faster and with 10X less ...
Every column has it's own dtype in a pandas DataFrame, for example, integers have int64 , int32 , int16 etc… int8 can store...
Read more >Quickest ways to load CSV file data in R | Analytics Vidhya
Dataset 1: Starting Small. Let's have a look at the first dataset. It's got 8K rows and 23 categorical columns with few levels...
Read more >Create Tabular Dataset in Vertex AI - YouTube
This tutorial shows how to create Tabular Dataset in Vertex AI, from BigQuery table, Google Cloud Storage CSV file or Pandas DataFrame.
Read more >A complete guide to CSV files in Node.js - LogRocket Blog
js runtime environment, have applications and packages for reading, writing, and parsing CSV files. In this article, we will learn how to manage ......
Read more >Chapter 2 Reading in data locally and from the web
If we want to read the .csv file named happiness_report.csv into R, we could do this using either a relative or an absolute...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
What tokenizer do you use? Maybe it could be better if you pre-tokenize text and join tokens with whitespace. Then you will need only str.split as tokenizer which works fast.
You should disable spacy’s pipelines if you are only using it for tokenization. More info here: https://spacy.io/usage/processing-pipelines#disabling
(because parser pipeline is the slowest and if you are not using that information, then it’s just waste of time)