question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot tell usecols to ignore missing columns

See original GitHub issue

Code Sample

import pandas as pd
# Where example.csv is:
# column1,column2
# 1, 2

pd.read_csv('example.csv', usecols=['column1', 'column2', ' column3'])

Problem description

When specifying usecols to reduce the amount of data loaded, read_csv fails if the columns do not exist. This is not always desired, especially when reading a large number of files that may have varying columns.

There should be an option to suppress this and allow usecols to cut-down columns without enforcing their presence.

Current Output

ValueError: Usecols do not match columns in file, columns expected but not found: ['column3']

Expected Output

No error thrown where only some of the usecols exist.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.23.4 pytest: 4.0.2 pip: 18.1 setuptools: 40.6.3 Cython: 0.29.2 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.2.0 sphinx: 1.8.2 patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.2 openpyxl: 2.5.12 xlrd: 1.2.0 xlwt: 1.3.0 xlsxwriter: 1.1.2 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.15 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

7reactions
chris-b1commented, Apr 4, 2019

See the example here http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#column-and-index-locations-and-names

I do think a callable to usecols is the right way to handle this - read_csv already has a ton of params and this is relatively simple to customize exactly how you want.

In [3]: pd.read_csv(StringIO("column1,column2\n1,2"), 
                    usecols=lambda c: c in {'column1', 'column2', 'column3'})
Out[3]:
   column1  column2
0        1        2
1reaction
jrebackcommented, Apr 4, 2019

see the docs: http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#csv-text-files

you can pass a callable to usecols, IIRC @gfyoung we have an example of this somewhere?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ignore missing columns in 'usecol' parameter - Stack Overflow
Use a callable checking if the column is in the subset subset = ['a', 'c', 'd'] df = pd.read_csv('sample.csv', usecols=lambda x: x in ......
Read more >
[Code]-Ignore missing columns in 'usecol' parameter-pandas
Coding example for the question Ignore missing columns in 'usecol' parameter-pandas. ... I'm reading a table from csv and only want a subset...
Read more >
Pandas Read CSV Tutorial: skiprows, usecols, missing data + ...
In this Pandas read CSV tutorial you will learn how to set index column, read certain columns, remove unnamed columns, skip rows &...
Read more >
Pandas read_csv() - How to read a csv file in Python
usecols parameter takes the list of columns you want to load in your data frame. Selecting columns using list # Read the csv...
Read more >
dask.dataframe.read_csv - Dask documentation
If True, all integer columns that aren't specified in dtype are assumed to contain missing values, and are converted to floats. Default is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found