question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keep original filenames in dask.dataframe.read_csv

See original GitHub issue

For the data I am reading, the path (directory name) is an important trait, and this would be useful to access (possibly as an additional column, path_as_column = True) or at the very least in the collection of delayed objects

import dask.dataframe as dd
dd.read_csv('s3://bucket_name/*/*/*.csv', collection=True)

image

The collection version of the command comes closer

all_dfs = dd.read_csv('s3://bucket_name/*/*/*.csv', collection=True)
print('key:', all_dfs[0].key)
print('value:', all_dfs[0].compute())

but returns an internal code as the key and doesn’t seem to have the path (s3://bucket_name/actual_folder/subfolder/fancyfile.csv) anywhere

key: pandas_read_text-35c2999796309c2c92e6438c0ebcbba4
value:    Unnamed: 0   T   N   M   count
0           0  T1  N0  M0  0.4454
1           5  T2  N0  M0  0.4076
2           6  T2  N0  M1  0.0666
3           1  T1  N0  M1  0.0612
4          10  T3  N0  M0  0.0054

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:17 (13 by maintainers)

github_iconTop GitHub Comments

7reactions
jsignellcommented, Aug 28, 2018

I have been thinking more about the path_parser idea. Pandas already accepts a converters kwargs. So we could just do a minor tweak and check that converters dict for the path_col and if we find it do the transformation before returning the df. I think the value is in only needing to do the operation once for each path vs. lots of times. This is what it would look like:

screen shot 2018-08-28 at 3 34 08 pm
4reactions
mrocklincommented, Aug 27, 2018

I’m inclined to skip the path_parser option suggested by @jlstevens . There is, I think, a clear path for users to handle this orthogonally on their own with the map or apply methods if they so choose. I don’t think that we need to make this any easier for them.

df = dd.read_csv(..., include_filenames=True)
df['filename'] = df['filename'].map(parse)

I think that this approach is simpler in the long run.

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.dataframe.read_csv - Dask documentation
Read CSV files into a Dask.DataFrame. This parallelizes the pandas.read_csv() function in the following ways: It supports loading many files at once using ......
Read more >
using dask read_csv to read filename as a column name
I was able to do this (fairly straight forward) using dask's delayed function: import pandas as pd import dask.dataframe as dd from dask...
Read more >
Reading CSV files into Dask DataFrames with read_csv
This blog post explains how to read one or multiple CSV files into a Dask DataFrame with read_csv.
Read more >
Dask Dataframes — Python tools for Big data - Pierre Navaro
Load Data from CSVs in Dask Dataframes# · import os here = os.getcwd() filenames = os.path. · import dask import dask.dataframe as dd...
Read more >
Pandas Read Multiple CSV Files into DataFrame
The Dask library can be used to read a data frame from multiple files. Before you use Dask library, first you need to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found