Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Applying aggregations to a dataframe fails when reading data from partitioned parquet

See original GitHub issue

This code is a toy example of aggregating a dataframe and producing multiple aggregated columns. It works (thanks to @martindurant ).

This code is an attempt to do the same aggregation on a dataframe being read from parquet files. This code does not work, and I can’t figure out why. Exception looks like this:

aggregate2.py:25: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  ag = gb.apply(lambda d: pd.DataFrame({
Traceback (most recent call last):
  File "aggregate2.py", line 30, in <module>
    print(ag.compute())
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/base.py", line 99, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/base.py", line 206, in compute
    results = get(dsk, keys, **kwargs)
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/dataframe/core.py", line 3194, in apply_and_enforce
    return _rename(c, df)
  File "/Users/irina/.pyenv/versions/talks/src/dask/dask/dataframe/core.py", line 3231, in _rename
    df.columns = columns
  File "/Users/irina/.pyenv/versions/talks/lib/python2.7/site-packages/pandas/core/generic.py", line 3094, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/src/properties.pyx", line 65, in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)
  File "/Users/irina/.pyenv/versions/talks/lib/python2.7/site-packages/pandas/core/generic.py", line 473, in _set_axis
    self._data.set_axis(axis, labels)
  File "/Users/irina/.pyenv/versions/talks/lib/python2.7/site-packages/pandas/core/internals.py", line 2836, in set_axis
    (old_len, new_len))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 7 elements

I also noticed a few odd things about the loaded parquet. Originally, it was partitioned on year, month, day, hour and customer. After reading the files, even when specifying columns=['customer', 'url', 'ts', 'session_id', 'referrer'], I have hour in the data if I look at df.head(). In fact, specifying columns=['customer', 'url', 'ts', 'session_id', 'referrer'] in dd.read_parquet does not seem to work, the dataframe looks like this:

ipdb> df.head()
                       url              referrer session_id                  ts customer hour
0  http://a.com/articles/1    http://google.com/        xxx 2017-09-15 00:15:00    a.com    0
1  http://a.com/articles/2      http://bing.com/        yyy 2017-09-15 00:30:00    a.com    0
2  http://a.com/articles/2  http://facebook.com/        yyy 2017-09-15 00:45:00    a.com    0

with year, month, day not present in dataframe, while customer and hour are. I would assume all partition keys should be either read or dropped together, but this seem to happen selectively.

I attached the parquet data in question.

events.zip

This issue came here from https://stackoverflow.com/questions/46375382/aggregate-a-dask-dataframe-and-produce-a-dataframe-of-aggregates/46380632#46380632.

Issue Analytics

State:
Created 6 years ago
Comments:28 (28 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Sep 25, 2017

Sorry, I meant subsetting after this line

df = dd.from_delayed(dfs)

We would accept that those columns will show up, but after they show up we would get rid of them. This is somewhat inefficient because we allocate memory for columns that we then release, but would, I think, be relatively fool-proof.

0reactions

j-bennetcommented, Oct 24, 2018

I made several changes to my code, and this version works:

https://github.com/j-bennet/talks/blob/master/2018/daskvsspark/daskvsspark/aggregate_dask.py

I suspect that this is related to the changed aggregation part:

https://github.com/j-bennet/talks/blob/09965f678c9625f46214042e9d22ccb26187c98d/2018/daskvsspark/daskvsspark/aggregate_dask.py#L75-L90

But I can’t be sure, as other things changed as well. Going to close the issue though.

Thank you for all the help!

Top Results From Across the Web

Reading DataFrame from partitioned parquet file

Right, so the first thing you do is a filter operation. Since Spark does lazy evaluation you should have no problems with the...

Spark SQL, DataFrames and Datasets Guide

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet...

dask.dataframe.read_parquet - Dask documentation

This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if...

Notes about saving data with Spark 3.0 | by David Vrba

Parquet files support data skipping on different levels, namely on partition level and row-group level. So a dataset can be partitioned by some ......

Spark Groupby Example with DataFrame

Similar to SQL "GROUP BY" clause, Spark groupBy() function is used to collect the identical data into groups on DataFrame/Dataset and perform aggregate....