Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keyerror when slicing dating

See original GitHub issue

This looks similar to #2211 but I’m not sure. I’ve attached a zip with sample data and code that reproduces it. If you uncomment line 25 it works for some reason.

import dask.bag
import pandas as pd
import re
from datetime import datetime
schema_dict = {
    'timestamp': 'datetime64[ns]',
}


time_regex = r'\[(?P<time>[^]]+)\]'
time_regex = re.compile(time_regex)


def get_log_dict(line):
    match = time_regex.match(line)
    dt = pd.datetime.strptime(match.groupdict()['time'], '%d/%b/%Y:%H:%M:%S +0000')
    return {'timestamp': dt}


files = ['2012-09-25.log', '2012-09-26.log', '2012-09-27.log']
b = dask.bag.read_text(files, blocksize=5000000).map(get_log_dict).to_dataframe(schema_dict)
b = b[~b.timestamp.isnull()]
b = b.set_index('timestamp')
b = b[sorted(b.columns)]
# b = b.repartition(freq='15m')
start = datetime(2012, 9, 26)
end = datetime(2012, 9, 27)
b = b.loc[start:end]
b.compute()

Archive.zip

Issue Analytics

State:
Created 6 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, May 2, 2017

Yeah, this looks very similar.

I think dask’s .loc will have to protect against the index not being monotonic / sorted, and fall back to boundary_slice if it isn’t. I can take a closer look tonight or tomorrow morning.

0reactions

shughes-ukcommented, May 3, 2017

If you’re going to raise an error the docs should probably be changed to reflect the ‘mostly sorted’ status and perhaps include the workaround for it. It doesn’t sound like you’re going to go that route though.