question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Monthly data and ordinal encoding create an "off by one" error

See original GitHub issue

I uncovered this “bug” during some data exploration for development of a open-source library tracking the U.S. Consumer Price Index.

The source data is monthly, with the values linked with the first date of each month.

import pandas as pd
import altair as alt
df = pd.read_json("https://raw.githubusercontent.com/datadesk/cpi/master/notebooks/last_13.json", dtype={"date_label": pd.np.datetime64})
df.head(13)[['date',  'pct_change_rounded']]

screenshot from 2018-07-16 18-33-09

My aim was to format the dates in the x-axis labels in the same manner as the government’s sample chart.

screenshot from 2018-07-16 18-36-58

I tried to do that with timeUnit and the format option to axis.

alt.Chart(df).mark_bar().encode(
    x=alt.X("date:O", timeUnit="yearmonth", axis=alt.Axis(format="%b %y")),
    y="pct_change_rounded:Q"
)

Here’s what I got:

visualization 39

Look closely and you can see that the latest value, June 1, is rendered as May.

What’s up with that?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
jakevdpcommented, Jul 27, 2018

Update: thanks to some input from @domoritz I understand this a bit better. It looks like if we want to ensure dates are parsed as local times, we need to either use unix timestamps or fully-qualified ISO-8601 dates. This is not a characteristic of Vega or Vega-Lite, but of Javascript itself. For example:

Given a date string of “March 7, 2014”, parse() assumes a local time zone, but given an ISO format such as “2014-03-07” it will assume a time zone of UTC (ES5 and ECMAScript 2015). Therefore Date objects produced using those strings may represent different moments in time depending on the version of ECMAScript supported unless the system is set with a local time zone of UTC. This means that two date strings that appear equivalent may result in two different values depending on the format of the string that is being converted.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/parse

So if we replace df[col_name].astype(str) with df[col_name].dt.strftime('%Y-%m-%dT%H:%M:%SZ') in sanitize_dataframe, then these kinds of off-by-one errors will be avoided.

2reactions
jakevdpcommented, Jul 26, 2018

OK, so it turns out I was completely wrong on the reason for this issue when I speculated above.

The issue is that the timestamps are being parsed as if they’re UTC, and then displayed in local time (i.e. compensating for timezone). Because the west coast is 8 hours behind london, this shifts the time to the previous month for the dates you’re using. If you were east of London when running this code, you wouldn’t see this issue 😄.

So the best solution is to make sure dates are both parsed and displayed as local time, so no time zone correction is required.

Looking at the Vega-Lite docs on UTC time, it looks like (and this seems entirely crazy to me) the way you make sure dates are parsed as local time is to not use ISO format. Altair serializes datetime data in ISO format by default, so timezone corrections will always be applied.

But if you change the serialization to be non-ISO compliant, you can make things work in as expected:

df['date2'] = df['date'].dt.strftime('%b %d %Y')  # non-ISO serialization 

alt.Chart(df).mark_bar().encode(
    x=alt.X("date2:O", timeUnit="yearmonth", axis=alt.Axis(format="%b %y")),
    y="pct_change_rounded:Q"
)

visualization 18

This format-dependent parsing of dates in Vega is more than a bit surprising to me, and I hope that it can be addressed upstream. But if not, I’d propose we change the way we serialize dates in Altair so that they will be parsed the same way they are displayed, without any implicit time-zone conversion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ordinal and One-Hot Encodings for Categorical Data
The one-hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example, if ...
Read more >
Feature Engineering Ordinal Variables - Towards Data Science
For the encoders, the current default is to raise an error if there are new values where the encoder has not seen in...
Read more >
Is there ever a reason to one-hot encode ordinal data?
In general it depends on what you do with the encoded data. If you apply a method that assumes that the connection between...
Read more >
Three Approaches to Encoding Time Information as Features ...
The easiest way to encode time-related information is to use dummy variables (also known as one-hot encoding). Let's look at an example. X_1...
Read more >
A guide to encoding categorical features using R | R-bloggers
Ordinal. This is the simplest form of encoding where each value is converted to an integer. The maximum value is the equal to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found