Monthly data and ordinal encoding create an "off by one" error
See original GitHub issueI uncovered this “bug” during some data exploration for development of a open-source library tracking the U.S. Consumer Price Index.
The source data is monthly, with the values linked with the first date of each month.
import pandas as pd
import altair as alt
df = pd.read_json("https://raw.githubusercontent.com/datadesk/cpi/master/notebooks/last_13.json", dtype={"date_label": pd.np.datetime64})
df.head(13)[['date', 'pct_change_rounded']]
My aim was to format the dates in the x-axis labels in the same manner as the government’s sample chart.
I tried to do that with timeUnit
and the format
option to axis
.
alt.Chart(df).mark_bar().encode(
x=alt.X("date:O", timeUnit="yearmonth", axis=alt.Axis(format="%b %y")),
y="pct_change_rounded:Q"
)
Here’s what I got:
Look closely and you can see that the latest value, June 1, is rendered as May.
What’s up with that?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Ordinal and One-Hot Encodings for Categorical Data
The one-hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example, if ...
Read more >Feature Engineering Ordinal Variables - Towards Data Science
For the encoders, the current default is to raise an error if there are new values where the encoder has not seen in...
Read more >Is there ever a reason to one-hot encode ordinal data?
In general it depends on what you do with the encoded data. If you apply a method that assumes that the connection between...
Read more >Three Approaches to Encoding Time Information as Features ...
The easiest way to encode time-related information is to use dummy variables (also known as one-hot encoding). Let's look at an example. X_1...
Read more >A guide to encoding categorical features using R | R-bloggers
Ordinal. This is the simplest form of encoding where each value is converted to an integer. The maximum value is the equal to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Update: thanks to some input from @domoritz I understand this a bit better. It looks like if we want to ensure dates are parsed as local times, we need to either use unix timestamps or fully-qualified ISO-8601 dates. This is not a characteristic of Vega or Vega-Lite, but of Javascript itself. For example:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/parse
So if we replace
df[col_name].astype(str)
withdf[col_name].dt.strftime('%Y-%m-%dT%H:%M:%SZ')
insanitize_dataframe
, then these kinds of off-by-one errors will be avoided.OK, so it turns out I was completely wrong on the reason for this issue when I speculated above.
The issue is that the timestamps are being parsed as if they’re UTC, and then displayed in local time (i.e. compensating for timezone). Because the west coast is 8 hours behind london, this shifts the time to the previous month for the dates you’re using. If you were east of London when running this code, you wouldn’t see this issue 😄.
So the best solution is to make sure dates are both parsed and displayed as local time, so no time zone correction is required.
Looking at the Vega-Lite docs on UTC time, it looks like (and this seems entirely crazy to me) the way you make sure dates are parsed as local time is to not use ISO format. Altair serializes datetime data in ISO format by default, so timezone corrections will always be applied.
But if you change the serialization to be non-ISO compliant, you can make things work in as expected:
This format-dependent parsing of dates in Vega is more than a bit surprising to me, and I hope that it can be addressed upstream. But if not, I’d propose we change the way we serialize dates in Altair so that they will be parsed the same way they are displayed, without any implicit time-zone conversion.