[ENH] Calendar Feature Extractor
See original GitHub issueIs your feature request related to a problem? Please describe. I would like to use sktime in the context of tree based models for time series. In general, these models need some data preparation to make them useful for time series. One data prepation step involves creating calendar dummy features representing the current day of the year, day of the month, week of the quarter etc. Another data preparation step would involve generating features representing fourier terms of different order and periodicity.
Describe the solution you’d like I have been thinking about the best way to implement this, and @aiwalter suggested to use an SeriesToSeries Transformer to generate new exogeneous features based on the index in ForecastingPipeline. Would that approach work from an architecture point of view, or do we need a separate not yet defined class “FeatureExtractor” and / or a specific IndexToSeries Generator?
We could either try to adapt solutions from GluonTS (dummies) / Prophet (fourier features) or build our own approach. Attached is some example code (not yet adapted to sktime) to outline the general goal.
import pandas as pd
import numpy as np
base_seasons = [
["parent","child","period","dummy"],
["year","year",None,"year"],
["year","quarter",365.25/4,"quarter"],
["year","month",12,"month"],
["year","week",365.25/7,"week_of_year"],
["year","day",365.25,"day_of_year"],
["quarter","month",12/4,"month_of_quarter"],
["quarter","week",365.25/(4*7),"week_of_quarter"],
["quarter","day",365.25/4,"day_of_quarter"],
["month","week",365.25/(12*7),"week_of_month"],
["month","day",30,"day"],
["week","day",7,"day_of_week"],
["day","hour",24,"hour"],
["hour","minute",60,"minute"],
["minute","second",60,"second"],
["second","millisecond",1000,"millisecond"]
]
base_seasons = pd.DataFrame(base_seasons[1:],columns=base_seasons[0])
base_seasons["fourier"] = base_seasons["child"] + '_in_' + base_seasons["parent"]
base_seasons["child"] = base_seasons["child"].astype("category").cat.reorder_categories(['year','quarter','month', 'week','day',"hour","minute","second","millisecond"])
base_seasons["rank"] = base_seasons["child"].cat.codes
def get_supported_seasons(base_frequency,base_seasons=base_seasons):
rank = base_seasons.loc[base_seasons["child"]==base_frequency,"rank"].max()
matches = base_seasons.loc[base_seasons["rank"]<=rank]
if matches.shape[0] == 0:
raise ValueError("Seasonality or Frequency not supported")
return matches
def calendar_fourier(dti_actual,fourier_period,fourier_order,base_frequency):
dti = pd.date_range(dti_actual.min(),dti_actual.max(),freq=base_frequency.upper()[0])
if dti.min().to_numpy() != dti_actual.min().to_numpy():
raise ValueError("Actual time Series does not correspond to frequencies provided by pandas Datetimeindex. This can happen when e.g. monthly data does not correspond to month end.")
funcs = [np.sin,np.cos]
outlist = list()
for index,item in enumerate(fourier_order):
outlist.append((np.arange(item)+1)*1/fourier_period[index])
inlist = np.zeros(shape=(len(dti),np.concatenate(outlist).shape[0]*len(funcs)))
colnames = list()
k = 0
for item in outlist:
for index,p in enumerate(item):
for func in funcs:
inlist[:,k] =func(2*np.pi*p*np.arange(len(dti)))
# colnames.append("per"+ str(int(1/p)*int(index+1)) + "_or" + str(int(index+1))+func.__name__)
colnames.append("per"+ str(int(1/p)) + func.__name__)
k = k +1
inlist = pd.DataFrame(inlist)
inlist.columns = colnames
inlist.set_index(dti,inplace=True)
inlist = inlist[inlist.index.isin(dti_actual)]
inlist = inlist.reset_index(drop=True)
return inlist
def calendar_dummies(x,funcs):
if funcs == "week_of_year":
return pd.DataFrame({funcs:getattr(x,"isocalendar")()["week"].reset_index(drop="date")})
elif funcs == "week_of_month":
return pd.DataFrame({funcs:(x.day - 1) // 7 + 1})
elif funcs == "month_of_quarter":
return pd.DataFrame({funcs:(np.floor(x.month/4)+1).astype(np.int64)})
elif funcs == "week_of_quarter":
year_week = getattr(x,"isocalendar")()["week"]
def week_of_quarter(x):
if x <= 13:
return 1
elif x <=26:
return x-13
elif x <=39:
return x-26
elif x <=53:
return x-39
year_week.apply(lambda x: week_of_quarter(x))
elif funcs == "millisecond":
return pd.DataFrame({funcs:x.microsecond*1000})
elif funcs == "day_of_quarter":
quarter = x.quarter
quarter_start = pd.DatetimeIndex(
x.year.map(str) + "-" + (3 * quarter - 2).map(int).map(str) + "-01")
values = ((x - quarter_start) / pd.to_timedelta("1D") + 1).astype(int)
return pd.DataFrame({funcs:values},dtype=np.float64)
else:
return pd.DataFrame({funcs:getattr(x,funcs)})
def calendar_other(x,funcs):
if funcs == "proportion_total":
values = 1+(x.view(np.int64)-x.view(np.int64).max())/(x.view(np.int64).max()-x.view(np.int64).min())
return pd.DataFrame({funcs:values},dtype=np.float64)
elif funcs == "proportion_total_squared":
values = (1+(x.view(np.int64)-x.view(np.int64).max())/(x.view(np.int64).max()-x.view(np.int64).min()))^2
return pd.DataFrame({funcs:values},dtype=np.float64)
elif funcs == "proportion_total_cubic":
values = (1+(x.view(np.int64)-x.view(np.int64).max())/(x.view(np.int64).max()-x.view(np.int64).min()))^3
return pd.DataFrame({funcs:values},dtype=np.float64)
elif funcs == "proportion_total_squared_root":
values = (1+(x.view(np.int64)-x.view(np.int64).max())/(x.view(np.int64).max()-x.view(np.int64).min()))^(1/2)
return pd.DataFrame({funcs:values},dtype=np.float64)
elif funcs == "proportion_total_cubic_root":
values = (1+(x.view(np.int64)-x.view(np.int64).max())/(x.view(np.int64).max()-x.view(np.int64).min()))^(1/3)
return pd.DataFrame({funcs:values},dtype=np.float64)
elif funcs == "proportion_month":
values = x.day/x.days_in_month
return pd.DataFrame({funcs:values},dtype=np.float64)
elif funcs == "proportion_quarter":
quarter = x.dt.quarter
quarter_start = pd.DatetimeIndex(
x.dt.year.map(str) + "-" + (3 * quarter - 2).map(int).map(str) + "-01")
next_quarter_start = x + pd.tseries.offsets.QuarterBegin(startingMonth=1)
quarter_length = (next_quarter_start - quarter_start).dt.days
doq = ((x - quarter_start) / pd.to_timedelta("1D") + 1).astype(int)
values = doq / quarter_length
return pd.DataFrame({funcs:values},dtype=np.float64)
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:10

Top Related StackOverflow Question
I think also its a good idea to just ignore this edge case for now. And there is still a workaround for advanced users with the solution I mentioned:
Yes, exactly, why not?
If it’s a transformer applied to
X(noty), in predict it would be applied toXbefore predict is called, and the transformedX(with calendar indicator) would be fed to predict. So it seems all is fine?