ValueError: maxlag should be < nobs
See original GitHub issueDescribe the bug
I get this annoying bug when trying to fit my data with my generated Arima:
ValueError: maxlag should be < nobs
I am not entirely sure what it means, but upon googling i found this:
The problem is that you need more observations to estimate the model.
from here:
https://github.com/statsmodels/statsmodels/issues/4465#issuecomment-380459136
The person also mentions that for a specific model atleast X observations are needed. Couldn’t this relation be raised as an exception from your module? The exception is rather obscure, and from a lowlevel module. Either upon calling the initializer with bad data, or simply catching it and rephrasing it, in a language that relates more to the input i supply to your service.
I might just be a noob, regarding math, but the error isn’t that useful currently. 😕
To Reproduce
I’ve created a snippet that can be run, which throws the exception:
https://gist.github.com/C0DK/6c21a2990b275c26779a5e157322e424
Stack trace
File "/usr/local/lib/python3.6/dist-packages/pmdarima/base.py", line 46, in fit_predict
self.fit(y, exogenous, **fit_args)
File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 439, in fit
self._fit(y, exogenous, **fit_args)
File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 354, in _fit
fit, self.arima_res_ = _fit_wrapper()
File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 348, in _fit_wrapper
**fit_args)
File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/mlemodel.py", line 445, in fit
start_params = self.start_params
File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/sarimax.py", line 938, in start_params
self.polynomial_ma, self.k_trend, trend_data
File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/sarimax.py", line 863, in _conditional_sum_squares
X = np.c_[X, lagmat(residuals, k_ma)[r-k:, cols]]
File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/tsatools.py", line 408, in lagmat
raise ValueError("maxlag should be < nobs")
ValueError: maxlag should be < nobs
Versions pmdarima 1.2.1 NumPy 1.17.0 SciPy 1.2.2 Scikit-Learn 0.21.3 Statsmodels 0.10.1
Expected behavior It throwing an exception that guides me towards what values are valid.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
tl;dr
The problem is that you’re calling
fit_predict
on test data when you should call be callingpredict
on your model to get forecasted test values.fit_predict
is for fitting and creating forecasts from your training samples (not test). When used as intended, the error is not raised.Explanation
After looking at this, I don’t think this is a bug. I think this is exactly the behavior that’s expected… you have too few observations (
nobs
) for the number of lags (maxlag
) your model has specified.Now, the reason you’re hitting that error is this line:
You just fit a model on 500 samples, and
auto_arima
picked out that the appropriate order should be(4, 0, 4)
(model.order) and now you’re throwing away the model fit to runfit_predict
with a model with very large lag terms on just 10 test samples. If you just want to predict, callmodel.predict(n_periods=2)
.Going back to the linked issue… if you follow Chad’s equation, he estimates you need at least 14 samples to do what you’re trying:
Now, going back to the error message… If we were to hardcode a check for every possible input constraint that a dependency module sets, the library would become unmaintainable. We can’t possibly curate a comprehensive list of every data permutation that will raise errors in lower-level libraries, so we trust their error handling for those situations.
Now, if you still disagree… PRs are always welcome
so you are saying i should generate a new auto arima for each step in my timeseries data?
Because my point is that i have a dataset. In this case 53 points, I want to make predictions throughout this dataset. i want to feed the model those 10 steps to predict the coming 5. I am not using the 10 steps as testing data - i am using the following five (which the model doesn’t have access to at any given time. This is how it is done in our GRU, LSTM and CNN implementations, to create a RNN.
see the image. between the two vertical lines is input data, and the red line represents the output data. The grey scaled data following the input “window” is then tested against. This then shows an animation where the two vertical lines move, representing a different input set, where i in data[i:i+10] increment by one at each step.
However now i am recreating the auto arima at each timestep. I just want to make sure that this is exactly what you mean is best practice, because in this case i create 43 different models, with a really small dataset. with a larger one it seems impossible to generate… But if that’s how arima is supposed to work, then i’ll do that. It just suprises me that i cannot reuse the generated values for the next prediction.