Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError: maxlag should be < nobs

See original GitHub issue

Describe the bug I get this annoying bug when trying to fit my data with my generated Arima: ValueError: maxlag should be < nobs

I am not entirely sure what it means, but upon googling i found this: The problem is that you need more observations to estimate the model. from here: https://github.com/statsmodels/statsmodels/issues/4465#issuecomment-380459136

The person also mentions that for a specific model atleast X observations are needed. Couldn’t this relation be raised as an exception from your module? The exception is rather obscure, and from a lowlevel module. Either upon calling the initializer with bad data, or simply catching it and rephrasing it, in a language that relates more to the input i supply to your service.

I might just be a noob, regarding math, but the error isn’t that useful currently. 😕

To Reproduce
I’ve created a snippet that can be run, which throws the exception: https://gist.github.com/C0DK/6c21a2990b275c26779a5e157322e424

Stack trace

File "/usr/local/lib/python3.6/dist-packages/pmdarima/base.py", line 46, in fit_predict
    self.fit(y, exogenous, **fit_args)
  File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 439, in fit
    self._fit(y, exogenous, **fit_args)
  File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 354, in _fit
    fit, self.arima_res_ = _fit_wrapper()
  File "/usr/local/lib/python3.6/dist-packages/pmdarima/arima/arima.py", line 348, in _fit_wrapper
    **fit_args)
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/mlemodel.py", line 445, in fit
    start_params = self.start_params
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/sarimax.py", line 938, in start_params
    self.polynomial_ma, self.k_trend, trend_data
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/statespace/sarimax.py", line 863, in _conditional_sum_squares
    X = np.c_[X, lagmat(residuals, k_ma)[r-k:, cols]]
  File "/usr/local/lib/python3.6/dist-packages/statsmodels/tsa/tsatools.py", line 408, in lagmat
    raise ValueError("maxlag should be < nobs")
ValueError: maxlag should be < nobs

Versions pmdarima 1.2.1 NumPy 1.17.0 SciPy 1.2.2 Scikit-Learn 0.21.3 Statsmodels 0.10.1

Expected behavior It throwing an exception that guides me towards what values are valid.

Issue Analytics

State:
Created 4 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

tgsmith61591commented, Aug 12, 2019

tl;dr

The problem is that you’re calling fit_predict on test data when you should call be calling predict on your model to get forecasted test values. fit_predict is for fitting and creating forecasts from your training samples (not test). When used as intended, the error is not raised.

Explanation

After looking at this, I don’t think this is a bug. I think this is exactly the behavior that’s expected… you have too few observations (nobs) for the number of lags (maxlag) your model has specified.

Now, the reason you’re hitting that error is this line:

output = model.fit_predict(test_data, n_periods=2)

You just fit a model on 500 samples, and auto_arima picked out that the appropriate order should be (4, 0, 4) (model.order) and now you’re throwing away the model fit to run fit_predict with a model with very large lag terms on just 10 test samples. If you just want to predict, call model.predict(n_periods=2).

Going back to the linked issue… if you follow Chad’s equation, he estimates you need at least 14 samples to do what you’re trying:

>>> p, d, q = model.order
>>> P, D, Q, s = model.seasonal_order
>>> d + D*s + max(3*q + 1, 3*Q*s + 1, p, P*s) + 1
14

Now, going back to the error message… If we were to hardcode a check for every possible input constraint that a dependency module sets, the library would become unmaintainable. We can’t possibly curate a comprehensive list of every data permutation that will raise errors in lower-level libraries, so we trust their error handling for those situations.

Now, if you still disagree… PRs are always welcome

0reactions

C0DKcommented, Aug 12, 2019

so you are saying i should generate a new auto arima for each step in my timeseries data?

Because my point is that i have a dataset. In this case 53 points, I want to make predictions throughout this dataset. i want to feed the model those 10 steps to predict the coming 5. I am not using the 10 steps as testing data - i am using the following five (which the model doesn’t have access to at any given time. This is how it is done in our GRU, LSTM and CNN implementations, to create a RNN.

see the image. between the two vertical lines is input data, and the red line represents the output data. The grey scaled data following the input “window” is then tested against. This then shows an animation where the two vertical lines move, representing a different input set, where i in data[i:i+10] increment by one at each step.

However now i am recreating the auto arima at each timestep. I just want to make sure that this is exactly what you mean is best practice, because in this case i create 43 different models, with a really small dataset. with a larger one it seems impossible to generate… But if that’s how arima is supposed to work, then i’ll do that. It just suprises me that i cannot reuse the generated values for the next prediction.