question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Difference between R and Python implementation of auto.arima

See original GitHub issue

Question In the documentation, there is written:

pmdarima is designed to behave as similarly to R’s well-known auto.arima as possible

Could someone answer the question, what are the differences?

  1. I tried to run simple ARIMA (order is not important at the moment) on my dataset (1000 data points). Something like that:
import pmdarima as pm
arima = pm.ARIMA(order=(0,1,1))
arima_fit = arima.fit(ts_data)

And it’s pretty fast.

  1. When I use the same dataset and use auto_arima function (like pm.auto_arima(ts_data)), it’s taking a bit more time (measured with timeit):

1.07 s ± 53.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

R implementation of auto.arima is roughly 10 times faster. What’s the reason? Is there a way how to improve that?

  1. When I take the same dataset and use R and Python implementations of auto ARIMA I get (depends on data) different results. The default parameters seem to be the same. What’s the reason for that?

Versions Windows-10-10.0.17134-SP0 Python 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)] pmdarima 1.2.1 NumPy 1.16.4 SciPy 1.2.1 Scikit-Learn 0.20.2 Statsmodels 0.9.0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
tgsmith61591commented, Jul 18, 2019

Doing a side-by-side on wineind:

Python:

As stated, this does take a while. Timings for each model are shown via trace

# For exact replicability, you can run this in a docker image:
# $ docker run --rm -it tgsmith61591/pmdarima:1.2.1
import pmdarima as pm

y = pm.datasets.load_wineind()
model = pm.auto_arima(y, seasonal=True, m=12, suppress_warnings=True, trace=True)
# Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=3069.879, BIC=3094.629, Fit time=1.109 seconds
# Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=3133.376, BIC=3139.564, Fit time=0.046 seconds
# Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=3099.734, BIC=3112.109, Fit time=0.289 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.930, BIC=3079.305, Fit time=0.262 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=3067.842, BIC=3083.311, Fit time=0.931 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=3090.512, BIC=3099.793, Fit time=0.051 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=3068.080, BIC=3083.549, Fit time=0.702 seconds
# Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=3067.847, BIC=3086.410, Fit time=5.332 seconds
# Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.760, BIC=3082.229, Fit time=0.851 seconds
# Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=3094.571, BIC=3106.946, Fit time=0.225 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3066.742, BIC=3085.305, Fit time=0.621 seconds
# Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3070.634, BIC=3095.384, Fit time=1.760 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=3067.487, BIC=3089.143, Fit time=1.790 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 0, 12); AIC=3090.957, BIC=3106.426, Fit time=0.178 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 2, 12); AIC=3067.731, BIC=3089.387, Fit time=1.667 seconds
# Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=3068.627, BIC=3093.377, Fit time=4.410 seconds
# Fit ARIMA: order=(0, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3066.787, BIC=3082.256, Fit time=0.334 seconds
# Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3068.673, BIC=3090.330, Fit time=0.818 seconds
# Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3068.810, BIC=3090.467, Fit time=0.876 seconds
# Total fit time: 22.316 seconds

R:

Returns almost immediately:

> library(forecast)
> auto.arima(wineind, seasonal=TRUE)
Series: wineind
ARIMA(1,1,2)(0,1,1)[12]

Coefficients:
         ar1      ma1     ma2     sma1
      0.4299  -1.4673  0.5339  -0.6600
s.e.  0.2984   0.2658  0.2340   0.0799

sigma^2 estimated as 5399312:  log likelihood=-1497.05
AIC=3004.1   AICc=3004.48   BIC=3019.57

Explanation and root cause analysis

First of all, when you cross the language border, there is not always such a thing as a 1-to-1 rewrite, so that is the core challenge here. Under the hood, we are using statsmodels to fit our ARMA, ARIMA and SARIMAX models, and what we’re providing is the optimization code, tests of seasonality and stationarity, etc.

Just to make sure it isn’t a bottleneck on our side, I ran an experiment on the slowest of the models shown above:

from statsmodels import api as sm
import time


def timed(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        res = func(*args, **kwargs)
        print(f"complete in {time.time() - start:.4f} seconds")
        return res
    return wrapper


@timed
def create_model():
    sm_sarimax = sm.tsa.statespace.SARIMAX(
        endog=y, exog=None, order=(0, 1, 1),
        seasonal_order=(1, 1, 2, 12), trend='c',
        enforce_stationarity=True)
    res_wrapper = sm_sarimax.fit(
        start_params=None, trend='c',
        method='lbfgs', transparams=True,
        solver='lbfgs', maxiter=50, disp=0)
    return res_wrapper

create_model()
# complete in 5.2814 seconds
# <statsmodels.tsa.statespace.sarimax.SARIMAXResultsWrapper object at 0x7f33cefd2a50>

When you dig deeper, R’s ARIMA code is almost 100% C, and statsmodels’ is almost 100% python. While that will account for a lot of it, I’m not one to be swayed by that fact alone… my opinion is there is surely something going on in the SARIMAX class that is causing this to drag, and I don’t have a perfect explanation for you right now.

RE: AIC, BIC, etc., my guess is that statsmodels computes those values in a slightly different fashion than statsmodels. I wish I had a better answer for you… Now, it could totally be that the manner in which we’re using the SARIMAX interface (or one of the default options we’re passing) is causing a slow-down, and that’s something I’m not really certain about. But I do think it warrants maybe a deep dive by someone on our side to see if it’s anything we have control over.

Thank you for bringing this up; hopefully we can get a satisfactory explanation, if not solution, to you in a reasonable time.

0reactions
tgsmith61591commented, Nov 25, 2019

It seems to me there is not problem with different AIC results, but the algorithm has to differ in something. Do you know more about that?

There was a time when R added more rules to its stepwise search algorithm, so the two diverged. The new release in 1.5.0 will reconcile these differences. However, one reason the same algorithm may not even search the same orders is due to the AIC of the results. The stepwise search refines its search based on the AIC, and since R and Python have subtly different ways of computing it, the same progression of models may not yield the same refinement, if that makes sense.

Another thing I’ve noticed recently, is R is faster because it will approximate more liberally:

> auto.arima
function (y, d = NA, D = NA, max.p = 5, max.q = 5, max.P = 2,
    max.Q = 2, max.order = 5, max.d = 2, max.D = 1, start.p = 2,
    start.q = 2, start.P = 1, start.Q = 1, stationary = FALSE,
    seasonal = TRUE, ic = c("aicc", "aic", "bic"), stepwise = TRUE,
    trace = FALSE, approximation = (length(x) > 150 | frequency(x) >
        12), truncate = NULL, xreg = NULL, test = c("kpss", "adf",
        "pp"), seasonal.test = c("seas", "ocsb", "hegy", "ch"),
    allowdrift = TRUE, allowmean = TRUE, lambda = NULL, biasadj = FALSE,
    parallel = FALSE, num.cores = 2, x = y, ...)
{
    ...
}

Notice approximation = (length(x) > 150 | frequency(x) > 12). The python version does not approximate unless you explicitly pass simple_differencing=True, and even then the results still differ slightly. That has to do with the number of iterations run in the optimizer. R’s BFGS optimization defaults to 100 iterations (run optim in R to see the function), while pmdarima’s ARIMA defaults to 50 iterations. If you set simple_differencing=True, you might also crank up the maxiter param to refine it some more.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Time Series Forecasting Methods | Arima In Python and R
A basic introduction to various time series forecasting methods and techniques. This guide includes an auto arima model with implementation ...
Read more >
r and python suggest different arima models for same data, why?
The variable tp imput here is the univiriate time series data which I indicate with ts in the python code. the result is...
Read more >
Python vs R for time-series forecasting
Yes, R has a lot of time series libraries — a lot of good work was done a while back by Rob Hyndman,...
Read more >
Automatic vs. Manual ARIMA Configuration | by Michael Grogan
An ARIMA model stands for Autoregressive Integrated Moving Average Model, and the key difference is that the model is designed to work with...
Read more >
Arima model always predicts the same value in python ...
Arima model always predicts the same value in python implementation, R's Auto Arima however gives me a well function model.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found