question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

constant regressor error with simulated_historical_forecasts with an indicator variable used as regressor

See original GitHub issue

The simulated_historical_forecasts function currently doesn’t account for the fact that in splitting an indicator variable across different cutoffs, you may run into a case where all the values take a zero or one.

This throws an error in the initalize_scales_fn because the check for uniqueness of regressor, given below, fails:

for (name in names(m$extra_regressors)) {
    n.vals <- length(unique(df[[name]]))
    if (n.vals < 2) {
      stop('Regressor ', name, ' is constant.')
    }

I handle this by making the following changes in the function:

regressor_names <- names(model$extra_regressors)
# check that regressor we added is not entirely constant

    if (!is.null(regressor_names)) { # start of if
    # number of unique values for regressors in history.c
    num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))

    # which regressors should we remove
    regressors_to_remove <- names(which(num_unique_by_regressor < 2))

    if (length(regressors_to_remove) > 0) {
     # remove the regressors from model
      for (name in regressors_to_remove){
        m$extra_regressors[[name]] <- NULL
    }
    
    # remove attributes for consistency
                                      
    if (!is.null(attr(m$extra_regressors, which = 'names'))){
      attr(m$extra_regressors, which = 'names') <- NULL
    }
    
   # remove the regressors from history.c
    history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
   }                          
} # end of if

The entire function then becomes -->

simulated_historical_forecasts <- function(model, horizon, units, k,
                                           period = NULL) {
  df <- model$history
  horizon <- as.difftime(horizon, units = units)
  if (is.null(period)) {
    period <- horizon / 2
  } else {
    period <- as.difftime(period, units = units)
  }
  # regressor names
  regressor_names <- names(model$extra_regressors)
  cutoffs <- generate_cutoffs(df, horizon, k, period)
  predicts <- data.frame()
  for (i in 1:length(cutoffs)) {
    cutoff <- cutoffs[i]
    # Copy the model
    m <- prophet_copy(model, cutoff)
    # Train model
    history.c <- dplyr::filter(df, ds <= cutoff)
    # check that regressor we added is not entirely constant
    
    if (!is.null(regressor_names)) {
      # number of unique values for regressors in history.c
      num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))
      
      # which regressors should we remove
      regressors_to_remove <- names(which(num_unique_by_regressor < 2))
      
      if (length(regressors_to_remove) > 0) {
        
        # remove the regressors from model
        for (name in regressors_to_remove){
          m$extra_regressors[[name]] <- NULL
        }
        
        # remove attributes for consistency
        
        if (!is.null(attr(m$extra_regressors, which = 'names'))){
          attr(m$extra_regressors, which = 'names') <- NULL
        }
        
        # remove regressors from history.c
        history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
      }
    }
    
    # fit model
    m <- fit.prophet(m, history.c)
    # Calculate yhat
    df.predict <- dplyr::filter(df, ds > cutoff, ds <= cutoff + horizon)
    columns <- c('ds')
    if (m$growth == 'logistic') {
      columns <- c(columns, 'cap')
      if (m$logistic.floor) {
        columns <- c(columns, 'floor')
      }
    }
    columns <- c(columns, regressor_names)
    future <- df[columns]
    yhat <- stats::predict(m, future)
    # Merge yhat, y, and cutoff.
    df.c <- dplyr::inner_join(df.predict, yhat, by = "ds")
    df.c <- dplyr::select(df.c, ds, y, yhat, yhat_lower, yhat_upper)
    df.c$cutoff <- cutoff
    predicts <- rbind(predicts, df.c)
  }
  return(predicts)
}

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
blethamcommented, Nov 17, 2017

This is a challenging issue. There are certainly ways to get around this and the solution you post is one. But it isn’t clear to me what the right thing to do is for making the cross validation meaningful. Our goal is to estimate model generalization. If the the external regressor is important, then removing it means we’re now fitting a different model, whose performance is probably not indicative of the generalization performance of the full model.

It seems to me the more reasonable thing to do would be to not try to do cross-validation using segments of the history that do not contain all of the data needed by the model (like both levels of an indicator variable). Since the cross-validation uses histories of increasing length, we should really just start the cross validation at a point in the history that has everything we need. This might mean fewer samples to estimate performance, but like I said above, otherwise we are getting more samples of something that isn’t really the generalization we want to estimate.

1reaction
blethamcommented, Jun 1, 2018

@deniznoah That seems like a reasonable use case. We’ll then need to have a way to drop constant extra regressors in fitting.

As for whether or not they are normalized - Not-binary extra regressors are standardized (subtract mean, divide by standard deviation) so they are mean 0 and have standard deviation 1. Binary extra regressors are left as-is. This behavior can be overriden when adding them, see help(Prophet.add_regressor).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Introduction to Time Series Regression and Forecasting
We will transform time series variables using lags, first ... A natural starting point for a forecasting model is to use past values...
Read more >
4 Time Series Models
The Auto-Regressive (AR) model is a regression model in which the independent variables are lagged values of the dependent variable Yt.
Read more >
Two-Part Predictors in Regression Models - PMC - NCBI
The first variable in the pair is a dummy-coded indicator that denotes whether the covariate value is relevant (e.g., person is in a ......
Read more >
Regression Models with Data‐based Indicator Variables* - Hendry ...
Abstract Ordinary least squares estimation of an impulse-indicator coefficient is inconsistent, but its variance can be consistently estimated.
Read more >
Choosing the Correct Type of Regression Analysis
Linear models are the most common and most straightforward to use. If you have a continuous dependent variable, linear regression is probably the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found