Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

constant regressor error with simulated_historical_forecasts with an indicator variable used as regressor

See original GitHub issue

The simulated_historical_forecasts function currently doesn’t account for the fact that in splitting an indicator variable across different cutoffs, you may run into a case where all the values take a zero or one.

This throws an error in the initalize_scales_fn because the check for uniqueness of regressor, given below, fails:

for (name in names(m$extra_regressors)) {
    n.vals <- length(unique(df[[name]]))
    if (n.vals < 2) {
      stop('Regressor ', name, ' is constant.')
    }

I handle this by making the following changes in the function:

regressor_names <- names(model$extra_regressors)
# check that regressor we added is not entirely constant

    if (!is.null(regressor_names)) { # start of if
    # number of unique values for regressors in history.c
    num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))

    # which regressors should we remove
    regressors_to_remove <- names(which(num_unique_by_regressor < 2))

    if (length(regressors_to_remove) > 0) {
     # remove the regressors from model
      for (name in regressors_to_remove){
        m$extra_regressors[[name]] <- NULL
    }
    
    # remove attributes for consistency
                                      
    if (!is.null(attr(m$extra_regressors, which = 'names'))){
      attr(m$extra_regressors, which = 'names') <- NULL
    }
    
   # remove the regressors from history.c
    history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
   }                          
} # end of if

The entire function then becomes -->

simulated_historical_forecasts <- function(model, horizon, units, k,
                                           period = NULL) {
  df <- model$history
  horizon <- as.difftime(horizon, units = units)
  if (is.null(period)) {
    period <- horizon / 2
  } else {
    period <- as.difftime(period, units = units)
  }
  # regressor names
  regressor_names <- names(model$extra_regressors)
  cutoffs <- generate_cutoffs(df, horizon, k, period)
  predicts <- data.frame()
  for (i in 1:length(cutoffs)) {
    cutoff <- cutoffs[i]
    # Copy the model
    m <- prophet_copy(model, cutoff)
    # Train model
    history.c <- dplyr::filter(df, ds <= cutoff)
    # check that regressor we added is not entirely constant
    
    if (!is.null(regressor_names)) {
      # number of unique values for regressors in history.c
      num_unique_by_regressor <- sapply(regressor_names, function(x) length(unique(history.c[[x]])))
      
      # which regressors should we remove
      regressors_to_remove <- names(which(num_unique_by_regressor < 2))
      
      if (length(regressors_to_remove) > 0) {
        
        # remove the regressors from model
        for (name in regressors_to_remove){
          m$extra_regressors[[name]] <- NULL
        }
        
        # remove attributes for consistency
        
        if (!is.null(attr(m$extra_regressors, which = 'names'))){
          attr(m$extra_regressors, which = 'names') <- NULL
        }
        
        # remove regressors from history.c
        history.c <- dplyr::select(history.c, -one_of(regressors_to_remove))
      }
    }
    
    # fit model
    m <- fit.prophet(m, history.c)
    # Calculate yhat
    df.predict <- dplyr::filter(df, ds > cutoff, ds <= cutoff + horizon)
    columns <- c('ds')
    if (m$growth == 'logistic') {
      columns <- c(columns, 'cap')
      if (m$logistic.floor) {
        columns <- c(columns, 'floor')
      }
    }
    columns <- c(columns, regressor_names)
    future <- df[columns]
    yhat <- stats::predict(m, future)
    # Merge yhat, y, and cutoff.
    df.c <- dplyr::inner_join(df.predict, yhat, by = "ds")
    df.c <- dplyr::select(df.c, ds, y, yhat, yhat_lower, yhat_upper)
    df.c$cutoff <- cutoff
    predicts <- rbind(predicts, df.c)
  }
  return(predicts)
}

Issue Analytics

State:
Created 6 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

2reactions

blethamcommented, Nov 17, 2017

This is a challenging issue. There are certainly ways to get around this and the solution you post is one. But it isn’t clear to me what the right thing to do is for making the cross validation meaningful. Our goal is to estimate model generalization. If the the external regressor is important, then removing it means we’re now fitting a different model, whose performance is probably not indicative of the generalization performance of the full model.

It seems to me the more reasonable thing to do would be to not try to do cross-validation using segments of the history that do not contain all of the data needed by the model (like both levels of an indicator variable). Since the cross-validation uses histories of increasing length, we should really just start the cross validation at a point in the history that has everything we need. This might mean fewer samples to estimate performance, but like I said above, otherwise we are getting more samples of something that isn’t really the generalization we want to estimate.

1reaction

blethamcommented, Jun 1, 2018

@deniznoah That seems like a reasonable use case. We’ll then need to have a way to drop constant extra regressors in fitting.

As for whether or not they are normalized - Not-binary extra regressors are standardized (subtract mean, divide by standard deviation) so they are mean 0 and have standard deviation 1. Binary extra regressors are left as-is. This behavior can be overriden when adding them, see help(Prophet.add_regressor).