Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"t_cal" in plot_cumulative_transactions.. what does this do?

See original GitHub issue

I can’t find any info on this parameter in the plot_cumulative_transactions function.

According to the documentation, t_cal is “A marker used to indicate where the vertical line for plotting should be.”

OK I can put a vertical red line anywhere on the plot, but what does this do? No matter which value I set as t_cal the plot doesn’t change.

What should I do with this parameter?

Thanks

Issue Analytics

State:
Created 4 years ago
Comments:11

Top GitHub Comments

1reaction

VinceZJcommented, Nov 14, 2019

For example, the model predicted that John Smith died in May. The training cut-off is June. Does John Smith have any purchases after June? If he does, then the model is incorrect etc. etc and so on for however many thousands of customers in the dataset.

It seems to me you are coming at this problem from a common supervised learning angle, where we can clearly observe all types of outcomes. However, by definition of the problem we are trying to solve with the models implemented here, we have some type of non-contractual churn situation, i.e. no observed churn event.

This means, while you will be able to figure out the # of true positives (predicted to be alive, and was) & false negatives (predicted not to be alive, but was) with your procedure described above, there is no way for you to gain any actual reliable measurements of false positives & true negatives

Therefore any type of classification-error you define based on these, will be based on some arbitrary cut-off point/duration (in your case after June until …?), some arbitrary or weighted p(alive)-value, and will ultimately lead to misestimates, since by the problem definition itself it is not possible to truly classify all observations.

The difficulty in observing any actual churn event in the scenarios also described in the docs, is the reason why these types of probabilistic models were developed in the first place.

1reaction

psygocommented, Nov 12, 2019

Ok thanks. I’m not sure why we need to put that line there, as we are not using plot_cumulative_transactions on the data that was split into calibration/holdout. It’s plotted on the full un-split transaction data.

When you study your data, you will eventually use the split data and then use only the model that has been fit with the calibration/training data, because the holdout/testing data should be mostly used for control/quality purposes. Then, later, you can check if your model is doing well on both the training and test data by plotting the plot_cumulative_transactions and plot_incremental_transactions graphs. Having the t_cal marker will help you visually differentiate where those two (sub)datasets are located on the graph.

However, in the next step (I’m following along the documentation) I prepare calibration vs holdout data and now I fit the bgf model to that.

You do not need to fit your model first on the whole data and then on the train/test data. You can go directly to the train/test data if you wish to. Fitting to the whole dataset is supposed to only serve as a guideline to the final results, but, in the end, you will need to have a training and a control (test) dataset in order to statistically validate your results.

It’s recommended to do some diagnostic plots at this stage such as plot_calibration_purchases_vs_holdout_purchases

That’s one of the validation plots, not the only one. It’s important to use everything that’s available to you. Many of the plots only show that your model does not suck, and not that it is good.

The documentation uses the bgf fitted to the calibration/holdout data for predicting on the original summary data. Should I re-fit bgf back to the original data before carrying out these P-Alive and CLV predictions?

Rigorously, you shouldn’t refit your model to the whole dataset because everything it has “learned” to get its parameters and to validate its assumptions comes from the training set and not the complete data. Using the model you obtained from your calibration/holdout study on the whole data is, thus, valid, but not refitting it to the whole data and then using it on the same whole set — that’s also basically like cheating on a school test: once you know the content of the test, it becomes much easier to answer the questions.

Top Results From Across the Web

BTYDplus.pdf

star Number of events within holdout period. Only if T. cal is provided. Aggregates an event log to either incremental or cumulative number...

Cumulative Frequency Plots - Stat Trek

A cumulative frequency plot is a way to display cumulative information graphically. It shows the number, percentage, or proportion of observations that are ......

ECOD: Unsupervised Outlier Detection Using Empirical ... - arXiv

To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired ...

Cumulative Financial Statements - JSTOR

The cumulative graph is an efficient and effec- ... can be traced to the Federal Reserve Board publication, ... nature and magnitude of...

lifetimes Documentation

from lifetimes.plotting import plot_frequency_recency_matrix ... Given a customer transaction history, we can calculate their historical ...