"t_cal" in plot_cumulative_transactions.. what does this do?
See original GitHub issueI can’t find any info on this parameter in the plot_cumulative_transactions
function.
According to the documentation, t_cal
is “A marker used to indicate where the vertical line for plotting should be.”
OK I can put a vertical red line anywhere on the plot, but what does this do? No matter which value I set as t_cal
the plot doesn’t change.
What should I do with this parameter?
Thanks
Issue Analytics
- State:
- Created 4 years ago
- Comments:11
Top Results From Across the Web
BTYDplus.pdf
star Number of events within holdout period. Only if T. cal is provided. Aggregates an event log to either incremental or cumulative number...
Read more >Cumulative Frequency Plots - Stat Trek
A cumulative frequency plot is a way to display cumulative information graphically. It shows the number, percentage, or proportion of observations that are ......
Read more >ECOD: Unsupervised Outlier Detection Using Empirical ... - arXiv
To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired ...
Read more >Cumulative Financial Statements - JSTOR
The cumulative graph is an efficient and effec- ... can be traced to the Federal Reserve Board publication, ... nature and magnitude of...
Read more >lifetimes Documentation
from lifetimes.plotting import plot_frequency_recency_matrix ... Given a customer transaction history, we can calculate their historical ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It seems to me you are coming at this problem from a common supervised learning angle, where we can clearly observe all types of outcomes. However, by definition of the problem we are trying to solve with the models implemented here, we have some type of non-contractual churn situation, i.e. no observed churn event.
This means, while you will be able to figure out the # of true positives (predicted to be alive, and was) & false negatives (predicted not to be alive, but was) with your procedure described above, there is no way for you to gain any actual reliable measurements of false positives & true negatives
Therefore any type of classification-error you define based on these, will be based on some arbitrary cut-off point/duration (in your case after June until …?), some arbitrary or weighted p(alive)-value, and will ultimately lead to misestimates, since by the problem definition itself it is not possible to truly classify all observations.
The difficulty in observing any actual churn event in the scenarios also described in the docs, is the reason why these types of probabilistic models were developed in the first place.
When you study your data, you will eventually use the split data and then use only the model that has been fit with the calibration/training data, because the holdout/testing data should be mostly used for control/quality purposes. Then, later, you can check if your model is doing well on both the training and test data by plotting the
plot_cumulative_transactions
andplot_incremental_transactions
graphs. Having thet_cal
marker will help you visually differentiate where those two (sub)datasets are located on the graph.You do not need to fit your model first on the whole data and then on the train/test data. You can go directly to the train/test data if you wish to. Fitting to the whole dataset is supposed to only serve as a guideline to the final results, but, in the end, you will need to have a training and a control (test) dataset in order to statistically validate your results.
That’s one of the validation plots, not the only one. It’s important to use everything that’s available to you. Many of the plots only show that your model does not suck, and not that it is good.
Rigorously, you shouldn’t refit your model to the whole dataset because everything it has “learned” to get its parameters and to validate its assumptions comes from the training set and not the complete data. Using the model you obtained from your calibration/holdout study on the whole data is, thus, valid, but not refitting it to the whole data and then using it on the same whole set — that’s also basically like cheating on a school test: once you know the content of the test, it becomes much easier to answer the questions.