What is validation?

We build models using different algorithms and often, the model construct is to learn the data and be able to predict the target variable. Here model is nothing but a mathematical equation that fits the data or even a tree that is built and pruned on data.

How do we say my model is good? Whenever we build a model, my model is good if it is performing well in unseen data. So, this raises even more questions on how to choose my train set, what should be my test set and how should I train the model etc.

Though they are many perspectives we should think to build a model, but keeping all of them aside, one important view that I want to highlight here is how do I choose validation data?

How to choose validation data?

We can choose either out of sample (OOS) dataset or out of time (OOT) dataset.

Out Of Sample

Lets us say, you have a use case where, you want to choose some people for some campaigning purpose and you want to build a model for that. You choose some random sample for train and test, and you can use the rest of the whole dataset which is not part of train and test for validation. This validation is called as Out of sample as we are using the same dataset to have train, test and validation sets. This is otherwise called as Out of Validation(OOV) dataset.

Out Of Time

Let us consider the previous example itself, if you have some n number of campaigns, instead of choosing the current campaign which we trained and tested, we can choose some other random campaign and check the consistency of the model. This is called as Out Of Time validation set. This sometimes may depend on seasonality, like if we train on march data, it may or may not perform well in campaigns during other months, there may be effect of many external factors as well.

I will try to cover on various Cross Validation Techniques that are used in ML.

If you like my explanation, Please encourage me!!!!!! 🙂