Testing and training datasets are, in fact, different. When I introduced model validation earlier, I talked about how model validation partitions data into these two subsets, so let me dive into that a bit more.
Model validation uses randomly divided data in different subsets to reduce the risk of overfitting a model by tuning the model to respond correctly to new input. The two typical subsets of data are:
- Training set – This data is used to train and fit the model and determine parameters. It is usually 60–70% of the data and needs to reflect the complexity and diversity of the model.
- Testing set – This data is used to evaluate the performance of the model. This is usually about 30–40% of the data, and it also needs to reflect the complexity and diversity of the model.
Since we need to reflect the complexity and diversity of the model in both datasets, they need to be divided randomly. This approach will also decrease the risk of overfitting the model and give us a more accurate but simpler model to produce results for the study.
If we train the model with a non-randomly selected dataset, the model would be trained well for that specific subset of the data. The problem is this non-random data won’t represent the rest of the data or new data that we want to apply the model to. For example, say we are analyzing the energy consumption of a town. If the dataset we use for training and testing is not random and only has data from the weekend energy consumption, which is generally lower than weekdays, when we apply the model to new data such as a new month, it won’t be accurate since it will only represent the weekends.
Let me illustrate this with an example of two models adjusting to a training dataset. I’m going to use a basic example found in the Machine Learning Onramp. Here, I have a simple model and a complex model: