The index to the articles in this series is found here.
I’ve been talking about two datasets so far, training and validation. There’s a third dataset, holdout. This is data that the neural network never sees during training.
To review, here’s how the three datasets are used. The training dataset is run through the network to compute the loss, and that loss is used to adjust the weights, training the network.
The validation dataset is used, typically after each epoch, to compute a loss on data that the training never experienced. This is to help protect against overfitting. You can set up an early exit from the loop to stop training if the validation dataset starts to see worse results.
The holdout dataset is yet another dataset, one that was not seen either during training or validation. It’s there to see how well the trained network operates on new data.
Now, one thing about the holdout dataset is that it can be subsampled for different interesting behaviours. That is, you can remove entries that correspond to less interesting results. In our case, we’re going to focus on rain transitions. It’s not so impressive if the network predicts rain given that it’s raining right now, and similarly for predicting no rain if it’s not raining now. So, I filter the holdout dataset so that only entries where it rains in the future but isn’t raining now, or stops raining in the future but is raining now, are kept. These will form the basis for our evaluation of the network’s usability as we adjust parameters.
Another thing I’ve added to the training code is a hash of the training inputs and outputs. I’m going to be adjusting the network parameters and topology to try to find the best network I can, and I don’t want to discover later that I accidentally modified the input dataset, invalidating my comparisons. If the input set changes, the training will exit.