The index to the articles in this series is found here.

UPDATE #1 (2019-09-02): **These results are invalid due to an error in the input**. We will return to this later.

So, regularizing. This is a technique used to address overfitting. We’ve discussed overfitting before. One possible approach is to reduce the dimensionality of the network, use fewer neurons. Another is to find a way to penalize undesirable weight configurations so that the network is more likely to find a general solution, rather than a specific fit to the training data.

We’re going to explore some different approaches. One is a direct weight penalty, the other is a dropout layer. The dropout layer randomly zeroes a certain fraction of the inputs to the following layer. The effect of this is to penalize network configurations that depend too much on a specific small set of correlated inputs, while almost ignoring all the other inputs. Such an undesirable configuration would produce large losses when the dropout layer removed some of its inputs, allowing the network to train to a more resilient configuration that is less dependent on a narrow subset of its inputs.

The direct weight penalty is fairly obvious, the training loss is increased by the presence of large weights, thereby training the network toward weights of a more uniform distribution. There are two typical metrics used for minimization, referred to as L1 and L2. In L1 regularization, also referred to as Lasso regression, a term proportional to the absolute values of the weights is added to the loss function. In L2, or Ridge regression, a term proportional to the square of the weights is added. Each regularization technique also includes a free parameter, the proportionality constant for adding in the penalty, and the choice of this number can have an important impact on the quality of the final model.

To begin, we’re going to re-run our optimizers, with batch sizes of 512 now, and train out 400 epochs. At the end of that time, we will generate a histogram of the weights in the different layers, to see which layers, if any, have a badly unbalanced weight distribution. These will be candidates for our regularization techniques, either through one of Lasso or Ridge regression, or with a dropout layer.

I will not be using a dropout layer on the LSTM layer, since its inputs are often dominated by zeroes, only a relatively small fraction of the input data is non-zero. It sometimes makes sense to apply dropout to the inputs of a network, but it’s usually not useful on data of the type we have here, where the interesting feature of the data is a binary state, raining or not raining in that sector.

Recurrent layers, of which LSTM is a type, are particularly susceptible to overtraining issues with unbalanced weights, so we will be looking for problems in that layer and addressing them with regularization settings in the layer construction.

Another regularization technique that is sometimes applied is a noise layer. Random perturbations of the inputs to a layer can help the network to generalize from a specific set of values by training it to recognize as equivalent inputs that are close together in phase space. I’m not currently planning to use noise injection, we’ll see how the other approaches perform first.

In order to generate through-time histograms of weights, I’ll be using TensorBoard. To that end, I’ve modified the code in rptrainer2.py to log suitable data. The output files are huge, but I hope to get some useful information from them.