The index to the articles in this series is found here.

I’ve now re-run the experiments over 200 epochs. In the first set of runs, the overfitting manifested fairly early, within 100 epochs, so I didn’t see a need to run out to 400 epochs this time. The SGD optimizer is still working at 200 (and was still working at 400 in earlier tests). The other optimizers have long since established their networks.

Experiment | Training Loss | Validation Loss | Two Hour Warning |

SGD | 0.3135 | 0.3057 | 0.00 |

RMSProp | 0.0480 | 0.0796 | 0.82 |

Adagrad | 0.0317 | 0.0823 | 0.78 |

Adadelta | 0.0733 | 0.0829 | 0.77 |

Adam | 0.0300 | 0.0872 | 0.77 |

Adamax | 0.0340 | 0.1148 | 0.81 |

Nadam | 0.0789 | 0.0929 | 0.80 |

Examining the loss numbers, it turns out that Adadelta hadn’t reached overfitting yet, and was still improving against the validation set, so I re-ran that one with 400 epochs, and it settled at a two hour warning fraction of 0.82.

One thing we see in this table is that there’s really no great difference, from a final model accuracy standpoint, between the various optimizers. Excluding the SGD, which would probably get there eventually if I let it run long enough, the other optimizers arrive at similarly accurate solutions. They differ in how quickly they arrive there, and as we saw in an earlier posting, some can diverge if the batch size is too small and the training is performed on batches that are not statistically similar to the overall population. Otherwise, I can limit my experimentation to a single optimizer, at least for a while. I might, at the end, re-run tests with all optimizers to see if any interesting differences in performance have been teased out by whatever configuration changes I make. I’ll be using the RMSprop optimizer for the next while, unless otherwise explicitly noted.

These numbers are disappointing, I liked it when they were at 0.96. Let’s analyse why the corrupted data did so much better, because it helps to emphasize why it’s critically important to have separate datasets for training and validation.

So, the problem occurred because I was manually shuffling two numpy arrays in parallel, using a Knuth shuffle. In this shuffling algorithm, we loop from the first element of the array to one before the last. At each element, we choose a random element between the element itself and the final element in the array, and swap this element with that other one. We never revisit earlier elements, and the random choice includes the same element, so there’s a chance that the element swaps with itself. It is easy to prove that in the final shuffled state, each element has an equal probability of being placed in each slot, so we have a true, fair shuffle.

I was swapping with the familiar mechanism: TMP = A; A = B; B = TMP. It appears, however, that the assignment to temporary space is a pointer, not a deep copy. So, modifying A also modifies TMP. This means that after the swap is complete, B appears twice, and A has disappeared. That assignment is a deep copy, if B is changed, A doesn’t change in sympathy.

Now, I was using Keras’ facility for splitting data into training and validation. This is done by taking a certain number of elements off the end of the input array before Keras does its own shuffling for purposes of splitting into batches.

We can analyse the effect of our broken shuffling on the distribution of elements, and, in particular, find out how many elements from the last 20% of the array wind up in the first 80% where the training will see them. Note that, because elements only move back in the array, training data never gets into the validation data, but validation data can wind up in the training data.

The probability that the first element in the training set will be copied from the validation data is

P_{1} = \frac{V}{N_{tot}}where V is the number of elements in the validation segment, and N_{tot} is the total number of elements.

The probability that the second element will be copied from the validation data is

P_{2} = \frac{V}{N_{tot} - 1}The expected number of validation elements that appear in the training data is, then

<N_{V}> = \sum_{k=0}^{N-V-1} \frac{V}{N - k}This is just a bit of arithmetic on harmonic numbers.

<N_{V}> = V [ H_{N} - H_{N - k}]This simplifies to:

<N_{V}> = V [ ln(N) - ln(N-k) + \mathcal{O}(\frac{k}{N(N-k)})]When k is 20% of N, and N is in the thousands, as in our case, about 22% of the training set actually contains validation data.

So, unless your network has too few degrees of freedom to describe the problem, or is exactly balanced, then, barring convergence pathologies, you will eventually overfit the model. If your training data contains a significant fraction of your validation data, then you will appear to be doing very well on validation, because you trained against it.

What about the probability that any specific element in the validation set has *not* accidentally been placed in the training set? This is the product of the probabilities for each element in the training set:

Therefore, we expect 80% of our validation elements to be present in the training set, some duplicated, and 22% of our training set to contain validation elements. And that completely messes up our statistics, making it look like our network was doing much better than it truly was.

Where do we go from here? First, let’s look at the nature of a failing prediction. I pull one failure out and look at the historical rainfall, then the rain starting two hours later.

Yeah, there’s no rain there. What’s going on? The network got the correct answer, it’s the training “true” values that are wrong. You recall I had to deal with what I called “phantom rain”. That’s the scattering of light rain points around the radar station. Not really rain, it seems to be related to close range scattering from humid air. I don’t see a mention of this style of false image on Environment Canada’s page detailing common interpretation errors. I decided to use a rule that said that this false rain is declared to be occurring when only the lowest intensity of rain is seen within a certain radius of the radar station, and those rain pixels make up less than 50% of the area of the disc. Well, in the image that declares that there is rain in Ottawa at that particular time, there is a single phantom rain pixel of intensity 2, South-East of the station. This is enough for the data generation system to declare that another phantom rain pixel over Ottawa is real rain, and the training data gets an incorrect Y value. I brought up another image from a failed prediction, and there was a single pixel of intensity 3, North-North-West of the station.

All right, so my data cleaning operation didn’t work as well as it should have. Neural networks are notoriously sensitive to dirty data, as they work hard to imagine some sort of correlation between events that didn’t actually take place.

Recall that all of our networks overtrained, and based on the values of the training losses, reproduced their inputs essentially exactly when subject to a 0.5 mid-point decision cutoff between rain and no-rain. That means that our overtrained networks actually managed to declare those phantom rain pixels as rain, when they should not have done so. The best matching networks, though, the ones with the lowest validation losses, correctly indicated that there was no rain in those images, and we scored them lower because they failed to match the incorrect Y values. Validation loss doesn’t feed back into the network weights during training, so the network didn’t force itself to give wrong answers on these entries.

That’s actually very encouraging. Now all we have to do is to figure out how to identify phantom rain more accurately, regenerate our Y values, and try again. Well, I have a nice list of 30 or so bad cases, taken from the holdout data set. The neural network has helpfully presented me with a good set of incorrectly-classified images that I can use to improve my data cleaning efforts.

So, let’s not say we need exactly zero pixels of higher intensity than the lowest for the rain to be declared phantom rain. Instead, we’ll say that as long as there are fewer than 5 such pixels, there’s still a possibility of phantom rain. That’s an edit to prepare-true-vals.py.