The index to the articles in this series is found here.
UPDATE #1 (2019-09-02): These results turn out to be invalid. I was doing a Knuth shuffle on rows in a numpy array for the inputs and true values, and apparently there are subtleties in swapping rows through a temporary value that I wasn’t aware of, because I wound up with duplicated entries (and corresponding disappearance of unique entries). I have to re-run now that I’ve fixed the shuffling code.
I’ve run through all of the optimizers bundled with Keras. Originally, I had been tracking CPU time, but I stopped. First of all, execution time was very similar in all experiments, roughly 2 wallclock hours to run 400 epochs. Secondly, for all experiments after the first, the best network wasn’t the one at epoch 400, but an earlier, sometimes much earlier one, before the network began to overtrain and the validation data loss worsened.
I decided that, for these experiments, I’d focus on one particular result in the holdout data. If it is currently not raining, and will start raining in more than 1 hour but less than 2, then the network will be considered to have a successful prediction if it predicts rain any time in the next 2 hours. I’ll call this the “one hour warning” success rate.
Here are the results from the first run of experiments:
Experiment | Training Loss | Validation Loss | One Hour Warning Rate |
SGD | 0.0638 | 0.0666 | 0.56 |
RMSProp default | 0.0247 | 0.0383 | 0.88 |
RMSProp large batch | 0.0074 | 0.0102 | 0.96 |
RMSProp Small LR | 0.0180 | 0.0293 | 0.84 |
Adagrad | 0.0018 | 0.0111 | 0.93 |
Adadelta | 0.0102 | 0.0186 | 0.91 |
Adam | 0.0066 | 0.0102 | 0.83 |
Adamax | 0.0026 | 0.0069 | 0.84 |
Nadam | 0.0090 | 0.0228 | 0.78 |
So, do we just declare that RMSProp large batch is the best result and continue? Well, no, we can’t do that. First, we know that the large batch size improves RMSProp, and we didn’t test a large batch size on the other optimizers. Second, we have to compute statistics on our one-hour warning rate. For RMSProp large batch, the uncertainty in the result is 0.027 at the 90% confidence level using the student T-test. This is, in fact, a lower bound on the uncertainty. There are 146 eligible samples that match the requirements for a one-hour warning test, but these aren’t strictly independent. Many of these might just be the same rain event but with the historical window starting 10 or 20 minutes later, so a significant fraction of the input time steps is shared between so-called independent measurements.
All right, so we know, on broad arguments, that a larger batch size is really required for any of these tests in this particular project. We’ll have to re-run with a larger batch size. We’ll also want to start putting in regularizers. From now on, all our experiments will be made with a batch size of 512, unless otherwise noted. The larger batch size will reduce the amount of convergence we achieve per epoch, but the epochs are also 4 times faster, so we don’t expect any great increase in runtime.
In the next post, we’ll begin the regularizer testing.