The index to the articles in this series is found here.
I’ve finished downloading additional training data from the archive at Environment Canada. Turns out I needn’t have worried about polling their web server too heavily, it responds quite slowly to requests in any case. A single 20k .gif file might take several seconds to assemble before it is transferred, and sometimes the file that arrives is truncated.
I have been doing data repairs, loading fresh copies of damaged or missing files. To determine whether a .gif file is truncated, I tried using giftopnm on the file and checking the return code, but there are some truncated files that return a success result from that program, so I had to do a bit more, scanning the output for reports of difficulty. I believe I’ve found all the bad cases, so now I’m re-running my repair script.
Now, I can download the radar data from two servers. I’ll call server1 the one that I used in the past, the one that presents current radar data to the viewer in a one-hour loop so you can estimate rain, but doesn’t have any historical data. Then there’s server2, which has historical data.
The .gif files from the two servers are not identical. The data from server1 is as shown in the early posts in this series. The historical data from server2 is similar, but includes 4 overlays with names of cities, and tracks of rivers and concentric rings around the radar facility. The problem is that these overlays occlude rainfall data. So, if it’s raining under the label of a city name, the data from server2 won’t show that rain, but the data from server1 will.
Fine, I can just switch everything over to server2. Except, I can’t, because that archive isn’t up to date. Server1 has data collected at noon appearing on the server by about 8 minutes after twelve. Server2 has that data two or three hours later. I can’t expect to write a rain predictor if my current data is already three hours out of date, so I have to continue to use server1 for predictions, but I’m using server2 to train.
So, I’ve had to replicate the overlays. This wasn’t difficult, server1 stores the overlays, because they can be optionally added to the view in the web browser, so it’s simple to download them, OR them together, and produce a new mask. The code to do this is in the git repository, under the name sum-overlays.py.
I can now use the mask to hide those pixels from server1 that are effectively dead pixels on server2, so that my inputs to the network for predictions once again match the training data.
So, I have data now from Jan 1, 2015 to Sep 20, 2019. I’ll be using this to train, validate, and for the holdout set.
One more thing that’s been bothering me about the way I was training the data in the past was that the training and validation sets were a bit too close together. That is, because a single entry is generated from 6 consecutive .gif files, I might train on the 6 images starting at 1:00 on June 10, and have a validation element containing the 6 images starting at 1:10 on June 10. That’s always made me a bit uneasy, but with only a few months of data available, it seemed a reasonable compromise.
Now, though, I have years of data. I plan to use the data from 2015 through 2017 as training data, 2018 for validation, and 2019 for the holdout. These will be completely non-overlapping entries, so that worry will be assuaged.
Another thing I’ve seen, running the desktop widget for a few weeks now, is that the network is not very good at seeing very light rain, it seems to be getting confused by the phantom rain effect again. I decided that the two bytes of data per sector might not be enough for pixels close to the radar facility, so I added a third byte for those sectors close in. This is a number that represents the fraction of rain pixels in the sector for which all bordering pixels are also rain pixels. Phantom rain is often dotty or streaky on the images, and that would give a very low value in this third byte, while real light rain is usually uniform across large swathes of the sky, so this byte would have a high value. We’ll see how that does once the training has been done.
One more thing I’m planning to do in the updated training code is to add a value for time of year. It would go from 0 in February to 1 in August, then back down again, in steps of 0.2. This is so that the neural network can condition its response to rain and snow independently, in case that’s important.
Of course, training will take much longer, but even if we go from two hours to one day, that’s not a big deal.
So, that’s where we are now, more updates to follow once I’ve finished my data repairs. I may also have to do something about the phantom rain detection for the true values, we’ll see.