Tag Archives: keras

Building a Rain Predictor. Feature generation.

The index to the articles in this series is found here.

What has been happening since the phantom rain work? I’m happy with the training set y-values again, but I was having trouble generating useful networks. I discovered later that a bug introduced when I was working on the phantom rain problem effectively deleted half of the feature data, so there wasn’t very much for the neural network to chew on to train itself. I wrote a small script to read the features out of the preprocessed files and generate .gifs of them, so I could, in effect, see what the network sees.

My earlier networks were being trained on data that was the average rain value over all pixels in a sector (even pixels with no rain), and the maximum rain value for any pixel in the sector. This didn’t seem to be doing very well.

Recall that, early on, I decided that I wasn’t just going to feed the .gif files into the network and let it work out its own features. The data sets are too big, and the number of neurons that would be needed to handle them was too high. So, very early on I moved to synthesized features.

Now, I’m just trying different synthetic feature combinations to see what works well. Each iteration takes about two days, which is the amount of time it takes my current machine to preprocess the image files into a format suitable for the neural network training code.

The current feature combination I’m trying out consists of three bytes per sector, and still the same 400 sectors. The first byte represents the fraction of pixels in the sector that show any rain at all. The second byte is a measure of the mean rain intensity in all rain pixels. The third byte is the root-mean-squared rain intensity in all rain pixels. We shall see, once the preprocessing is complete, whether this produces something more useful for the neural network training.

Building a Rain Predictor. Revisiting the phantom rain.

The index to the articles in this series is found here.

I was thinking about the overfitting issue I mentioned in the last posting. It seemed strange, like something was supposed to work, but didn’t. That kind of immediate departure between training and validation loss suggests incompatible data. So, I wondered if maybe the true values used for training and validation were corrupted.

After I made the first working version of the rain predictor, I said that it was not very good at light rainfall over short time scales, and proposed that this might be because of poor detection of phantom rain. So, I adjusted the phantom rain algorithm before my last training attempts with the larger dataset.

Now, though, I was becoming suspicious, so I instrumented the script that generates the true values so that it would record names of intermediate binary files for which it had decided there was true rain, and phantom rain. The results were enlightening.

The phantom rain algorithm’s false positive rate was very low, out of about 1800 events labeled as true rain, about 2 or 3 were, in fact, phantom rain. However, the false negative rate was horrible. Out of over 3100 events labeled as phantom rain, more than two thirds were actually rain. My little heuristic for determining phantom rain was badly misbehaving.

This messes up the training badly. It’s no wonder that the mislabeled training and validation sets behave so differently, the neural network is hallucinating badly trying to resolve the inconsistencies in the training set.

I can look at a radar image and decide in about 2 seconds whether it’s showing phantom rain or true rain in Ottawa. Sometimes it’s a bit ambiguous, but I’m pretty sure well over 95% of the time.

If only there were some program that could look for patterns in two-dimensional arrays of pixels, and determine what sort of features distinguish true rain from phantom rain. Whatever would do that? What sort of software do I have to learn to use?

Imagine Keras just waving its arm above its head and pointing at itself.

Right, so a 2-dimensional convolutional neural network it is, then. I have to automate the rain determination, there are over 200000 input files and I’m not inspecting them one by one.

I inspected all files in 2015 and 2016 for which a sensitive pixel was illuminated, so there was the possibility of rain over Ottawa. That was 8641 images. To make this run more quickly, I put together this script:

#! /usr/bin/python3

import matplotlib.pyplot as plt
import PIL

category = []
plt.ion()

images = []

with open('phantom-rain-candidates.txt', 'r') as ifile:
    filenames = ifile.readlines()

for fn in filenames:
    fn = fn[:-5]
    
    img = PIL.Image.open(fn)
    plt.imshow(img)
    plt.pause(0.05)
    response = input('PHANTOM?: ')
    plt.cla()
    if response == 'q':
        break
    category.append('{0}  {1}\n'.format(fn, response))


with open('screened-rain-candidates.txt', 'a+') as ofile:
    for result in category:
        ofile.write(result)

It pops up a window and shows images in that window one after the other. After each photo, I type either ‘y’ or ‘n’, followed by ENTER, and that value is recorded for later output, then the next image is displayed.

This gave me a training set and a validation set. Next, it’s time to write the convolutional neural network code:

#! /usr/bin/python3

# Try to write a neural network to distinguish real from phantom rain.

# Phantom rain looks different from real rain, particularly in the
# region surrounding the radar station.  Ottawa lies within the radius
# where phantom rain can appear, so we need to know when active rain
# pixels are actually indicative of rain.

import argparse
import numpy
import rpreddtypes
import os
import keras
from keras.layers import Input, Conv2D, Dense, Flatten, MaxPooling2D
from keras.models import Model

parser = argparse.ArgumentParser(description='Train the phantom '
                                 'classification network.')
parser.add_argument('--training-set', type=str, dest='training',
                    required=True,
                    help='A file of pathnames, and then letter '
                    '\'y\' for phantom rain, \'n\' for non-phantom '
                    '(i.e. real) rain.')
parser.add_argument('--validation-set', type=str, dest='validation',
                    required=True,
                    help='A file of pathnames, and then letter '
                    '\'y\' for phantom rain, \'n\' for non-phantom '
                    '(i.e. real) rain.')
parser.add_argument('--examination-box', type=list, dest='bounds',
                    default=[240, 320, 160, 240],
                    help='Bounds of the region to pass to the network.'
                    '  They are [minCol, maxCol, minRow, maxRow].')
parser.add_argument('--epochs', type=int, dest='nEpochs',
                    default = 100,
                    help = 'Set the number of epochs to train.')
parser.add_argument('--dense-layer-nodes', type=int, dest='densenodes',
                    default = 100,
                    help = 'Set the number of nodes in the synthesis layer.')
parser.add_argument('--name', type=str, dest='name',
                    required=True,
                    help='A name to distinguish this run.  It '
                    'will be used to construct filenames for '
                    'detailed logging.')

args = parser.parse_args()



def loadData(pathname, bounds):

    if os.path.exists(pathname + '-saved.npz'):
        container = numpy.load(pathname + '-saved.npz')
        return container['rvalX'], container['rvalY']
    
    minRow = bounds[2]
    minCol = bounds[0]
    numRows = bounds[3] - bounds[2] + 1
    numCols = bounds[1] - bounds[0] + 1
    records = []
    with open(pathname, 'r') as ifile:
        records = ifile.readlines()

    rvalX = numpy.zeros((len(records), 1, numRows, numCols))
    rvalY = numpy.zeros((len(records)))
    index = 0
    for r in records:
        isPhantom = r[-2:-1]
        if isPhantom == 'y':
            rvalY[index] = 1
        elif isPhantom == 'n':
            rvalY[index] = 0
        else:
            os.exit(1)

        binfilename = r[:-4] + '.bin'
        
        reader = rpreddtypes.RpBinReader()
        reader.read(binfilename)
        mrv = reader.getMaxRainval()
        rawdat = reader.getScaledObject(1).getNumpyArrayMax()
        for row in range(numRows):
            for col in range(numCols):
                # normalizing on the range [-1,1]
                rvalX[index, 0, row, col] = (rawdat[minRow + row, minCol + col] / mrv - 0.5) * 2

        index += 1

    numpy.savez(pathname + '-saved.npz', rvalX = rvalX, rvalY = rvalY)
    return rvalX, rvalY
        

trainingX, trainingY = loadData(args.training, args.bounds)
validateX, validateY = loadData(args.validation, args.bounds)

numrows = args.bounds[3] - args.bounds[2] + 1
numcols = args.bounds[1] - args.bounds[0] + 1


inputs = Input(batch_shape = (None, 1, numrows, numcols))
convlayer1 = Conv2D(filters=64, kernel_size=3, data_format='channels_first', activation='relu')(inputs)
pool1 = MaxPooling2D()(convlayer1)
convlayer2 = Conv2D(filters=32, kernel_size=3, activation='relu')(pool1)
flat = Flatten()(convlayer2)
synthlayer = Dense(args.densenodes, activation='relu')(flat)
outlayer = Dense(1, activation='sigmoid')(synthlayer)

mymodel = Model(inputs = inputs, outputs = outlayer)
mymodel.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(),
                metrics=['accuracy'])




basename = 'phantomnet_{}.'.format(args.densenodes)
cb1 = keras.callbacks.ModelCheckpoint(basename,
                                      save_weights_only=False,
                                      save_best_only = True,
                                      verbose=1,
                                      mode='auto', period=1)


history = mymodel.fit(x = trainingX, y = trainingY, epochs = args.nEpochs,
                      validation_data = [validateX, validateY],
                      verbose=1, batch_size=512, shuffle = True,
                      callbacks=[cb1])

for key in history.history.keys():
    filename='history_{}_{}'.format(args.densenodes, key)
    with open(filename, 'w') as ofile:
        ofile.write('\n'.join(str(val) for val in history.history[key]))

Now, I can feed it my training and validation sets, and see what the validation loss/accuracy are.

This is running now, with different values of densenodes. For 20 nodes in that layer, I got a validation accuracy of 96.7%. So, data that it never saw for training was validating that well on the best network after 200 iterations. I’ll see, after this runs overnight, whether a larger number of nodes works better.

So, what does this network do? It produces 64 random 3×3 convolutions on a subset of the data. A square that has the radar station in its bottom left corner and which extends to cover the city of Ottawa, and a bit beyond. A 3×3 convolution is essentially an operation that replaces the value at I,J with a linear combination of the values at I,J and its 8 nearest neighbours. This detects simple features. A pooling layer downscales this by a factor of two squared, and a second convolution looks for features built from simple features. This is then fed into a dense intermediate layer, and then an output layer.

Once I’ve done the overnight run, I’ll pick out the best network and use it in the script that generates true values, replacing the defective algorithm that is there. This should greatly improve the training y-values, and we’ll see if the previous network topology has more success.

Building a Rain Predictor. Results from larger data set.

The index to the articles in this series is found here.

I got my larger data set, almost 5 years of radar data. This took about 40 hours on all cores of my home machine to convert into the intermediate binary format, but then I was able to start experimenting.

I mentioned in the last article that I had been a bit concerned about the overlapping in time of the training, validation, and holdout sets. Since I only had a handful of months of data, and rain only happens a few times a week, I decided that this was how I would break up my data set. With the new multi-year dataset, though, I could make truly independent training, validation, and holdout sets, from separate years.

And… it turns out I was right to be concerned about that overlap. With the disjoint data sets, the network can be seen to be almost immediately overtraining. That is, it is not getting any better at predicting the outcome from new data, it’s getting worse. Typically, this results in a prediction success rate of about 60% now.

Of course, the desktop widget that I have been using has shown its usefulness, even though it was trained and validated on overlapping data. This means that that network was overtrained, and probably doing more poorly in the predictions I was using it for.

So, how do we address the overtraining? I’ve changed the geometry somewhat to start with. Now, instead of feeding the data into the LSTM layer, I first feed it into a geometry layer that takes the 800+ inputs and produces a smaller number of outputs to feed into the LSTM layer. I apply batch normalization to that layer, and a 50% dropout, before going on to LSTM. After LSTM, another 50% dropout, then into another normalized dense layer, then finally the output layer. I cut way back on the sizes of my layers, instead of hundreds of nodes per layer, I’m down to tens.

Even with all that, the network overtrains immediately. I have to find a better way to design the network to capture generic features, rather than specific details.

Early on, when the scope of the problem became apparent, I said I was not going to use convolutional layers. The size of the data appeared to make that impractical. I’m now going back and thinking about that decision. I have some ideas for how I can apply convolutions without straining the memory of the machine. I was thinking of doing it in polar coordinates, the way the preprocessed data in the intermediate binary files is implicitly stored, but I think I’ll stick with cartesian features for now. The intermediate binary files do contain a raw representation of the rain in a radar image, in addition to the preprocessed vectors, and so I can do something with that before saving the numpy arrays to disc for easy loading and reuse.

Building a Rain Predictor. More data.

The index to the articles in this series is found here.

I’ve finished downloading additional training data from the archive at Environment Canada. Turns out I needn’t have worried about polling their web server too heavily, it responds quite slowly to requests in any case. A single 20k .gif file might take several seconds to assemble before it is transferred, and sometimes the file that arrives is truncated.

I have been doing data repairs, loading fresh copies of damaged or missing files. To determine whether a .gif file is truncated, I tried using giftopnm on the file and checking the return code, but there are some truncated files that return a success result from that program, so I had to do a bit more, scanning the output for reports of difficulty. I believe I’ve found all the bad cases, so now I’m re-running my repair script.

Now, I can download the radar data from two servers. I’ll call server1 the one that I used in the past, the one that presents current radar data to the viewer in a one-hour loop so you can estimate rain, but doesn’t have any historical data. Then there’s server2, which has historical data.

The .gif files from the two servers are not identical. The data from server1 is as shown in the early posts in this series. The historical data from server2 is similar, but includes 4 overlays with names of cities, and tracks of rivers and concentric rings around the radar facility. The problem is that these overlays occlude rainfall data. So, if it’s raining under the label of a city name, the data from server2 won’t show that rain, but the data from server1 will.

Fine, I can just switch everything over to server2. Except, I can’t, because that archive isn’t up to date. Server1 has data collected at noon appearing on the server by about 8 minutes after twelve. Server2 has that data two or three hours later. I can’t expect to write a rain predictor if my current data is already three hours out of date, so I have to continue to use server1 for predictions, but I’m using server2 to train.

So, I’ve had to replicate the overlays. This wasn’t difficult, server1 stores the overlays, because they can be optionally added to the view in the web browser, so it’s simple to download them, OR them together, and produce a new mask. The code to do this is in the git repository, under the name sum-overlays.py.

I can now use the mask to hide those pixels from server1 that are effectively dead pixels on server2, so that my inputs to the network for predictions once again match the training data.

So, I have data now from Jan 1, 2015 to Sep 20, 2019. I’ll be using this to train, validate, and for the holdout set.

One more thing that’s been bothering me about the way I was training the data in the past was that the training and validation sets were a bit too close together. That is, because a single entry is generated from 6 consecutive .gif files, I might train on the 6 images starting at 1:00 on June 10, and have a validation element containing the 6 images starting at 1:10 on June 10. That’s always made me a bit uneasy, but with only a few months of data available, it seemed a reasonable compromise.

Now, though, I have years of data. I plan to use the data from 2015 through 2017 as training data, 2018 for validation, and 2019 for the holdout. These will be completely non-overlapping entries, so that worry will be assuaged.

Another thing I’ve seen, running the desktop widget for a few weeks now, is that the network is not very good at seeing very light rain, it seems to be getting confused by the phantom rain effect again. I decided that the two bytes of data per sector might not be enough for pixels close to the radar facility, so I added a third byte for those sectors close in. This is a number that represents the fraction of rain pixels in the sector for which all bordering pixels are also rain pixels. Phantom rain is often dotty or streaky on the images, and that would give a very low value in this third byte, while real light rain is usually uniform across large swathes of the sky, so this byte would have a high value. We’ll see how that does once the training has been done.

One more thing I’m planning to do in the updated training code is to add a value for time of year. It would go from 0 in February to 1 in August, then back down again, in steps of 0.2. This is so that the neural network can condition its response to rain and snow independently, in case that’s important.

Of course, training will take much longer, but even if we go from two hours to one day, that’s not a big deal.

So, that’s where we are now, more updates to follow once I’ve finished my data repairs. I may also have to do something about the phantom rain detection for the true values, we’ll see.

Building a Rain Predictor. Discussion.

The index to the articles in this series is found here.

While I continue to download training data from Environment Canada, I’ll talk a bit about this project.

So, I’ve been calling this a rain predictor. What is it, and what is it not? This neural network project aims to look at a time series of radar images, specifically 6 images taken at 10 minute intervals before the present. These images are taken from the Franktown radar facility SW of Ottawa and show only active precipitation. It then attempts to predict how these rain patterns will move through the region, and whether and when to expect rain in Ottawa. The radar images have a coverage with a radius of 240 km.

Rain that forms directly overhead has no radar warning, and this network will always be surprised by that. This code is limited to making predictions based only on a short (one hour) time series of active precipitation images. It will never pretend to know the weather next week, or tomorrow.

This isn’t a forecasting application, the scientists at Environment Canada have their own sophisticated models for estimating weather patterns over different parts of the country on a time scale of hours to days.

The obvious next question is, does it work with snow? The training data I’m downloading now is a continuous record (barring station outages due to hardware maintenance or failure). It will have training data from snow events when this is done. It so happens that the data I collected in 2015 and again in 2019 were all for summer months, so the network has not been trained against winter weather patterns. I should have enough data downloaded in about a week to train a new network, and I expect it will happily incorporate winter conditions.

I started this project thinking I could just throw the pixels from the radar into a neural network. I hadn’t quite appreciated the scale of that, and the resulting network was unusably slow to train. I then came up with a simple manual data convolution that allowed me to discard two levels of the neural network, and reduce each 480×480 .gif images to a record 800 bytes long. Six of these, less than 5k in total, is the entire input to the neural network for one prediction.

I tried various regularizations, activations, and topologies, but so far haven’t found any places for obvious improvement. Examining the distribution of weights with TensorBoard shows very good behaviour, no pathologies that produce unphysical dependencies on subsets of the data. I’m downloading more training data now, and once I’ve trained a new network from that I’ll check again to see if there’s anything I can do to improve it.

Training time on my home machine is about 20-30 minutes to get a stable, but overtrained network in 200 epochs with the training set I’ve been using until now. The best validation network can appear as soon as 5 minutes in.

The resulting network is at least as good as I am at reading the radar image sequence and estimating rain in Ottawa.