Tag Archives: tensorflow

Building a Rain Predictor. Final thoughts.

The index to the articles in this series is found here.

The prediction system is complete, and working quite well. I made many slight changes to the network configuration, none really performed much better than the others. The network layout that I finally used is as follows:

  • Input images are processed into 400 sectors laid out similarly to the patches on a dartboard.
  • Each sector has three scalars associated with it. They are:
    1. The fraction of pixels in the sector that has any rain at all
    2. The mean intensity of pixels with non-zero rain values
    3. The RMS intensity of pixels with non-zero rain values
  • These preprocessed images are passed, in groups of 6, into a dense layer of 80 nodes called the geometry layer, with a ‘relu’ activation function.
  • Output from the geometry layer enters the time layer, an LSTM network of 45 nodes, with a ‘relu’ activation function.
  • These outputs are then concatenated with a vector of scalars that represent the time of the year. This is a number that starts a 0 in February, steps up by intervals of 1/6 to 1 in August, and then steps down by 1/6 in each month following, to return to zero the following February.
  • Output from the time layer passes into the synthesis layer, a dense layer of 10 nodes with ‘relu’ activation.
  • Output from the synthesis layer passes into the output layer, a dense layer of 10 nodes, 10 outputs, and sigmoid activation.

The loss function was binary_crossentropy, and I used a validation binary accuracy monitor for checkpointing, rather than validation loss. This is because I’m more interested in training against getting the distinction between 0 and 1 correct, than I am in making the zeroes smaller and the ones larger. A network that, for true values of [0, 0, 1, 1], produces results of [ 0.45, 0.45, 0.55, 0.55] is better, for my purposes, than one that produces true values of [ 0.1, 0.6, 0.9, 0.9], even if the latter might have a better validation loss.

I also used classweights to emphasize the quantities I was most interested in predicting. Recall that the network produces 10 binary outputs. 5 of them are the likelihood of any rain in the next 0-60 minutes, 61-120 minutes, 121-180 minutes, etc. The other 5 are the likelihood of heavy rain in the same intervals. I decided that I’m most interested in the near future, if it’s going to rain soon I want to make sure the neural network will tell me so. The light rain/heavy rain distinction is less interesting. The weights I gave, then, on an arbitrary scale, were:

  • 1.0 for the probability of rain in the next hour
  • reduce the weight by 0.1 for each hour interval after the first hour
  • the weight of the heavy/light output is half the weight of the any rain output for the same time interval

All of the code is in the git repository.

I’m happy with the network now, and I’m unlikely to find the need to retrain it in the near future, so this will close off my thoughts on this project.

I mentioned early on in this long series of posts that I was unexpectedly out of work. I started a new job a month ago, so life is back to its normal, moderately busy state.

Building a Rain Predictor. Feature generation.

The index to the articles in this series is found here.

What has been happening since the phantom rain work? I’m happy with the training set y-values again, but I was having trouble generating useful networks. I discovered later that a bug introduced when I was working on the phantom rain problem effectively deleted half of the feature data, so there wasn’t very much for the neural network to chew on to train itself. I wrote a small script to read the features out of the preprocessed files and generate .gifs of them, so I could, in effect, see what the network sees.

My earlier networks were being trained on data that was the average rain value over all pixels in a sector (even pixels with no rain), and the maximum rain value for any pixel in the sector. This didn’t seem to be doing very well.

Recall that, early on, I decided that I wasn’t just going to feed the .gif files into the network and let it work out its own features. The data sets are too big, and the number of neurons that would be needed to handle them was too high. So, very early on I moved to synthesized features.

Now, I’m just trying different synthetic feature combinations to see what works well. Each iteration takes about two days, which is the amount of time it takes my current machine to preprocess the image files into a format suitable for the neural network training code.

The current feature combination I’m trying out consists of three bytes per sector, and still the same 400 sectors. The first byte represents the fraction of pixels in the sector that show any rain at all. The second byte is a measure of the mean rain intensity in all rain pixels. The third byte is the root-mean-squared rain intensity in all rain pixels. We shall see, once the preprocessing is complete, whether this produces something more useful for the neural network training.

Building a Rain Predictor. Revisiting the phantom rain.

The index to the articles in this series is found here.

I was thinking about the overfitting issue I mentioned in the last posting. It seemed strange, like something was supposed to work, but didn’t. That kind of immediate departure between training and validation loss suggests incompatible data. So, I wondered if maybe the true values used for training and validation were corrupted.

After I made the first working version of the rain predictor, I said that it was not very good at light rainfall over short time scales, and proposed that this might be because of poor detection of phantom rain. So, I adjusted the phantom rain algorithm before my last training attempts with the larger dataset.

Now, though, I was becoming suspicious, so I instrumented the script that generates the true values so that it would record names of intermediate binary files for which it had decided there was true rain, and phantom rain. The results were enlightening.

The phantom rain algorithm’s false positive rate was very low, out of about 1800 events labeled as true rain, about 2 or 3 were, in fact, phantom rain. However, the false negative rate was horrible. Out of over 3100 events labeled as phantom rain, more than two thirds were actually rain. My little heuristic for determining phantom rain was badly misbehaving.

This messes up the training badly. It’s no wonder that the mislabeled training and validation sets behave so differently, the neural network is hallucinating badly trying to resolve the inconsistencies in the training set.

I can look at a radar image and decide in about 2 seconds whether it’s showing phantom rain or true rain in Ottawa. Sometimes it’s a bit ambiguous, but I’m pretty sure well over 95% of the time.

If only there were some program that could look for patterns in two-dimensional arrays of pixels, and determine what sort of features distinguish true rain from phantom rain. Whatever would do that? What sort of software do I have to learn to use?

Imagine Keras just waving its arm above its head and pointing at itself.

Right, so a 2-dimensional convolutional neural network it is, then. I have to automate the rain determination, there are over 200000 input files and I’m not inspecting them one by one.

I inspected all files in 2015 and 2016 for which a sensitive pixel was illuminated, so there was the possibility of rain over Ottawa. That was 8641 images. To make this run more quickly, I put together this script:

#! /usr/bin/python3

import matplotlib.pyplot as plt
import PIL

category = []
plt.ion()

images = []

with open('phantom-rain-candidates.txt', 'r') as ifile:
    filenames = ifile.readlines()

for fn in filenames:
    fn = fn[:-5]
    
    img = PIL.Image.open(fn)
    plt.imshow(img)
    plt.pause(0.05)
    response = input('PHANTOM?: ')
    plt.cla()
    if response == 'q':
        break
    category.append('{0}  {1}\n'.format(fn, response))


with open('screened-rain-candidates.txt', 'a+') as ofile:
    for result in category:
        ofile.write(result)

It pops up a window and shows images in that window one after the other. After each photo, I type either ‘y’ or ‘n’, followed by ENTER, and that value is recorded for later output, then the next image is displayed.

This gave me a training set and a validation set. Next, it’s time to write the convolutional neural network code:

#! /usr/bin/python3

# Try to write a neural network to distinguish real from phantom rain.

# Phantom rain looks different from real rain, particularly in the
# region surrounding the radar station.  Ottawa lies within the radius
# where phantom rain can appear, so we need to know when active rain
# pixels are actually indicative of rain.

import argparse
import numpy
import rpreddtypes
import os
import keras
from keras.layers import Input, Conv2D, Dense, Flatten, MaxPooling2D
from keras.models import Model

parser = argparse.ArgumentParser(description='Train the phantom '
                                 'classification network.')
parser.add_argument('--training-set', type=str, dest='training',
                    required=True,
                    help='A file of pathnames, and then letter '
                    '\'y\' for phantom rain, \'n\' for non-phantom '
                    '(i.e. real) rain.')
parser.add_argument('--validation-set', type=str, dest='validation',
                    required=True,
                    help='A file of pathnames, and then letter '
                    '\'y\' for phantom rain, \'n\' for non-phantom '
                    '(i.e. real) rain.')
parser.add_argument('--examination-box', type=list, dest='bounds',
                    default=[240, 320, 160, 240],
                    help='Bounds of the region to pass to the network.'
                    '  They are [minCol, maxCol, minRow, maxRow].')
parser.add_argument('--epochs', type=int, dest='nEpochs',
                    default = 100,
                    help = 'Set the number of epochs to train.')
parser.add_argument('--dense-layer-nodes', type=int, dest='densenodes',
                    default = 100,
                    help = 'Set the number of nodes in the synthesis layer.')
parser.add_argument('--name', type=str, dest='name',
                    required=True,
                    help='A name to distinguish this run.  It '
                    'will be used to construct filenames for '
                    'detailed logging.')

args = parser.parse_args()



def loadData(pathname, bounds):

    if os.path.exists(pathname + '-saved.npz'):
        container = numpy.load(pathname + '-saved.npz')
        return container['rvalX'], container['rvalY']
    
    minRow = bounds[2]
    minCol = bounds[0]
    numRows = bounds[3] - bounds[2] + 1
    numCols = bounds[1] - bounds[0] + 1
    records = []
    with open(pathname, 'r') as ifile:
        records = ifile.readlines()

    rvalX = numpy.zeros((len(records), 1, numRows, numCols))
    rvalY = numpy.zeros((len(records)))
    index = 0
    for r in records:
        isPhantom = r[-2:-1]
        if isPhantom == 'y':
            rvalY[index] = 1
        elif isPhantom == 'n':
            rvalY[index] = 0
        else:
            os.exit(1)

        binfilename = r[:-4] + '.bin'
        
        reader = rpreddtypes.RpBinReader()
        reader.read(binfilename)
        mrv = reader.getMaxRainval()
        rawdat = reader.getScaledObject(1).getNumpyArrayMax()
        for row in range(numRows):
            for col in range(numCols):
                # normalizing on the range [-1,1]
                rvalX[index, 0, row, col] = (rawdat[minRow + row, minCol + col] / mrv - 0.5) * 2

        index += 1

    numpy.savez(pathname + '-saved.npz', rvalX = rvalX, rvalY = rvalY)
    return rvalX, rvalY
        

trainingX, trainingY = loadData(args.training, args.bounds)
validateX, validateY = loadData(args.validation, args.bounds)

numrows = args.bounds[3] - args.bounds[2] + 1
numcols = args.bounds[1] - args.bounds[0] + 1


inputs = Input(batch_shape = (None, 1, numrows, numcols))
convlayer1 = Conv2D(filters=64, kernel_size=3, data_format='channels_first', activation='relu')(inputs)
pool1 = MaxPooling2D()(convlayer1)
convlayer2 = Conv2D(filters=32, kernel_size=3, activation='relu')(pool1)
flat = Flatten()(convlayer2)
synthlayer = Dense(args.densenodes, activation='relu')(flat)
outlayer = Dense(1, activation='sigmoid')(synthlayer)

mymodel = Model(inputs = inputs, outputs = outlayer)
mymodel.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(),
                metrics=['accuracy'])




basename = 'phantomnet_{}.'.format(args.densenodes)
cb1 = keras.callbacks.ModelCheckpoint(basename,
                                      save_weights_only=False,
                                      save_best_only = True,
                                      verbose=1,
                                      mode='auto', period=1)


history = mymodel.fit(x = trainingX, y = trainingY, epochs = args.nEpochs,
                      validation_data = [validateX, validateY],
                      verbose=1, batch_size=512, shuffle = True,
                      callbacks=[cb1])

for key in history.history.keys():
    filename='history_{}_{}'.format(args.densenodes, key)
    with open(filename, 'w') as ofile:
        ofile.write('\n'.join(str(val) for val in history.history[key]))

Now, I can feed it my training and validation sets, and see what the validation loss/accuracy are.

This is running now, with different values of densenodes. For 20 nodes in that layer, I got a validation accuracy of 96.7%. So, data that it never saw for training was validating that well on the best network after 200 iterations. I’ll see, after this runs overnight, whether a larger number of nodes works better.

So, what does this network do? It produces 64 random 3×3 convolutions on a subset of the data. A square that has the radar station in its bottom left corner and which extends to cover the city of Ottawa, and a bit beyond. A 3×3 convolution is essentially an operation that replaces the value at I,J with a linear combination of the values at I,J and its 8 nearest neighbours. This detects simple features. A pooling layer downscales this by a factor of two squared, and a second convolution looks for features built from simple features. This is then fed into a dense intermediate layer, and then an output layer.

Once I’ve done the overnight run, I’ll pick out the best network and use it in the script that generates true values, replacing the defective algorithm that is there. This should greatly improve the training y-values, and we’ll see if the previous network topology has more success.

Building a Rain Predictor. Results from larger data set.

The index to the articles in this series is found here.

I got my larger data set, almost 5 years of radar data. This took about 40 hours on all cores of my home machine to convert into the intermediate binary format, but then I was able to start experimenting.

I mentioned in the last article that I had been a bit concerned about the overlapping in time of the training, validation, and holdout sets. Since I only had a handful of months of data, and rain only happens a few times a week, I decided that this was how I would break up my data set. With the new multi-year dataset, though, I could make truly independent training, validation, and holdout sets, from separate years.

And… it turns out I was right to be concerned about that overlap. With the disjoint data sets, the network can be seen to be almost immediately overtraining. That is, it is not getting any better at predicting the outcome from new data, it’s getting worse. Typically, this results in a prediction success rate of about 60% now.

Of course, the desktop widget that I have been using has shown its usefulness, even though it was trained and validated on overlapping data. This means that that network was overtrained, and probably doing more poorly in the predictions I was using it for.

So, how do we address the overtraining? I’ve changed the geometry somewhat to start with. Now, instead of feeding the data into the LSTM layer, I first feed it into a geometry layer that takes the 800+ inputs and produces a smaller number of outputs to feed into the LSTM layer. I apply batch normalization to that layer, and a 50% dropout, before going on to LSTM. After LSTM, another 50% dropout, then into another normalized dense layer, then finally the output layer. I cut way back on the sizes of my layers, instead of hundreds of nodes per layer, I’m down to tens.

Even with all that, the network overtrains immediately. I have to find a better way to design the network to capture generic features, rather than specific details.

Early on, when the scope of the problem became apparent, I said I was not going to use convolutional layers. The size of the data appeared to make that impractical. I’m now going back and thinking about that decision. I have some ideas for how I can apply convolutions without straining the memory of the machine. I was thinking of doing it in polar coordinates, the way the preprocessed data in the intermediate binary files is implicitly stored, but I think I’ll stick with cartesian features for now. The intermediate binary files do contain a raw representation of the rain in a radar image, in addition to the preprocessed vectors, and so I can do something with that before saving the numpy arrays to disc for easy loading and reuse.

Building a Rain Predictor. More data.

The index to the articles in this series is found here.

I’ve finished downloading additional training data from the archive at Environment Canada. Turns out I needn’t have worried about polling their web server too heavily, it responds quite slowly to requests in any case. A single 20k .gif file might take several seconds to assemble before it is transferred, and sometimes the file that arrives is truncated.

I have been doing data repairs, loading fresh copies of damaged or missing files. To determine whether a .gif file is truncated, I tried using giftopnm on the file and checking the return code, but there are some truncated files that return a success result from that program, so I had to do a bit more, scanning the output for reports of difficulty. I believe I’ve found all the bad cases, so now I’m re-running my repair script.

Now, I can download the radar data from two servers. I’ll call server1 the one that I used in the past, the one that presents current radar data to the viewer in a one-hour loop so you can estimate rain, but doesn’t have any historical data. Then there’s server2, which has historical data.

The .gif files from the two servers are not identical. The data from server1 is as shown in the early posts in this series. The historical data from server2 is similar, but includes 4 overlays with names of cities, and tracks of rivers and concentric rings around the radar facility. The problem is that these overlays occlude rainfall data. So, if it’s raining under the label of a city name, the data from server2 won’t show that rain, but the data from server1 will.

Fine, I can just switch everything over to server2. Except, I can’t, because that archive isn’t up to date. Server1 has data collected at noon appearing on the server by about 8 minutes after twelve. Server2 has that data two or three hours later. I can’t expect to write a rain predictor if my current data is already three hours out of date, so I have to continue to use server1 for predictions, but I’m using server2 to train.

So, I’ve had to replicate the overlays. This wasn’t difficult, server1 stores the overlays, because they can be optionally added to the view in the web browser, so it’s simple to download them, OR them together, and produce a new mask. The code to do this is in the git repository, under the name sum-overlays.py.

I can now use the mask to hide those pixels from server1 that are effectively dead pixels on server2, so that my inputs to the network for predictions once again match the training data.

So, I have data now from Jan 1, 2015 to Sep 20, 2019. I’ll be using this to train, validate, and for the holdout set.

One more thing that’s been bothering me about the way I was training the data in the past was that the training and validation sets were a bit too close together. That is, because a single entry is generated from 6 consecutive .gif files, I might train on the 6 images starting at 1:00 on June 10, and have a validation element containing the 6 images starting at 1:10 on June 10. That’s always made me a bit uneasy, but with only a few months of data available, it seemed a reasonable compromise.

Now, though, I have years of data. I plan to use the data from 2015 through 2017 as training data, 2018 for validation, and 2019 for the holdout. These will be completely non-overlapping entries, so that worry will be assuaged.

Another thing I’ve seen, running the desktop widget for a few weeks now, is that the network is not very good at seeing very light rain, it seems to be getting confused by the phantom rain effect again. I decided that the two bytes of data per sector might not be enough for pixels close to the radar facility, so I added a third byte for those sectors close in. This is a number that represents the fraction of rain pixels in the sector for which all bordering pixels are also rain pixels. Phantom rain is often dotty or streaky on the images, and that would give a very low value in this third byte, while real light rain is usually uniform across large swathes of the sky, so this byte would have a high value. We’ll see how that does once the training has been done.

One more thing I’m planning to do in the updated training code is to add a value for time of year. It would go from 0 in February to 1 in August, then back down again, in steps of 0.2. This is so that the neural network can condition its response to rain and snow independently, in case that’s important.

Of course, training will take much longer, but even if we go from two hours to one day, that’s not a big deal.

So, that’s where we are now, more updates to follow once I’ve finished my data repairs. I may also have to do something about the phantom rain detection for the true values, we’ll see.