Monthly Archives: October 2019

Building a Rain Predictor. Feature generation.

The index to the articles in this series is found here.

What has been happening since the phantom rain work? I’m happy with the training set y-values again, but I was having trouble generating useful networks. I discovered later that a bug introduced when I was working on the phantom rain problem effectively deleted half of the feature data, so there wasn’t very much for the neural network to chew on to train itself. I wrote a small script to read the features out of the preprocessed files and generate .gifs of them, so I could, in effect, see what the network sees.

My earlier networks were being trained on data that was the average rain value over all pixels in a sector (even pixels with no rain), and the maximum rain value for any pixel in the sector. This didn’t seem to be doing very well.

Recall that, early on, I decided that I wasn’t just going to feed the .gif files into the network and let it work out its own features. The data sets are too big, and the number of neurons that would be needed to handle them was too high. So, very early on I moved to synthesized features.

Now, I’m just trying different synthetic feature combinations to see what works well. Each iteration takes about two days, which is the amount of time it takes my current machine to preprocess the image files into a format suitable for the neural network training code.

The current feature combination I’m trying out consists of three bytes per sector, and still the same 400 sectors. The first byte represents the fraction of pixels in the sector that show any rain at all. The second byte is a measure of the mean rain intensity in all rain pixels. The third byte is the root-mean-squared rain intensity in all rain pixels. We shall see, once the preprocessing is complete, whether this produces something more useful for the neural network training.

Building a Rain Predictor. Revisiting the phantom rain.

The index to the articles in this series is found here.

I was thinking about the overfitting issue I mentioned in the last posting. It seemed strange, like something was supposed to work, but didn’t. That kind of immediate departure between training and validation loss suggests incompatible data. So, I wondered if maybe the true values used for training and validation were corrupted.

After I made the first working version of the rain predictor, I said that it was not very good at light rainfall over short time scales, and proposed that this might be because of poor detection of phantom rain. So, I adjusted the phantom rain algorithm before my last training attempts with the larger dataset.

Now, though, I was becoming suspicious, so I instrumented the script that generates the true values so that it would record names of intermediate binary files for which it had decided there was true rain, and phantom rain. The results were enlightening.

The phantom rain algorithm’s false positive rate was very low, out of about 1800 events labeled as true rain, about 2 or 3 were, in fact, phantom rain. However, the false negative rate was horrible. Out of over 3100 events labeled as phantom rain, more than two thirds were actually rain. My little heuristic for determining phantom rain was badly misbehaving.

This messes up the training badly. It’s no wonder that the mislabeled training and validation sets behave so differently, the neural network is hallucinating badly trying to resolve the inconsistencies in the training set.

I can look at a radar image and decide in about 2 seconds whether it’s showing phantom rain or true rain in Ottawa. Sometimes it’s a bit ambiguous, but I’m pretty sure well over 95% of the time.

If only there were some program that could look for patterns in two-dimensional arrays of pixels, and determine what sort of features distinguish true rain from phantom rain. Whatever would do that? What sort of software do I have to learn to use?

Imagine Keras just waving its arm above its head and pointing at itself.

Right, so a 2-dimensional convolutional neural network it is, then. I have to automate the rain determination, there are over 200000 input files and I’m not inspecting them one by one.

I inspected all files in 2015 and 2016 for which a sensitive pixel was illuminated, so there was the possibility of rain over Ottawa. That was 8641 images. To make this run more quickly, I put together this script:

#! /usr/bin/python3

import matplotlib.pyplot as plt
import PIL

category = []
plt.ion()

images = []

with open('phantom-rain-candidates.txt', 'r') as ifile:
    filenames = ifile.readlines()

for fn in filenames:
    fn = fn[:-5]
    
    img = PIL.Image.open(fn)
    plt.imshow(img)
    plt.pause(0.05)
    response = input('PHANTOM?: ')
    plt.cla()
    if response == 'q':
        break
    category.append('{0}  {1}\n'.format(fn, response))


with open('screened-rain-candidates.txt', 'a+') as ofile:
    for result in category:
        ofile.write(result)

It pops up a window and shows images in that window one after the other. After each photo, I type either ‘y’ or ‘n’, followed by ENTER, and that value is recorded for later output, then the next image is displayed.

This gave me a training set and a validation set. Next, it’s time to write the convolutional neural network code:

#! /usr/bin/python3

# Try to write a neural network to distinguish real from phantom rain.

# Phantom rain looks different from real rain, particularly in the
# region surrounding the radar station.  Ottawa lies within the radius
# where phantom rain can appear, so we need to know when active rain
# pixels are actually indicative of rain.

import argparse
import numpy
import rpreddtypes
import os
import keras
from keras.layers import Input, Conv2D, Dense, Flatten, MaxPooling2D
from keras.models import Model

parser = argparse.ArgumentParser(description='Train the phantom '
                                 'classification network.')
parser.add_argument('--training-set', type=str, dest='training',
                    required=True,
                    help='A file of pathnames, and then letter '
                    '\'y\' for phantom rain, \'n\' for non-phantom '
                    '(i.e. real) rain.')
parser.add_argument('--validation-set', type=str, dest='validation',
                    required=True,
                    help='A file of pathnames, and then letter '
                    '\'y\' for phantom rain, \'n\' for non-phantom '
                    '(i.e. real) rain.')
parser.add_argument('--examination-box', type=list, dest='bounds',
                    default=[240, 320, 160, 240],
                    help='Bounds of the region to pass to the network.'
                    '  They are [minCol, maxCol, minRow, maxRow].')
parser.add_argument('--epochs', type=int, dest='nEpochs',
                    default = 100,
                    help = 'Set the number of epochs to train.')
parser.add_argument('--dense-layer-nodes', type=int, dest='densenodes',
                    default = 100,
                    help = 'Set the number of nodes in the synthesis layer.')
parser.add_argument('--name', type=str, dest='name',
                    required=True,
                    help='A name to distinguish this run.  It '
                    'will be used to construct filenames for '
                    'detailed logging.')

args = parser.parse_args()



def loadData(pathname, bounds):

    if os.path.exists(pathname + '-saved.npz'):
        container = numpy.load(pathname + '-saved.npz')
        return container['rvalX'], container['rvalY']
    
    minRow = bounds[2]
    minCol = bounds[0]
    numRows = bounds[3] - bounds[2] + 1
    numCols = bounds[1] - bounds[0] + 1
    records = []
    with open(pathname, 'r') as ifile:
        records = ifile.readlines()

    rvalX = numpy.zeros((len(records), 1, numRows, numCols))
    rvalY = numpy.zeros((len(records)))
    index = 0
    for r in records:
        isPhantom = r[-2:-1]
        if isPhantom == 'y':
            rvalY[index] = 1
        elif isPhantom == 'n':
            rvalY[index] = 0
        else:
            os.exit(1)

        binfilename = r[:-4] + '.bin'
        
        reader = rpreddtypes.RpBinReader()
        reader.read(binfilename)
        mrv = reader.getMaxRainval()
        rawdat = reader.getScaledObject(1).getNumpyArrayMax()
        for row in range(numRows):
            for col in range(numCols):
                # normalizing on the range [-1,1]
                rvalX[index, 0, row, col] = (rawdat[minRow + row, minCol + col] / mrv - 0.5) * 2

        index += 1

    numpy.savez(pathname + '-saved.npz', rvalX = rvalX, rvalY = rvalY)
    return rvalX, rvalY
        

trainingX, trainingY = loadData(args.training, args.bounds)
validateX, validateY = loadData(args.validation, args.bounds)

numrows = args.bounds[3] - args.bounds[2] + 1
numcols = args.bounds[1] - args.bounds[0] + 1


inputs = Input(batch_shape = (None, 1, numrows, numcols))
convlayer1 = Conv2D(filters=64, kernel_size=3, data_format='channels_first', activation='relu')(inputs)
pool1 = MaxPooling2D()(convlayer1)
convlayer2 = Conv2D(filters=32, kernel_size=3, activation='relu')(pool1)
flat = Flatten()(convlayer2)
synthlayer = Dense(args.densenodes, activation='relu')(flat)
outlayer = Dense(1, activation='sigmoid')(synthlayer)

mymodel = Model(inputs = inputs, outputs = outlayer)
mymodel.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(),
                metrics=['accuracy'])




basename = 'phantomnet_{}.'.format(args.densenodes)
cb1 = keras.callbacks.ModelCheckpoint(basename,
                                      save_weights_only=False,
                                      save_best_only = True,
                                      verbose=1,
                                      mode='auto', period=1)


history = mymodel.fit(x = trainingX, y = trainingY, epochs = args.nEpochs,
                      validation_data = [validateX, validateY],
                      verbose=1, batch_size=512, shuffle = True,
                      callbacks=[cb1])

for key in history.history.keys():
    filename='history_{}_{}'.format(args.densenodes, key)
    with open(filename, 'w') as ofile:
        ofile.write('\n'.join(str(val) for val in history.history[key]))

Now, I can feed it my training and validation sets, and see what the validation loss/accuracy are.

This is running now, with different values of densenodes. For 20 nodes in that layer, I got a validation accuracy of 96.7%. So, data that it never saw for training was validating that well on the best network after 200 iterations. I’ll see, after this runs overnight, whether a larger number of nodes works better.

So, what does this network do? It produces 64 random 3×3 convolutions on a subset of the data. A square that has the radar station in its bottom left corner and which extends to cover the city of Ottawa, and a bit beyond. A 3×3 convolution is essentially an operation that replaces the value at I,J with a linear combination of the values at I,J and its 8 nearest neighbours. This detects simple features. A pooling layer downscales this by a factor of two squared, and a second convolution looks for features built from simple features. This is then fed into a dense intermediate layer, and then an output layer.

Once I’ve done the overnight run, I’ll pick out the best network and use it in the script that generates true values, replacing the defective algorithm that is there. This should greatly improve the training y-values, and we’ll see if the previous network topology has more success.