Monthly Archives: August 2019

Writing a rain predictor. Domain knowledge.

The index to the articles in this series is found here.

So, I’ve got a baseline image I can use to extract active precipitation data from the government website. Now, can we just throw everything into a neural network?

The temptation is there, let the network do everything, but we have to understand our problem well enough to make sure we are using the network well. After all, neural networks turn into big black boxes, you can’t easily pull out weights one at a time and examine them to see if they look right.

Well, our time series of data for rainfall is going to have clouds moving more or less in a straight line and at a speed in the tens of kilometres per hour. Most of these events will move from roughly WSW towards ENE, but we should expect that rain might blow in from any direction. The problem is that my training set almost certainly doesn’t have any data for rain blowing in from the East. If I train with the data I have, the neural network will decide that rainfall to the East of Ottawa can never cause rainfall in Ottawa, and so input weights for pixels on that side will be zero or small. Then, when rain really does come in from there, the neural network will be surprised, and I will get wet.

It might be nice, then, to synthesize data for rain coming from the East. The most obvious way to do this is to rotate the incoming data. Note, though, that I don’t want the position of Ottawa itself to change on the plot, and the radar dish is not in Ottawa, but some distance to the South-West, in Franktown. Now, by rotating the input data we are assuming that weather patterns that approach from the East are very similar to those that approach from the West. They display a similar lack of high deflections, have similar speeds of approach, and that the unconsidered causal agents responsible for the weather (such as fronts, temperature trends, moisture loss, etc.) can be treated as symmetrical. This assumption would probably not be true for a place like Hamilton, where lake effect snows mean that precipitation to the East is qualitatively different from that to the West, or Vancouver, with a huge ocean to the West and snow-covered mountains to the East. These places don’t demonstrate the degree of rotational symmetry that we consider important for such a simple data synthesis method. Ottawa, however, far from lakes and in a large plain, with only relatively low hills in the vicinity, doesn’t have any obvious external asymmetries aside from that imposed by the prevailing winds.

If you look at the map in part 1 of this series, you’ll see that this data rotation approach on the Franktown radar station data wouldn’t work very well at all for rain in Montreal, as it is near the East edge of the imaging area, so we don’t actually have much data for rain inbound from East of Montreal. For Ottawa, though, we might be able to manage. Montreal has its own McGill radar station, so we wouldn’t be using Franktown data for Montreal in any case.

Rotations are a bit tricky on square data. We can easily rotate by steps of 90 degrees, because those operations are part of the symmetry group of the square (D_4 for all you group theory enthusiasts out there). Outside of those simple rotations, issues arise with floating-point roundoff, voids, and many-to-one mappings. If we have two adjacent pixels with value ‘3’ in the unrotated input set, and these pixels, after rotation, project partially onto four different pixels, how are we going to assign values to those pixels? The ‘3’ indicates intensity, so we don’t want to smooth it out to 1.5 on each pixel, and we’re using discrete intensities rather than continuous, so it might be awkward to try to sum up contributions on pixels in the rotated dataset. For now, we will just use the three available 90 degree rotations, as they do not distort the data, and we’ll evaluate the system’s behaviour to see if smaller rotations are necessary for good predictions.

We think, because our radar image is a disc, and rain travels more or less in straight lines, that radial movement of rain will be important. That is, rain moving toward the radar station will influence the likelihood of rain in Ottawa. Also, we’re probably not too worried about the fine details of rain falling far away, but close to Ottawa we need some finer resolution. So, let’s cut up the space a bit like a dartboard. We’ll cut the disc into a certain number of equal-width wedges, and each wedge will be cut by arcs of fixed radius. We’ll adjust the parameters of these regions to suit our model later.

So, do we centre these regions on Ottawa, or on Franktown? I’m going to centre them on the radar facility itself, because that allows for the possibility that, some time down the road, we decide to set up a system that monitors rain in two different places. We can’t centre our regions of interest on both, to do that would be just to say we’re creating two completely independent neural networks. At least to begin with, then, we’ll centre our division of the disc on the radar facility in Franktown.

Why aren’t we using convolutional neural networks? Well, that’s something we’ll keep in mind for later, but they’re the sort of thing you would use to investigate the shape of rainfall patterns within a region, and, at least for now, we think rainfall prediction is not so strongly dependent on the fine-scale features of the rainfall intensities within a region of space. So, my inputs are going to be broken into these disjoint regions, and the first layer of our network will only consider values within these regions, not between them. This is called a modular neural network. A module of neurons that look only at pixels in their area of responsibility, and feed some outputs up to the next layer for more analysis. The higher layer(s) will be able to work with outputs from multiple modules. For now, I’m not going to say how many outputs each of these first-layer regions will produce. With only one output, the network would probably train itself to something like “fraction of pixels in the region which show rain”, but it’s likely that we need a bit more information than that, so we will keep in mind the possibility that these regions will each be allowed produce multiple outputs if we decide that’s necessary. Also, the region(s) immediately around Ottawa are special. In there, we probably are interested in the fine-scale details of rainfall amounts, so we’ll either want to make those regions quite small, or we’ll want to expose those pixels separately to the higher layer of the network. It’s something to keep in mind.

OK, so that’s a basic outline of our proposed neural network topology. What about training data? We need some outcomes to validate predictions.

We’ll pick four pixels near the middle of Ottawa, and look at the rainfall in those pixels. If any pixel shows any rainfall, our real outcomes will have that bit set. If at least one pixel shows heavy rain, our real outcomes will have that other bit set. Since our bits are binned into one hour predictions and we have 6 radar images per hour, we’ll say that any rain in any one of those six radar observations will result in the corresponding bits being set.

But, as always, there are complications. Recall in the previous article that there’s a close-range phantom rainfall observed when the skies are clear and it’s humid out. Ottawa lies within this region of phantom rainfall, which means that a naive extraction would conclude that it’s almost always raining over Ottawa on muggy days. Here’s an example of such phantom rainfall:

That rain near the white cross isn’t really there

What rule are we going to use to distinguish this phantom rainfall from real rain? Well, domain knowledge time again. These regions of phantom rainfall only appear as the lightest rainfall rate, and appear as scattered pixels rather than solid regions. In real light rainfall, we always see large areas of light rain, usually with heavier rain inside. So, that’s going to be our rule. If there are only light blue active pixels within, say, 20 pixels of the radar facility, and if the density of such pixels is less than 0.5, then we’ll say that the purported rain in Ottawa is, in fact, not there. We might have to tune this condition later, but this will make a starting point.

OK, so we’ve laid out a preliminary approach. We might discover after testing that we need to do things differently, but we now have a viable first attempt.

In our next posting, we’ll process the input data to make it suitable for feeding to the neural network.

UPDATE #1 (2019-08-23): Added a link to an index page of articles in this series.

Writing a rain predictor, preparing the data

The index to the articles in this series is found here.

It’s time to get a baseline reference, the radar image as it would appear with no rain anywhere. I picked out three images that were quite clean. This isn’t trivial, as the radar seems to produce false short-range returns on clear, humid days. I assume this is because, in the absence of any precipitation, there’s no strong reflected signal, and the radar analysis is interpreting some close-range backscatter from the air as slight rainfall. This means that we often have light blue pixels surrounding the radar station when there isn’t rain elsewhere. Still, I found three images that voted to produce a good consensus.

Here’s the code I used to analyse those .gif files and produce a consensus image:

#! /usr/bin/python3

# This script reads in three .gif files and produces a new file in
# which each pixel is set to the majority value from the three inputs.
# If there is no majority value (i.e. all three files have a different
# value at that point), we exit with an error so that a better set of
# inputs can be found.

# We are using this script to analyse machine-generated files in a
# single context.  While the usual programming recommendation is to be
# very permissive in what formats you accept, I'm going to restrict
# myself to verifying consistency and detecting unexpected inputs,
# rather than trying to handle all of the possible cases.

# This is a pre-processing step that will be used by another script
# that reads .gif files.  Therefore it is reasonable to make this
# script's output be a .gif itself.

# The script takes 4 arguments.  The first three are the names of the
# input files.  The fourth is the name of the output file.

# The script will return '1' on error, '0' for success.

import sys
import gif


class SearchFailed(Exception):
    def __init__(self, message):
        self.message = message


def find_index_of_tuple (list_of_tuples, needle, hint = 0):
    if list_of_tuples[hint] == needle:
        return hint
    for i in list_of_tuples:
        if (list_of_tuples[i] == needle):
            return i
    raise SearchFailed('Tuple {0} not found in list.' % needle)


if len(sys.argv) != 5:
    print ("Require 3 input filenames and 1 output filename.")
    sys.exit(1)

file = [None, None, None]
reader = [None, None, None]

for i in range(3):    
    try:    
        file[i] = open(sys.argv[i+1], 'rb')
    except OSError as ex:
        print ("Failed to open input file: ", sys.argv[i+1])
        print ("Reason: ", ex.strerror)
        sys.exit(1)
    reader[i] = gif.Reader()
    reader[i].feed(file[i].read())
    if ( not reader[i].is_complete()
         or not reader[i].has_screen_descriptor() ):
        print ("Failed to parse", sys.argv[i+1], "as a .gif file")
        sys.exit(1)

# OK, if we get here it means we have successfully loaded three .gif
# files.  The user might have handed us the same one three times, but
# there's not much I can do about that, it's entirely possible that we
# want to look at three identical but distinct files, and filename
# aliases make any more careful examination of the paths platform
# dependent.

# So, we're going to want to verify that the three files have the same
# sizes.

if ( reader[0].width != reader[1].width
     or reader[1].width != reader[2].width
     or reader[0].height != reader[1].height
     or reader[1].height != reader[2].height ):
    print ("The gif logical screen sizes are not identical")
    sys.exit(1)

for i in range(3):
    if ( len(reader[i].blocks) != 2
         or not isinstance(reader[i].blocks[0], gif.Image)
         or not isinstance(reader[i].blocks[1], gif.Trailer)):
        print ("While processing file: ", sys.argv[i+1])
        print ("The code only accepts input files with a single block of "
               "type Image followed by one of type Trailer.  This "
               "constraint has not been met, the code will have to be "
               "changed to handle the more complicated case.")
        sys.exit(1)
    
    
# Time to vote

try:
    writer = gif.Writer (open (sys.argv[4], 'wb'))
except OSError as ex:
    print ("Failed to open output file: ", sys.argv[4])
    print ("Reason: ", ex.strerror)
    sys.exit(1)

output_width = reader[0].width
output_height = reader[0].height
output_colour_depth = 8
output_colour_table = reader[0].color_table
output_pixel_block = []

for ind0, ind1, ind2 in zip(reader[0].blocks[0].get_pixels(),
                            reader[1].blocks[0].get_pixels(),
                            reader[2].blocks[0].get_pixels()):
    tup0 = reader[0].color_table[ind0]
    tup1 = reader[1].color_table[ind1]
    tup2 = reader[2].color_table[ind2]

    # Voting
    if ( tup0 == tup1 or tup0 == tup2):
        output_pixel_block.append(ind0)
    elif ( tup1 == tup2 ):
        try:
            newind = find_index_of_tuple(output_colour_table,
                                         tup1, ind1)
            output_pixel_block.append(newind)
        except SearchFailed as ex:
            print ('The colour table for file %s does not hold the '
                   'entry {0} that won the vote.  You may be able '
                   'to fix this problem simply by reordering your '
                   'command-line arguments.' % sys.argv[1], tup1)
            sys.exit(1)

writer.write_header()
writer.write_screen_descriptor(output_width, output_height,
                               True, output_colour_depth)
writer.write_color_table(output_colour_table, output_colour_depth)
writer.write_image(output_width, output_height,
                   output_colour_depth, output_pixel_block)
writer.write_trailer()

So, what does this do? After verifying that it received the correct number of arguments, that it can open the three inputs, and that the input files are all valid .gif files, it checks to make sure they all have the same image dimensions.

Now, it would be a bit more work to support multiple image blocks, though the GIF specification does allow that. So, I verified that these files from the government website do not use multiple image blocks, and coded in a check. This script will exit with an error if it is presented such files. This way I don’t have to write the extra code unless some future change forces me to accept the more complicated format.

Now, the files I chose did not have identical colour tables, but the tables differed only in the ordering. This might not always be true, but it is at the moment. I use the colour table from the first input .gif as my output colour table. Then, I walk through the pixels in the three files and look up the tuple of colours for that pixel. If the first and second input files agree on the value of that tuple, then we simply insert the appropriate index into the colour table. If the first disagrees, but the second and third agree, then we have to find the index of this tuple in the output colour table. It’s probably the same, so we hint with the offset into the colour table of the second file, but my function will walk the entire colour table if it has to, to find an index matching that tuple. If it fails to do so, that’s an error, and we exit.

Finally, we write out the consensus .gif file, and exit normally.

In the next article we’ll have a discussion of how to set up the neural network.

UPDATE #1 (2019-08-23): Included a link to an index of articles in this series.

A machine learning project

The index to the articles in this series is found here.

Well, four years ago I mentioned that I was going on a brief hiatus, and there hasn’t been very much here since then. Turns out that having a baby in the house does eat into the free time a bit. Now, though, I find myself with some more free time, after the parent company closed the entire Ottawa office and laid off the staff here. If anybody’s looking for an experienced mathematical programmer with a doctorate in physics, get in touch.

So, here’s a project I was about to start four years ago. I had collected some training data, but never got the project itself started.

I like to bicycle in the summer time, but I don’t like to ride in the rain. So, when I remember, I check the local weather radar and look for active precipitation moving toward the city. I can decide from that whether to go for a bicycle ride, and whether to ride to work, or find another way to get to the office.

The weather radar website, https://weather.gc.ca/radar/index_e.html?id=XFT, shows an hour of rain/snow detection at 10 minute intervals, played on a loop. You can look at the rain and guess how long it will take to get to the city. This won’t help you if rain forms directly over the city, but most of the time the rain moves into town, rather than beginning here.

The interpretation of these sequences seemed to me to be something I could automate. Maybe have a program that sends a warning or email to my cellphone if rain is imminent, in case I’m out on the bike.

I collected over 11000 .gif files by downloading individual files via a cron job. The images don’t have an embedded copyright message, and are government-collected data, but I’m not confident that this gives me the right to make this dataset available online, so I will satisfy myself with reproducing a single example for illustrative purposes. Here is a typical downloaded image:

The city of Ottawa is located roughly North-East of the white cross, just South of the Ottawa river that runs dominantly West to East. Near the right edge of the active region you can see the island of Montreal.

The very light blue represents light rainfall, something you might barely notice while riding a bicycle. Anything at the bright green or higher would be something I would try to wait out by sheltering under a bridge or similar construction. Weather patterns in this area, as in much of the continent, are dominantly blown from the West to the East, though there are some exceptions, and we will, very occasionally, have storms blow in from the East.

So, here’s the project. I haven’t actually written code yet, so we’ll explore this together. I would like to set up a neural network that can watch the radar website, downloading a new image every 10 minutes, and use this to predict 10 binary states. The first five values will be the network’s confidence (I’m not going to call it probability) that there will be any rain at all in the next 0 to 1 hours, 1 to 2 hours, 2 to 3 hours, and so on out to 5 hours. The next five values will be the confidence of heavy rain, defined as rain at the bright green or higher level, in the same intervals.

Ideally, this network would also update itself continuously, as more data became available.

This isn’t a substitute for the weather forecasts made by the experts at Environment Canada, they use a lot more to inform their forecasts than just the weather radar in the area, but it aims to answer a different question. My project will try to estimate only confidence of rain specifically in the city of Ottawa, and over a relatively short projection interval, no more than 5 hours. It’s answering a more precise question, and I hope it turns out to give me useful information.

Now, we might be tempted to just throw the raw data at a neural network along with indications of whether a particular image is showing that it is raining in Ottawa, but we don’t have an unlimited data set, and we can probably help the process along quite a bit by making some preliminary analysis. This isn’t feature selection, our input set is really a bit too simple for any meaningful feature selection, but we can give the algorithm a bit of a head start.

The first thing we’ll want to do is to pull out the background image. The radar image shows precipitation as colours overlaid on a fixed background. If we know what that background is in the absence of any rain, we can call that ‘0’ everywhere in the inputs, and any pixels that differ will be taken as coming from rain, with a value that increases as we climb that scale on the right side of the sample image.

I’ll pick out three images that are rain-free to my eye. There might be tiny pockets of precipitation that escape my notice, but by choosing three that appear clean and letting them vote on pixel values, I should have a good base reference.

We’ll be writing this project in Python3, with Keras interfacing onto TensorFlow.

The next posting will cover the baseline extraction code.

UPDATE #1 (2019-08-20): I’ve made the source files I’m posting in this series available on github. You can download them from https://github.com/ChristopherNeufeld/rain-predictor. I’ll continue to post the source code in these articles, but may not post patches there, I’ll just direct you back to the github tree for history and changes.

UPDATE #2 (2019-08-23): Added a link to an index page.