Monthly Archives: August 2019

The index to the articles in this series is found here.

So, earlier we discussed thinning the uninteresting cases from the input set. Clear skies leading to no rain. We thought we’d like to cut the no-rain entries from the training set by about half, so that the resulting data set would have roughly equal numbers of training elements that showed no rain in the next five hours and elements that showed at least some rain in the next five hours.

We proposed cutting those training elements that had the least summed intensity of rainfall over the six-image historical sequence, and discussed the danger of cutting an entire swath this way. The compromise I’ve decided to use is to thin the low end by keeping only every sixth training member from the no-rain set, sorted from lowest total intensity to highest, until we reach our target amount of thinning.

Before we do that, though, it’s a good idea to try to understand what we’re throwing away. So, we compute the sum of intensities for all of those candidate training elements that show no rain in the next five hours, and order them. This allows us to produce a quick plot of summed intensity against number of training elements no higher than that. Here is that plot:

What we can see here is that we have under 7000 candidates that produce no rain at all in the five hours after the historical window. In the first 4000 such candidates, the total rain intensity comes in with a low summed intensity. In fact, the 4000’th candidate has a summed intensity of 18833. This compares to almost 800000 on the high end. So, our candidates that produce no rain are quite heavily skewed on the side of very low intensities. This is helpful, it indicates that we can preferentially drop those low-sum-intensity entries from the candidates list without adversely affecting our training data.

Good, so we’ve checked that, and this vetoing is reasonable. Here’s the code to produce the veto list (and also to produce the data for the plot above). make-vetoes.py:

#! /usr/bin/python3

# Before we go off cutting out candidates from the training set, we
# should make sure we understand the data we're eliminating.  So,
# we'll generate a data set that can be used to plot candidate rain
# intensity on the Y axis and number of training candidates with
# values at or below that sum on the X axis

import argparse
import rpreddtypes
import sys

parser = argparse.ArgumentParser(description='Generate '
                                 'intensity data for plotting.')

parser.add_argument('--candidates', type=str,
                    dest='candidates',
                    help='A file of candidates data from '
                    'get-training-set.py')

parser.add_argument('--sequences', type=str,
                    dest='sequences',
                    help='A file of sequence information from '
                    'prepare-true-vals.py')

parser.add_argument('--with-plotting-data', type=bool,
                    dest='plotdat', default=True,
                    help='Whether to generate plotting data '
                    'on stdout')
parser.add_argument('--veto-fraction', type=float,
                    dest='vetofrac', default=0.5,
                    help='Fraction of no-rain candidates to put in '
                    'the veto list for exclusion.')
parser.add_argument('--veto-keepstride', type=int,
                    dest='keepstride', default=6,
                    help='When writing vetoes, will exclude each '
                    'Nth entry from the list to ensure we don\'t '
                    'produce data that\'s too lopsided.')
parser.add_argument('--write-vetoes', type=str,
                    dest='vetofile',
                    help='If set, writes an unsorted list of '
                    'candidate sequence '
                    'numbers to skip because of common-case thinning.')

args = parser.parse_args()

if not args.candidates:
    print('A file with the list of candidates must be supplied with the '
          '--candidates argument')
    sys.exit(1)

if not args.sequences:
    print('A file with the list of sequence data be supplied with the '
          '--sequences argument')
    sys.exit(1)

seqIntensity = {}
sumIntensities = []
with open(args.sequences, 'r') as ifile:
    for record in ifile:
        fields = record.split()
        seqno = int(fields[0])
        pathname = fields[1]
        reader = rpreddtypes.RpBinReader()
        reader.readHeader(pathname)
        seqIntensity[seqno] = reader.getTotalRain()
    
with open(args.candidates, 'r') as ifile:
    for record in ifile:
        fields = record.split()
        startval = int(fields[0])

        skipEntry = False
        for i in range(4,14):
            if (fields[i] != '0'):
                skipEntry = True
                break

        if skipEntry:
            continue
        
        sum = 0
        for i in range(6):
            sum += seqIntensity[startval + i]

        sumIntensities.append([sum, startval])

sumIntensities.sort()

if args.plotdat:
    for i in range(len(sumIntensities)):
        print (sumIntensities[i][0], i)

if args.vetofile:
    recordsToDiscard = int(len(sumIntensities) * args.vetofrac)
    with open(args.vetofile, 'w') as ofile:
        for i in range(len(sumIntensities)):
            if recordsToDiscard > 0:
                if i % args.keepstride != 0:
                    ofile.write('{}\n'.format(sumIntensities[i][1]))
                    recordsToDiscard -= 1
            else:
                break

And there we are. We can now start developing our neural network, which will happen over the following posts.

It’s worth pointing out that, up to now, we have not discarded any information. Our training candidates use the full resolution of the original .gif files, and we have a veto list that can be applied to skip some if we choose, but so far we have preserved all the data we started with.

UPDATE #1 (2019-08-23): Included a link to an index of articles in this series.

Writing a Rain Predictor. Source files on github.

2019-08-20

Uncategorized

bicycling machine learning neural networks python rain predictorChristopher

The index to the articles in this series is found here.

In the last message, we talked about looking at the total weight of rain pixels in a training entry, since ones with a low total weight are less interesting to us. I decided that this total weight was something that would be useful to have in the header of the intermediate binary file. So, I’ve bumped the version number of that file, and made appropriate adjustments. At this point, I figured it might get confusing to follow along as I make edits to source code, so I’ve set up a github repository for the python files I’ve been writing here. You can find it at: https://github.com/ChristopherNeufeld/rain-predictor

I’ll update the first post to mention the github source.

UPDATE #1 (2019-08-23): Included a link to an index of articles in this series.

Writing a Rain Predictor. Our training candidates.

2019-08-19

Uncategorized

bicycling machine learning neural networks python rain predictorChristopher

The index to the articles in this series is found here.

Well, we’re ready to go now. We have just to identify consecutive runs of 36 radar images, and bin up the rain/no-rain data appropriately for the 5 prediction windows.

We’ve got a simple script for that, get-training-set.py:

#! /usr/bin/python3

# Here we generate the training set candidates.
#
# We read the list of records made by prepare-true-vals.py from a
# file, and produce a list of candidates on stdout
#
# A training set candidate is a run of 6 sequence numbers representing
# the previous hour of historical data, while the following 30
# sequence numbers inform the rain/no-rain data for the five 1-hour
# blocks of future predictions.
#
# A training set candidate is available when there exists a set of 36
# consecutive sequence numbers.  In that case, we generate a record
# like this:
#
# <FIRST_SEQ_NO> <HASH> <N_ROTS> <ROTNUM> <RAIN0_1> <HEAVY0_1> ...
#
# The first field is the starting sequence number in the run of 36.
#
# The second is the hash, as in prepare-true-vals.py, to ensure that
# we don't accidentally mix incompatible training data
#
# The third field is the number of rotations from which this is taken,
# or 0 if we're using unrotated data sets
#
# The fourth field is the rotation index.  0 for unrotated, up to 1
# less than N_ROTS
#
# The fifth field is the logical OR of the RAIN record for the 7th
# through 12th sequence numbers.  The sixth is the logical OR of the
# HEAVY_RAIN record for the 7th through 12th sequence numbers.
#
# The following 8 fields are as above, for subsequence runs of 6
# sequence numbers.

import argparse
import rpreddtypes

parser = argparse.ArgumentParser(description='Find training candidates.',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)

parser.add_argument('truevalfile', type=str, metavar='truevalfile',
                    help='Filename to process')

parser.add_argument('--override-centre', type=list, dest='centre',
                    default=[240,239], help='Set a new location for '
                    'the pixel coordinates of the radar station')
parser.add_argument('--override-sensitive-region', type=list,
                    dest='sensitive',
                    default=[[264,204], [264,205], [265,204], [265,205]],
                    help='Set a new list of sensitive pixels')
parser.add_argument('--rotations', type=int, dest='rotations',
                    default=0, help='Number of synthetic data points '
                    'to create (via rotation) for each input data point')
parser.add_argument('--heavy-rain-index', type=int, dest='heavy',
                    default=3, help='Lowest index in the colour table '
                    'that indicates heavy rain, where 1 is the '
                    'lightest rain.')

args = parser.parse_args()

hashval = rpreddtypes.genhash(args.centre, args.sensitive, args.heavy)

seqnoList = []
parsedData = {}

nRots = -1
skipRotations = False

with open(args.truevalfile, 'r') as ifile:
    for record in ifile:
        fields = record.split()
        if fields[2] != hashval:
            continue

        # Check for inconsistent number of rotations
        if nRots != -1 and nRots != int(fields[3]):
            skipRotations = True
            
        nRots = int(fields[3])
        seqnoList.append(int(fields[0]))
        value = [ int(fields[3]) ]
        value[1:1] = list(map(int, fields[4:]))
        parsedData[int(fields[0])] = value

seqnoList.sort()
if skipRotations:
    nRots = 0

# Now, we've loaded sequence numbers into a list, and indexed a dict
# against them to record number of rotations and rain data.  

# Find runs of 36.

idx = 0
while idx < len(seqnoList) - 36:
    candSeqNo = seqnoList[idx]
    
    offset = 1
    invalid = False
    while not invalid and offset < 36:
        if seqnoList[idx + offset] != seqnoList[idx + offset - 1] + 1:
            invalid = True
            idx = idx + offset

        offset += 1

    if invalid:
        continue

    for rot in range(nRots + 1):
        record = '{0} {1} {2} {3} '.format(candSeqNo, hashval, nRots, rot)
        binnedValues = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
        for timeInterval in range(5):
            for snapshot in range(6):
                oneRec = parsedData[candSeqNo + 6 + timeInterval * 6 + snapshot]
                if oneRec[1 + rot * 2] == 1:
                    binnedValues[timeInterval * 2] = 1
                if oneRec[1 + rot * 2 + 1] == 1:
                    binnedValues[timeInterval * 2 + 1] = 1

        print(record, *binnedValues, sep = ' ')

    idx += 1

We consider each training record to be 36 radar images. The first six are the historical data, they’re the inputs to the neural network. The following 30 images form the ‘true’ results, the actual occurence or not of rain in the five one-hour long bins following the historical interval.

This script produces records consisting of the first sequence number in the run of 36, the hash (as before, to prevent the accidental mixture of incompatible data), two fields to identify the rotation of the true prediction values, and a set of 10 zeros or ones. The first pair of numbers are 1 if there is rain/heavy-rain in the 6 records immediately following the historical data, indicating the first hour of the prediction. The next pair of numbers is the same for the second hour of prediction, and so on.

For now, I’ve chosen not to enable rotations on the data while I work out the neural network.

Now, in my summer 2015 data, I’ve got 10518 candidate records, of which 3740 predict at least some rain. That’s kind of interesting, it suggests that, on average in 2015, at any given moment we were about 36% likely to have rain some time in the next 5 hours.

Now, 36% isn’t actually a bad training fraction. We could probably afford to cut down the no-rain cases by half, to balance the data, assuming we’re not going for an auto-encoder trained only against the no-rain case.

We won’t be using an auto-encoder. The fact that we have close to a million input pixels per training element makes that quite impractical for this project.

So, how can we thin our input set to remove half of the no-rain entries? Well, clear skies leading to clear skies seems to be fairly uninteresting, and we don’t think we need to train our network against that. How about, for each candidate with no rain predicted over the next 5 hours, we count up the total rain intensity in the historical run of 6 radar images. That is, the sum of the pixel values, so heavy rain might count 6 on a pixel, and light rain only 2. Add these up over the historical interval, and the smallest aggregated numbers are the ones that show the least precipitation over that hour on the radar images. We could then drop the bottom half, with the least precipitation, but… there’s a danger in that. We would, in effect, be failing to train the network on the clearest sky cases, forcing it to extrapolate to what is a fairly common input. The network might wind up handling that case badly, since it never saw training data in that domain. So, I’d suggest that we drop data preferentially on the bottom end, but not exclusively so. If we’re trying to delete half of the no-rain data, perhaps we start from the lowest precipitation images and keep every fifth one as we walk up the list. This way we still remove a lot of relatively uninteresting training data, but we keep enough of it in to ensure that the network isn’t unaware of this case.

In the next brief posting, I’ll write a script to do this data thinning. Then, we’ll be able to get to the actual construction of the neural network.

UPDATE #1 (2019-08-23): Included a link to an index of articles in this series.

Writing a Rain Predictor. Building the training index.

2019-08-19

Uncategorized

bicycling machine learning neural networks python rain predictorChristopher

The index to the articles in this series is found here.

We’re almost ready to feed out data into the neural network. We’ve generated intermediate binary data files for our .gif files, now we need a way to find a set of at least 36 consecutive timestamped files, and an indication of whether a particular binary data file shows rain, and whether that rain is heavy. We also will want to mark the rain/no-rain condition for our synthetic rotated data, assuming that our local geography makes rotated data plausible.

We’re going to make a set of records, one to a line, that contain everything the neural network training script will need. A script will output these to stdout, so that it can easily be appended to our list as our data grows.

My data was collected from June to September of 2015, and I restarted the collector one week ago when I decided to take up this project, so I’ve got some data from August of 2019. My filenames all have UTC date and time specifiers. Always do that. You don’t want to have to care about standard/savings time switches in your filenames.

I’ll declare the UTC time 2015-01-01T00:00:00Z to be my epoch. That’s midnight on January 1, 2015, UTC. I know that data is available every 10 minutes from the radar station, so my sequence number will simply be the number of 10 minute intervals that have passed between that epoch and the timestamp of this particular file. This sequence number is the first field in the record.

I will also need a reference to the intermediate binary file itself. That will be the second field in the record.

The third field in the record is a hash of the centre of the radar image, the sensitive pixels, and the threshold that defines ‘heavy’ rain. This is to verify that we don’t accidentally mix incompatible data in our training set.

The next field in the record is the number of rotations we are calculating. 0 indicates no rotations, and 3 indicates we’ll have 4 total orientations, the original one plus three 90 degree rotations.

Following this, we have a pair of integers. 0/1 indicates no/any rain in the first field, and indicates no/any heavy rain in the second field.

If we have rotations, then each rotation has two more fields just like the ones above, indicating presence or absence of rain, and whether it’s heavy rain.

I’ve added a new function to rpreddtypes.py:

def genhash(centre, senseList, heavyThreshold, seed = 0xabcddcba):
    """
    Returns a string, a hash derived from MD5 of the inputs
    """

    hasher = hashlib.md5()
    hasher.update('{0:0>8x}'.format(seed).encode('ascii'))
    hasher.update('{0:0>8x}{1:0>8x}'
                  .format(centre[0], centre[1])
                  .encode('ascii'))
    for tuple in senseList:
        hasher.update('{0:0>8x}{1:0>8x}'
                      .format(tuple[0], tuple[1])
                      .encode('ascii'))

    hasher.update('{0:0>8x}'.format(heavyThreshold).encode('ascii'))
        
    return hasher.digest().hex()[0:8]

This generates the hash I’ll need to verify in the training code.

Now, we’re ready for the script itself. This is prepare-true-vals.py:

#! /usr/bin/python3

# This script will parse one or more of the intermediate binary files
# for the rain predictor and will produce a record on stdout.  A
# record consists of one line:

# <SEQ_NO> <FULL_PATH> <HASH> <N_ROTS> <RAIN_0> <HEAVY_RAIN_0> ...

# The SEQ_NO is a sequence number, the number of .gif files that ought
# to have been produced between 2015-01-01T00:00:00Z and now, assuming
# constant 10 minute intervals.  That's because we need an unbroken
# sequence of 6 hours of data for a training element, so checking for
# contiguous blocks of sequence numbers will be a quick determinant

# FULL_PATH is the pathname of the intermediate binary file.

# HASH is a 32-bit integer computed from the pixel coordinates of the
# radar station and of the rain-sensing pixels.  Represented as a
# 32-bit hex value with a leading 0x, it is there so we can
# distinguish training records for different areas of interest, if we
# should ever expand to that.

# N_ROTS is the number of rotations in this data.  '0' indicates only
# the unrotated set is present.  For numbers greater than 0, the
# rotations are assumed to be evenly divided over the circle, and
# count COUNTER-CLOCKWISE from the unrotated entry.

# RAIN_0 HEAVY_RAIN_0 is 0 or 1, 1 if there is any rain/heavy-rain in
# any of the rain-sensing pixels for this intermediate binary file, 0
# otherwise.  The _0 suffix indicates the unrotated set.  If rotated
# data is present, there will be pairs of records for each such
# rotation.

# IMPORTANT NOTE:
#
# the rotations of the rain-sensing region are counter-clockwise
# rotations.  That's because we're going to consider the rotations of
# the input set to be clockwise rotations, and rotating the rain
# clockwise means we have to rotate the sensing pixels
# counter-clockwise to get correct results.

# Also, remember we're using the same pixel numbering scheme as in the
# .gif file, the first index counts across a row, the second counts
# down rows.  This is a left-handed coordinate system, adjust the
# trigonometry appropriately.

import argparse
import sys
import os
import rpreddtypes
import numpy as np
import re
import datetime
import math


phantomRainRadius = 20
phantomRainFrac = 0.5
epoch = datetime.datetime(year=2015, month = 1, day = 1,
                          hour = 0, minute = 0)


def rotate_pixel_CCW (pixel, centre, nDiv, rotNum):
    """
    Rotates the pixel counter-clockwise around the centre by rotNum *
    2 pi / nDiv.  Special case treatment for nDiv = 2 or 4, since
    D_2 and D_4 symmetry groups are contained within the square grid.
    rotNum counts up from 1, so must not exceed nRots
    """

    if nDiv == 1:
        return pixel.copy()

    if nDiv <= 0 or rotNum <= 0 or rotNum > nDiv:
        print('Invalid invocation of rotate_pixel_CCW.  '
              'nDiv={0} and rotNum={1}'.format(nDiv, rotNum))
        sys.exit(1)

    delta = pixel.copy()
    delta[0] -= centre[0]
    delta[1] -= centre[1]

    rval = pixel.copy()

    if nDiv == 2 or ( nDiv == 4 and rotNum == 2):
        rval[0] = centre[0] - delta[0]
        rval[1] = centre[1] - delta[1]
        return rval

    if nDiv == 4:
        if rotNum == 1:
            rval[0] = centre[0] + delta[1]
            rval[1] = centre[1] - delta[0]
            return rval
        if rotNum == 3:
            rval[0] = centre[0] - delta[1]
            rval[1] = centre[1] + delta[0]
            return rval

    # If we get here, it's a rotation not covered by the symmetry of
    # the grid.  Note that we lose the 1-1 mapping guarantee now.
    # It's possible for two different input pixels to map to the same
    # output pixel.

    theta = (np.pi * 2 / nDiv) * rotNum
    sinT = np.sin(theta)
    cosT = np.cos(theta)

    # left-handed rotation matrix
    rmat = np.array(((c, s), (-s, c)))
    deltavec = np.array(delta)
    deltavec = rmat.dot(deltavec)

    rval[0] = centre[0] + deltavec[0]
    rval[1] = centre[1] + deltavec[1]
    return rval


def computeSequenceNumber(filename):
    """
    Returns a sequence number, or -1 on error
    """
    # My names are in the form "<dirs>/radar_YYYY_MM_DD_HH_MM.gif
    # I will search for the numeric sub-sequence, and ignore the rest
    match = re.search(r'.*([0-9]{4})_([0-9]{2})_([0-9]{2})'
                      '_([0-9]{2})_([0-9]{2}).*', filename)
    if not match:
        return -1

    year = int(match.group(1))
    month = int(match.group(2))
    day = int(match.group(3))
    hour = int(match.group(4))
    minute = int(match.group(5))

    mytime = datetime.datetime(year = year, month = month, day = day,
                               hour = hour, minute = minute)
    delta = mytime - epoch
    return int(delta.total_seconds() / 600)


def rainPresent(binReader, centre, sensitivePixels, heavyVal):
    """
    Returns a list of 2 integer elements.  The first indicates any
    rain at all in any of the sensitive pixels.  The second indicates
    rain above the threshold intensity for heavy rain.
    """
    
    maxSeen = 0
    anyRain = 0
    heavyRain = 0
    checkPhantom = True
    xoffset = binReader.xoffset
    yoffset = binReader.yoffset
    offset = [xoffset, yoffset]
    dataWidth = binReader.width
    dataHeight = binReader.height
    data = binReader.getNumpyArray()
    for pixel in sensitivePixels:
        pval = data[pixel[0] - xoffset][pixel[1] - yoffset]
        if pval > 0:
            anyRain = 1
            if pval > 1:
                checkPhantom = False
            if pval >= maxSeen:
                maxSeen = pval
        if maxSeen >= heavyVal:
            heavyRain = 1
            return [ 1, 1 ]

    if not anyRain:
        return [ 0, 0 ]

    if not checkPhantom:
        return [ anyRain, heavyRain ]

    # Have to check for phantom rain.  If all pixels within
    # 'phantomRainRadius' of the centre are 0 or 1, and the fraction
    # of elements that are 1 is less than 'phantomRainFrac', then we
    # declare the rain to be an instrumental artefact

    nPixels = 0
    nSetPixels = 0
    for probeY in range(-phantomRainRadius, phantomRainRadius):
        deltaX = int(math.sqrt(phantomRainRadius ** 2 - probeY ** 2))
        for probeX in range(-deltaX, deltaX):
            probePt = [ centre[0] + probeX - offset[0],
                        centre[1] + probeY - offset[1]]
            if ( probePt[0] < 0 or probePt[0] >= dataWidth
                 or probePt[1] < 0 or probePt[1] >= dataHeight ):
                continue

            nPixels += 1
            pixelVal = data[probePt[0]][probePt[1]]

            # If we've got a pixel larger than 1, no phantom rain
            if pixelVal > 1:
                return [anyRain, heavyRain]

            if pixelVal == 1:
                nSetPixels += 1

    if nSetPixels >= nPixels * phantomRainFrac:
        return [ anyRain, heavyRain ]
    else:
        return [ 0, 0 ]

    


## Main execution begins here

parser = argparse.ArgumentParser(description='Build training sequence.',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)

parser.add_argument('ifilenames', type=str, metavar='filename',
                    nargs='+', help='Filenames to process')

parser.add_argument('--override-centre', type=list, dest='centre',
                    default=[240,239], help='Set a new location for '
                    'the pixel coordinates of the radar station')
parser.add_argument('--override-sensitive-region', type=list,
                    dest='sensitive',
                    default=[[264,204], [264,205], [265,204], [265,205]],
                    help='Set a new list of sensitive pixels')
parser.add_argument('--rotations', type=int, dest='rotations',
                    default=0, help='Number of synthetic data points '
                    'to create (via rotation) for each input data point')
parser.add_argument('--heavy-rain-index', type=int, dest='heavy',
                    default=3, help='Lowest index in the colour table '
                    'that indicates heavy rain, where 1 is the '
                    'lightest rain.')

args = parser.parse_args()


hashString = rpreddtypes.genhash(args.centre, args.sensitive, args.heavy)

for inputfile in args.ifilenames:
    rpReader = rpreddtypes.RpBinReader()
    rpReader.read(inputfile)

    record = '{0} {1} {2} {3}'.format(computeSequenceNumber(inputfile),
                                      os.path.abspath(inputfile),
                                      rpreddtypes.genhash(args.centre,
                                                          args.sensitive,
                                                          args.heavy),
                                      args.rotations)

    truevals = rainPresent(rpReader, args.centre, args.sensitive, args.heavy)
    record = '{0} {1} {2}'.format(record, truevals[0], truevals[1])

    for rot in range(args.rotations):
        sense2 = args.sensitive.copy()
        for i in range(len(sense2)):
            sense2[i] = rotate_pixel_CCW(sense2[i], args.centre,
                                         args.rotations + 1, rot + 1)
        truevals = rainPresent(rpReader, args.centre, sense2, args.heavy)
        record = '{0} {1} {2}'.format(record, truevals[0], truevals[1])

    print (record)

So, we run that script on all of the input files I have so far, and see what we find. Scanning over my files from 2015, I have 10921 radar images. The analysis process determines that of these images, 667 indicate rain in Ottawa, and of those, 156 are heavy rain. So, we have a 6% event rate for any rain, and 1.5% for heavy rain. We’re going to have to keep this imbalance in mind when we design our neural network training set, as training works better with balanced data. I could easily design a network with a 94% accurate prediction rate, by training it to say “no rain” all the time, but I would miss the events I really care about.

Now, note that the problem isn’t to decide whether it’s raining in Ottawa, we already know that. We want to train the network to predict rain over the next 5 hours. That means our training data is not individual images, but consecutive runs of 36 images. The first 6 images in each run represent the data that will be presented to the network in order for it to form its prediction, the last hour of radar data. The next 30 images will be used to form the true data, whether rain actually fell in each block of 6 images and whether it was heavy rain.

So, our next post will generate the candidates for a training set. There, we’ll check the rare event statistics again and decide how we’re going to handle the training. Possibilities include trimming the no-rain set so that rain and no-rain entries are roughly equally represented in the training set, making multiple copies of the rain set for the same goal, producing an auto-encoder trained on the no-rain set, or maybe some other approach.

Of course, you’ll also start to wonder how we’re handling the time sequence. We’ll get to that soon.

UPDATE #1 (2019-08-23): Included a link to an index page of articles in this series.

Writing a Rain Predictor. Generating the training data

2019-08-17

Uncategorized

bicycling machine learning neural networks python rain predictorChristopher

The index to the articles in this series is found here.

So, we’re about ready for our next step. Converting those .gif files from the radar station into files holding no extraneous information, just precipitation data. So, that colour bar on the right can go, as can all the background. We want to encode those pixels so that any non-rain pixel has a value of 0, and rain pixels have numeric values that increase as the intensity of rain increases.

So, in earlier postings we created a background image. Now, for each radar image, we compare the background to the new image. Where pixels differ, if the colour in the new image can be found in a reference table of colours to rain intensities, then we record the appropriate ordinal of the table entry in the file.

I’ll record the results in a binary data file of my own. Because I’ll have to read these files from multiple scripts, it makes sense to create classes to manipulate them, and to design a file format that can verify that we’re not reading some random file not of my creation.

Here, then, is the manipulation package, called rpreddtypes.py:

#! /usr/bin/python3

import numpy as np
import gzip


# First, classes to manipulate the intermediate binary file.
#
# Format:
# RAIN PREDICTOR BIN FILE\n
# VERSION 1\n
# WIDTH NNN\n
# HEIGHT NNN\n
# XOFFSET NNN\n
# YOFFSET NNN\n
# <Binary blob of byte values>



class RpBinFileReadError(Exception):
    def __init__(self, message):
        self.message = message

class RpBinCommon:
    HEADER_KEY = 'RAIN PREDICTOR BIN FILE'
    VERSION_KEY = 'VERSION'
    WIDTH_KEY = 'WIDTH'
    HEIGHT_KEY = 'HEIGHT'
    XOFFSET_KEY = 'XOFFSET'
    YOFFSET_KEY = 'YOFFSET'
    

class RpBinReader(RpBinCommon):
    """
    Reader for intermediate binary data type
    """
    def __init__ (self,):
        self.version = 0
        self.buffer = b''
        self.blen = 0
        self.width = 0
        self.height = 0
        self.xoffset = 0
        self.yoffset = 0

    def read(self, filename):
        with open(filename, 'rb') as istream:
            try:
                istream = open(filename, 'rb')
            except OSError as ex:
                raise RpBinFileReadError(ex.strerror)

            header = istream.readline().rstrip().decode('ascii')
            if header != self.HEADER_KEY:
                raise RpBinFileReadError('File {0} is not a valid '
                                         'file'.format(filename))
            
            vstr, vnum = (istream.readline().rstrip()
                          .decode('ascii').split(" "))
            if vstr != self.VERSION_KEY:
                raise RpBinFileReadError('File {0} is not a valid '
                                         'file'.format(filename))
            if vnum != '1':
                raise RpBinFileReadError('File {0} is version {1}'
                                         'which is not supported'
                                         'by this code'.format(filename,
                                                               vnum))

            width, self.width = (istream.readline().rstrip()
                                 .decode('ascii').split(" "))
            self.width = int(self.width)
            if width != self.WIDTH_KEY or self.width < 0:
                raise RpBinFileReadError('File {0} is not a valid '
                                         'file'.format(filename))

            height, self.height = (istream.readline().rstrip()
                                   .decode('ascii').split(" "))
            self.height = int(self.height)
            if width != self.WIDTH_KEY or self.height < 0:
                raise RpBinFileReadError('File {0} is not a valid '
                                         'file'.format(filename))

            xoffset, self.xoffset = (istream.readline().rstrip()
                                     .decode('ascii').split(" "))
            self.xoffset = int(self.xoffset)
            if xoffset != self.XOFFSET_KEY or self.xoffset < 0:
                raise RpBinFileReadError('File {0} is not a valid '
                                         'file'.format(filename))

            yoffset, self.yoffset = (istream.readline().rstrip()
                                     .decode('ascii').split(" "))
            self.yoffset = int(self.yoffset)
            if yoffset != self.YOFFSET_KEY or self.yoffset < 0:
                raise RpBinFileReadError('File {0} is not a valid '
                                         'file'.format(filename))

            btmp = istream.read()
            self.buffer = gzip.decompress(btmp)
            self.blen = self.width * self.height
            if self.blen != len(self.buffer):
                raise RpBinFileReadError('File {0} is '
                                         'corrupted'.format(filename))
            
            
    def getVersion():
        return self.version

    def getWidth():
        return self.width

    def getHeight():
        return self.height

    def getXOffset():
        return self.xoffset

    def getYOffset():
        return self.yoffset

    def get1Dbuffer():
        return self.buffer

    def getNumpyArray():
        array = np.arange(self.width * self.height, dtype=bool)
        for i in range(self.width * self.height):
            array[i] = self.buffer[i]

        return array.reshape(self.width, self.height)

    

class RpBinWriter(RpBinCommon):
    def __init__(self):
        pass

    def write(self, filename, width, height, xoffset, yoffset, values):
        with open(filename, 'wb') as ofile:
            ofile.write('{0}\n'.format(self.HEADER_KEY)
                        .encode('ascii'))
            ofile.write('{0} {1}\n'.format(self.VERSION_KEY, 1)
                        .encode('ascii'))
            ofile.write('{0} {1}\n'.format(self.WIDTH_KEY, width)
                        .encode('ascii'))
            ofile.write('{0} {1}\n'.format(self.HEIGHT_KEY, height)
                        .encode('ascii'))
            ofile.write('{0} {1}\n'.format(self.XOFFSET_KEY, xoffset)
                        .encode('ascii'))
            ofile.write('{0} {1}\n'.format(self.YOFFSET_KEY, yoffset)
                        .encode('ascii'))
            ofile.write(gzip.compress(bytearray(values)))

This will be used by our script that generates binary training data files from .gif files. That’s this one, make-rain-inputs.py:

#! /usr/bin/python3

# This script will take .gif files downloaded from the radar station
# and convert them to a simpler format for eventual use.  It will
# contain only information about precipitation or its absence, in a
# set of integer steps.

import argparse
import sys
import gif
import rpreddtypes

parser = argparse.ArgumentParser(description='Extract '
                                 'precipitation data.')
parser.add_argument('ifilenames', type=str,
                    metavar='filename', nargs='+',
                    help='Filenames to process')
parser.add_argument('--baseline', type=str,
                    dest='baseline',
                    help='The baseline .gif file')
parser.add_argument('--width', type=int, dest='owidth',
                    default=-1,
                    help='The width of the sub-rectangle '
                    'that is to be output')
parser.add_argument('--height', type=int, dest='oheight',
                    default=-1,
                    help='The height of the sub-rectangle '
                    'that is to be output')
parser.add_argument('--top-left-x', type=int, dest='offsetx',
                    default=0,
                    help='The x-value of the upper left '
                    'of the sub-rectangle that is to be output')
parser.add_argument('--top-left-y', type=int, dest='offsety',
                    default=0,
                    help='The y-value of the upper left '
                    'of the sub-rectangle that is to be output')
parser.add_argument('--override-intensities', type=list,
                    dest='intensities',
                    default=[0x99ccff, 0x0099ff, 0x00ff66,
                             0x00cc00, 0x009900, 0x006600,
                             0xffff33, 0xffcc00, 0xff9900,
                             0xff6600, 0xff0000, 0xff0299,
                             0x9933cc, 0x660099],
                    help='Override the colour codes for '
                    'intensities')
parser.add_argument('--verbose', type=bool, dest='verbose',
                    default = False,
                    help='Extra output during processing')

args = parser.parse_args()

if not args.baseline:
    print ('A baseline comparison file must be supplied '
           'with the --baseline argument')
    sys.exit(1)

baselineReader = gif.Reader()
bfile = open(args.baseline, 'rb')
baselineReader.feed(bfile.read())
bfile.close()
if ( not baselineReader.is_complete()
     or not baselineReader.has_screen_descriptor() ):
    print ('Failed to parse {0} as a '
           '.gif file'.format(args.baseline))
    sys.exit(1)

baselineBuffer = baselineReader.blocks[0].get_pixels()
baselineColours = baselineReader.color_table
baselineWidth = baselineReader.width
baselineHeight = baselineReader.height

newwidth = baselineWidth
if args.owidth != -1:
    newwidth = args.owidth

newheight = baselineHeight
if args.oheight != -1:
    newheight = args.oheight

xoffset = args.offsetx
yoffset = args.offsety
    
for ifile in args.ifilenames:
    convertReader = gif.Reader()
    cfile = open(ifile, 'rb')
    convertReader.feed(cfile.read())
    cfile.close()
    
    if ( not convertReader.is_complete()
         or not convertReader.has_screen_descriptor() ):
        print ('Failed to parse {0} as a '
               '.gif file'.format(ifile))
        sys.exit(1)

    if ( len(convertReader.blocks) != 2
         or not isinstance(convertReader.blocks[0], gif.Image)
         or not isinstance(convertReader.blocks[1], gif.Trailer)):
        print ('While processing file: {}'.format(ifile))
        print ('The code only accepts input files with a single block of '
               'type Image followed by one of type Trailer.  This '
               'constraint has not been met, the code will have to be '
               'changed to handle the more complicated case.')
        sys.exit(1)
        
    convertBuffer = convertReader.blocks[0].get_pixels()
    convertColours = convertReader.color_table
    convertWidth = convertReader.width
    convertHeight = convertReader.height

    if baselineWidth != convertWidth or baselineHeight != convertHeight:
        print('The baseline file ({0}) and the file to convert {1} '
              'have incompatible dimensions'.format(args.baseline,
                                                    ifile))
        sys.exit(1)

    output_block = []

    for pixel in range(len(baselineBuffer)):

        row = pixel // baselineWidth
        col = pixel % baselineWidth

        if row < yoffset:
            continue

        if row >= yoffset + newheight:
            break

        if col < xoffset or col >= xoffset + newwidth:
            continue

        if pixel >= len(convertBuffer):
            output_block.append(0)
            continue

        btuple = baselineColours[baselineBuffer[pixel]]
        ctuple = convertColours[convertBuffer[pixel]]
        
        if btuple == ctuple:
            output_block.append(0)
        else:
            code = ( ctuple[0] * 256 * 256
                     + ctuple[1] * 256
                     + ctuple[2] )
            appendval = 0
            for i in range(len(args.intensities)):
                if code == args.intensities[i]:
                    appendval = i+1
                    break
            output_block.append(appendval)

    newfilename = ifile + '.bin'

    writer = rpreddtypes.RpBinWriter()
    writer.write(newfilename, newwidth, newheight, xoffset, yoffset,
                 output_block)

    if (args.verbose):
        print('Wrote output file: {0}'.format(newfilename))

In the next posting we’ll need a script to generate the true values. That is, for each file, whether it indicates rain in Ottawa, and whether that rain is heavy. We’ll also want to attach a sequence number to each file, allowing us to know when we have continuous runs of data that we need to make a single training entry.

Update #1: Fixed a bug that resulted in rainfall in the lowest intensity bin not being recorded.

UPDATE #2 (2019-08-23): Included a link to an index page of articles in this series.

There have been visitors to this page

Wordpress on cneufeld.ca

Moved here from Taiwan Yahoo blogs

Monthly Archives: August 2019

Writing a Rain Predictor. Thinning the training set.

Writing a Rain Predictor. Source files on github.

Writing a Rain Predictor. Our training candidates.

Writing a Rain Predictor. Building the training index.

Writing a Rain Predictor. Generating the training data