The index to the articles in this series is found here.
Well, we’re ready to go now. We have just to identify consecutive runs of 36 radar images, and bin up the rain/no-rain data appropriately for the 5 prediction windows.
We’ve got a simple script for that, get-training-set.py:
#! /usr/bin/python3
# Here we generate the training set candidates.
#
# We read the list of records made by prepare-true-vals.py from a
# file, and produce a list of candidates on stdout
#
# A training set candidate is a run of 6 sequence numbers representing
# the previous hour of historical data, while the following 30
# sequence numbers inform the rain/no-rain data for the five 1-hour
# blocks of future predictions.
#
# A training set candidate is available when there exists a set of 36
# consecutive sequence numbers. In that case, we generate a record
# like this:
#
# <FIRST_SEQ_NO> <HASH> <N_ROTS> <ROTNUM> <RAIN0_1> <HEAVY0_1> ...
#
# The first field is the starting sequence number in the run of 36.
#
# The second is the hash, as in prepare-true-vals.py, to ensure that
# we don't accidentally mix incompatible training data
#
# The third field is the number of rotations from which this is taken,
# or 0 if we're using unrotated data sets
#
# The fourth field is the rotation index. 0 for unrotated, up to 1
# less than N_ROTS
#
# The fifth field is the logical OR of the RAIN record for the 7th
# through 12th sequence numbers. The sixth is the logical OR of the
# HEAVY_RAIN record for the 7th through 12th sequence numbers.
#
# The following 8 fields are as above, for subsequence runs of 6
# sequence numbers.
import argparse
import rpreddtypes
parser = argparse.ArgumentParser(description='Find training candidates.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('truevalfile', type=str, metavar='truevalfile',
help='Filename to process')
parser.add_argument('--override-centre', type=list, dest='centre',
default=[240,239], help='Set a new location for '
'the pixel coordinates of the radar station')
parser.add_argument('--override-sensitive-region', type=list,
dest='sensitive',
default=[[264,204], [264,205], [265,204], [265,205]],
help='Set a new list of sensitive pixels')
parser.add_argument('--rotations', type=int, dest='rotations',
default=0, help='Number of synthetic data points '
'to create (via rotation) for each input data point')
parser.add_argument('--heavy-rain-index', type=int, dest='heavy',
default=3, help='Lowest index in the colour table '
'that indicates heavy rain, where 1 is the '
'lightest rain.')
args = parser.parse_args()
hashval = rpreddtypes.genhash(args.centre, args.sensitive, args.heavy)
seqnoList = []
parsedData = {}
nRots = -1
skipRotations = False
with open(args.truevalfile, 'r') as ifile:
for record in ifile:
fields = record.split()
if fields[2] != hashval:
continue
# Check for inconsistent number of rotations
if nRots != -1 and nRots != int(fields[3]):
skipRotations = True
nRots = int(fields[3])
seqnoList.append(int(fields[0]))
value = [ int(fields[3]) ]
value[1:1] = list(map(int, fields[4:]))
parsedData[int(fields[0])] = value
seqnoList.sort()
if skipRotations:
nRots = 0
# Now, we've loaded sequence numbers into a list, and indexed a dict
# against them to record number of rotations and rain data.
# Find runs of 36.
idx = 0
while idx < len(seqnoList) - 36:
candSeqNo = seqnoList[idx]
offset = 1
invalid = False
while not invalid and offset < 36:
if seqnoList[idx + offset] != seqnoList[idx + offset - 1] + 1:
invalid = True
idx = idx + offset
offset += 1
if invalid:
continue
for rot in range(nRots + 1):
record = '{0} {1} {2} {3} '.format(candSeqNo, hashval, nRots, rot)
binnedValues = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
for timeInterval in range(5):
for snapshot in range(6):
oneRec = parsedData[candSeqNo + 6 + timeInterval * 6 + snapshot]
if oneRec[1 + rot * 2] == 1:
binnedValues[timeInterval * 2] = 1
if oneRec[1 + rot * 2 + 1] == 1:
binnedValues[timeInterval * 2 + 1] = 1
print(record, *binnedValues, sep = ' ')
idx += 1
We consider each training record to be 36 radar images. The first six are the historical data, they’re the inputs to the neural network. The following 30 images form the ‘true’ results, the actual occurence or not of rain in the five one-hour long bins following the historical interval.
This script produces records consisting of the first sequence number in the run of 36, the hash (as before, to prevent the accidental mixture of incompatible data), two fields to identify the rotation of the true prediction values, and a set of 10 zeros or ones. The first pair of numbers are 1 if there is rain/heavy-rain in the 6 records immediately following the historical data, indicating the first hour of the prediction. The next pair of numbers is the same for the second hour of prediction, and so on.
For now, I’ve chosen not to enable rotations on the data while I work out the neural network.
Now, in my summer 2015 data, I’ve got 10518 candidate records, of which 3740 predict at least some rain. That’s kind of interesting, it suggests that, on average in 2015, at any given moment we were about 36% likely to have rain some time in the next 5 hours.
Now, 36% isn’t actually a bad training fraction. We could probably afford to cut down the no-rain cases by half, to balance the data, assuming we’re not going for an auto-encoder trained only against the no-rain case.
We won’t be using an auto-encoder. The fact that we have close to a million input pixels per training element makes that quite impractical for this project.
So, how can we thin our input set to remove half of the no-rain entries? Well, clear skies leading to clear skies seems to be fairly uninteresting, and we don’t think we need to train our network against that. How about, for each candidate with no rain predicted over the next 5 hours, we count up the total rain intensity in the historical run of 6 radar images. That is, the sum of the pixel values, so heavy rain might count 6 on a pixel, and light rain only 2. Add these up over the historical interval, and the smallest aggregated numbers are the ones that show the least precipitation over that hour on the radar images. We could then drop the bottom half, with the least precipitation, but… there’s a danger in that. We would, in effect, be failing to train the network on the clearest sky cases, forcing it to extrapolate to what is a fairly common input. The network might wind up handling that case badly, since it never saw training data in that domain. So, I’d suggest that we drop data preferentially on the bottom end, but not exclusively so. If we’re trying to delete half of the no-rain data, perhaps we start from the lowest precipitation images and keep every fifth one as we walk up the list. This way we still remove a lot of relatively uninteresting training data, but we keep enough of it in to ensure that the network isn’t unaware of this case.
In the next brief posting, I’ll write a script to do this data thinning. Then, we’ll be able to get to the actual construction of the neural network.
UPDATE #1 (2019-08-23): Included a link to an index of articles in this series.