Building a Rain Predictor. Preprocessing the data.

The index to the articles in this series is found here.

I wrote the new network topology and started training it. It was taking a very long time to begin training, and I tracked that down to the generator taking a long time to prepare data batches. It seems that retrieving and operating on random elements from a numpy array is costly, as the numbers have to be converted into python entities for use. A bit like the cost of cons-ing things in Lisp, so we try to minimize that.

I decided that it would be best to preprocess the data up front, during the construction of the binary intermediate files. That way I could retrieve the data in a format that could be rapidly converted to a numpy array suitable for passing to a Keras Input layer.

I’ve spent a lot of time on this project on the preprocessor, more, probably, than I’ve spent on the neural networking code itself. That’s probably not surprising, once the network design is laid out, that part is simple, but preparing the data for use is a relatively time-consuming task. Feature selection, feature engineering, normalization, and other activities along these lines.

So, the preprocessed binary files now have payload types. In addition to the raw data and the coarse-scaled data, there’s a new payload type that contains an input vector for the neural network. This input vector is represented as a set of unsigned 8-bit values, so when the generator loads it, it simply has to convert the bytearray to a numpy array and divide the elements by 255.

These changes are now in the git tree. I’m preprocessing my files overnight, and will try training again in the morning.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

反垃圾邮件 / Anti-spam question * Time limit is exhausted. Please reload CAPTCHA.