The index to the articles in this series is found here.
I’ve been throwing about a lot of somewhat technical terms, and it occurs to me that I probably should have spent some time explaining them up front.
If your understanding of neural networks is along the lines of saying “a directed graph in layers, with neurons that sum their inputs in a linear combination, add a bias, and feed through a sigmoid function before going to the next layer”, that’s a start, but there’s much more subtlety than that once you get into the details.
Some types of layers:
I’ve mentioned various types of layers before. Note, however, that a neural network can be made without layers at all, though doing so poses its own difficulties of mathematics and implementation.
The simplest layer people usually think of in the context of neural networks is the dense, or fully connected layer. Such a layer might have N inputs, and consist of M neurons. Each neuron will take, as input, all of the N inputs and produce a single output. The topology is symmetric, from a connection standpoint all of the neurons look equivalent. Of course, as the layer is trained, weights will change, and the neurons will not behave identically. From an ease of implementation standpoint, one usually treats the bias term as an additional input, with its value set to 1. This means that each neuron has N+1 weights, and there are M neurons, for a total of M*(N+1) weights. In implementation, these will generally be stored in a tensor. Think of a tensor as an array with an arbitrary number of dimensions.
I’ve mentioned convolutional layers a few times in this series. This is a layer that acts on groups of proximate data. The simplest way to think of a convolutional layer is to imagine a system for computer vision. The inputs are separate pixels in the image laid out on a two-dimensional grid, and we can describe certain pixels as being adjacent to certain other pixels. We can construct a neighbour graph that states that the pixel at (10,9) is adjacent to the pixel at (10,8). A convolutional layer, rather than taking the values of the pixels as inputs, performs a convolution operation on the pixels. For each pixel it produces one or more values that are functions of the pixel and its neighbours. If you think of the values on the pixels as representing a function in two dimensions, then the convolution is a function of the function. There are several different convolutions one could apply, including nonlinear ones that involve min() and max(), but a simple linear one would be this: imagine that you want to use only the eight nearest neighbours in a square grid, then you have the centre pixel and the eight pixels that surround it, for a 3×3 grid of pixels with their associated values. The convolution function itself is another 3×3 grid of numbers. The convolution operation is to multiple these two matrices together, then add the values of the resulting 3×3 product matrix together to produce a single number. The convolution operation then slides over to a new next pixel and the operation repeats. There is a lot more behind convolutional layers, but I’m not going to do a thorough discussion of them here. You might want to imagine a one-dimensional problem, say a noisy signal on a measurement, and an associated kernel that looks like [0.2,0.2,0.2,0.2,0.2], and what this convolution would do to the signal.
A sparse layer is the converse of a dense one. It’s a layer in which not all neurons receive all inputs. One typically generates these topologies by hand, due to some specific knowledge of the problem space, though there are automated techniques, such as training a dense network for a while and then forcefully cutting low-weight inputs to zero. The aim here is to reduce computational effort by removing connections that have little bearing on the final answer. There have been reports, however, that these layers yield networks that are prone to convergence problems, so one should bear that possibility in mind when using sparse layers.
A recurrent layer is one that feeds back into itself. That is, its inputs at time N+1 include one or more values derived from a function of its outputs from one or more earlier timesteps. The details of this function lead to different types of recurrent networks, including simple, GRU, and LSTM. A recurrent neural network is one way in which the network can be designed to track behaviour through time, which is exactly what I’m trying to do in the rain predictor.
These four cases make a good introduction to the kinds of active layers that are often used in neural network problems. I’m not including layers that Keras and TensorFlow define but which I think of more as topological transformations, such as pooling or merge layers, or simple functional layers where each neuron takes a single input, applies a mathematical operation on that one value (such as computing its square or adding noise) and produces a new value.
Activation functions
Without a nonlinear activation function, the outputs of a neural network are a simple linear combination of inputs (including bias), which means it can be represented by a single matrix, so the layers could be collapsed into a single layer. To get interesting behaviour, you need a nonlinear activation function. This is the function that is applied to the linear combination of inputs to produce the output.
Keras supplies several activation functions, the choice of function is a bit subtle. Typically one doesn’t use sigmoid or tanh activation functions on the intermediate layers because it’s easy to find oneself in the regime of small but non-zero derivatives that cause significant slowdowns in the training process, as it depends on gradient descent. Common choices for the intermediate layers are ReLU, leaky ReLU, and hard sigmoid, all of which have bounded derivatives that exclude small non-zero values.
For our output layer, we will be using sigmoid functions, since we’re looking for a 0/1 distinction, and sigmoid is the obvious candidate.
Regularization
There are some common problems that can arise in neural networks. Regularization can help address at least two big ones: overfitting and unrealistic high weights on certain inputs.
Overfitting is a common concern. It’s one reason I’ve spent so much effort on minimizing neurons in our neural network. With enough parameters in play, you can run a line through any collection of points. A neural network might become obsessively accurate at reproducing the training data, at the expense of generality. This is what is generally referred to as overfitting. We will address this concern statistically in a later posting.
The other common problem is an unrealistically high dependence on a particular input. The neural network might sniff out some coincidental correlation between some subset of the inputs and the desired outputs, and put a lot of weight on that. This is, in effect, another manifestation of overfitting, but stands a bit in its own category.
Regularization is how one typically avoid overfitting. A first obvious technique is referred to as “early stopping”. As you continue to train, your network becomes better at matching the training data. It may, however, start to drift ever further from your validation data (you will have validation data). Early stopping just asserts that once the network’s performance on the test data starts to degrade, that you stop your training. Naturally, it’s not quite that simple, the agreement with the validation data can worsen before ultimately improving further, so various heuristics are used to decide when to stop training.
Some other regularization techniques rely on the observation that neural networks are, or should be, fairly robust in the face of neuron dropouts or noisy data. You train the network while it is faced with problems like the deletion of a significant fraction of its nodes. Between training passes, the nodes that are deleted change. This prevents any small set of nodes from dominating the behaviour of the system, because if some of those nodes are deleted, the network will suddenly perform poorly, and will train away from that configuration.
One can also introduce noise in the outputs of one layer, as inputs to the next. This helps to force the network to configure itself for a broader volume of the phase space of inputs, thereby becoming more general, and less likely to overfit.
Finally, to avoid the problem of extremely large coefficients in some places in the network, a penalty term can be applied in the error estimation simply due to the presence of large coefficients. This causes the network to push itself away from configurations that have large coefficients.
One regularization technique that is sometimes applied is to delete weights that are small but non-zero, and set them to zero. This is helpful for reducing network complexity, but, as noted above, it is suggested that convergence suffers if this is applied too liberally.
In conclusion
So, that’s a brief overview of some of the terms I’ve been throwing around in this set of articles. One can practically write an entire book around the content of each paragraph above, so there’s a lot more detail to explore.
UPDATE #1 (2019-08-23): Included a link to an index of articles in this series.