|
|
3.3.1 Training
The network has been trained on a corpus of 4,000 generated sentences (containing 6,497 clauses) with a maximum length of 20 words (the average was 10.5 words). Both the number of sentences and the maximum length find their origin in the need to restrict the size of the training corpus to what the available hardware could manage (Note 16). In addition, the length of the sentences has been limited in order to prevent the generation of unrealistically long sentences with, for example, multiple embedded relative clauses.
The learning algorithm used is a slightly adapted form of backpropagation suitable for recurrent neural networks (Zell et al. 1994). The feed-forward connection weights were initialized with random values ranging between -1 and +1. The recurrent connection weights were frozen at +1 to insure the correct copying of the hidden layers to the recurrent layers. The connections within the recurrent layers were initialized at a slightly inhibitory -0.1 and all the units in these layers were given an initial activation value of 0.5. The error correction value was initially 0.15 (for a tolerance of 0.15), but was later decreased to 0.01 (for a tolerance of 0.05) to allow for more precise changes in the weights (Note 17). These values were based on the results of previous experimentation, be it with a much smaller network and a much simpler task. As the same values led to fairly efficient learning in the case of CLASPnet as well, no attempts were made to optimize the learning parameters any further. (Note 18)
The training was done using the Stuttgart Neural Network Simulator v4.1 package (see 1.3) on a single-processor Linux machine: a Pentium 100 with 64 megabytes of RAM. (Most of the other training results which will be reported later were done on a SunOS SPARCstation 20/501 with 48 megabytes of RAM. The training times on this machine were roughly similar to the ones on the Linux P100.)
Training took approximately 40 hours of CPU time on an otherwise idle machine. Figure 3.7 shows how training progressed: the X-axis is for the number of training epochs (i.e. the number of times that the entire corpus was presented to the network); the Y-axis represents the overall error of the network as calculated by the mean sum of squared errors (Note 19). The figure shows that training progressed very rapidly during the first ten epochs, but then more and more slowly. When no more progress seemed likely, the training was stopped.
Note 16
Note 17
The tolerance value is used to determine whether the real activation
value of an output value is close enough to the desired one: if it is not,
then all the weights leading to the unit are modified; if it is, then no
learning is done. Again, choosing the tolerance value is not
straightforward.
The idea behind decreasing both the error correction term and the tolerance
value during training is that the network should first try to find the area
in the space of possible solutions which contains the best solution; once
the network has found this general area, it should be allowed to take
smaller steps towards the optimal solution. (However, backpropagation offers
no guarantees that the network will indeed find (the region of) the best
solution: it is possible for the net to get stuck in a 'local minimum'. In
such a local minimum, small changes to the weights will only increase the
overall error, so the net will stabilize there -- oblivious to the fact that
larger changes to the weights would allow it to jump out of the local
minimum. Luckily, the more dimensions (i.e. hidden units) the space of
possible solutions has, the smaller the chance that there are local minima
without exits.)
The corpus contains 47,338 input patterns, along with the desired output
patterns. About 40 megabytes of RAM are needed to store this corpus in
memory. Doubling the size of the training corpus would probably also lead to
a training time which is twice as long.
The error correction term is used to calculate how much the weights
of the connections will be modified. Large values can lead to faster
learning because the network can change all the values quickly. Smaller
values allow the network to fine-tune the weights. There are no fixed rules
for knowing which values are most appropriate for a given network and a
given task.