3.4.2 Increasing the Size of the Training Corpus
As the memory capacity of the machines I had access to was limited, I have consequently had to limit the size of the training corpus -- in the case of CLASPnet, 4,000 sentences turned out to be the most I could use without running the risk of sending the machine into perpetual swapping. But because the grammar can generate far many than 4,000 sentences, it is at least possible that a number of clausal constructions are simply never presented to the network at all during training. If similar constructions occur in the test corpora, they will likely be classified incorrectly. Hence, this experiment shows how the performance of the network on a large test corpus of 5,000 sentences (maximum length of 20 words) improves as the size of the training corpus increases from 100 to 4,000 sentences. On the basis of the curve, one can predict with some degree of certainty whether training corpora of 10,000 or 50,000 words would have been more suitable.
The curves show the percentages of missed errors for each of the 17 output units, and for a tolerance of 0.2. As one might have expected, the percentages tend to drop as the size of the training corpora increases. (There is a strange exception: the best score for the Polarity unit is achieved by the network which has only been trained on 1,000 sentences -- I have no ready explanation for this exception.) Especially with regard to the clausal type units, these results point strongly towards the fact that the performance of CLASPnet could still have improved, if training on larger corpora had been possible.