|
|
2.4 Cognitive Science -- Mixing the Bits
The fourth type of neural network models differ from the previous ones in that they rarely have as definite a goal as the other types. Most of the connectionist models of language fall under this category, so I will not even attempt to short-list the more noteworthy ones. Instead, I will limit my attention to the one model which has probably been most influential in my decision to develop CLASPnet: Jeffrey Elman's (1992, 1993, 1995) seminal work on how connectionist nets can learn to process syntactically complex sentences. Part of Elman's inspiration came from a claim made by Chomsky in 1957, namely that no Finite State Automaton would ever provide much insight into how our knowledge of language might be organized in the brain. FSA's, Chomsky argued, cannot account for our ability to construct infinitely recursive center-embedded clauses; and for those recursive sentences which they can describe, the descriptions will be too complex to be of any use.
Against this claim, Elman pitted a recurrent network with multiple hidden layers -- see Figure 2.11. The task of the network during each cycle was to predict the next word in a sentence. The input sentences were represented one word at a time, and each word was a pattern of 1's and 0's on the 26 input units (no meaning was connected to these units). Example input sentences were: boys who chase dogs see girls and dogs see boys who cats who mary feeds chase. Such sentences abstract away from real English, but still contain important aspects of English 'syntax' like subject-verb agreement, verb argument structure, and relative clauses. The network also had to find out for itself what the forms of the words were, and in which lexical categories they had to be divided.
Before I discuss the results, it is necessary to mention that Elman (1993) was also concerned with the issue of nativism (cf. 1.2.2). Generative grammar has always defended a strong nativist position on the grounds that, one, no natural language can be learned through positive language experience alone and, two, there is no (conclusive) evidence that children receive direct negative evidence. Chomskyans have consequently attributed the child with a rich Universal Grammar, while others have sought indirect negative evidence. Elman, however, was looking for a different way out: if the brains and minds of children still develop when they are learning their mother tongue, then might their changing capacity not be (partly) responsible for their ability to learn something which is as convoluted as a natural language? The development of the mind and brain might then correspond to what is often called the 'critical period' in language learning.
In a first experiment Elman presented the network with all the possible sentences of his simple grammar mixed up: sentences with no embedded structures and very complex sentences alternated randomly. The results, however, were quite disappointing. Following this failure, Elman tried out two incremental strategies. He first opted for an incremental input: in a first phase, only simple sentences were presented to the network; the second phase contained a few complex sentences; etc. ... all the way to a fourth (1992) or fifth (1993) phase in which the simple and complex sentences occurred together. Elman (1992: 154) writes: "when the network was permitted to focus on the simpler data first, it was able to learn the task quickly and then move on successfully to more complex patterns. The important aspect to this was that the earlier training constrained later learning in a useful way; it forced the network to focus on canonical versions of the problems, which apparently created a good basis for then solving the more difficult forms of the same problem." In Figure 2.12, the expectations of the network after each word in the sentence 'boys who mary chases feed cats' are shown. The subject-verb agreement, the arguments required by the verb, and the embedded sentence were all treated correctly.
It is also interesting to compare how the network handled the following sentences:
(a) 'boy chases boy';
(b) 'boy chases boy who chases boy';
© 'boy who chases boy chases boy';
(d) 'boy chases boy who chases boy who chases boy'.
Figure 2.13 shows the trajectories in a state space which the network went through when encountering these sentences. (Note 6) What is particularly striking in the diagram for (d) is the fact that the network distinguished between the two outwardly identical instances of 'boy who chases boy'. Although the network settled in states that were close to each other, they were not identical, and thus carried information about the previous states the network had gone through. From a traditional linguistic perspective, such a distinction is at best superfluous, and at its worst plainly wrong. But then, much of traditional linguistics considers syntax as a level separate from semantics. If that assumption is abandoned, as is done by Elman, the value of such a representation shines through, for it is exactly when a semantic interpretation has to be made that detailed information about the context proves valuable: at any time all the knowledge of what has been read or heard in the recent past is instantly available. Elman (1992, 1995) has remarked that this extreme context sensitivity of the network is very similar to Langacker's (1987) notion of 'accommodation' in cognitive grammar
One final remark about this network is in order: it did not fare well on sentences with more than three levels of embedding, and sentences with center-embedded relative clauses proved more difficult for it than sentences with right-branching structures (cf. 3.3.4.2). Both problems could be alleviated to some extent by providing the network with more hidden units. But once sentences with, say, five or six levels of embedding can be treated correctly by such a network, one should ask oneself whether we really need anything more -- and, especially, if we really need a grammar that can account for an infinite level of recursion. Connectionism offers only the first, not the latter, but as Elman (1992: 168) writes: "The finite precision and tendency to degrade over time are, in fact, consistent with the observed abilities of language users."
There is, however, one fundamental problem with Elman's first experiment: because it manipulated the input data in an unrealistic way, it was fatally flawed. Elman (1993) provided a solution: "If it is not true that the child's environment changes radically [...], what is true is that the child changes during the period he or she is learning language." (78). In his second experiment, he simulated the gradual increasing memory and attention span of children by slowly increasing the number of consecutive words the network was sensitive to: in a first phase, the recurrent feedback of the context layer was dropped after the third or fourth word; in the second phase, this became fourth or fifth word; etc. ... and in the last phase the feedback was left undisturbed. Elman found that it took the network about twice as long to reach the same level of performance as when it only saw simple sentences in the first experiment. However, once this stage had been passed, the new network quickly learned to treat all sentences correctly and achieved the same level of performance as the network in the first experiment. In essence, the limited attention span of the network in its early stages protected it from getting confused by the long and complex sentences which it also saw. The fact that part of those long sentences turned into noise for the early network actually was a positive factor, because the increased variability slowed down learning and kept the network in a state of flux until it had seen sufficient data to make reasonable approximations at the true generalization. The critical period in language acquisition would then correspond to the period in which the child's limitations enable her to grasp the fundamentals of her native tongue.
The influence of Elman's work on CLASPnet has been profound. Not only have I made use of the type of network which Elman has developed, but the more general belief that natural language 'syntax' can be studied successfully with neural networks is also due to him. In other areas I have tried to improve on Elman's model: the corpora for CLASPnet were based on a more complex grammar, and contained more words; the words, in turn, had an orthographic and semantic representation rather than a random one.