|
|
2.1 Computer Science -- Network Basics
Papers by computer scientists about language and neural networks tend to be something of a shock for linguists: the computer scientists are usually not interested as much in language as in the properties of their networks. Nonetheless, useful information can often be garnered from such papers: for example, Lawrence, Giles & Fong (1995) have carefully compared different network architectures and have come to the conclusion that Elman's type of recurrent neural networks is most suited for the analysis of natural language (see below); Sperduti & Starita (1995), for their part, have proposed a new type of "complex recursive neuron" which can make many types of networks more suitable for the processing of structured input sequences, e.g. language.
In this section, however, I will take a closer look at the model reported on by Paul Rodriguez (1995) in a paper entitled "Representing the Structure of a Simple Context-Free Language in a Recurrent Neural Network: A Dynamical Systems Approach". Rodriguez's aims were to find out to what extent a small recurrent neural network could learn to process a very simple language, and then to analyze how the network had managed this task. The architecture of the net used was based on Elman's well known simple recurrent network (Elman 1990; Servan-Schreiber et al. 1989). Compare these two networks:
Network A is a standard 'feed-forward' network: information is presented on the input units, and flows via the connections to the hidden units, and then on to the output units. This type of network is suitable for learning to associate specific input patterns with specific output patterns: e.g. a pattern recognition task in which the input units represent a satellite image and the output units indicate whether the image contains a tank. Network A, however, will be at a loss if is asked to associate the same input pattern with different output patterns depending on which other pattern(s) preceded the current pattern. There are no connections in Network A which could store this information: the only pieces of information which the network can take into account are the current pattern (as presented on the input units) and the probability of each input pattern being associated with each output pattern (as stored in the connections between the units). (Note 1)
Now consider Network B. It too can learn to associate specific input patterns with specific output patterns. But the extra recurrent layer enables it to do more: after each input pattern has been presented, the pattern of activation in the hidden layer is copied to the recurrent layer; when the next input pattern is presented, the hidden units receive information not only from the input units, but also from the recurrent units. In this manner, the hidden units will end up with a pattern of activation which represents both the current input pattern and the previous input pattern. This combined information is copied to the recurrent units again for when the next input pattern comes along. So, Network B can slowly build up a representation of the context in which each input pattern occurs. On the basis of this information, it can learn to react to the very same input pattern in many different ways. Natural language processing, of course, is a very good example of a task in which context information is extremely important: recognizing the /t/ in 'net' partly depends on having heard or seen the previous phonemes or sounds; similarly, disambiguating between the different meanings of 'net' in 'this fishing net is of very good quality', 'the semantics net has trained very well' and 'cyberspace and the net have grown exponentially over the past five years' depends on being able to store a number of words in memory while processing the sentences. It is for this reason that recurrent neural networks like B are said to have a 'memory' too.
The network employed in Rodriguez's paper had the following architecture:
(The role of the bias unit is to send a constant stream of information to the hidden units. In this manner, it is easy to prevent the output units ever becoming active 'by accident', i.e. when none of the input units is active.)
The context-free language which the network was trained to process was based on these two rules: S aSb and S ab. Valid strings were therefore 'ab', 'aabb', 'aaabbb', etc. Only strings with lengths of up to 22 letters were presented to the network during training. The task of the network was to predict on the output layer which letter would occur next: as the network could never know when a series of consecutive a's would stop, the task boiled down to predicting the correct number of b's once the net had seen the first b. This may appear trivial, but Rodriguez wanted to show that a recurrent neural network could behave like a traditional symbolic Push-Down Automaton (i.e. a Finite-State Machine with a stack). The relevance of all this for linguistics is that center-embedded constructions in natural languages (e.g. 'the man which the cat which was sleeping loved petted it') can be formally represented by strings like aaabbb. So, if a neural network cannot learn to process simple strings of this nature, then they seem unlikely to be able to parse real sentences. The number of embedded clauses which a network could be trained to remember could also be compared with the number of embedded clauses humans can process correctly.
Rodriguez found that the simple recurrent network did not always learn the task equally well: in one case, the net could only be taught to react correctly to strings of up to 16 letters (i.e. fewer than were present in the training set); in the second case, however, the trained net managed to process strings of up to 32 letters correctly (i.e. more than in the training set). (Note 2) A graphical comparison of how the two cases differ is shown in Figure 2.3.
The figures show the trajectories through the 'hidden unit space' for the string 'aaabbb': the two dimensions of the figures correspond to the two hidden units, so the activation values of the hidden units can be plotted as points in the figures; the arrows link the consecutive points. In the first case, the network used the horizontal dimension to represent the number of a's it had seen, and the vertical dimension for the number of b's. When it knew it had seen the last b, it correctly predicted the first a of the next string (i.e. the large diagonal arrow shows that the network goes back to its initial position). In the second case, a different type of representation was developed by the net: the apparently chaotic movements were centered around two points in the hidden unit space. Rodriguez found that the movements around the two centers were coordinated to a very high degree, so that the net seemed able to 'deduce' from its initial position in the b side how many moves it had to make there before returning to the a side in anticipation of the next string. It was the net with this second type of representation which was able to process longer strings than it had seen during training.
The linguistic conclusion to be drawn from the Rodriguez (1995) model is that even very small neural networks are capable of storing information about the level of embedding of an input string. In 3.3.4.2 below, though, I will present results that show that CLASPnet has more difficulties with correctly analyzing multiple levels of embedded clauses.