|
|
3.1.1.1 The Orthographic Representation
The orthographic representation of the words use a pattern of five bits to represent each letter: 2^5 = 32 possible patterns. Of these 32 patterns, 26 are used to represent the letters of the Roman alphabet, and 5 are used to represent punctuation marks. The all-zeroes pattern is not used at all (see Table 3.1).
|
00000 10000 01000 11000 00100 10100 01100 11100 00010 |
a b c d e f g h |
10010 01010 11010 00110 10110 01110 11110 00001 10001 |
i j k l m n o p q |
01001 11001 00101 10101 01101 11101 00011 10011 01011 |
r s t u v w x y z |
11011 00111 10111 01111 11111 |
, . ? ! ' |
I have chosen binary patterns because they provide the most compact way of storing orthographic representation. That in turn is important because it keeps the total number of connections down. And as each connection has to be updated every time a word is presented to the network, the number of connections determines how much time is needed to train the network. (Note 3)
From Table 3.1 it also becomes clear that no distinction is made between lower case and upper case, and that some characters (e.g. parentheses, quotations marks) can not currently be presented to the network. But adding a sixth bit for each letter would give an extra 32 possible combinations, so this extension would be easy to implement. A second disadvantage of the current representation is that it is somewhat misleading for the network: the representations for 'a' and 'I' differ in only one bit, whereas there are four different bits in the patterns for 'a' and 'z'. The network has to learn to ignore these apparent (dis)similarities.
As there are five bits per letter, and a total of 60 units, it can readily be deduced that up to twelve letters can be represented for each word: e.g. 'c,a,t,-,-,-,-,-,-,-,-,-' or 'r,e,p,u,b,l,i,c,-,-,-,-'. (Note 4) The actual type of representation used is, however, more complex than that: in order to give the network useful information both about the beginnings of words and about their endings, I have felt it useful to present both left-justified and right-justified representations: the left-justified part is seven letters long, the right justified part 5 letters. Table 3.2 shows some examples of the kind of orthographic representations that I have used:
a be can dead baker tigers working sleeping ... interesting ... antepenultimate |
a,-,-,-,-,-,- -,-,-,-,a b,e,-,-,-,-,- -,-,-,b,e c,a,t,-,-,-,- -,-,c,a,t c,a,t,s,-,-,- -,c,a,t,s b,a,k,e,r,-,- b,a,k,e,r t,i,g,e,r,s,- i,g,e,r,s w,o,r,k,i,n,g r,k,i,n,g s,l,e,e,p,i,n e,p,i,n,g ... i,n,t,e,r,e,s s,t,i,n,g ... a,n,t,e,p,e,n i,m,a,t,e |
Whether such representations are psychologically plausible is not very clear: the sound stream of spoken English certainly does not always have stressed beginnings and endings for every word; as far as written English is concerned, the data are more ambiguous. People read about 8 or 9 characters at a time (with the focus on the central letter or somewhat to the left of center in longer words) before saccading to the next batch of letters. So-called function words and spaces are often not focused at all. Still, as soon as the spaces are removed or misplaced, reading becomes a lot more difficult (Brady 1981). Other psycholinguistic research has concluded that in recall tasks, the initial letters of words are remembered best, the final letters next, and the medial letters poorest (Morrison & Inhoff 1981). And in a language like English, which has both prefixes and suffixes, both are detected very easily (Bergman 1988). Hence, while it is likely that the particular mechanism used to simulate these effects in CLASPnet is implausible, its use can still be defended in the context of this model. (In order to find out to which extent the network benefits from left-justified, right-justified, and left-and-right-justified representations, an experiment has compared the three. The results are presented in 3.4.5.)