|
|
3.2.3 The Parser
Generating the input sentences is only half of what is needed to create the training and test corpora: what is also needed for each of the input patterns is the desired output pattern. In order to determine the correct value for each of the 17 output units, CLASPnet makes use of a simple parser. The reason why the latter can be simple is that most of the important information about each clause is generated along with the words of the clause: the rewriting rules of the grammar which generate clauses also contain SGML-ish templates similar to the following one: <DE_VO=A_PO=P_M> ... </DE>. (This particular template describes a declarative clause (DE), with active voice (VO=A), positive polarity (PO=P), and matrix-clause status (M).) Figure 3.4 shows some of these enriched rewriting rules (Note 14):
c1 -> <DE_VO=A_PO=P_M> np1_hum_s vt_hum np1_hum_o </DE> period c1 -> <DE_VO=A_PO=N_M> np1_hum_s negp be vting_hum np_hum_o </DE> period b2 -> <BD_VO=A_PO=N_M> np_s vben too adj </BD> scon4 period scon4 -> <CT_VO=A_PO=P_S> in order to vpsp </CO> scon4 -> <CT_VO=A_PO=N_S> in order not to vpsp </CO> s7 -> <DE_VO=A_PO=N_M> np1_hum_s negp vsay </DE> bsay1 period s8 -> <DE_VO=A_PO=P_M> np1_hum_s aux vask </DE> sask4 period sask5 -> <WH_VO=A_PO=N_S> w2 aux np1_s neg vt </WH> bsay1 -> <BT_VO=A_PO=P_S> that thisp vbep adj </BT> |
Hence, the sentences which are fed to the parser look more like this: '<WH_VO=A_PO=P_M> what teacher shall the great sets of tigers , <RE_VO=A_PO=P_S> which drink </RE> , wash near the roof </WH> ?', '<BD_VO=A_PO=P_M> the large sets of experts should be the right scientist of scotland </BD> . and <DE_VO=A_PO=P_M> the tigers shall move behind the palace </DE> .', or '<TM_VO=A_PO=P_M> the more a child , <RE_VO=A_PO=N_S> who do not work </RE> , sleep </TM> , <TM_VO=P_PO=P_M> the more the dogs be teased </TM> .' The parser strips these templates from an input sentence, and analyzes them to find out the desired output values for the words of each clause. In addition, the parser also looks more closely at the entire sentence (to determine the mood), and at the clauses (to determine their ID number). The combination of both techniques suffices to construct a list containing each element of the sentence, followed by its desired output pattern -- Figure 3.5 shows an example sentence in this format (Note 15):
what 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 teacher 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 shall 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 the 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 great 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 sets 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 of 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 tigers 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 , 0 0 0 - 0 0 0 - 0 - 0 0 0 0 0 0 0 - 0 - 0 - 0 which 0 0 1 - 0 1 0 - 0 - 0 0 0 0 1 0 0 - 0 - 1 - 1 drink 0 0 1 - 0 1 0 - 0 - 0 0 0 0 1 0 0 - 0 - 1 - 1 , 0 0 0 - 0 0 0 - 0 - 0 0 0 0 0 0 0 - 0 - 0 - 0 wash 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 near 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 the 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 roof 0 0 1 - 1 0 0 - 1 - 0 0 1 0 0 0 0 - 0 - 1 - 1 ? 0 0 0 - 0 0 0 - 0 - 0 0 0 0 0 0 0 - 0 - 0 - 0 |
During the final stage of corpus generation, each input word is replaced by its binary representation of 85 units: the values for the 60 orthographic units are computed on the fly, while the values for the 25 semantic values are read from a special vocabulary file.
onyx:~/zin$generate.input +-+-+ Starting the CLASPnet English Sentence Generator +-+-+ +-+-+ Copyright 1995-1996. Ezra Van Everbroeck +-+-+ Use which grammar file (new name allowed)? demo ++ Made new grammar file: demo.grm Use which vocabulary file (new name allowed) [demo.voc]? ++ Made new vocabulary file: demo.voc Maximum number of words per sentence [20]? Number of sentences to create [100]? 1000 ++ Creating: . 100 . 200 . 300 . 400 . 500 . 600 . 700 . 800 . 900 . 1000 Use which name for sentences file [demo.sen]? ++ Saving results output file as well: demo.rot Name for SNNS patterns file (will overwrite) [demo.pat]? ++ Created pattern file. ++ Writing summary file: demo.inf ++ Done. onyx:~/zin$ |
The different stages mentioned above have been automated as much as possible with a Perl program (see Appendix A.3). Creating a new corpus is therefore very easy, and is illustrated in Figure 3.6.
All the training and test corpora for CLASPnet have been generated in this manner, except for the few natural language test samples which were manually tagged (see 3.3.4.6).
Note 14
Note 15
In the following templates, BD refers to a declarative 'be' clause,
RE to a relative clause, CT to a connector clause with a 'to' +
infinitive, WH to a clause introduced by a WH-element, BT to a complement
clause with 'be', and TM to a clause introduced by 'the
more/fewer/less'. A VO=P means that the clause is in the passive voice;
PO=N refers to negative polarity; and a final S means that is a subordinate
clause. The meaning of the non-terminal symbols in the rewriting rules can
best be garnered from looking at the grammar in Appendix
1 -- still, c1, b2, s7 and s8 are different construction types for
matrix clauses; scon4 is one type of subordinate connector clause, while
sask5 and bsay1 are complement clauses of verbs of asking and saying,
respectively. Most of the other symbols are self-explanatory, except perhaps
for vt_hum (transitive verb only used with human NPs), negp (negation
phrase, 'not' or 'never'), vben (negative verb phrase with
'be'), vpsp (simple verb phrase), and thisp (NP with a determiner
like 'this' or 'those').
The desired output value for the punctuation marks is a pattern of all
zeros.