3.1.1.2 The Semantic Representation
As has been mentioned above, all the words in the training and testing
corpora have also been given a semantic representation. As it has been
my aim to follow cognitive linguistic tenets, these representations are
of a somewhat unusual nature: rather than using 'standard' semantic
features like [+HUMAN] or [+ABSTRACT], I have tried to limit myself to
strictly sensory and perceptual features. A personal selection of 25
semantic features has been made. Each feature is represented by one
input unit, and the unit can have continuous activation values between 0
and 1 (actually 0.1 intervals have been used).
There are 14 features which are directly based on sensory
perceptions. They have a value of 0 if I consider the feature to be
irrelevant for the semantic definition of a word, or a continuous value
between 0.1 and 1 if I think it is relevant. As one might expect, there
are more units for those senses which I expect to be of more relevance
for most words. It is also worth mentioning that I have paid little
attention to 'objective' comparisons, which I deem unlikely to be of
much importance in most concepts: e.g. both 'barn' and
'Berlin' have received very high values for the unit representing
visual size, because both are much larger than humans -- the latter is
much larger than the former, but this knowledge is not readily available
from visual perception alone, if at all. Similarly, both
'aircraft' and 'shout' are considered to be very loud from
an auditory point of view, despite the fact that aircraft make more
noise than shouting people when measured in decibels.
- Visual perception:
- Primary color: 0.1 (black) to 1 (white) with dark hues (e.g.
gray) closer to black and lighter hues (e.g. pink) closer to white;
- Secondary color: idem;
- Size: 0.1 (very small) through 0.5 (adult human) to 1 (very
large);
- Rate of movement: 0.1 (very slow) through .5 (walking pace) to 1
(very fast);
- Fuzziness: 0.1 (unclear edges, e.g. fog) through 0.5 (e.g.
the sun) to 1 (clear edges); large physical structures usually have a
lower value on this unit than small ones, because the latter are
difficult to see completely at a single glance;
- Has_Face: 0.1 (no face at all) through 0.5 (some animals) to 1
(e.g. humans, puppets);
- Auditory perception:
- Pitch: 0.1 (very low pitched sounds) through
0.5 (adult human voice talking) to 1 (very high pitch);
- Loudness: 0.1 (very soft sounds) through 0.5 (normal
conversation) to 1 (very loud);
- Understandability: 0.1 (white noise) through 0.5 (e.g. sobbing,
crying) to 1 (humans talking);
- Tactile perception:
- Softness: 0.1 (very hard) through 0.5 (human skin) to 1 (very
soft);
- Smoothness: 0.1 (very rough) through 0.5 (human skin) to 1 (very
smooth);
- Graspability: 0.1 (e.g. water) through 0.5 (e.g. a car) to 1
(e.g. a spoon) -- another way of looking at this unit is that it codes how
easily one can literally manipulate the object;
- Olfactory perception:
- Strength of smell: 0.1 (very weak) through 0.5 to 1 (very strong);
- Gustatory perception:
- Strength of taste: 0.1 (very weak) through
0.5 to 1 (very strong).
The next seven units are used to represent more Gestalt-like
properties of concepts: rather than focusing on a single type of
perception, they allow for the integration of different types of input.
Frankly, they are also a mixed bag, so giving concepts values for these
units has not been straightforward. Contrary to the units described
above, however, all words have been given a value of at least 0.1 for
these units (with 0.5 being the default value and 1 still being the
highest possible value). There are two reasons for this change of
policy: first, I believe that all concepts can be thought of as having
default values for duration, or the extent to which they are typically
repeated -- such knowledge seems fairly easy to abstract from everyday
experience involving these concepts; second, and more practically, most
'abstract' words in the corpus (e.g. 'the', 'by',
'federal' or 'and') would have had completely empty
semantic representations (i.e. all zeros) if they also had received 0's
for the Gestalt-like property units -- which would have given the
network the false impression that so-called function words carry
no meaning at all, rather than a very schematic meaning (cf.
Langacker 1987).
- Gestalt-like properties:
- Spatio-temporal range: this unit expresses the extent to which it
is possible to physically stay in or experience the concept, either because
the concept is large (and allows movement inside) or because it lasts --
values range from 0.1 (e.g. a flash) through 0.5 (e.g. a human) to 1 (e.g.
cities);
- Proximity: this unit indicates whether the concept is typically
physically close to the conceptualizer, or whether it is typically linked to
another concept -- hence, 0.1 is appropriate for fairly autonomous concepts,
while 1 suits many adjectives and prepositions. (Note 5);
- Up-down axis: 0.1 for concepts which are much closer to the
ground than 0.5 (adult humans), and up to 1 for concepts which are typically
very high;
- Front-back axis: 0.1 for concepts which happen behind one's back,
through 0.5 (no special orientation) to 1 (typically in front of
oneself);
- Homogeneity: 0.1 for concepts which are internally not uniform at
all, through 0.5 to 1 -- all plurals, for example, had lower values on this
unit than their corresponding singulars;
- Duration: this unit expresses whether the concept is subjectively
experienced as typically lasting for a very short (0.1), normal (0.5) or
very long (1) time -- hence, the objective difference between a century and
a millennium is not relevant for this unit;
- Repetition: 0.1 for concepts which never repeat, through 0.5 for
those who do so irregularly, to 1 for concepts which always include
repetition -- plurals, for example, had higher values for this unit than
their corresponding singulars.
The final four semantic units represent whether the concepts are linked
weakly or strongly to instinctive feelings like danger and pleasure. All
concepts have again been given a value between 0.1 and 1 for these
units, though 0.5 indicates a fairly high degree of activation rather
than the default value.
- Instinctive properties:
- Danger: from 0.1 for concepts which are not associated with
physical (or mental) harm, to 1 for concepts which are;
- Pleasure: a similar range from 0.1 to 1 for physical (and mental)
joy;
- Excitation: from 0.1 for concepts which are dull, to 1 for those
which cause a lot of adrenaline to be released -- obviously, there is some
overlap with the previous two units;
- Food: from 0.1 for concepts which are not related to the feelings
of hunger and thirst to 1 for those which are strongly related;
It should be obvious that the 25 semantic units used in CLASPnet are
quite insufficient to provide psychologically plausible representations of
the concepts underlying the words in the corpus (see Appendix A.2 for the complete list of words and their semantic
representations). There ought to be many more units with much more strictly
defined meanings. Developing a plausible semantic representation, however,
would have been an entire project in itself. What's more, the advantage of
the present setup is that CLASPnet can show that even an imprecise
and impoverished lexical semantics scheme is relevant for spotting clausal
properties (see 3.4.3). And that, I feel, is an
interesting finding in itself.
Note 5
Because of the ambiguous nature of this unit, and because it carries obvious
word class information, a network has also been trained in which this unit
and the other Gestalt-like units had been removed. (See 3.4.4 for the
results.)