Learning Verb Subcategorisations:
A Case Study of the Acquisition of Syntax
How do children learn their first language?
-
When exposed to utterances in that language, how do they
infer the grammatical system which produced those utterances?
Currently parameter setting models are the dominant paradigm
for explaining language acquisition.
-
People are innately endowed with Universal Grammar (UG).
-
Learning consists of identifying words and setting some parameters.
Reason for this theory: ?[E]ven the most superficial look
reveals the chasm that separates the knowledge of the language user from
the data of experience.? (Chomsky, 1995, p5).
My research has investigated to what extent we can learn
language from experience.
-
Looks at hardest case, when we don?t have access to prosodic
or semantic cues,
-
And when we receive no feedback about incorrect sentences.
What have people already shown is learnable?
Redington, Chater & Finch (1998) showed how word classes
can be learned.
-
They took the 1000 most frequent words in a corpus.
-
For each of these words they looked only at the two preceding
and two following words, and counted how often each appeared in each position.
-
They clustered those words with the most similar contexts.
-
Words were grouped into clusters corresponding closely to
linguistic categories,
-
Although the program doesn?t know at what level of dissimilarity
to form separate classes.
Neural Network Approaches
Neural networks are computational models consisting of
nodes, and links between the nodes specifying what effect one node has
on neighbouring nodes.
-
The nodes can be either on or off.
-
Words are represented as patterns of activation of particular
nodes.
-
Each link between nodes has an associated weighting, determining
whether it tends to make the nodes it is connected to more or less activated
when it is switched on.
-
Networks are trained by presenting them with data, and adjusting
the weightings on the links so that they learn to predict the data better.
Elman (1993) showed how neural nets can be used not only
to learn word classes, but also the syntactic patterns in which words can
occur.
He used a neural net to learn a simple English-like syntactic
system containing such features as number agreement and recursion in relative
clauses.
Given a sequence of words corresponding to the beginning
of a sentence, it could predict the next word, even when there were long
range dependencies.
So the network has internalised the structure implicit
in the data.
Are there aspects of syntax which can?t be learned on
the basis of distributions alone?
Pinker (1989) argues that some constructions can?t be
learned in this way.
Verb subcategorisations in English:
-
gave can appear in both the prepositional dative construction
(1a), and the double object dative construction (1b).
-
But some verbs, such as donated are irregular, and
can only occur in the prepositional construction (1c).
| (1) |
a. |
John gave a painting to the museum. |
| |
b. |
John gave the museum a painting. |
| |
c. |
John donated a painting to the
museum. |
| |
d. |
*John donated the museum a painting. |
Children learn the regular alternations between constructions
such as these.
-
When presented with novel, nonce, verbs in the prepositional
construction they will generalise and use them in the double object construction.
-
During learning they over-generalise, and use verbs such
as donated in double object constructions.
Ultimately children learn that some verbs, such as donated,
do not occur in double object constructions.
-
We need a theory which can explain why children at first
make such over-generalisations, but subsequently learn the correct subcategorisations.
Solving the Learnability Problem:
Statistical Grammars.
If we had a grammar that predicted that donated
was equally likely to occur in the prepositional and double object constructions,
but then we observed it 100 times in the prepositional construction, but
no times in the double object construction, we might suspect that the grammar
was wrong.
Pinker notes that children might be able to use this kind
of data to determine which constructions are not grammatical.
But:
-
Children don?t rule out all sentences which they?ve never
heard.
-
Nor do they rule out all verb argument structure combinations
which they haven?t heard.
-
How do they know at what level to form generalisations?
It is necessary to identify ?under exactly what circumstances
does a child conclude that a nonwitnessed sentence is ungrammatical?? (Pinker,
1989, p14).
The computational model which I?m going to describe here
does just that, and so can learn the distinction between gave and
donated.
Bayesian Grammar Learning
Bayes theorem:
-
In order to find the probability of a grammar with respect
to a corpus, we need to multiply the a priori probability of the
grammar by the probability of the corpus with respect to that grammar.
Information theory allows us to relate the probability of
a grammar and of a corpus to how much information is needed to specify
it within an efficient coding scheme.
Amount of information = - log Probability
-
Simple grammars are more probable.
-
Complex grammars are less probable.
If we make the grammars statistical by assigning a probability
to each rule used in deriving sentences, then we can also assign probabilities
to each word in a sentence.
We can apply the above formula to each sentence in the
corpus, and the total amount of information gives us a measure of the complexity
of the corpus.
Summing the amount of information needed to specify the
grammar and the corpus gives us an overall evaluation for the grammar with
respect to the corpus.
-
Smaller evaluations are better than larger ones.
Evaluation measures of this kind are usually referred to
as Minimum Description Length.
Applying the Evaluation Measure:
Some Example Grammars
1. A very simple grammar could be constructed which would
say nothing about what word combinations are possible in a language.
-
The grammar itself would have a good evaluation, but it doesn?t
make any predictions about what to expect in the corpus, so the corpus
would have a very bad evaluation.
-
So a bad overall evaluation.
2. A very complex grammar which specifies exactly which sentences
have been observed, and which won?t allow any others.
-
The corpus would receive a good evaluation as it is well
predicted by the grammar, but the grammar is very complex, so it gets a
very bad evaluation.
-
So again a bad overall evaluation.
Bayesian inference trades off fit to data and complexity
of grammars, so as to arrive at appropriate generalisations given the observed
data.
A Bayesian Computational Model of Syntactic Acquisition
Grammars are represented using phrase structure rules
(restricted to, at most, binary branching).
The grammars contain arbitrary symbols, one of which,
S, is special, as every top-down derivation must start with this symbol.
Learning starts with a simple grammar allowing every possible
combination of words:
S -> X S
S -> X
X -> John
X -> thinks
etc.
The program makes small random changes to the grammar
such as:
-
Choosing a rule at random and deleting it.
-
Adding a random new rule.
-
Changing one of the symbols or words in an existing rule.
Changes which improve the overall evaluation of the grammar
are more likely to be kept than those which make it worse.
If a change means that the grammar can no longer parse
the corpus then it is rejected.
After a fixed amount of time the program stops running.
It will usually have settled on a single grammar by this stage.
Learning a Simple Syntactic System
The program was given the data below from which to learn.
| John hit Mary |
Ethel thinks John ran |
| Mary hit Ethel |
John thinks Ethel ran |
| Ethel ran |
Mary ran |
| John ran |
Ethel hit Mary |
| Mary ran |
Mary thinks John hit Ethel |
| Ethel hit John |
John screamed |
| Noam hit John |
Noam hopes John screamed |
| Ethel screamed |
Mary hopes Ethel hit John |
| Mary kicked Ethel |
Noam kicked Mary |
| John hopes Ethel thinks Mary hit
Ethel |
This data corresponds to the grammar given below.
| S -> NP VP |
Vs -> thinks |
| VP -> ran |
Vs -> hopes |
| VP -> screamed |
NP -> John |
| VP -> Vt NP |
NP -> Ethel |
| VP -> Vs S |
NP -> Mary |
| Vt -> hit |
NP -> Noam |
| Vt -> kicked |
|
Results
The program learned a grammar exactly equivalent in structure
to the one given before. (The program doesn?t know what nouns and verbs
and so on are, so it just uses a different arbitrary symbol for each category.)
This grammar has been chosen because, while it was more
complex than the initial grammar, it accounted better for regularities
in the data, resulting in a better overall evaluation.
Grammar Evaluations
|
|
Initial state of learning
|
Learned Grammar
|
| Overall Evaluation |
406.5 bits
|
329.5 bits
|
| Grammar |
160.3 bits
|
199.3 bits
|
| Data |
246.2 bits
|
130.3 bits
|
Learning Verb Subcategorisations
Given the program?s success at learning such syntactic
systems, it was decided to see if it could learn the kind of construction
which it?s been claimed pose particular difficulties for theories of acquisition.
Ditransitive verbs in English:
-
gave, passed, lent and sent show
the dative alternation.
-
donated doesn?t.
3 Key phenomena which I aimed to model:
-
Children eventually learn the distinction between verbs showing
the dative alternation and those which can appear in only the prepositional
construction.
-
Children generalise and use newly seen verbs in regular constructions.
-
At early stages of learning children over-generalise and
use irregular verbs in regular constructions.
The same program was used to learn grammars, but using
a different data set.
The data consisted of 150 sentences of the types:
| (2) |
a. |
John gave a painting to Sam |
| |
b. |
Sam donated John to the museum |
| |
c. |
The museum lent Sam a painting |
Each verb except sent appeared with approximately
equal frequency.
Alternating verbs were equally likely to occur in either
construction.
In addition there was only a single occurrence of the
verb sent, in the following sentence (which uses the prepositional
construction).
(3) The museum sent a painting to Sam.
Results
The verbs were divided into two classes:
-
One contained gave, passed, lent and sent.
-
The other contained donated.
The grammar allows the first group of verbs to appear in
either the construction.
But only allows donated to appear in prepositional
constructions.
This accounts for the first two phenomena:
-
Learning a distinction between sub-classes of verb.
-
And newly seen verbs (sent) being used productively
in regular patterns.
To investigate what would happen at earlier stages of learning,
when children have not observed so many examples of each kind of verb,
the total amount of data from which the program learned was reduced.
-
Then the model didn't make a distinction between sub-classes
of verbs, allowing all verbs to appear in both constructions.
Neural Nets Again
It seems that current neural net models wouldn't be able
to learn such verb subcategorisations correctly.
Christiansen and Chater (1994) investigated the generalisation
ability of Elman-type networks. They excluded girl from genitive
contexts, and boy from noun phrase conjunctions in the training
data.
They then trained the network on 50,000 sentences.
At the end the network had generalised to predict that
boy could appear in noun phrase conjunctions, but it didn't predict
that girl could appear in genitive contexts.
-
Christiansen and Chater concluded that the network had learned
correctly in the case of boy, but not in the case of girl.
But if a word does not occur in a particular construction
in 50,000 sentences is this likely to be due to chance? (The language only
contained 34 words.)
-
I would say that the network learned correctly in the case
of girl, but was wrong in allowing boy to appear in noun
phrase conjunctions.
If neural net models are to learn distinctions between words
such as gave and donated, they must be able to make inferences
about which constructions are unlikely to have been absent simply due to
chance.
-
They would need to incorporate some form of Bayesian inference.
(This isn?t necessarily incompatible with neural nets.)
Early Generative Grammar
Simplicity based evaluation metrics are not new to linguistic
theory.
-
Chomsky?s (1965) theory of syntactic acquisition used a simplicity
metric to choose between alternative grammars.
But Chomsky considered syntax to be fundamentally non-statistical.
-
'Despite the undeniable interest and importance of semantic
and statistical studies of language, they appear to have no direct relevance
to the problem of determining or characterising the set of grammatical
utterances.' (Chomsky 1957, p.17).
Even more importantly Chomsky's evaluation metric did not
incorporate a measure of goodness of fit to data.
-
Chomsky's measure simply preferred the shortest grammar,
in terms of how many symbols it contained.
-
Innate constraints on the forms of grammar were needed so
that ?real generalisations shorten the grammar and spurious ones do not.'
(p.42).
However Chomsky's (1965) theory shows that simplicity measures
and Universal Grammar are not incompatible.
Implications for Syntactic Theory
Taking account of Bayesian inference allows us to return
the degree to which language is determined by innate principles to an empirical
question.
But if language is learned using Bayesian inference, then
this would make very different predictions about what forms grammars will
take.
-
They don't have to be 'conceptually natural', or completely
regular. Language could contain a lot of irregularities.
-
Even the principle of lexical minimisation is not so clear
cut - we can justify a complex lexical entry for frequent words, so long
as this accounts better for their distribution in the corpus.
Conclusion
Language acquisition may involve a much greater amount
of learning than is usually assumed.
References
Chomsky (1957) Syntactic Structures. The Hague:
Mouton & Co.
Chomsky (1965) Aspects of the Theory of Syntax.
MIT Press.
Chomsky (1995) The minimalist program. MIT Press.
Christiansen and Chater (1994). Generalization and connectionist
language learning. Mind and Language, 9, 273-287.
Elman (1993). Learning and development in neural networks:
The importance of starting small. Cognition, 48, 71-99.
Pinker (1989) Learnability and Cognition. MIT Press.
Redington, Chater and Finch. (1998). Distributional Information:
A Powerful Cue for Acquiring Syntactic Categories. Cognitive Science, 22,
425-469.