Learning Verb Subcategorisations:

A Case Study of the Acquisition of Syntax

How do children learn their first language?

Currently parameter setting models are the dominant paradigm for explaining language acquisition. Reason for this theory: ?[E]ven the most superficial look reveals the chasm that separates the knowledge of the language user from the data of experience.? (Chomsky, 1995, p5).

My research has investigated to what extent we can learn language from experience.

What have people already shown is learnable?

Redington, Chater & Finch (1998) showed how word classes can be learned.

Neural Network Approaches

Neural networks are computational models consisting of nodes, and links between the nodes specifying what effect one node has on neighbouring nodes.

Elman (1993) showed how neural nets can be used not only to learn word classes, but also the syntactic patterns in which words can occur.

He used a neural net to learn a simple English-like syntactic system containing such features as number agreement and recursion in relative clauses.

Given a sequence of words corresponding to the beginning of a sentence, it could predict the next word, even when there were long range dependencies.

So the network has internalised the structure implicit in the data.

Are there aspects of syntax which can?t be learned on the basis of distributions alone?

Pinker (1989) argues that some constructions can?t be learned in this way.

Verb subcategorisations in English:

(1) a. John gave a painting to the museum.
  b. John gave the museum a painting.
  c. John donated a painting to the museum.
  d. *John donated the museum a painting.

Children learn the regular alternations between constructions such as these.

Ultimately children learn that some verbs, such as donated, do not occur in double object constructions.
Solving the Learnability Problem:

Statistical Grammars.

If we had a grammar that predicted that donated was equally likely to occur in the prepositional and double object constructions, but then we observed it 100 times in the prepositional construction, but no times in the double object construction, we might suspect that the grammar was wrong.

Pinker notes that children might be able to use this kind of data to determine which constructions are not grammatical.

But:

It is necessary to identify ?under exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?? (Pinker, 1989, p14).

The computational model which I?m going to describe here does just that, and so can learn the distinction between gave and donated.

Bayesian Grammar Learning

Bayes theorem:

Information theory allows us to relate the probability of a grammar and of a corpus to how much information is needed to specify it within an efficient coding scheme.

Amount of information = - log Probability

If we make the grammars statistical by assigning a probability to each rule used in deriving sentences, then we can also assign probabilities to each word in a sentence.

We can apply the above formula to each sentence in the corpus, and the total amount of information gives us a measure of the complexity of the corpus.

Summing the amount of information needed to specify the grammar and the corpus gives us an overall evaluation for the grammar with respect to the corpus.

Evaluation measures of this kind are usually referred to as Minimum Description Length.

Applying the Evaluation Measure:

Some Example Grammars

1. A very simple grammar could be constructed which would say nothing about what word combinations are possible in a language.

2. A very complex grammar which specifies exactly which sentences have been observed, and which won?t allow any others. Bayesian inference trades off fit to data and complexity of grammars, so as to arrive at appropriate generalisations given the observed data.

A Bayesian Computational Model of Syntactic Acquisition

Grammars are represented using phrase structure rules (restricted to, at most, binary branching).

The grammars contain arbitrary symbols, one of which, S, is special, as every top-down derivation must start with this symbol.

Learning starts with a simple grammar allowing every possible combination of words:

S -> X S

S -> X

X -> John

X -> thinks

etc.

The program makes small random changes to the grammar such as:

Changes which improve the overall evaluation of the grammar are more likely to be kept than those which make it worse.

If a change means that the grammar can no longer parse the corpus then it is rejected.

After a fixed amount of time the program stops running. It will usually have settled on a single grammar by this stage.

Learning a Simple Syntactic System

The program was given the data below from which to learn.
 
John hit Mary Ethel thinks John ran
Mary hit Ethel John thinks Ethel ran
Ethel ran Mary ran
John ran Ethel hit Mary
Mary ran Mary thinks John hit Ethel
Ethel hit John John screamed
Noam hit John Noam hopes John screamed
Ethel screamed Mary hopes Ethel hit John
Mary kicked Ethel Noam kicked Mary
John hopes Ethel thinks Mary hit Ethel

This data corresponds to the grammar given below.
 
S -> NP VP Vs -> thinks
VP -> ran Vs -> hopes
VP -> screamed NP -> John
VP -> Vt NP NP -> Ethel
VP -> Vs S NP -> Mary
Vt -> hit NP -> Noam
Vt -> kicked  

 

Results

The program learned a grammar exactly equivalent in structure to the one given before. (The program doesn?t know what nouns and verbs and so on are, so it just uses a different arbitrary symbol for each category.)

This grammar has been chosen because, while it was more complex than the initial grammar, it accounted better for regularities in the data, resulting in a better overall evaluation.

Grammar Evaluations

 

Initial state of learning
Learned Grammar
Overall Evaluation
406.5 bits
329.5 bits
Grammar
160.3 bits
199.3 bits
Data
246.2 bits
130.3 bits

 
 
 

Learning Verb Subcategorisations

Given the program?s success at learning such syntactic systems, it was decided to see if it could learn the kind of construction which it?s been claimed pose particular difficulties for theories of acquisition.

Ditransitive verbs in English:

3 Key phenomena which I aimed to model:

The same program was used to learn grammars, but using a different data set.

The data consisted of 150 sentences of the types:
 
(2) a. John gave a painting to Sam
  b. Sam donated John to the museum
  c. The museum lent Sam a painting

Each verb except sent appeared with approximately equal frequency.

Alternating verbs were equally likely to occur in either construction.

In addition there was only a single occurrence of the verb sent, in the following sentence (which uses the prepositional construction).

(3) The museum sent a painting to Sam.



Results

The verbs were divided into two classes:

The grammar allows the first group of verbs to appear in either the construction.

But only allows donated to appear in prepositional constructions.

This accounts for the first two phenomena:

To investigate what would happen at earlier stages of learning, when children have not observed so many examples of each kind of verb, the total amount of data from which the program learned was reduced.
Neural Nets Again

It seems that current neural net models wouldn't be able to learn such verb subcategorisations correctly.

Christiansen and Chater (1994) investigated the generalisation ability of Elman-type networks. They excluded girl from genitive contexts, and boy from noun phrase conjunctions in the training data.

They then trained the network on 50,000 sentences.

At the end the network had generalised to predict that boy could appear in noun phrase conjunctions, but it didn't predict that girl could appear in genitive contexts.

But if a word does not occur in a particular construction in 50,000 sentences is this likely to be due to chance? (The language only contained 34 words.) If neural net models are to learn distinctions between words such as gave and donated, they must be able to make inferences about which constructions are unlikely to have been absent simply due to chance.
Early Generative Grammar

Simplicity based evaluation metrics are not new to linguistic theory.

But Chomsky considered syntax to be fundamentally non-statistical. Even more importantly Chomsky's evaluation metric did not incorporate a measure of goodness of fit to data. However Chomsky's (1965) theory shows that simplicity measures and Universal Grammar are not incompatible.

Implications for Syntactic Theory

Taking account of Bayesian inference allows us to return the degree to which language is determined by innate principles to an empirical question.

But if language is learned using Bayesian inference, then this would make very different predictions about what forms grammars will take.

Conclusion

Language acquisition may involve a much greater amount of learning than is usually assumed.

References

Chomsky (1957) Syntactic Structures. The Hague: Mouton & Co.

Chomsky (1965) Aspects of the Theory of Syntax. MIT Press.

Chomsky (1995) The minimalist program. MIT Press.

Christiansen and Chater (1994). Generalization and connectionist language learning. Mind and Language, 9, 273-287.

Elman (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71-99.

Pinker (1989) Learnability and Cognition. MIT Press.

Redington, Chater and Finch. (1998). Distributional Information: A Powerful Cue for Acquiring Syntactic Categories. Cognitive Science, 22, 425-469.