Syntax: Accounting for Acquisition and Change

Chomsky (1986) argues that syntactic theories should be psychological theories of a person's knowledge of language.

As such they should be able to account for how children acquire language, and how people's language changes throughout their life.

Syntactic theories should also be compatible with data on how languages vary dialectically, and gradually change historically.

Problem: Current theories are usually very rigid - they don't seem to allow for a gradual process of learning over many years, or to be very useful in describing sociolinguistic phenomena or historical change.

Why is this? Almost all work in syntactic theory is done within a theory of Universal Grammar (e.g. Minimalist Program, LFG, HPSG, etc.)

Learnability

The key argument for this approach seems to be that  we can't account for language acquisition without UG. Chomsky has argued that there is not enough information in positive examples of language to determine the form of grammars, and that negative evidence (explicit correction of incorrect forms) is generally not available to children. This is the `poverty of the stimulus' argument.

Gold (1967) and Pinker (1989) both claim to have proved that language is not learnable in these circumstances.

Pinker's Learnability Proof

Pinker argues that a child observing (1a) and (1b) will form a general rule, such that if they observe (1c) they will regard (1d) as being grammatical. He concludes from examples of this type that children need UG to prevent them from making this sort of error.
 
(1) a. John gave a dish to Sam.
b. John gave Sam a dish.
c. John donated a painting to the museum.
d. * John donated the museum a painting.

A Solution to the Learnability Problem: Statistical Grammars

Children must hear words such as donate many times during their lifetimes. If their grammars predicted that this word was equally likely to appear in double object constructions (1d) as in dative constructions (1c), then, if they observed 100 occurences of it in the dative construction and none in the double object, they might begin to suspect that their current grammar was incorrect.

This leads to a fundamentally different conception of grammars, as statistical entities, as opposed to purely formal rule systems, a possibility which was dismissed by Chomsky (1957, p17):

`Despite the undeniable interest and importance of semantic and statistical studies of language, they appear to have no direct relevance to the problem of determining or characterizing the set of grammatical utterances.'
`...probabilistic models give no particular insight into some of the basic problems of syntactic structure.'

However, if we regard grammars as fundamentally statistical, this makes it much easier to account for acquisition, as children will be able to make statistical inferences of this kind. There is also some (controversial) psycho-linguistic evidence that people use statistical information when parsing sentences.

Children can't just apply such a simple strategy though, they need to know when to make a generalisation, and when not to. It's always possible to construct any number of ad hoc grammars to account for a set of data. My solution to this problem was to use Minimum Coding Length as an evaluation of how good a grammar is as a description of a particular corpus.

Learning a Grammar Using Minimum Coding Length

So far it's only been possible to learn grammars for simple artificial languages, or to learn fairly restricted aspects of grammar. The problem addressed in Dowman (1998) was how to learn a grammar such as (2) from a corresponding corpus of data such as (3). The only language specific knowledge available to the learning system was that the corpus had to be described using binary branching Context Free Phrase Structure Rules.
 
(2) S -> NP VP 
VP -> ran
VP -> screamed 
VP -> Vt NP
VP -> Vs S
NP -> John
NP -> Ethel
NP -> Mary
NP -> Noam
Vt -> hit
Vt -> kicked
Vs -> thinks
Vs -> hopes
(3) John hit Mary
Ethel thinks John ran.
Mary hit Ethel.
John thinks Ethel ran.
Ethel ran.
Mary ran.
John ran.
Ethel hit Mary.
Mary thinks John hit Ethel.
Mary ran.
Ethel hit John.
John hopes Ethel thinks Mary hit Ethel.
Noam hit John.
Noam hopes John screamed. 
Mary hopes Ethel hit John.
Mary kicked Ethel.
Noam kicked Mary.
Ethel screamed.
John screamed.
While the language is very small, it has some of the key features of real language, including recursion, allowing it to generate an infinite number of sentences. Learnability proofs claim that all languages of this type are not learnable without UG, regardless of how few words they contain.

There are two trivial grammars which can describe any corpus. The first is the one which simply regards any combination of the words in the language as grammatical sentences. This grammar would be very short and simple, but would not be a very accurate description of the corpus.

The second is the grammar which regards only the observed sentences as grammatical and no others. This grammar would be complex, because it would have to list the entire corpus, but it would be a very accurate description of the corpus.

The correct grammar is one which lies somewhere in between these two extremes, one which makes some generalisations, but still gives a good description of the corpus. The problem is knowing when to make generalisations and when not to, for example Noam above appears only in subject possition, so is it correct to generalise to allow it to occur in object position?

The minimum coding length method for analysing grammars consists of specifying formally a description of the grammar, followed by a description of the corpus represented in terms of the grammar. The grammar which results in the most concise overall description is considered to be the best grammar. This will give a balance between overly restrictive and overly general grammars, as grammars which predict the corpus more accuratly will allow a shorter description of it. This is based on arguments from philosophy of science which state that, other things being equal, simpler theories should be preferred (Occam's razor).

Information theory gives the formula (4) for quantifying the amount of information conveyed by a symbol, relating it to the probability of the symbol occuring in the current context. Probable events convey little information, and improbable ones much information. Using this formula, and specifying how frequently each symbol was used in the grammar, we can calculate the size of the grammar. Each grammar rule was assigned a probability, and so if the corpus was represented in terms of the grammar rules needed to derive each sentence, it was then possible to calculate the size of the corpus under the description of each grammar.

(4) Quantity of information = -log Probability

This provides a way of choosing between alternative grammars, but it doesn't explain how the grammars were created in the first place. Even though the program was restricted to considering only grammars with no more than 20 rules, and with no more than 8 different constituent labels, there were still more than 4x1052 possible grammars. Considering each in turn would take a ridiculously long time.

Instead learning started with a simple grammar allowing all sentences, and rules were added or removed at random, though always favouring changes which resulted in a better evaluation. After considering 17 557 grammars learning was stopped, and a grammar corresponding to (2) had been learned.

While this might seem a lot of iterations to learn such a simple grammar, it is tiny in comparison to the number of possible grammars. This suggests that the system is likely to scale up to real languages resonably well, although it seems that a more efficient search strategy would have to be used. However, it is important to bear in mind that the human brain is far more powerful than any computer, and even then it takes many years to learn a language, so there's room for a very big search for the right grammar.

Implications for Theories of Syntax

If we assume that grammars are learned in this way, then this gives us a very different perspective on syntactic theory.

We can now talk about how good or bad given sentences are based on their probabilities, not just a dichotomy between grammatical and ungrammatical.

We can quantify how good a grammar is at capturing regularities in a corpus.

Principles such as lexical minimisation are not now so clear cut, we can justify lots of information in the lexicon for frequent words, but much less for rare ones (hence one prediction is that the rarer a word gets the more regular it will be).

Universal grammarians frequently strive to find the most abstract description of a language as possible, so that underlying similarities between languages can be found. However very abstract and complex structures seem very undesirable from a psychological or neurological perspective, they're just the kind of thing which the brain is very bad at doing, especially at the high speeds required during language production and comprehension.

I would argue that we should usually assume the least level of abstraction required to explain language productivity. For example consider the numbers twenty, thirty, fourty etc. These are semi-regular, but in acquiring and using a language it is probably better to assume that these are just learnt, any regularities can be explained diachronically (Hurford, 1987).

Another example is that in the Minmalist Program the fact that subjects in English are obligatory is explained by postulating that the abstract Case value of the feature bundle T is `strong', and so must be checked off by raising an overt subject to the specifier of T before the point in the derivation known as Spell-Out. However, I would argue that it is more plausible that children observe that all English sentences have subjects, and so form a rule: all sentences must have subjects. This then has the added advantage that it is easier to account for the register often used in diaries etc., were sentences don't have subjects, simply by adapting the rule to state that in some registers sentences don't have obligatory subjects.

Key Points

If grammars are statistical it is possible to make inferences as to which constructions appear less often than would be expected, and hence which parts of a grammar may be incorrect.

By evaluating grammars in terms of their complexity and how good a statistical description they give of a corpus, it is possible to learn at least simple grammars for artificial languages. Scaling up to real language is limitted only by the time it takes to learn larger grammars.

The best grammar isn't necesarily the most abstract, sometimes a less abstract grammar may be much more psychologically plausible, yet still able to account for productivity.

References

Chomsky (1957). Syntactic Structures.

Chomsky (1986). Knowlege of Language, Its Nature, Origin and Use.

Dowman (1998). A Cross-linguistic Computational Investigation of the Learnability of Syntactic, Morpho-syntactic, and Phonological Structure. Research Report, Centre for Cognitive Science, University of Edinburgh.

Gold (1967). Language Identification in the Limit. Information and Control, 10:447-474.

Hurford (1987). Language and Number.

Pinker (1989). Learnability and Cognition.