As such they should be able to account for how children acquire language, and how people's language changes throughout their life.
Syntactic theories should also be compatible with data on how languages vary dialectically, and gradually change historically.
Problem: Current theories are usually very rigid - they don't seem to allow for a gradual process of learning over many years, or to be very useful in describing sociolinguistic phenomena or historical change.
Why is this? Almost all work in syntactic theory is done within a theory of Universal Grammar (e.g. Minimalist Program, LFG, HPSG, etc.)
Gold (1967) and Pinker (1989) both claim to have proved that language is not learnable in these circumstances.
| (1) | a. John gave a dish to Sam. |
| b. John gave Sam a dish. | |
| c. John donated a painting to the museum. | |
| d. * John donated the museum a painting. |
This leads to a fundamentally different conception of grammars, as statistical entities, as opposed to purely formal rule systems, a possibility which was dismissed by Chomsky (1957, p17):
`Despite the undeniable interest and importance of semantic
and statistical studies of language, they appear to have no direct relevance
to the problem of determining or characterizing the set of grammatical
utterances.'
`...probabilistic models give no particular insight into
some of the basic problems of syntactic structure.'
However, if we regard grammars as fundamentally statistical, this makes it much easier to account for acquisition, as children will be able to make statistical inferences of this kind. There is also some (controversial) psycho-linguistic evidence that people use statistical information when parsing sentences.
Children can't just apply such a simple strategy though, they need to know when to make a generalisation, and when not to. It's always possible to construct any number of ad hoc grammars to account for a set of data. My solution to this problem was to use Minimum Coding Length as an evaluation of how good a grammar is as a description of a particular corpus.
| (2) | S -> NP VP
VP -> ran VP -> screamed VP -> Vt NP VP -> Vs S NP -> John NP -> Ethel NP -> Mary NP -> Noam Vt -> hit Vt -> kicked Vs -> thinks Vs -> hopes |
(3) | John hit Mary
Ethel thinks John ran. Mary hit Ethel. John thinks Ethel ran. Ethel ran. Mary ran. John ran. Ethel hit Mary. Mary thinks John hit Ethel. Mary ran. Ethel hit John. John hopes Ethel thinks Mary hit Ethel. Noam hit John. Noam hopes John screamed. Mary hopes Ethel hit John. Mary kicked Ethel. Noam kicked Mary. Ethel screamed. John screamed. |
There are two trivial grammars which can describe any corpus. The first is the one which simply regards any combination of the words in the language as grammatical sentences. This grammar would be very short and simple, but would not be a very accurate description of the corpus.
The second is the grammar which regards only the observed sentences as grammatical and no others. This grammar would be complex, because it would have to list the entire corpus, but it would be a very accurate description of the corpus.
The correct grammar is one which lies somewhere in between these two extremes, one which makes some generalisations, but still gives a good description of the corpus. The problem is knowing when to make generalisations and when not to, for example Noam above appears only in subject possition, so is it correct to generalise to allow it to occur in object position?
The minimum coding length method for analysing grammars consists of specifying formally a description of the grammar, followed by a description of the corpus represented in terms of the grammar. The grammar which results in the most concise overall description is considered to be the best grammar. This will give a balance between overly restrictive and overly general grammars, as grammars which predict the corpus more accuratly will allow a shorter description of it. This is based on arguments from philosophy of science which state that, other things being equal, simpler theories should be preferred (Occam's razor).
Information theory gives the formula (4) for quantifying the amount of information conveyed by a symbol, relating it to the probability of the symbol occuring in the current context. Probable events convey little information, and improbable ones much information. Using this formula, and specifying how frequently each symbol was used in the grammar, we can calculate the size of the grammar. Each grammar rule was assigned a probability, and so if the corpus was represented in terms of the grammar rules needed to derive each sentence, it was then possible to calculate the size of the corpus under the description of each grammar.
(4) Quantity of information = -log Probability
This provides a way of choosing between alternative grammars, but it doesn't explain how the grammars were created in the first place. Even though the program was restricted to considering only grammars with no more than 20 rules, and with no more than 8 different constituent labels, there were still more than 4x1052 possible grammars. Considering each in turn would take a ridiculously long time.
Instead learning started with a simple grammar allowing all sentences, and rules were added or removed at random, though always favouring changes which resulted in a better evaluation. After considering 17 557 grammars learning was stopped, and a grammar corresponding to (2) had been learned.
While this might seem a lot of iterations to learn such a simple grammar, it is tiny in comparison to the number of possible grammars. This suggests that the system is likely to scale up to real languages resonably well, although it seems that a more efficient search strategy would have to be used. However, it is important to bear in mind that the human brain is far more powerful than any computer, and even then it takes many years to learn a language, so there's room for a very big search for the right grammar.
We can now talk about how good or bad given sentences are based on their probabilities, not just a dichotomy between grammatical and ungrammatical.
We can quantify how good a grammar is at capturing regularities in a corpus.
Principles such as lexical minimisation are not now so clear cut, we can justify lots of information in the lexicon for frequent words, but much less for rare ones (hence one prediction is that the rarer a word gets the more regular it will be).
Universal grammarians frequently strive to find the most abstract description of a language as possible, so that underlying similarities between languages can be found. However very abstract and complex structures seem very undesirable from a psychological or neurological perspective, they're just the kind of thing which the brain is very bad at doing, especially at the high speeds required during language production and comprehension.
I would argue that we should usually assume the least level of abstraction required to explain language productivity. For example consider the numbers twenty, thirty, fourty etc. These are semi-regular, but in acquiring and using a language it is probably better to assume that these are just learnt, any regularities can be explained diachronically (Hurford, 1987).
Another example is that in the Minmalist Program the fact that subjects in English are obligatory is explained by postulating that the abstract Case value of the feature bundle T is `strong', and so must be checked off by raising an overt subject to the specifier of T before the point in the derivation known as Spell-Out. However, I would argue that it is more plausible that children observe that all English sentences have subjects, and so form a rule: all sentences must have subjects. This then has the added advantage that it is easier to account for the register often used in diaries etc., were sentences don't have subjects, simply by adapting the rule to state that in some registers sentences don't have obligatory subjects.
By evaluating grammars in terms of their complexity and how good a statistical description they give of a corpus, it is possible to learn at least simple grammars for artificial languages. Scaling up to real language is limitted only by the time it takes to learn larger grammars.
The best grammar isn't necesarily the most abstract, sometimes a less abstract grammar may be much more psychologically plausible, yet still able to account for productivity.
Chomsky (1986). Knowlege of Language, Its Nature, Origin and Use.
Dowman (1998). A Cross-linguistic Computational Investigation of the Learnability of Syntactic, Morpho-syntactic, and Phonological Structure. Research Report, Centre for Cognitive Science, University of Edinburgh.
Gold (1967). Language Identification in the Limit. Information and Control, 10:447-474.
Hurford (1987). Language and Number.
Pinker (1989). Learnability and Cognition.