This updated and augmented release adds two million words to the Modern British English corpus, for a total of three million. It also includes a substantial number of corrections to the other corpora in the series. In addition, a small number of changes have been made to the annotation guidelines for the corpora. For details see the online annotation manual.


Penn Parsed Corpora of Historical English

The Penn Parsed Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2), are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. The texts come in three forms: simple text, part-of-speech tagged text and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences, but also for syntactic structure. All of the annotation has been carefully reviewed by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available to individuals, research groups and libraries.


The Penn Parsed Corpora of Historical English are distributed on CD-ROM, along with software to retrieve words and structures of interest to the user. This software searches all three forms of text on the CD and provides as well as sophisticated coding, editing and display facilities.

Various classes of corpus site license are available to individuals, to academic departments or research groups, and to libraries. See the Corpus Order Form for charges. Corpus license fees go toward improving the corpora and increasing them in size. Upgrades, when completed, are available to corpus license holders at modest cost.

The PPCHE CD-ROM (version 4) contains a local web server which allows users to search the corpora for syntactic structures or part-of-speech tagged text from a web browser. A demonstration of this capability can be explored at the PPCHE web demo site.


The search program included with the Penn Parsed Corpora of Historical English, CorpusSearch2, was written by Beth Randall and has been released as open source software. The most current version is always downloadable from its Sourceforge project web site.

  • The PPCME2 was created with the support of the National Science Foundation (Grants BNS 89-19701 and SBR 95-11368), with supplementary support from the University of Pennsylvania Research Foundation.
  • The PPCEME was created with the support of the National Endowment for the Humanities (Grant PA 23382-99) and the National Science Foundation (Grant BCS 99-05488).
  • The PPCMBE2 was created with the support of the National Science Foundation (Grants BCS 05-08731 and BCS 11-47499).

With respect to the above-listed grants, please note that any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the National Endowment for the Humanities.

Byland Abbey, Yorkshire. It was at abbeys like Byland, throughout Britain, that the manuscripts on which our knowledge of Middle English is based were largely written, copied and preserved. The monastic orders that built and inhabited these monasteries were dissolved by Henry the Eighth, whereupon the buildings were dismantled for building materials by the landlords who succeeded to the monastic estates. Most of the abbeys' manuscripts were lost, but some came into private hands and so survived. Photo © A. Kroch 1998.