Exploring Words and Phrases from the British National Corpus*

  The most up-to-date version of this site with a much larger database is at
http://pie.usna.edu
or http://phrasesinenglish.org
 

Please help support this site by acquiring the innovative multilingual Visual Thesaurus

KWiCFinder Home


The British National Corpus and this site

The British National Corpus (BNC) is a carefully-selected collection of 4124 contemporary written and spoken English texts, primarily from the United Kingdom.  The corpus totals over 100 million words and covers a representative range of domains, genres and registers. The entire corpus has been analyzed and marked up with part of speech (POS) tags. Provenance and other attributes are carefully documented for each text. "What is the BNC?" provides a succinct overview of the corpus;  for an exhaustive description, consult the British National Corpus Users Reference Guide Chapter 1 of Guy Aston and Lou Burnard's BNC Handbook includes an informative survey of possible uses of corpora in general and of the BNC in particular.  Additional useful information and resources (including various frequency lists with more refined POS tagging) are found on the companion website for Word Frequencies in Written and Spoken English based on the British National Corpus by Geoffrey Leech, Paul Rayson and Andrew Wilson.  The introduction includes a very readable discussion of how the corpus was tokenized and tagged.

This site incorporates a database (referred to here as the w&p-db) derived from the second or World Edition of the BNC (2000); it is not affiliated with the BNC Consortium. It aims to provide a simple yet powerful interface for studying words and phrases up to six words long appropriate for both experienced researchers and novice users.  For investigating words in longer contexts, the full BNC corpus and Sara search and analysis software is available on CD-ROM from the BNC Consortium (a single user license costs only 50).  Alternatively, one can look up individual words and phrases online.

To understand and interpret the datasets produced here and to compare them to results of direct queries to BNC, please read how and why the original data were normalized to build the w&p-db.

What can this site do now?

Via two basic query pages users can explore n-grams and phrase-frames.  Here n-gram is understood as a sequence of n words, where n is in the range 1-6, and word means a token of any lexical entity assigned a BNC POS tag by the CLAWS parser (details here). For example, the most frequent 1-gram in the BNC data is the, and the end of the tops the list of 4-grams.  Phrase-frames are sets of variants of an n-gram identical except for one word, represented here by the wildcard symbol *. The most frequent (and productive, i.e. having the greatest number of variants) 4-frame is the * of the, with 4058 variants such as the end of the, the rest of the, the top of the, the nature of the etc. 

For each query datasets are returned in "chunks" of up to 10,000 items, and queries can be repeated until all matching data have been displayed.  Results can be ordered alphabetically or by frequency. For focused studies users can "filter" results for specific word-forms and / or word-classes which a query must match or exclude, including full support for wildcards. Details will be found in the tutorials. Sample uses of filters include searches for only...

What will this site do later?

When a faster, more reliable server becomes available in November 2003 the database will be expanded to include items with lower frequencies (current minimum is 5; the planned cutoff is 2), and a separate phrase-structure database will permit study of all patterns of POS tags. In a follow-on version planned for release in mid-2004 this site will also support querying with regular expressions and filtering of query results by domain, genre, target age and target level. A separate database focusing on numbers is planned as well. Users will be able to download an entire dataset matching a query in plain text or XML format as a single zip-compressed file; tools developed for KWiCFinder and kfNgram will permit browsing and analysis of the datasets via a graphical user interface on the PC. Slight modifications to data normalization conventions may result in minor discrepancies in frequencies reported for the two versions of the database.  Finally, as releases of the American National Corpus (ANC) become available, parallel databases will be created for the ANC data. The ANC's POS tagset and text-type taxonomy are substantially similar to those of the BNC, so this site will facilitate both separate and comparative studies of words and phrases in the two principal variants of English. Major changes to this site will be announced on the Corpora, Linguist and Corpus Linguistics and Language Teaching lists. 

Acknowledgements

First and foremost* this site owes its very existence to the monumental achievement of the BNC development team. After months of reading and re-reading every bit of documentation and rooting around in the nooks and crannies of the SGML-encoded data I have profound respect and gratitude for their efforts and accomplishments. We all look forward to future updates to the corpus. [*occurs 234 times in the BNC ]

As site developer I also gratefully acknowledge my debt to Michael Stubbs of the University of Trier for fruitful e-mail discussions that led to the creation and refinement of this database and Web site. It was Stubbs who generously suggested that I add support for "phrase-frames" to kfNgram. This concept originated with his research assistant Isabel Barth, who also implemented the original phrase-frame generator. Their collaboration led to the insightful paper "Using recurrent phrases as text-type discriminators: a quantitative method and some findings", to appear in Functions of Language (10, 1, 2004).  kfNgram was originally developed for a comparative study of a corpus I compiled from the Web with data from the BNC. When I remarked that generating lists of all the n-grams and phrase-frames in the BNC would really test the limits of kfNgram, Stubbs encouraged me to do it and suggested breaking the lists down further by domain and genre. The goal has evolved from a collection of overwhelmingly large static lists into databases which produce manageable datasets tailored to the user's research needs. Two of Stubbs' works available online survey and illustrate core concepts and point the way to exploring words and phrases: "Words in Use: Introductory Examples", chapter 1 of Words and Phrases: Corpus Studies in Lexical Semantics (Blackwell, 2001) and "Using very large text collections to study semantic schemas" (2000).

Finally I am very grateful to David Lee for permission to incorporate portions of his spreadsheet BNC Index for the BNC World Edition in the database for the follow-on phase of this site to permit filtering results by domain, genre, target age and target level. While awaiting implementation of the next phase, users are encouraged to consult his thorough discussion of the issues of classification by "text type" in: Lee, David Y.W. 2001. Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology, Vol.5(3): 37-72.