FAQ (Fletcher-Anticipated Questions)
Exploring Words and Phrases from the
British National Corpus*
What are...?
- n-grams
- on this site n-grams means sequences of n words as
defined here. In this database, n
can be any number in the range 1-6, i.e. from individual words up to
six-word phrases. Only words and phrases occurring at least five times in the
BNC are included here. Relatively frequent n-grams are typically familiar
building blocks of English; such recurrent n-grams are also known as lexical
bundles, lexical chains or clusters. <<add references>> Shorthand forms like 1-gram, 2-gram,
3-gram etc. specify the value of n; some prefer
unigram, bigram, trigram etc. In information retrieval and
computational linguistics contexts term n-gram more frequently means
"sequence of n characters".
- phrase-frames
- sets of phrases (n-grams) which are identical except for one word,
dubbed the "wildword" and represented by the wildcard sign *. For
example, at the * of is a
phrase-frame with variants like at the start of, at the end
of, at the heart of
etc. Phrase-frames are useful tools for discovering phraseological patterns.
Guidelines for choosing n-grams or phrase-frames are given in the
tutorials. Parallel to 3-gram etc. this
site uses 3-frame etc. as shorthand for "phrase-frame of three words",
and p-frame is a handy stand-in for phrase-frame.
- words
- lexical units as identified by the BNC's
CLAWS
parser with POS tags, including "multiword units".
"Fused forms" are split up into morphemes, each tagged as a separate word
token. Orthographic variants of the same lexeme (database / data-base,
realise / realize) appear as different lexical units. Compound nouns
written with white-space instead of hyphens are separated into their
components, so data base is treated as two lexical units.
- multiword units
- phrases that function grammatically as single words, e.g. conjunction
so that or preposition in spite of, receive a single
POS tag, so they are treated here as single words.
To make this obvious in search results they are displayed with underscores
instead of spaces: so_that, in_spite_of. To search for multiword units
you must enter them in a single query field and use underscores, not spaces.
Since spaces
are used to separate multiple words to match, the word-form filter in spite of matches
in
OR spite OR of.
List
of multiword units.
- fused forms
- multiple morphemes written without space in English such as cannot, he'd,
George's are "de-fused" by the parser into can not, he 'd, George 's.
Different POS tags clarify whether 'd stands for had or would
and whether 's comes from is or has, or else represents a possessive.
List
of fused forms.
- filters
- query conditions which focus the matching dataset by "filtering out"
unwanted items. Filtering can be done by words, POS codes and / or
frequency, and multiple forms can be specified to either include or exclude
from the dataset.
- POS-tags
- "Words" in the corpus are tagged with one of 57 "Part
Of Speech" codes consisting of three characters; this
list of POS codes explains and gives examples of how these codes are
applied. The w&pdb database permits searching for specific combinations of POS
codes specified by either choosing from a list or entering directly; wildcards
can be used to match groups of related codes. Occasionally the code UNC
(unclassified) is overused, for example for the ai of ain't,
which is ambiguous but could be assigned manually to the proper form of BE or
HAVE.
Why...?
- Why do you only support Internet Explorer?
- In this initial phase the
time required to develop and test for multiple browsers would detract from
building the database and user interface. Webmasters report that over
85% of Website visitors use Internet Explorer (IE), and even more have access to IE
on their machine. When this Website is stable
and fully documented I will strive for cross-browser compatibility.
Incidentally, the compact and capable
Opera 7 browser supports most of the IE features on this site,
and most functions should also work in Netscape versions 6 and higher.
- Why do I see no change in the results pane after editing the query parameters?
- When you change any of the query parameters you must click the "Query"
button to start a new query. (The "Next" button in the results
pane continues fetching
- subsequent chunks of the dataset from your previous query.)
- Why do I only see the page heading in the results pane, but no results
appear?
-
The current server can be very slow: you may have to wait up to 90
seconds for results, and the server or your browser may "time out" while you
are waiting. Some suggestions are...
- Wait at least a minute before giving up.
- Try a different time of day.
- Choose only essential options: don't show POS tags if not needed;
don't display results as a table.
- Choose a smaller "chunk" size and / or specify fewer filters.
- Wait until the database is installed on a new server sometime in November
2003.
- Why do results show no matches for a phrase that must be in the BNC?
- This question has many possible answers:
- Is your minimum frequency set too high or your maximum too low? Some
phrases are less frequent than you think, and setting a maximum frequency may
exclude some familiar phrases. (The minimum frequency for inclusion in the
database is 5; there is no maximum.)
- Are you looking for phrases that are too long? Try a smaller value
for n: 4-, 5- and 6-grams are relatively rare.
- If you have specified POS tag filters, are they appropriate for the word
forms you want? Try again with no filters. If you checked the "exclude"
box, does it make sense?
- If you are an American, did you use the appropriate British
spelling? Orthographic variants (e.g. -ise / -ize) have not been
normalized. If you wish to query for more than one variant, enter both in the
"word form" filter field, separated by a space (normalise normalize),
or else use a wildcard (normali?e).
- Why are there no phrase frames matching my query even though I find several
variants in the database?
-
- Phrase frames are sets of variants which are identical except for one
word, e.g. all but the second word are the same. Do the variants you have
observed really differ only in the (ordinally) same word?
- If you specify word form or POS tag filters, leave at least one word
unspecified. (You may specify -*- for this "wildword", but that
is redundant if the other words are specified.) If you need to specify
something for each word, use the "Explore N-Grams" page instead.
- If you have specified POS tag filters, are they appropriate for the word
forms you want? Try again with no filters. If you checked the "exclude"
box, does it make sense?
- Why can't I save results pages with the "Save Page" or "Save Data" buttons?
- These buttons require the ActiveX file system component and work only with
the Windows version of Internet Explorer 5.x and greater. With this browser your security settings will prevent
saving pages unless you either have
enabled ActiveX components to run automatically or after prompting (in which
case you will be nagged for permission each time). It is potentially unsafe to
allow every site to run any desired components on your
computer. The best solution is to add this site to the browser's "Trusted
Sites" list. (Tools menu > Internet Options... menu > Security tab,
click on the "Trusted sites" icon, then the "Sites" button and add this site
to the list. Uncheck "Require server verification..."), then click "Ok".
On this site ActiveX is used exclusively to save Web pages. Users with security concerns are encouraged to verify this by inspecting the
JavaScript function savepage( ) in the script file
BNCresults.js.
- Why can't I find common phrases like of course, in spite of?
- Such "multiword units" are treated by the BNC's CLAWS parser as single words.
Enter them in a single word field and replace the spaces with _
(underscore): of_course. Complete list of multiword
units.
- Why can't I find contractions like don't, they're or
possessives like children's, parents'?
- Such "fused forms" are treated by the BNC's CLAWS parser as separate words.
Enter each part in a separate word field: do n't, they 're,
children 's, parents ' . Note that "altered" forms like won't, ain't
are segmented as wo n't, ai n't; the exception can't is
segmented can n't', parallel to cannot > can not. Complete list of fused forms.