LECTURE # 2 . The method of Corpus Analysis . The British National Corpus 


Мы поможем в написании ваших работ!



ЗНАЕТЕ ЛИ ВЫ?

LECTURE # 2 . The method of Corpus Analysis . The British National Corpus



1. General notes: benefits, purpose, definitions.

2.Design of the corpus.

3. Selection features.

4. Design of the spoken component.

5. The searching procedure.

 

Quantitative methods have been long and are now widely applied in linguistic researches. The statistical data obtained enables to draw solid conclusions. Nowadays the via-computer access to large amounts of linguistic evidence helps to avoid time and effort consuming complicated formula-based calculations. The corpus method is in question.

C orpus is a large collection of computer-readable writings.

Corpus Linguistics is a study of language that includes all processes related to processing, usage and analysis of written or spoken machine-readable corpora. Corpus linguistics is a relatively modern term used to refer to a methodology, which is based on examples of ‘real life’ language use. At present, effectiveness and usefulness of corpus linguistics is closely related to the development of computer science. There are:

 

The Bank of English – 524 mln words (COBUILD dictionaries are based on it).

The Corpus of Contemporary American English (COCA) –  450 million words (1990-2012)

The Longman Written American Corpus is a dynamic corpus of 100 million words comprising running text from newspapers, journals, magazines, best-selling novels, technical and scientific writing, and coffee-table books.

The Longman Spoken American Corpus is a unique resource of 5 million words of everyday American speech.

The British National Corpus – 100 mln words.

The Czech Corpus – focuses mainly on written Czech, over 100 million words.

The International Netherlands Language Corpus – 38 mln.words.

The International Netherlands Language Newspaper Corpus – 27 mln.words.

The Portuguese Corpus – 45 million words.

The Oslo Corpus of Bosnian Texts – 1.5 million words.

The British National Corpus: World Edition October 2000

General Notes

An initial report appeared in 1991, and a substantially revised and expanded version in early 1994.

Lead partner in consortium [[kən'sɔːtɪəm]] (an association of companies, esp. one formed for a particular purpose): Oxford University Press

The general benefits of the corpus method:

– The material collected in large computerized corpora represents authentic rather than invented language.

– Computers can process enormous amounts of data.

– The method of retrieving the data is objective rather than intuitive, which implies that studies can be replicated by other researches using the same or different corpora.

– Specific corpora selected from particular types of texts allow for comparisons of the use and frequency of certain features in different text-types, provided that the corpora are large enough.

Purpose

The uses originally envisaged for the British National Corpus were set out in a working document called Planned Uses of the British National Corpus BNCW02 (11 April 91). This document identified the following as likely application areas for the corpus:

• reference book publishing

• academic linguistic research

• language teaching

• artificial intelligence

• natural language processing

• speech processing

• information retrieval

Particularly, the database provided by the Corpus may be used:

1) as a source of examples of “real life” language usage in teaching English;

2) for finding new tendencies in language development;

3) for the investigation of a speaker’s role in language production;

4) for determining peculiarities of different registers;

5) for contrastive analysis of English as a Native Language and English as a Foreign Language;

6) for theory and practice of translation using so called “translation and parallel corpora”.

The same document identified the following categories of linguistic information derivable from the corpus:

• lexical

• semantic/pragmatic

• syntactic

• morphological

• graphological/written form/orthographical

The example of the contrastive analysis: the research in the sphere of infinitive and gerundial constructions usage has demonstrated the overuse of the infinitive construction after the word «possibility» by the students learning English as their second language. At the same time the speakers for whom English is a mother-tongue use the gerundial construction only.

General definitions

The British National Corpus is:

• a sample corpus: composed of text samples generally no longer than 45,000 words.

• a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.

• a general corpus: not specifically restricted to any particular subject field, register or genre.

• a monolingual British English corpus: it comprises text samples which are sub-stantially the product of speakers of British English.

• a mixed corpus: it contains examples of both spoken and written language.

Design of the corpus

There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English.

The BNC World Edition contains 4054 texts and occupies 1,508,392 Kbytes, or about 1.5 Gb. In total, it comprises just over 100 million orthographic words (specifically, 100,467,090), but the number of w-units is slightly less: 97,619,934. The total number of s-units is just over 6 million (6,053,093).

• S-units (segment-units): number of <s> elements – more or less equivalent to sentences

• W-units: number of <w> elements – more or less equivalent to words.

The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of written texts. These reference totals are given in the first table below.

Table 1. Composition of the BNC World Edition

Text type Texts Kbytes W-units S-units percent
Spoken demographic 153 4206058 4.30 610563 10.08
Spoken context-governed 757 6135671 6.28 428558 7.07
All Spoken 910 10341729 10.58 1039121 17.78
Written books and periodicals 2688 78580018 80.49 4403803 72.75
Written-to-be-spoken 35 1324480 1.35 120153 1.98
Written miscellaneous 421 7373707 7.55 490016 8.09
All Written 3144 87278205 89.39 5013972 82.82

 


All texts are also classified according to their date of production. For spoken texts, the date was that of the recording. For written texts, the date used for classification was the date of production of the material actually transcribed, for the most part; in the case of imaginative works, however, the date of first publication was used. Informative texts were selected only from 1975 onwards, imaginative ones from 1960, reflecting their longer “shelf-life”, though most (75 per cent) of the latter were published no earlier than 1975.

Table 2. Date of production

Creation date texts w-units % s-units %
Unknown 162 1814051 1.85 127132 2.10
Before 1974 47 1741624 1.78 121323 2.00
1974 to 1983 156 4621950 4.73 255057 4.21
1984 to 1994 3689 89442309 91.62 5549581 91.68

 

Selection features

Texts were chosen for inclusion according to three selection features: domain (subject field), time (within certain dates) and medium (book, periodical, etc.). The purpose of these selection features was to ensure that the corpus contained a broad range of different language styles, for two reasons. The first was so that the corpus could be regarded as a microcosm of current British English in its entirety, not just of particular types. The second was so that different types of text could be compared and contrasted with each other.

3.1. Sample size and method

For books, a target sample size of 40,000 words was chosen. No extract included in the corpus exceeds 47,000 words. Text samples normally consist of a continuous stretch of discourse from within the whole. Only one sample was taken from any one text. Samples were taken randomly from the beginning, middle or end of longer texts. (In a few cases, where a publication included essays or articles by a variety of authors of different nationalities, the work of non-UK authors was omitted.) As far as possible, the individual stories in one issue of a newspaper were grouped according to domain, for example as “Business” articles, “Leisure” articles, etc.

The following subsections discuss each selection criterion, and indicate the actual numbers of words in each category included.

Domain

Classification according to subject field seems hardly appropriate to texts which are fic­tional or which are generally perceived to be literary or creative. Consequently, these texts are all labelled imaginative and are not assigned to particular subject areas. All other texts are treated as informative and are assigned to one of the eight domains listed in Tab. 3.

Table 3. Written domain

Domain texts w-units % s-units %  
Applied science 370 7104635 8.14 357067 7.12
Arts 261 6520634 7.47 321442 6.41
Belief and thought 146 3007244 3.44 151418 3.01
Commerce and finance 295 7257542 8.31 382717 7.63
Imaginative 477 16377726 18.76 1356458 27.05
Leisure 438 12187946 13.96 760722 15.17
Natural and pure science 146 3784273 4.33 183466 3.65
Social science 527 13906182 15.93 700122 13.96
World affairs 484 17132023 19.62 800560 15.96

 

The labels we have adopted represent the highest levels of a fuller taxonomy of text medium.

Table 4. Written medium

Medium texts w-units % s-units %
Book 1414 49891770 57.16 2895652 57.75
Periodical 1208 28356005 32.48 1487725 29.67
Published miscellanea 238 4197450 4.80 288004 5.74
Unpublished miscellanea 249 3508500 4.01 222438 4.43
To-be-spoken 35 1324480 1.51 120153 2.39

 

The ‘Miscellaneous published’ category includes brochures, leaflets, manuals, advertise-ments. The ‘Miscellaneous unpublished’ category includes letters, memos, reports, minutes, essays. The ‘written-to-be-spoken’ category includes scripted television material, play scripts etc.

3. Selection procedures employed – Books

Roughly half the titles were randomly selected from available candidates identified in Whitaker’s Books in Print (BIP), 1992, by students of Library and Information Studies at Leeds City University. Each text randomly chosen was accepted only if it fulfilled certain criteria: it had to be published by a British publisher, contain sufficient pages of text to make its incorporation worthwhile, consist mainly of written text, fall within the designated time limits, and cost less than a set price. The final selection weeded out texts by non-UK authors. Half of the books having been selected by this method, the remaining half were selected systematically.



Поделиться:


Последнее изменение этой страницы: 2021-03-10; просмотров: 70; Нарушение авторского права страницы; Мы поможем в написании вашей работы!

infopedia.su Все материалы представленные на сайте исключительно с целью ознакомления читателями и не преследуют коммерческих целей или нарушение авторских прав. Обратная связь - 3.144.48.3 (0.011 с.)