Design of the spoken component 


Мы поможем в написании ваших работ!



ЗНАЕТЕ ЛИ ВЫ?

Design of the spoken component



The British National Corpus project undertook to produce five to ten million words of orthographically transcribed speech, covering a wide range of speech variation. A large proportion of the spoken part of the corpus — over four million words — comprises spontaneous conversational English. The importance of conversational dialogue to linguistic study is unquestionable: it is the dominant component of general language both in terms of language reception and language production. The demographic sampling approach was adopted for approximately half of the spoken part of the corpus. The demographic component of the corpus was complemented with a separate text typology intended to cover the full range of linguistic variation found in spoken language; this is termed the context-governed part of the corpus. The sampling frame was defined in terms of the language production of the population of British English speakers in the United Kingdom. Representativeness was achieved by sampling a spread of language producers in terms of age, gender, social group, and region, and recording their language output over a set period of time.

 

A total of 757 texts (6,153,671 words) make up the context-governed part of the corpus. The following contexts are distinguished:

Table 5. Context in which spoken text was captured

Context texts w-units % s-units %
Educational/Informative 169 1633303 26.61 119252 27.82
Business 131 1285938 20.95 108101 25.22
Public/Institutional 262 1655263 26.97 96504 22.51
Leisure 195 1561167 25.44 104701 24.43

Sampling procedure

124 adults (aged 15+) were recruited from across the United Kingdom. Recruits were of both sexes and from all age groups and social classes. The intention was, as far as possible, to recruit equal numbers of men and women, equal numbers from each of the six age groups, and equal numbers from each of four social classes.

Recording procedure

All conversations were recorded as unobtrusively as possible, so that the material gathered approximated closely to natural, spontaneous speech. In many cases the only person aware that the conversation was being taped was the person carrying the recorder.

The context-governed part of the corpus

As mentioned above, the spoken texts in the demographic part of the corpus consists mainly of conversational English. A complementary approach was developed to create what is termed the context-governed part of the corpus. As in other spoken corpora, the range of text types was selected according to a priori linguistically motivated categories. At the top layer of the typology is a division into four equal-sized contextually based categories: educational, business, public/institutional, and leisure. Each is divided into the subcategories monologue (40 percent) and dialogue (60 percent). Each monologue subcategory therefore totals 10 percent of the context-governed part of the corpus, and each dialogue subcategory 15 percent.

Sampling procedure

For the most part, a variety of text types were sampled within three geographic regions. However, some text types, such as parliamentary proceedings, and most broadcast categories, apply to the country as a whole and were not regionally sampled. Different sampling strategies were required for each text type, and these are:

Educational and informative domain (area):

Lectures, talks, educational demonstrations Within each sampling area a university (or college of further education) and a school were selected. A range of lectures and talks was recorded, varying the topic, level, and speaker gender.

News commentaries Regional sampling was not applied, but both national and regional broadcasting companies were sampled. The topic, level, and gender of commentator was varied.

Classroom interaction Schools were regionally sampled and the level (generally based on student age) and topic were varied. Home tutorials were also included.

Business:

Company talks and interviews Sampling took into account company size, areas of activity, and gender of speakers.

Trade union talks Talks to union members, branch meetings and annual conferences were all sampled.

Sales demonstrations A range of topics was included.

Business meetings Companies were selected according to size, area of activity, and purpose of meeting.

Consultations These included medical, legal, business and professional consultations. All categories under this heading were regionally sampled.

Public/ or institutional:

Political speeches Regional sampling of local politics, plus speeches in both the House of Commons and the House of Lords.

Sermons Different denominations were sampled.

Public/government talks Regional sampling of local inquiries and meetings, plus national issues at different levels.

Council meetings Regionally sampled, covering parish, town, district, and county councils.

Religious meetings domain Includes church meetings, group discussions, and so on.

Parliamentary proceedings Sampling of main sessions and committees, House of Commons and House of Lords.

Legal proceedings Royal Courts of Justice, and local Magistrates and similar courts were sampled.

Leisure:

Speeches Regionally sampled, covering a variety of occasions and speakers.

Sports commentaries Exclusively broadcast, sampling a variety of sports, commentators, and TV/radio channels.

Talks to clubs Regionally sampled, covering a range of topics and speakers.

Broadcast chat shows and phone-ins Only those that include a significant amount of unscripted speech were selected from both television and radio.

Club meetings Regionally sampled, covering a wide range of clubs.

How to search

Using the BNC one can search for a lexeme by entering a word form (or a part of this word form plus wildcards). In corpora via the Internet it is also possible to search on lemma (a word or phrase treated in a glossary or similar listing = словарная форма слова - форма, представляющая лексему в словаре; a word considered as its citation form together with all the inflected forms. For example, the lemma go consists of go together with goes, going, went, and gone) and part of speech. The restriction of searching on wordform only entails problems related to, for example, homographs, which have different parts of speech. It is possible to conduct the search, presupposing one-to-five-element distance between the units as is useful in case of phrasal verbs. The retrieval-program then shows the frequency of the word form (= keyword). Subsequently, the concordances (keywords in restricted textual contexts; a book that indexes the principal words in a literary work, often with the immediate context and an account of the meaning – алфавітний покажчик слів у книзі з цитатами (у яких ці слова трапляються)) may be retrieved. The largest context is a bit longer than a paragraph.

The results one gets are from the whole corpus. So, if needed it is possible to refer to the list of works excerpted where besides the author and the title, the domain and year are indicated and select only those concordances you have interest in.

 

How it looks on the screen:



Поделиться:


Последнее изменение этой страницы: 2021-03-10; просмотров: 57; Нарушение авторского права страницы; Мы поможем в написании вашей работы!

infopedia.su Все материалы представленные на сайте исключительно с целью ознакомления читателями и не преследуют коммерческих целей или нарушение авторских прав. Обратная связь - 3.129.67.26 (0.005 с.)