So far the thesis has dealt with the problem of synonymy, and the analysis of the lexical items on the basis of the five dictionaries. In the former chapter the intention was to point out the semantic overlap of happen and its synonyms with the help of the dictionaries. In the following chapter these words and expressions are going to be studied by means of corpus-based analysis. Before presenting the results, it must be considered what the different terms used in connection with corpus linguistics mean. Therefore the aim of this chapter is to give the explanation of them.
3. 1. What are corpora for?
In corpus linguistics the corpus-based analysis uses a collection of texts known as corpus. According to the relevant sections of Wikipedia “in linguistics, a corpus or text corpus is a large and structured set of texts, now usually electronically stored and processed. They are used to do statistical analysis, checking occurrences or validating linguistic rules or specific universe” (2001, p. 1).
From a linguistic point of view, Meyer discusses corpora in connection with corpus linguistics, which he calls a recent approach in linguistics appeared in the second half of this century, in the1960s (2002, p. 1). The point of this relatively new field of linguistics is the use of corpus, which is “an electrical lexical database for descriptive and theoretical studies of the language” (Meyer, 2002, p. 1). Hasselgard characterises corpus linguistics as “a text-based language description, that is to say that the language is described on the basis of how it is used in texts” (2001, p. 2). She adds that it can be also characterised with a quantitative language description (2001, p. 2). It means that “corpus linguistics often includes counting how many times a particular linguistic feature occurs, that is how frequent that feature is. It is more generally concerned with patterns that can be identified across many different texts than with single examples or the internal structure of single texts” (Hasselgard, 2001, p. 2). According to this approach a corpus is “a body of texts, that is for representing authentic language use, for use in linguistics, and is usually machine-readable” (Hasselgard, 2001, p. 1).
Furthermore, Hunston defines corpora in terms of form and purpose. She states that a corpus is “a collection of naturally occurring examples of the language.” (2002, p. 1) It is also added that corpora can consist of anything, from a few sentences to a set of written texts or tape recordings, but more significantly, they might be texts or parts of texts having been collected for linguistic study, and are stored and accessed electronically. Corpus linguistics belongs to the field of applied linguistics, since it deals with the application of theories in real life. (Hunston, 2002, p. 1)
Wynne defines “linguistic corpus as a collection of texts which have been selected and brought together so that language can be studied on the computer” (2004, p. 1.). Today it offers new procedures for the analysis of the language. Sinclair regarded the corpus as a remarkable thing, but not because it is a collection of texts, but due to its properties that it acquires if it is well-designed and carefully-constructed. (2004, p. 1)
Corpus linguistics is a relatively new approach in linguistics, applying corpora that are electronic lexical databases to carry out a theoretical or descriptive study of the use of the language. It investigates language usage in naturally occurring texts such as registers, which are varieties of language in different situations. (Biber, Conrad, Reppen, 2001, p. 2) It means that the corpus demonstrates the lexical items through a representative set of concordances taken from real life; that is why they can be regarded as authentic sources of the study of the language. In addition, the content of the language a corpus handles is a large amount of texts, and it keeps track of many contextual functions at the same time (Biber, Conrad, Reppen, 2001, p. 3). However, the main advantage of corpus-based research is that the information the corpus provides is real and not artificial. In this sense McEnery, Xiao and Tono convince us that corpus linguistics is a methodology rather than a theory, and affords a wide range of applications across all branches of linguistics. (2006, p. 506)
The features of corpus-based research can be summarised by the following points:
It is empirical, and the research is based on collecting data and analysing them. The aim is to study the actual patterns of the language usage in natural texts.
It depends on both quantitative and qualitative (functional) analytical techniques. “Corpus-based analysis must go beyond simple counts of linguistic features. That is, it is essential to include qualitative, functional interpretations of quantitative patterns. The goal of corpus-based investigations is not only simply to report quantitative findings, but to explore the importance of these finding for learning about the patterns of language use.” (Biber, Conrad, Reppen, 2001, p. 4)
Corpus-based analysis makes an extensive use of computers for analysis, using both automatic and interactive techniques.
The research is based on a corpus, which is a “large and principled collection of natural texts as the basis of the analysis” (Biber, Conrad, Reppen, 2001, p. 4).
According to a definition “a corpus is the store of used language, but by itself can do nothing at all. A corpus-access-software, however, can rearrange that store so that observations of various kinds can be made” (Hunston, 2002, p. 3). More interestingly, if a speaker’s experience of a language is represented, the access software re-orders that experience, so that it can be examined in ways that are normally seem impossible. A corpus does not contain new information about a language, but offers a new perspective on the familiar to the researcher. “Software packages process data from a corpus in three ways: showing frequency, phraseology and collocation of the keyword” (Hunston, 2002, p. 3). The first point to be considered in this respect is that words can be arranged in order of their frequency in a corpus that is to provide their frequency list. The method is to study the frequency of keywords across corpora, which is comparing them by the frequency lists the corpora set up. (Hunston, 2002, p. 5) Phraseology is mentioned in relation to concordance programs, which are for collecting “concordance lines bringing together many instances of the use of the word or phrase, allowing the user to observe regularities in use that tend to remain unobserved when the same words or phrases are met in their normal contexts” (Hunston, 2002, p. 9). The third issue is collocation which refers to the tendency of words which co-occur. (Hunston, 2002, p. 12) In this connection, keywords are words that occur unusually more frequently than any other words in the observed text. The definition says that keywords are “words that occur in all or most of the texts which make up the corpus one examines” (Tribble, 2000, p. 10).
To study what different corpora are applied for, the following points can be differentiated:
(Hunston, 2002)
They are important devices for language teaching providing a lot of information about “how a language works that may not be accessible to native speaker intuition”. The information a corpus provides may be great aids for preparing syllabus and material design.
An increasing number of teachers encourage their students to explore corpora for themselves, allowing them to observe nuances of usage and to make comparisons between languages. In this respect, corpora gains significance in foreign language acquisition, too.
Corpora are important means for translators who use “comparable corpora to compare the use of apparent translation equivalents in two languages, and parallel corpora to see how words and phrases have been translated in the past.”
“General corpora can be used to establish norms of frequency and usage against which individual texts can be measured. This method is often applied in stylistics and in clinical and forensic linguistics.”
“Corpora are used to investigate cultural attitudes expressed through the language and as a resource for critical discourse studies.”
Finally, a special attention is to be devoted to the role of corpora in pedagogy. Corpus-based lexicographic research investigates the meaning and use of words, and synonyms. This area of study is essential for dictionary making. (Biber, Conrad, Reppen, 2001, p. 21) Dictionaries based on corpora such as COBUILT or BNC are not made up by artificial samples of the language that is why the information they provide can be regarded as authentic. Therefore, these dictionaries are useful sources of information for both students and teachers about the use of the patterns of a language such as synonyms. (Biber, Conrad, Reppen, 2001, p. 21)
3. 2. Types of corpora
“The development of corpus linguistics includes a development of types of corpora” (Hasselgard, 2001, p. 3). Corpora differ in many respects, for example size, regional variety of English, diachronic variety of English, general reference or specific purpose, degree of annotation/mark up, text types (spoken/written, news, fiction, legal texts, etc.), and whether it is static (finite size) or dynamic (being added to). (Hasselgard, 2001, p. 3)
To achieve the aforementioned aim, the meanings of some terms used in corpus linguistics need defining. Annotation is a superordinate term for tagging and parsing. ”Tagging refers to the addition of a code to each word in a corpus, indicating the part of speech” (Hunston, 2002, p. 18). “Corpus parsing relates to the analysis of a text into its constituents, such as clauses and groups” (Hunston, 2002, p. 19). Its purpose is to count the number of different structures in a corpus with great efficiency. Annotation aims to describe other kinds of information that can be added to a corpus. (Hunston, 2002, p. 18)
The division of the different types is based on the issues in which corpora differ. According to Hasselgard “different corpora are suitable for different kinds of linguistic investigations. If you are studying a phenomenon that is fairly frequent, a small corpus will probably give you enough examples. If you are after a very rare phenomenon, you are likely to need a large corpus” (2001, p. 3).
Hunston’s division can be illustrated by the following corpus-types:
Specialized corpus is a collection of texts of a particular type such as newspaper editorials, geography textbooks, and academic articles in a particular subject, lectures, casual conversations, etc. Its aim is to be representative of a given type of texts. It is used to investigate a particular type of a language. Researchers often collect their own specialized corpora to reflect the kind of language they want to investigate. It has no limitation at all.
General corpus is a collection of texts of many types. It may include written or spoken language, texts produced in one country or several. It might be much larger than a specialized corpus. Its aim is to produce reference materials for language learning and translation. It is often used as a baseline in comparison with other specialized corpora that type is called reference corpus. Such corpora are the British National Corpus, the Bank of English and the Brown Corpus, for instance.
Comparable corpora, for example ICE, are two or more corpora in different languages. They are designed along the same lines, for example they contain the same proportions of newspaper texts, novels, etc. The aim of the investigations referring to this type is to compare those varieties of the same language. It may be used by both learners and translators.
Parallel corpora are two or more corpora in different languages, each containing texts that have been translated from one language to another (for example, a novel in English that has been translated into English), or texts that have been produced simultaneously in two or more languages. This type is often used by translators and learners. Its aim is to find potential equivalent expressions in each language and to investigate differences between them.
Learner corpus such as ICLE or LOCNESS is a collection of texts - for example essays - produced by learners of a language. The aim of this type of corpus is to identify in what respects the languages used by learners differ from each other and from the language of native speakers.
Pedagogic corpus is a corpus consisting of all the language a learner has been exposed to. This type refers to every material the learner has used, whether it is a course book, a reader, a cassette, etc.
Historical or diachronic corpus is a corpus of texts from different periods of time. Its aim is to trace the development of aspects of a language over time. A historical corpus is Helsinki Corpus, for instance.
Monitor corpus is a corpus designed to track current changes in a language. This can be added annually, monthly or daily, that is why it rapidly increases in size. (Hunston, 2002, p. 14-16)
One more division of the different corpus-types is interpreted by Aston. According to him, “three main types of corpora have been proposed as relevant” (1999, p. 1). They are the following:
Monolingual corpora consist of texts in a single language, which may be either the source or the target language of a given translation. Monolingual corpora have two further subclasses: general or specialized corpora.
General corpora include texts of a wide variety of types.
Specialized corpora are restricted to a particular genre and/or topic domain. (Aston, 1999, p. 1)
However, “in either case, the corpus attempts to provide a sample of a particular textual population, which ideally reflects the variability of that population” (Aston, 1999, p. 1).
Where monolingual corpora of similar design are available for two or more languages, they may be treated as components of a single comparable corpus. “They are currently specialised with the texts belonging to genres or domains which are sociolinguistic ally similar in each of the cultures involved, and have similar variability” (Aston, 1999, p. 2).
Parallel corpora also have components in two or more languages, consisting of original texts and their translations. Most parallel corpora are specialized. They have two main types:
Unidirectional parallel corpora consist of texts in one language along with translations of those texts into another language. They consist of two components, source texts in a language, and their aligned translations in another one.
Bidirectional corpora or reciprocal parallel corpora combine the characteristics of unidirectional parallel corpora with those of comparable corpora. They consist of four components: source texts in one language and their aligned translations in another language, source texts in the second language and their aligned translations in the former one. (Aston, 1999, p. 2)
3. 3. The British National Corpus (BNC)
“The BNC is a one hundred million word collection of samples of written and spoken English from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the twentieth century, both spoken and written” (BNC, 2005 p. 1). The written part (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books, and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of texts. The spoken part (10%) consists of a large amount of unscripted informal conversations, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins. (BNC, 2005, p. 1) To sum up, the content of BNC is taken from various fields of life.
The BNC deals with modern British English only, and does not devote any attention to any other languages used in Britain. However, non-British English and foreign language words do occur in the corpus. At the same time, it can be said to be a synchronic corpus since it covers British English of the late twentieth century rather than earlier times. (BNC, 2005, p. 1)
Hunston discusses BNC among general corpora, which contain texts of different types. More interestingly, she refers to BNC as a reference corpus, similarly to Bank of English and the Brown Corpus. These corpora are often used as baselines of comparisons with other specialised corpora. (2002, p. 15) The two ways of establishing the type of the BNC shows that no rigid limitations between the different types of corpora exist.
To sum up, the BNC is a corpus - a collection of texts. It is presented in a way that makes possible almost any kind of computer-based research on the nature of the language. Obvious application areas include lexicography, natural language understanding systems and all branches of applied and theoretical linguistics. Its main advantage is to provide exact authentic information about the language. That is why for this present corpus-based research the BNC was used for studying the lexical items happen and its synonyms.
Go back to to the contents page
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment