What is a corpus?
A corpus is essentially a collection of texts (usually in electronic form) which is used for linguistic research.
How can corpora help me as a translator?
There are many ways in which you can use corpora to assist you with your translation work. For example, if you are learning to specialise in a particular area, you may want to collect a specific type of document in your target language and analyse these documents for terminology, collocations in context and key words. You could use this to start you own glossary of target language terminology. You could also collect similar documents in your source language and use these to try to match terminology with the terminology you found in your target language corpus. You can also use corpora to help you with a specific translation. Perhaps you have been asked to translate a type of document with which you are not yet familiar. Compiling a corpus of similar target language documents can help you find the correct terminology and collocations. Or perhaps you would like to compile your own bilingual corpus by collecting texts from multilingual websites. There is also the option of downloading corpora which have already been compiled. Why not download the free bilingual corpus provided by the European Commission’s Directorate-General for Translation which I discussed here.
How can I find texts for my own corpus?
You are unlikely to find ready-made corpora which are a perfect fit for your particular research needs. You will therefore need to learn how to compile your own corpora. In order to do this, the first thing you need to do is to find reliable and authentic documents, either in the target language, the source language or both. If you are looking for contracts and other official documents, the best way to do this is to search for PDF documents on the internet. For advice on how to do this click here. You can, of course, also use websites, a collection of newspaper articles or any other collection of texts which you can find which suits your needs.
More effective and efficient ways of leveraging terminology from monolingual corpora
In theory, the larger your corpus, the more useful your corpus will be. However, in practical terms, this is not always the case. If you collect a very large number of texts, analysing them manually is going to be a near impossible task. This is the stage I had arrived at myself earlier this year. I had a large number of judgments in a particular matter I was working on but just didn’t have the time to extract the very useful information I was sure was available within the corpus. I was therefore very excited to learn that there are programs available which allow you to automatically leverage terminology from corpora. One of these is AntConc developed by Laurence Anthony. Since I’m still a novice at this myself, I can only direct you to the AntConc website for further information but what I can say is that it is quick and easy to use and extremely effective and efficient.
Credit and final note
The credit for the information contained in this post goes to Juliette Scott. If I had not met Juliette and learned about her PhD project that “pile of potentially useful documents gathering dust” would still be there on my desk untouched and unused and would never have made the transition from “pile of potentially useful documents” to “specialised corpus” from which I have been able to leverage useful terminology and collocations for translation assignments.
If you happen to specialise in legal translation and you are interested in more efficient ways of leveraging terminology from corpora in the legal field, do have a look at Juliette’s post about her PhD project here.