Using the World Wide Web as a corpus

Elsewhere in this section, a corpus was broadly defined as a collection of machine-readable authentic texts. Broadly speaking, this allows us to view the Internet as one large corpus. Consequently, we can carry out searches that have to do with the way English is used in writing on the Internet. This can be very helpful, but as will be clear later on, there are also some caveats that need to be mentioned. One straightforward way to search the Internet is by using the search engine Google. Typical questions that can be answered include 'does expression X exist?' or 'which word, A or B, is the one most frequently used?'. For example, let us assume that we are not sure whether we can say a big number of or a large number of in English. A simple Google search for these two sequences would provide us with some indication of which one is potentially more frequent than the other. If we perform a Google search, where we enter the following search string in the search box (the sequence is entered within quotation marks to get only verbatim hits):

"a large number of"

we get as many as 248,000,000 hits (3 May, 2010). If we do the same with the following string:

"a big number of"

we get 2,760,000 hits. These numbers are clearly indicative of which one is the more frequently used phrase in English. However, we do not have an answer to the question of whether both of these phrases are seen as equally correct in terms of English usage. In theory, when using a simple Google search, it could be the case that a proportion of the retrieved hits stem from instances of texts written by people that do not have English as a mother tongue. As a consequence, their written English may not always follow conventions for grammar, spelling and word use. This means that we have to be careful when using a simple Google search to find out about language use, because there is always a risk that we will get a number of hits also for grammatical constructions that by convention are seen as incorrect. In the next subsection, remedies to these problems will be introduced.

As seen above, using a simple Google search can help us to find out if a word or phrase is commonly used compared to another word or phrase, but we have to be careful when interpreting the results. One way of trying to avoid the problems that a simple Google search may lead to involves restricting the search. There are a number of ways in which this can be done, as indicated below. Using the advanced search option in Google One way of restricting the search is by using the advanced search option available in Google. By clicking on the advanced search link, the Google search engine lets us specify in which content we want to make our search. For example, we can specify that we only want to search web pages in English and also that we want to delimit our search to web pages published in Great Britain by entering the domain suffix .uk. Having made these specifications and again searching for the phrases a large number of and a big number of, we get the following results:

a large number of (number of Google hits: 8,470,000)

a big number of (number of Google hits: 31,900)

Clearly, the phrase a large number of is far more frequent than a big number of. This generally gives us support to use the former, but we must note that we cannot be sure whether one of the two phrases is invariantly used in a specific discipline, for example fiction, poetry, blogs, business, etc. To try to find this out, without having to resort to looking at each of the hits to ascertain what kind of text it is found in, we must restrict our search even further. Using Google Books or Google Scholar One way of restricting our search even further is to use one of Google's other specialised search engines. Since this material focuses on academic writing, let us assume that we want to try to find out if a large number of is more frequent than a big number of in academic text. For this we can use Google Scholar, which according to Google themselves allows searches for "articles, theses, books, abstracts and court opinions, from academic publishers, professional societies, online repositories, universities and other web sites" (Google Scholar 2010). Let us repeat our search for the two verbatim phrases and see what this can tell us:

a large number of (number of Google Scholar hits: 4,000,000)

a big number of (number of Google Scholar hits: 9,270)

It is evident that the pattern we saw from our general and advanced Google searches is reflected also in a search in Google Scholar. Thus, we are in a position where we can feel more assured that the phrase a large number of is more frequent than a big number of in academic texts available in electronic form. We can actually restrict our search to specific academic disciplines or subject areas. An Advanced Scholar Search lets us, for example, search for words and phrases solely in material classified as Medicine, Pharmacology, and Veterinary Science. A renewed search with these parameters yields the following:

a large number of (number of Google Scholar hits in Medicine: 538,000)

a big number of (number of Google Scholar hits in Medicine: 242)

Again, if we are writers of academic English and medicine is our field of study, we can be fairly confident about which phrase seems to be conventionally preferred over the other. Please note that the number of hits reported in this text about the use of Google as a corpus may differ from the number of hits you get if you make the same search yourself. The reason for this is of course that the material on the Internet changes regularly as new material is added and old material disappears.

Use of cookies

Using the World Wide Web as a corpus