Number of words in English
Many of the problems identified below are about how we wish to define the "words" we are counting. We can count based on semantics (meaning), based on orthography (spelling) or both.
Consequently, different studies use differing criteria when counting the number of "words", lexemes or vocabulary items in a language. Depending on the criteria used, estimates for English may vary between 500,000 and 2 million words - or many more. We identify below some of the many criteria which one would need to consider.
- 1 Grammatical changes
- 2 Scientific words
- 3 Status of a word
- 4 Acronyms
- 5 Various spellings
- 6 Multiple meanings
- 7 Get and phrasal verbs
- 8 Prefixes, suffixes and inflections
- 9 Binomials
- 10 Numbers
- 11 A sample case
- 12 Solutions
- 13 References
- 14 See also
Would conjugations or past participles used as adjectives be counted as separate words? In other words, the word "(to) close" is obviously a word. Should the past tense "closed" be counted as a separate word? In some verbs the past participle is formed differently to the past tense. Are past participles which differ from past tenses different words? That is to say, does the set "speak, spoke, spoken" represent three words and the set "close, closed, closed" only represent two?
Does the word "closed" when it is used as adjective in "a closed door" count as a separate word?
Should we count species names for flowers and insects and the 500,000 different names for fungi which are common to all languages? What about names for chemicals? How about medical names for diseases? With these you can dwarf the number of "normal" words in any language.
Status of a word
Equally difficult is the question of whether a word is actually used - it may exist but be so obsolete that it isn't used any more. Do we count it or not? Do we count slang? Do we count regional words? Do we count a word if it is used in the UK but not in the US or in all international varieties of English - including Indian English, which has a large selection of words from native languages?
There are a vast number of acronyms in the language some of which, such as UNESCO and NATO, are known internationally. Others, such as TEFL or CELTA, are only used by small communities. How would one decide whether to include them or not?
If a word has two spellings, does that count as one word or two? Or two past participles like "lighted" and "lit" or "dived" and "dove"? Does "dove" as a bird count as a separate word?
Furthermore, given that over eighty per cent of all words in English have more than one meaning – water as a verb and noun; lock as a verb and noun related to keys, or as a construction on a canal or river to regulate the ascent or descent of boats, or as a hold in wrestling or judo, or as in a lock of hair – should one count each meaning of the same word – the same combination of letters – as a different item? Surely if a person knows five meanings of the same word, he or she has a more extensive vocabulary than another person who knows only one meaning?
Get and phrasal verbs
Phrasal verbs are verbs formed by two (or more) parts. They express a single concept such as "run away" or "wake up". So should they be counted as a single word?
Take one of the most frequently used verbs in English – get. Should we consider the phrasal verbs get at, get away, get back, get by, get in, get off, get on, get over, get through, get up and many more as a word?
Then there are get forms where "get" means "become" such as get fat or get fatter. Should these be one lexeme -– get -– or an expression, a set phrase, an idiom? In a dictionary, these, and many others, might all be included under the entry get. And what about the inflections: gets, got/gotten, getting?
To further complicate the issue, phrasal verbs usually have multiple meanings. As the previous paragraph asks - should each different idea represented by these meanings be counted as a distinct word?
Prefixes, suffixes and inflections
How should we count words beginning with prefixes such as un-, as in unhappy, untidy, unlikely? Many of which are not included in dictionaries because of their apparent obviousness. The same occurs with adverbs ending in -ly, or inflections of nouns (singular and plural), adjectives (comparison) and, as we saw above, the past tenses of most verbs unless they are so irregular as to cause possible confusion. Thus, bad, worse and [the] worst would probably be included as three separate entries, whereas in the case of more regular adjectives such as cold, its regular comparative and superlative – colder, [the] coldest – would probably be included under one single entry: cold.
One might also ask if there is a difference between learning single words – big and small – and binomial expressions like black and white, thick and thin, boys and girls, ladies and gentlemen, eggs and bacon, fish and chips, socks and shoes?
Should we count individual numbers as words? "One" is pretty obviously a word and so is "two". "First" and "second" are clearly words as well. "Twenty" is a word but twenty-one in hyphenated - when spelt. So at what point, if any, do we stop calling numbers words?
One could argue that they stop becoming words when the spelling convention makes them separate - but that would mean that a hypothetical non-English language which spelt them as a single word would have an infinite number of words, as is the case of Italian (ottantotto) Dutch (achtentachtig) or German (achtundachtzig).
A sample case
To illustrate some of the above let us consider the very simple word "record" and its various derivations.
The verb "to record" has an initial meaning of "make a note of" or something of that nature and a newer meaning of "make an audio or video copy". It is conjugated as record, records, recorded, recorded, recording.
There is another verb which is "to break a record", meaning "to do something better than anybody else". For historic reasons this is a series of words but could easily have been one word. The verb "to break" is conjugated as usual to form the various tenses.
As a noun "a record" can be a plastic disc; an unequalled feat; or an note which has been taken. A "recording" may be an audio or video record of an event. A recorder may be a person which makes a record, a device which makes recordings or - with a totally different meaning - a wind instrument typically used by kids in primary school as an introductory instrument.
One can also use "record" as an adjective as in "a record time". "Recording" functions as an adjective in "a recording contract". "Recorded" functions as an adjective in "a recorded conversation".
In common with many such pairs, "record" as a verb and "record" as a noun are pronounced differently.
So how many words should we count?
Given the above, is there any way that we can usefully talk about numbers?
Counting the number of words we use
One solution might to try to estimate the vocabulary of the average native speaker, but even this presents difficulties. Partly because we all have an active and a passive vocabulary and partly because we can often "know" words we have never seen before, either because of their context or because they are made up of other parts of words we already know.
Counting the words in a dictionary
One might imagine that simply counting the words in a dictionary would provide the answer. But dictionary compilers have to consider all the issues outlined above.
And then one might ask "Which dictionary"? A medium-sized dictionary may contain some 100,000 entries. The New Oxford Dictionary of English, published in 1998, is the biggest single-volume dictionary and contains 350,000 words, of which 52,000 are scientific and technical words, although it avoids over-technical terminology. On the other hand, the 20-volume OED, the definitive dictionary of the English language, contains over half a million lexemes - many of which are obsolete.
Counting the words in other dictionaries would give other results depending on the objectives of the editorial staff. Counting the words in half a dozen dictionaries and dividing the total by the number of dictionaries would certainly give an average number - but, given the issues identified above, would this number be meaningful in any way? And would it be possible to the same with other languages to make any meaningful comparison?