Counting Known Chinese Words – Part I

In forums for foreign language learners,¹ ² certain questions recur. Some are easy to answer (Do I need to learn tones? Yes). Others (Should I study words or characters?) may have more than one answer. Two common questions I am especially interested in are:

How many words do I know?
How many words do I need to know (to read a newspaper, book, etc.)?

The second question is not immediately answerable except under certain conditions: if you can’t yet read your target material, then the answer is–More! Of course, people ask the second question because they want to know how close they are to their target fluency level, and the answer to the first question can give a rough estimate of that.

However, some may caution against focusing too much on the number of words one knows. One argument is that the quantity is inaccurate at best. Another is that the only relevant benchmark is whether you can comprehend what you are reading. A third arugment is that a larger vocabulary is only one aspect of improved skill in a foreign language; a better understanding of nuances in word meanings, as well as improved grammar and discourse structure are also essential to fluency, but are harder to measure. I don’t disagree with any of those points. But I still like to occasionally (once a year or so, at most) give myself an informal vocabulary test, to see if I’m making clear progress. It also is useful in giving me an idea of what level of texts I should be able to read comfortably. Because my primary goal in learning Chinese is reading fluency, it’s a metric that is useful for me. If listening and speaking were my primary areas of study, my level of written word recognition would be less relevant.

There has been active research in vocabulary acquisition, so I will mention some of the more prominent names. Paul Nation has been researching language learning, with a focus on English vocabulary, since 1970. His publications page has direct links to PDF’s for all his papers, and it’s a fascinating reading source because he addresses so many topics that I’ve had in mind as a learner. Paul Meara also studies vocabulary acquisition, with significant focus on estimating vocabulary size. He retired from professorship at Swansea University in 2009, so his other site is now the place to find his research information. James Milton is also at Swansea University, and has done joint research with Meara. His 2009 book Measuring Second Language Vocabulary Acquisition is an excellent review on testing methods, vocabulary size assessments, and the relationship of different teaching and learning styles.

In a 1988 paper by Meara and Jones, an insightful list of benefits from word counts is given, which I will quote here:

[The VOC test] is also interesting because it opens up areas of research which have not been accessible before. If the VOC test really does measure vocabulary size, then we can begin to ask questions like these:

How fast do people learn new words?

How much individual variation is there in the skill?

Is it affected by other variables, such as L1, or L1 vocabulary size?

How effective are different types of teaching programme?

Do intensive courses produce more vocabulary learning than less intensive a ones?

How quickly do learners who don’t practice lose their vocabulary?

Is the fallout rate such that it reaches a stable asymptote?

Is there a residue of words are you never really forget no matter how little you practice?

What is a Word?

This seems like an odd question, but it’s an important one when considering vocabulary counts. If a person knows the words “clock” and “clocks”, does this count for two words, or just one? What about “run” and “running”? Or, “nation”, “nations”, “national”, “nationality”, “nationalism”? How should an examiner count “walk” as a noun versus “walk” as a verb, or a “bank” for money versus a “bank” of a river? Should we count words that are proper names, compound words, abbreviations, or alternate spellings? Vocabulary tests generally consider only word families in estimating vocabulary size; that is, the base form of a word also represents its inflections and possibly a few derivations.

In language testing, having word families represent base words with their inflected and derived forms is a method commonly used in languages besides English. Most languages have grammar that includes regular rules for making plurals and inflecting verbs. Chinese is rather unusual in that it doesn’t have many such derived words. The plural marker “-们” is used sparingly and only collocates with certain words. Verbs do not conjugate like other languages (i.e., the base word functions as infinitive, present, and past tenses, and past and present participle), and instead uses auxiliary words where necessary to indicate tense or aspect. So, can corpus-derived frequency lists be used directly as a source for testing words? My personal experience using word frequencies derived from the Lancaster Corpus of Mandarin Chinese (LCMC) has been positive. The LCMC segments words in a way that is comparable to a base word in other languages. Verb-modifying markers such as 着 or 得 are split from the preceding verb, except in words such as 随着 or 觉得 where they are integral to the word. Similarly, the adverbal modifier 地 is split in most cases from its preceding adjective or stative verb. Thus, most words tagged in the corpus are already base words. Of the kind of words that I would prefer to split further are:

composite numbers (e.g., 一百一十四) which are not difficult to figure out if you know the individual characters
ordinals (第 + a number) tagged as single words
numbers + 年 (一九三三年) tagged as single words
most adjective and verb reduplication (e.g., 歪歪扭扭, 阴沉沉, 点点头), which can be easily determined from the base word
certain compounds, such 双 + X (“both X/bi- X”), or 最 + X (“most X”), which can be easily guessed

The effect of having these derivations in a source list of words is that they would generally be considered “known” if seen on a word test, even though they have never been studied, or possibly even seen before! This inflates the estimate of known word count, especially in the low frequency word bands. I will address this further in future posts.

Evaluation Methods

The most accurate way to count the number of words you know would be to go through every word in a large vocabulary list — possibly 10,000 words or more — and check off every word you know. Obviously, that’s too time-consuming to be practical. Instead, working with a random sample of words is more manageable. However, it does this at the expense of accuracy in the results, and the smaller the sample, the lower the accuracy. A commonly used method is to split the entire vocabulary into frequency “bands” of 1,000 contiguous word ranks by frequency (band 1 contains words ranked 1-1,000, band 2 contains 1001-2,000, etc.), and select random words from each band to test on. After the results are scored, the percentage of correct words in each sample is extrapolated to the size of the original band. For example, If a student scored 43 correct answers out of 50 words taken from the band of the top 1,000 words, it is estimated that the student would know around 43/50 * 1,000, or 860 words (86%) in this band. Adding up the estimated counts for all of the bands yields the estimated vocabulary size, out of the N word families from the full set.

Typical profile for an EFL student (Meara, 1992; Milton, 2009)

When measuring how many words one knows, it’s important to define what it means to “know” a word. The aforementioned researchers have designed tests for English learners that have been widely distributed. Nation and Laufer designed the Vocabulary Levels Test, a multiple-choice word matching test split into 1,000 word bands or other distinctions, for assessing an EFL student’s knowledge of English.

Sample question from the Vocabulary Levels Test
1 benefit 2 labour 3 percent 4 principle 5 source 6 survey	____ work ____ part of 100 ____ general idea used to guide one’s actions

Meara, et. al., have designed various vocabulary tests (e.g. here), reporting success with inventory checklists. Different sections of a test will contain sample words from distinct 1,000 word frequency bands, and the student checks off all the words he/she knows. Fake but plausible words are added as distractors to guard against simply checking off every word; a student’s score is penalized for checking off these distractors. The advantage of this method is that the test is quicker; a student can go through more than twice as many words in the same period of time compared to a multiple-choice test. If an examiner wants to test knowledge of multiple meanings (e.g., “bank”), a simple word inventory list can’t make this distinction. An inventory test can be modified to make use of associated pairs, such as “bank–money”, “bank–river”, and “bank–pencil”, and the testee can check off the associations that he thinks are correct.

Coming soon

In a future post, I will describe my own methods and experiences in estimating my own vocabulary size.

This Post Has 2 Comments

Confused Laowai January 23, 2011

Amazing post and information. I just finished a linguistics degree last year and I am very much interested in SLA and specifically vocabulary acquisition. Really eager for the next post.

Also, thanks for the links to the researchers. Definitely going to spend my time going through their work. Starting a Master’s degree this year in Hypermedia for Language Learning and these topics interest me very much. Keep up the good work!
John Pasden January 26, 2011

Great post! I’m really enjoying your take as a Chinese learner that’s into corpus analysis, because this is the same research we reference for a lot of the stuff we’re doing at AllSet Learning.

Looking forward to part 2!

Comments are closed.

What is a Word?

Evaluation Methods

Coming soon

You Might Also Like

Word Frequencies from Corpus Data

Counting Known Chinese Words – Part II

Hapax Legomena vs. the Brick Wall

This Post Has 2 Comments