A Mathematical Model for Chinese Word Knowledge

The Known Chinese Words Test has been running for a month now. During that time I’ve collected data from 170 trials, from learners with a wide range of levels. The results are very encouraging, so that I can give more details about what I have found. What I have been working on is a mathematical model for word knowledge, which can describe the probability for a particular person to know any word, with just a few variables involved. The results from the collected trials validates that hypothetical model, and I’m elated.

It has been known for a long time that there is a relationship between the frequency of a word in a language and the chance of a person knowing it. The most common words in a language are the most likely to be known, but the chance of knowing goes down as the words become more rare. This has been assumed evident since at least 1917¹. The word knowledge of foreign language learners has been the subject of recent research by Paul Nation², Paul Meara³, and James Milton⁴. For example, one line of study is on the knowledge of English words by foreign learners, testing the 5,000 most frequent word families in English. Plotting the percentage of words in each band of 1,000 contiguous ranks, a typical profile for a foreign language learner looks like this:

fig. 1. Typical profile for an EFL student (Meara, 1992; Milton, 2009)

The first band of the 1,000 top-ranked words have the highest percentage of words known, and this percentage goes down consistently for successively lower-frequency rank bands. For a beginning learner, the shape of the plot is a continuous downward slope. As a learner becaomes more advanced, the percentages all increase, but cannot go above 100%. Thus, the shape of the plot becomes more S-shaped, with an initial plateau of around 100% before sloping downward. The more word knowledge a learner has, the longer this plateau becomes.

My Experience

A few years ago I tested my word knowledge using a method similar to those in the above research, but with a few important differences. One, I used purely self-assessment of word knowledge, with a yes/no value for each presented word. Two, I used a larger range of words, much more than just the top 5,000 word families. Using this larger range to include more low-frequency words allowed a broader picture of the behavior of the knowledge profile. As expected, my personal examination yielded a near 100% knowledge of the highest-frequency words, quickly dropping down to a low level. As my studying progressed, similar graphs of word knowledge started to show the initial plateau for high-frequency words, showing that I knew near 100% of a wider cluster of words than before. Out of curiosity, I started playing around with formulas that could match the same kind of plot, where there is an initial plateau and a steep dropoff. There are a few equations that yield this kind of sigmoid plot; logistic or saturation models did not fit well with my results, nor did they make sense conceptually. I eventually arrived at an equation that did fit my personal data.

The Mathematical Model

My aim in crafting a formula for word knowledge was to predict the probability of knowing a word given either its rank or frequency. The initial equation I came up with was:

p_w = 1 – F ^{α * f_W} (eq. 1),

where

p_w: The probability of knowing word W

F: The “Forgetting Factor,” representing the probability of failing to learn a word after seeing it once

f_w: The frequency of word W, as a fraction of 100%

α: a factor related to the number of words the person has seen in the language

Conceptually, the equation represents a model where, every time I encounter a particular word, I have the chance of failing to learn it with probability F. While I’ve called F a “Forgetting Factor,” it really represents a failure to learn the word after seeing it, for whatever reason. Perhaps repeated encounters with a word offer more chances to learn the word in more contexts. Or, as often happens with me, repeatedly seeing a word I can’t figure out eventually spurs me on to look it up to find out the meaning. To illustrate, if F is 0.7, I have a 70% chance of not learning a word I see once. If I see the same word on two separate occasions, the probability of still not learning it is F², or 49%. After seeing the same word 10 times, I have a chance of learning it of over 97%. Since f_w is the percentage of the language consisting of word W, α * f_w is the likely number of times word W has been seen after reading α words in the language.

This equation sort of fits for both the initial platau and the steep dropoff. At least I can get the bend in the right place. Where this didn’t fit was in the long tail of rare words. The equation predicts that I should know nearly 0% of these words. However, I knew higher than this, a relatively flat probability of between 5% to 20% for these low-frequency words, no matter how low the frequency got. With the addition of another parameter, this can be accounted for by the modified equation:

p_w = 1 – F ^{αf_W + β} (eq. 2).

Parameter β effectively sets a constant probability of word knowledge, regardless of its frequency. For words that have never been seen before, f_w is effectively zero, so that the floor probability for knowing such a word is 1 – F^β, a non-zero asymptote when β > 0. Conceptually, β represents the ability to ascertain the meaning of words that have never been encountered, whether by educated guessing, etymology, or knowledge of component characters of a multi-character word.

Note that the absolute values of F, α, and β are not completely independent. If F is changed, α and β can also be changed to yield identical results. More precisely, α * log F is constant for a given plot, as is β * log F. When curve fitting, I tend to use 0.7 for F as a reasonable but arbitrary value. Adopting a consistent value for F makes it easier to compare α and β for different data sets.

my-known-2008-05 — **fig. 2.** My known words in 2008, with theoretical curves

Experimental Results

Using equation (2), I was able to fit all of my personal data, from both the beginning level and more current intermediate level of Chinese reading ability. But how does it fit with other learners’ word knowledge? I was relieved to see that it fits quite well for almost all trials, and for testees with wide-ranging knowledge of Chinese, from beginners to native speakers.

trial-97-beginner — **fig. 3.** Typical known words profile for beginning learner

trial-20-intermed — **fig. 4.** Typical known words profile for intermediate learner

trial-138-advanced — **fig. 5.** Typical known words profile for advanced learner with high β

trial-51-advanced — **fig. 6.** Typical known words profile for advanced learner with low β

These graphs show profiles at representative levels of Chinese knowledge. Figure 3 represents a typical beginning learner with low word knowledge. Figure 4 is for an intermediate learner. Figure 5 is for an advanced learner, with both high α and high β. Based on the model, the high β value indicates that this person not only knows many words, but can infer further ones based on the characters. Figure 6 is also for an advanced learner, but with a low value of β. Considering that the online test relied on self-assessment of word knowledge, this result may represent a learner with a more strict definition of knowing a word which doesn’t allow for guessing at meanings of unknown words. Even though figure 5 estimates more known words than figure 6, the model predicts that the learner associated with figure 6 should have a greater command of Chinese words, as shown by the higher value of α. [after writing this, I looked up the data for this trial, and it’s actually a heritage learner]

It’s important to keep in mind that the experimental trials relied completely on self-assessment. Testees use personal criteria for marking a tested word as “known”, allowing or disallowing guessed words, words with a correct meaning but incorrect (pinyin) pronunciation, or words with a minor definition known but where the major one was missed. This will all have some effect on the values of α and β.

I will continue to report on the results as the analysis comes in. Thanks to everyone who participated in the online test!

Palmer, Harold E., The scientific study & teaching of languages, 1917, World Book Company, Yonkers-on-Hudson, N.Y, p. 123. (archive.org)
↩
Paul Nation, Victoria Univ. of Wellington home page
↩
Paul Meara, _lognostics home page
↩
James Milton, Swansea University home page
↩

My Experience

The Mathematical Model

Experimental Results

You Might Also Like

Word Frequencies from Corpus Data

Hapax Legomena vs. the Brick Wall

Improving Character Frequency Lists with Dispersion Data