vocabulary – zhtoolkit

Mapping the Old HSK Vocabulary onto the New HSK

Post author:Chad
Post published:September 18, 2012
Post category:vocabulary

The HSK is a well-known skill level test used by the PRC to assess language proficiency in Chinese. Even for those who have no interest in taking the HSK test, the lists of Chinese words associated with the test are a convenient source of material for learners to study vocabulary. I have used these word lists myself with great success; it was a quick and effective way to gain a huge amount of usable vocabulary.

In 2010, the HSK exam underwent a major reworking, changing the structure of its skill ranks, increasing emphasis on speaking and writing, and revising its vocabulary. Where the “old” pre-2010 word lists consisted of 8,000+ words across 4 levels, the “new” HSK has 5,000 words distributed into 6 levels. Below is a summary of the word counts in the old and new vocabulary lists, based on actual word lists obtained from various sources (see footnotes for details). Note that these include a small amount of double counting (less than 2%) due to words repeated at more than one level, because of either different pronunciation or meaning. Also note that these counts differ slightly from the official word counts reported by Hanban.

Word counts in the old and new HSK word lists
Level	old HSK	new HSK
1	1007	153
2	2001	150
3	2189	300
4	3587	600
5	–	1300
6	–	2513
Total	8784	5016

Since I had invested so much time in studying the old lists (up to level 3), it was natural to wonder whether I should continue studying my existing flashcards or switch to the new HSK lists. How many words have I learned that are deprecated by the HSK, and does it mean they are unimportant? If I did switch, what level should I pick to start studying ?

(more…)

Hapax Legomena vs. the Brick Wall

Post author:Chad
Post published:December 20, 2011
Post category:Linguistics vocabulary

The Brick Wall

With reading as my primary skill of focus in learning Chinese, a large part of my study is acquiring new words. Some vocabulary is from general word lists such as the HSK, while much of it is tied to a specific text I am reading, in order to increase my level of comprehension. While many approach the task of reading in a foreign language by looking up unknown words as they are encountered, I prefer to learn them ahead of time, to avoid the break in concentration while reading. With my bad habit of perfectionism, my main strategy in the past for learning these word has been the “Brick Wall Method”:

The Brick Wall Method – Learn every unknown word you encounter, no matter how difficult or rare it is

My theory has been — like being a brick wall against a tennis player — to not let any unknown word get past me, so that eventually I will run out of unknown words and thus will have learned the language. If a word is used in a text, it’s clearly important to some nominal degree, and if it’s used once, then it’s more likely to be seen again at some point, versus all the words that aren’t in the text.

(more…)

New Software – Chinese Word Extractor

Post author:Chad
Post published:September 23, 2011
Post category:Software

I have had my online vocabulary extraction tool available on the web for a while now. I have gotten a lot of use out of it myself, as my primary interest has been to develop more vocabulary to increase reading ability. The application generally works ok, but it suffers from some technical issues. Because it loads the entire CC-CEDICT every time it runs, it taxes the shared hosting provider a lot, to the point where the script crashes unpredictably, especially for larger texts. It also requires manual intervention to keep the dictionary up to date, and adding more dictionaries takes a lot of additional effort.

Meanwhile, for the past year I’ve been working on a similar program that can be used offline. It has been working well, is a little faster, and is easier to drop in newer versions of the CC-CEDICT dictionary. I have spent a few months adding a little more polish to it, and now am releasing it as open source software. At this point, it is available for Windows systems. The source code is also available, which would allow it to be used on nearly any system. More details are at the project page and the documentation page. Here are some screenshots to demonstrate its functionality:

(more…)

Skill Levels Quantified

Post author:Chad
Post published:June 12, 2011
Post category:vocabulary

I have started to analyze some of the results from the word test experiment. While I continue to crank through the data, here is an early result that I found…

A Mathematical Model for Chinese Word Knowledge

Post author:Chad
Post published:April 24, 2011
Post category:Linguistics

The Known Chinese Words Test has been running for a month now. During that time I’ve collected data from 170 trials, from learners with a wide range of levels. The results are very encouraging, so that I can give more details about what I have found. What I have been working on is a mathematical model for word knowledge, which can describe the probability for a particular person to know any word, with just a few variables involved. The results from the collected trials validates that hypothetical model, and I’m elated.

(more…)

An application to estimate known Chinese words

Post author:Chad
Post published:March 17, 2011
Post category:Tools

There has been a scarcity of posts on the blog lately, as I’ve been working on a web application for the site. This is a page anyone can use to estimate their knowledge of Chinese words. The start page for the test is here.

(more…)

Counting Known Chinese Words – Part II

Post author:Chad
Post published:February 5, 2011
Post category:Linguistics

By 2008, I had been studying Chinese off and on for around 3 years. As a self-learner, my study was rather eclectic: Pimsleur, Chinesepod, and random flash card lists were my main methods. I was far from fluent, still struggling to understand all but the simplest news articles, fiction, or blog posts. But I felt like I did know a lot of words, I just didn’t know how many. How much longer before this would start to get easy? So I undertook a self-examination to estimate how many Chinese words I actually knew.

(more…)

Counting Known Chinese Words – Part I

Post author:Chad
Post published:January 23, 2011
Post category:Linguistics

In forums for foreign language learners,¹ ² certain questions recur. Some are easy to answer (Do I need to learn tones? Yes). Others (Should I study words or characters?) may have more than one answer. Two common questions I am especially interested in are:

How many words do I know?
How many words do I need to know (to read a newspaper, book, etc.)?

The second question is not immediately answerable except under certain conditions: if you can’t yet read your target material, then the answer is–More! Of course, people ask the second question because they want to know how close they are to their target fluency level, and the answer to the first question can give a rough estimate of that.

(more…)

What makes a good vocabulary list?

Post author:Chad
Post published:October 25, 2010
Post category:Reading

People study foreign languages in many different ways. Because my main goal is reading, my particular method for studying Chinese places a large emphasis in acquiring receptive vocabulary, knowing the pinyin and the definition of words from the written characters. This is done through either flashcard software (I use Stackz) or spaced repetition software (like Anki). If I have an electronic text available, I use home-grown scripts to segment the text into words, and then create a word list of all the unique words. If I only have a printed book or magazine, I pick out the unknown words by hand, although this can be overwhelming with a difficult text. (more…)

Word Frequencies from Corpus Data

Post author:Chad
Post published:July 2, 2010
Post category:Linguistics

Knowing the frequency of the Chinese words you are studying helpful in a few different ways. If an unknown word is relatively common, then it’s generally more important to learn that word, compared to a less common word. With that knowledge in hand, you can feel less guilty about removing the rare words from your flashcards, and persist in learning the ones that are common yet difficult. If a word has a low frequency in general, but happens to be used a lot in a particular text, that word may be of interest to study. In the early stages of learning, studying the top N (100, 200, 500, etc.) words as flashcards is an effective way to bootstrap one’s word knowledge before diving into authentic texts. But it’s not 100% effective; the long tail of infrequent words will keep you busy learning new vocabulary for years!

So, how can we obtain word frequency data? With Chinese, it’s trickier than it sounds. (more…)