In my Chinese studies, the Lancaster Corpus of Mandarin Chinese (LCMC) has been a useful source of data—word and character frequencies, collocations, phrase usage, parts of speech, etc. The corpus is freely available for non-commercial and research use. However, the native form of its data is in a set of XML files, which is not an easy format to work with. In addition, the XML data is slow to read data from, because all those XML tags and the entire data structure needs to be parsed. A much better format for the data is an SQL database. Stored in a database, many kinds queries and reports can be executed very efficiently. Depending on the software, these queries and reports can return results very quickly, much faster than in the XML format.
I have made available a Perl script and some other related tools to assist with extracting the LCMC files into a SQLite database. SQLite is a lightweight relational database management system intended for portability and ease of use. Because it functions as a standalone program (not client-server), it is easy to install and use. It’s more ubiquitous than you might think. It’s how the Firefox and Chrome browsers stores its history, cookies, and preferences. But it’s also used, for example, by the Anki program as the storage format for flashcard data, and by the Calibre e-reader program to store information on installed e-books.
› Continue reading…
Tags:
corpus,
howto,
LCMC,
Software,
SQL,
tools,
word frequency,
words
The HSK is a well-known skill level test used by the PRC to assess language proficiency in Chinese. Even for those who have no interest in taking the HSK test, the lists of Chinese words associated with the test are a convenient source of material for learners to study vocabulary. I have used these word lists myself with great success; it was a quick and effective way to gain a huge amount of usable vocabulary.
In 2010, the HSK exam underwent a major reworking, changing the structure of its skill ranks, increasing emphasis on speaking and writing, and revising its vocabulary. Where the “old” pre-2010 word lists consisted of 8,000+ words across 4 levels, the “new” HSK has 5,000 words distributed into 6 levels. Below is a summary of the word counts in the old and new vocabulary lists, based on actual word lists obtained from various sources (see footnotes for details). Note that these include a small amount of double counting (less than 2%) due to words repeated at more than one level, because of either different pronunciation or meaning. Also note that these counts differ slightly from the official word counts reported by Hanban.
Word counts in the old and new HSK word lists
Level |
old HSK |
new HSK |
1 |
1007 |
153 |
2 |
2001 |
150 |
3 |
2189 |
300 |
4 |
3587 |
600 |
5 |
– |
1300 |
6 |
– |
2513 |
Total |
8784 |
5016 |
Since I had invested so much time in studying the old lists (up to level 3), it was natural to wonder whether I should continue studying my existing flashcards or switch to the new HSK lists. How many words have I learned that are deprecated by the HSK, and does it mean they are unimportant? If I did switch, what level should I pick to start studying ?
› Continue reading…
Tags:
flashcards,
studying,
vocabulary,
word frequency,
word lists,
words
When one searches on Google’s search page, the Chinese words and phrases can strung together without separation, just as in normal writing. What isn’t immediately obvious is that it looks like behind the scenes Google has taken the Chinese pages it crawls and segments the texts into individual words before storing the terms in its database. For example, in Google’s search of web pages, the term 中国 reports over 1 billion hits. With the same term in quotes to indicate an exact phrase, “中国” reports 5 billion hits (with the discrepancy hard to explain). However, when a space is inserted into the word, the exact phrase “中 国” reports 4.3 million pages, which is 0.08% of the amount for the single word “中国”. The kinds of pages returned from the space-separated query include matches for: 中國 in traditional script (for unknown reasons); words separated by punctuation, especially “中(国)” and “中。国” (i.e., one sentence ends with 中 and the next sentence starts with 国); and pages where every character is separated, as if the page were encoded or decoded incorrectly. These results suggest that Google treats Chinese searches the same as other languages, by storing pages in its back end database indexed by the individual words in the page. Storing the terms this way allows Google to quickly return results for a variety of queries, whether the user wants the terms anywhere in the page or as a connected phrase.
› Continue reading…
Tags:
words
I have had my online vocabulary extraction tool available on the web for a while now. I have gotten a lot of use out of it myself, as my primary interest has been to develop more vocabulary to increase reading ability. The application generally works ok, but it suffers from some technical issues. Because it loads the entire CC-CEDICT every time it runs, it taxes the shared hosting provider a lot, to the point where the script crashes unpredictably, especially for larger texts. It also requires manual intervention to keep the dictionary up to date, and adding more dictionaries takes a lot of additional effort.
Meanwhile, for the past year I’ve been working on a similar program that can be used offline. It has been working well, is a little faster, and is easier to drop in newer versions of the CC-CEDICT dictionary. I have spent a few months adding a little more polish to it, and now am releasing it as open source software. At this point, it is available for Windows systems. The source code is also available, which would allow it to be used on nearly any system. More details are at the project page and the documentation page. Here are some screenshots to demonstrate its functionality:
› Continue reading…
Tags:
Software,
vocabulary,
word lists,
words
I have started to analyze some of the results from the word test experiment. While I continue to crank through the data, here is an early result that I found interesting.
What this shows is the estimated word count, as extrapolated from the tested samples of each test, of over 200 trials, grouped by self-reported skill level and fit to a Gaussian distribution. The bar height is the mean value for the estimated known words, and the error bars are +/-1 standard deviation for the Gaussian distribution. Yes, that’s quite a large deviation, and it’s not surprising given the many sources of variability in the data: sampling error in individual trials by using 165 samples out of 36,000; self-reporting skill level in one of 6 categories — beginning to advanced and native — with the meaning of the categories undefined; and individual choice of what it means to “know” a word. This is all in addition to the natural distribution of skill within one of the 6 defined categories. Also, keep in mind that +/-1 standard deviation is only about 68% of the data. Thus, while the graph makes a nice linear progression by advancing skill level, the results are probably fuzzier than they look.
Despite the overlap in known words for each skill category, there are clear definite realms for each level. If you score 7,000 words on the test, you would be similar in knowledge to others in either the lower intermediate or the intermediate level, but you definitely know more than most beginners, and less than most high intermediate or advanced learners. So, this chart may be useful if you’ve gotten a word score from the test, and want to find out if you’re at the skill level you thought you were. Of course, there is much more to language skill than the the number of isolated words you know. Ultimately, the only person you have to satisfy is yourself.
Tags:
studying,
vocabulary,
words
The Known Chinese Words Test has been running for a month now. During that time I’ve collected data from 170 trials, from learners with a wide range of levels. The results are very encouraging, so that I can give more details about what I have found. What I have been working on is a mathematical model for word knowledge, which can describe the probability for a particular person to know any word, with just a few variables involved. The results from the collected trials validates that hypothetical model, and I’m elated.
› Continue reading…
Tags:
Linguistics,
vocabulary,
word frequency,
words
There has been a scarcity of posts on the blog lately, as I’ve been working on a web application for the site. This is a page anyone can use to estimate their knowledge of Chinese words. The start page for the test is here.
› Continue reading…
Tags:
Software,
vocabulary,
words
By 2008, I had been studying Chinese off and on for around 3 years. As a self-learner, my study was rather eclectic: Pimsleur, Chinesepod, and random flash card lists were my main methods. I was far from fluent, still struggling to understand all but the simplest news articles, fiction, or blog posts. But I felt like I did know a lot of words, I just didn’t know how many. How much longer before this would start to get easy? So I undertook a self-examination to estimate how many Chinese words I actually knew.
› Continue reading…
Tags:
Linguistics,
vocabulary,
words
In forums for foreign language learners,1 2 certain questions recur. Some are easy to answer (Do I need to learn tones? Yes). Others (Should I study words or characters?) may have more than one answer. Two common questions I am especially interested in are:
- How many words do I know?
- How many words do I need to know (to read a newspaper, book, etc.)?
The second question is not immediately answerable except under certain conditions: if you can’t yet read your target material, then the answer is–More! Of course, people ask the second question because they want to know how close they are to their target fluency level, and the answer to the first question can give a rough estimate of that.
› Continue reading…
Tags:
Linguistics,
vocabulary,
words
Knowing the frequency of the Chinese words you are studying helpful in a few different ways. If an unknown word is relatively common, then it’s generally more important to learn that word, compared to a less common word. With that knowledge in hand, you can feel less guilty about removing the rare words from your flashcards, and persist in learning the ones that are common yet difficult. If a word has a low frequency in general, but happens to be used a lot in a particular text, that word may be of interest to study. In the early stages of learning, studying the top N (100, 200, 500, etc.) words as flashcards is an effective way to bootstrap one’s word knowledge before diving into authentic texts. But it’s not 100% effective; the long tail of infrequent words will keep you busy learning new vocabulary for years!
So, how can we obtain word frequency data? With Chinese, it’s trickier than it sounds. › Continue reading…
Tags:
corpus,
LCMC,
Linguistics,
vocabulary,
word frequency,
words