2012 – zhtoolkit

The Lancaster Corpus of Mandarin Chinese as an SQL Database

Post author:Chad
Post published:October 19, 2012
Post category:Guides Tools

In my Chinese studies, the Lancaster Corpus of Mandarin Chinese (LCMC) has been a useful source of data—word and character frequencies, collocations, phrase usage, parts of speech, etc. The corpus is freely available for non-commercial and research use. However, the native form of its data is in a set of XML files, which is not an easy format to work with. In addition, the XML data is slow to read data from, because all those XML tags and the entire data structure needs to be parsed. A much better format for the data is an SQL database. Stored in a database, many kinds queries and reports can be executed very efficiently. Depending on the software, these queries and reports can return results very quickly, much faster than in the XML format.

I have made available a Perl script and some other related tools to assist with extracting the LCMC files into a SQLite database. SQLite is a lightweight relational database management system intended for portability and ease of use. Because it functions as a standalone program (not client-server), it is easy to install and use. It’s more ubiquitous than you might think. It’s how the Firefox and Chrome browsers stores its history, cookies, and preferences. But it’s also used, for example, by the Anki program as the storage format for flashcard data, and by the Calibre e-reader program to store information on installed e-books.

(more…)

Mapping the Old HSK Vocabulary onto the New HSK

Post author:Chad
Post published:September 18, 2012
Post category:vocabulary

The HSK is a well-known skill level test used by the PRC to assess language proficiency in Chinese. Even for those who have no interest in taking the HSK test, the lists of Chinese words associated with the test are a convenient source of material for learners to study vocabulary. I have used these word lists myself with great success; it was a quick and effective way to gain a huge amount of usable vocabulary.

In 2010, the HSK exam underwent a major reworking, changing the structure of its skill ranks, increasing emphasis on speaking and writing, and revising its vocabulary. Where the “old” pre-2010 word lists consisted of 8,000+ words across 4 levels, the “new” HSK has 5,000 words distributed into 6 levels. Below is a summary of the word counts in the old and new vocabulary lists, based on actual word lists obtained from various sources (see footnotes for details). Note that these include a small amount of double counting (less than 2%) due to words repeated at more than one level, because of either different pronunciation or meaning. Also note that these counts differ slightly from the official word counts reported by Hanban.

Word counts in the old and new HSK word lists
Level	old HSK	new HSK
1	1007	153
2	2001	150
3	2189	300
4	3587	600
5	–	1300
6	–	2513
Total	8784	5016

Since I had invested so much time in studying the old lists (up to level 3), it was natural to wonder whether I should continue studying my existing flashcards or switch to the new HSK lists. How many words have I learned that are deprecated by the HSK, and does it mean they are unimportant? If I did switch, what level should I pick to start studying ?

(more…)

Hazards of Google Search and Chinese Word Segmentation

Post author:Chad
Post published:August 24, 2012
Post category:Uncategorized

When one searches on Google’s search page, the Chinese words and phrases can strung together without separation, just as in normal writing. What isn’t immediately obvious is that it looks like behind the scenes Google has taken the Chinese pages it crawls and segments the texts into individual words before storing the terms in its database. For example, in Google’s search of web pages, the term 中国 reports over 1 billion hits. With the same term in quotes to indicate an exact phrase, “中国” reports 5 billion hits (with the discrepancy hard to explain). However, when a space is inserted into the word, the exact phrase “中国” reports 4.3 million pages, which is 0.08% of the amount for the single word “中国”. The kinds of pages returned from the space-separated query include matches for: 中國 in traditional script (for unknown reasons); words separated by punctuation, especially “中(国)” and “中。国” (i.e., one sentence ends with 中 and the next sentence starts with 国); and pages where every character is separated, as if the page were encoded or decoded incorrectly. These results suggest that Google treats Chinese searches the same as other languages, by storing pages in its back end database indexed by the individual words in the page. Storing the terms this way allows Google to quickly return results for a variety of queries, whether the user wants the terms anywhere in the page or as a connected phrase.

(more…)