The Lancaster Corpus of Mandarin Chinese as an SQL Database

In my Chinese studies, the Lancaster Corpus of Mandarin Chinese (LCMC) has been a useful source of data—word and character frequencies, collocations, phrase usage, parts of speech, etc. The corpus is freely available for non-commercial and research use. However, the native form of its data is in a set of XML files, which is not an easy format to work with. In addition, the XML data is slow to read data from, because all those XML tags and the entire data structure needs to be parsed. A much better format for the data is an SQL database. Stored in a database, many kinds queries and reports can be executed very efficiently. Depending on the software, these queries and reports can return results very quickly, much faster than in the XML format.

I have made available a Perl script and some other related tools to assist with extracting the LCMC files into a SQLite database. SQLite is a lightweight relational database management system intended for portability and ease of use. Because it functions as a standalone program (not client-server), it is easy to install and use. It’s more ubiquitous than you might think. It’s how the Firefox and Chrome browsers stores its history, cookies, and preferences. But it’s also used, for example, by the Anki program as the storage format for flashcard data, and by the Calibre e-reader program to store information on installed e-books.

Obtaining the Corpus

The home page for the LCMC is at http://www.lancs.ac.uk/fass/projects/corpus/LCMC/. Of the two links on the page for the distribution locations, the Oxford Text Archive link is the only one that allows for direct downloading. Follow the link to the Oxford Text Archive, browse the archives, and search for Lancaster Corpus of Mandarin Chinese in the “Corpora” tab. The LCMC is ID 2474. The download link is directly on the detail page. After reading the Terms and Conditions, enter your email to receive the link to the actual download location. Clicking on the link in the email message will immediately download the file, 2474.zip (6.5MB).

Creating the SQL Database

The scripts can be downloaded as a zip file from here, or as individual files in the archive. You will need Perl installed on your machine, along with a few necessary modules that may need to be downloaded separately. Full information on the required steps can be found in the README file.

If the data conversion was successful, you will now have a 297 MB file called lcmc.db3 in the current directory.

Database_Schema — Entity-Relation diagram for the LCMC database

As for client access to the newly created SQLite database, you will need software that can handle Unicode, so that the Chinese characters can be displayed. For Windows, SQLiteSpy is my personal choice. Queries are fast, the interface is intuitive, and it has Unicode and support for regular expressions built in. For Linux, both Sqliteman and SQLite Manager (a Firefox plug-in) are adequate. One common software package for Linux, SQLite Database Browser is not suitable, because the LCMC database is large, and the program works by trying to load it all into memory at once.

sqlitespy-screenshot — Screenshot of SQLiteSpy for Windows

Some Sample Queries

In a relational database, information is organized into separate tables of data and linked by unique keys for each record. This arrangement makes it efficient to perform many kinds of queries. Below are just a few examples of the kind of reports that be done.

Raw frequency counts of all Chinese words in the corpus

SELECT characters, COUNT('x')
  FROM words
 WHERE token_type = 'w'
   AND is_cjk = 'Y'
GROUP BY characters ORDER BY count('x') DESC, characters ASC;

characters	COUNT(‘x’)
的	51141
了	12787
是	11601
在	10420
一	8380
和	7297
他	5898
不	5687
我	5576
…	…

List all words tagged as prepositions

SELECT W.characters, P.id, P.description, COUNT(*) from words W
  JOIN pos P on W.part_of_speech = P.id
 WHERE P.description = 'preposition'
 GROUP BY W.characters, P.id, P.description
 ORDER BY COUNT(*) DESC;

characters	id	P.description	COUNT(*)
在	p	preposition	9899
对	p	preposition	3155
把	p	preposition	1989
从	p	preposition	1782
为	p	preposition	1702
与	p	preposition	1432
以	p	preposition	1355
被	p	preposition	1245
…	…	…	…

List the average sentence length per text category

SELECT T.id, T.type, ROUND(AVG(LENGTH(F.characters)), 1) as avg_sentence_length
  FROM full_sentences F
  JOIN texts T on F.text_id = T.id 
 GROUP BY T.id, T.type;

id	type	avg_sentence_length
A	Press reportage	34.0
B	Press editorial	36.6
C	Press review	42.5
D	Religion	41.2
E	Skills, trades and hobbies	32.5
F	Popular lore	33.0
G	Biographies and essays	31.0
H	Miscellaneous (reports, official documents)	42.3
J	Science (academic prose)	40.3
K	General fiction	24.3
L	Mystery and detective fiction	28.4
M	Science fiction	33.4
N	Martial art fiction	28.6
P	Romantic fiction	25.8
R	Humour	20.8

List usage of words containing the character 给

SELECT W.characters, P.id, P.description, count(*) from words W
  JOIN pos P on W.part_of_speech = P.id
WHERE W.characters LIKE '%给%'
 GROUP BY W.characters, P.id, P.description
 ORDER BY COUNT(*) DESC;

characters	id	description	count(*)
给	p	preposition	985
给	v	verb	149
给予	v	verb	92
交给	v	verb	54
送给	v	verb	37
递给	v	verb	19
供给	vn	verb with nominal function	16
发给	v	verb	15
…	…	…

All words and corresponding pinyin

select W.file_id, W.sentence_id, W.word_num, W.characters, P.characters AS pinyin from words W
left join pinyin_words P USING (file_id, sentence_id, word_num);

file_id	sentence_id	word_num	characters	pinyin
A01	0001	1	大	da4
A01	0001	2	墙	qiang2
A01	0001	3	内外	nei4wai4
A01	0001	4	－－	－－
A01	0001	5	北京市	bei3jing1shi4
A01	0001	6	监狱	jian1yu4
A01	0001	7	纪实	ji4shi2
A01	0001	8	（	（
A01	0001	9	三	san1
A01	0001	10	）	）

Pinyin usage of character 乐

  select W.character, P.character AS pinyin, count(*)
    from characters W
    join pinyin_characters P USING (file_id, sentence_id, word_num, char_num)
    where W.character = '乐'
group by W.character, P.character;

character	pinyin	count(*)
乐	le4	270
乐	yue4	104

These are just a few examples of what you can do with the corpus data in the SQL database. Have fun!

http://www.zhtoolkit.com/apps/small_utilities/lcmc_xml_to_sqlite.zip (69Kb)

Obtaining the Corpus

Creating the SQL Database

Some Sample Queries

Raw frequency counts of all Chinese words in the corpus

List all words tagged as prepositions

List the average sentence length per text category

List usage of words containing the character 给

All words and corresponding pinyin

Pinyin usage of character 乐

You Might Also Like

An application to estimate known Chinese words

Busting Personal SRS Myths, and an Anki Simulation in Python

Recording Streaming Radio with VLC