Optimizing Chinese Character Learning

Written by jordan.shapiro | Published 2016/12/29
Tech Story Tags: chinese | language | data-science | education | language-learning

TLDRvia the TL;DR App

An Easier and Faster Way to Learn Chinese

About a week ago, a friend asked me a question that would end up consuming me for many late nights:

“I’ve been studying Chinese for a few months now and I’ve learned all of these characters to make basic words. I know a lot of them can be rearranged and put in different orders to make other words that I would be able to read, write, and speak more easily, but I don’t know what those words would be…How can I figure out what words I should already be able to form and if it’s worthwhile to learn them?”

The question intrigued me. Having studied Chinese for 5 years, I’ve often wondered the same thing. But this time, the question resonated differently with me. I started to think not just about what characters my friend could learn, but also how she should learn if she wanted to grasp Chinese as quickly as possible.

With this question in mind, I embarked on a journey to determine the quickest order in which one could learn Chinese characters.

Background

(Skip this section if you’ve already studied Chinese language or how the language works.)

The Chinese language is classified as a syllable-based logography, a writing system where each syllable is represented by a graphical character. Some of these characters represent full words on their own (just like the monosyllabic “I” or “me” in English). In other cases, stringing multiple characters together creates a full word (think of the polysyllabic “Iodine” or “meander” in English).

Basics of Chinese Characters. Chinese words can be made from one character or from several combined characters.

In alphabet-based languages, there is a direct link between reading and pronunciation, so students only need to memorize a word’s pronunciation or spelling to be able to fully use that word. In Chinese, however, there is no direct link between character and pronunciation (see Assumption 3 below), so students need to memorize a word’s pronunciation (romanized with the Latin alphabet in a system called pinyin) and character-based “spelling” in order to be able to fully read, write, speak, and understand that word.

For example, I can show you the Spanish word for house, casa, and you immediately have an idea of how to pronounce it. You can study this word with a two-sided flashcard, with the Spanish on one side and the English on the other. If I show you the Chinese character for house, 家, you have no clear signal of how the character should be pronounced (jiā). In order to study a Chinese word, you would need a three-sided flashcard, with the characters on one side, the pinyin pronunciation on another, and the English meaning on a third.

This complexity is one of the reasons that studying Chinese is so hard (see the US State Department’s classification). Since learning each character takes significant effort in Chinese, it is critically important to determine which characters will give students the most value for their effort.

The Question

So how can we order the characters that students memorize so that they learn the most Chinese as quickly as possible?

Our first instinct might be to teach students words based on how frequently those words are used and to require students to memorize the characters in those words. This is a relatively standard approach to language that is intuitive for non-character-based languages. It’s also the likely basis of most Chinese courses, since it makes sense to teach students the words that they’ll encounter most frequently.

But in a character-based language like Chinese, the standard approach might not be the most optimized. It is entirely possible that the most common words in Chinese contain uncommon characters or, rather, that students can gain more command of the language without learning as many characters so long as they make maximum use of those they already know. These “low hanging fruit” characters are the premise of my friend’s question and this exploration.

Learning Options

For the purposes of testing the idea focusing on these “low hanging fruit” (LHF) and how this prioritization in the Chinese learning process would increase learning efficiency, we can consider three different learning methods:

**Method 1: Standard**As stated above, students should learn Chinese words in the order of their usage frequency. By extension, students should study characters in the order in which they appear in these frequency-arranged words. This method prioritizes everyday usage and ease of communication.

**Method 2: LHF Words**Students should study the characters that will give them the most mastery over the Chinese language as measured by the words they can form with those characters. For each character a student is about to learn, we take into consideration all of the words that the student can make with that character and the characters they already know, as opposed to just focusing on the most common word. This method prioritizes efficiency in learning characters.

**Method 3: Combined Approach of Standard with LHF Words**Students should learn Chinese in the order of the most frequent words, but when they learn characters that they can use to form other words (LHF), they should learn those LHF words before attempting to learn another character. This combined approach uses principles from both Method 1 and Method 2. It prioritizes everyday usage and ease of communication, but also being efficient given the characters that a student has already learned.

Application of Ordering Methods. Here, we see a visual representation of a set of Chinese words and their frequencies (f). Colored blocks respond to Chinese characters with blocks of the same color representing the same character. Method 1 optimizes based on word frequency. Method 2 optimizes based on the best path to find LHF words. Method 3 finds LHF while going through the most frequent words. For each method, I show the characters learned at every step (cl) and the overall mastery that a student would have gained (m). Note how cl and m vary for each method.

Process

I decided to put these methods to the test to determine how each would alter a student’s ability to learn Chinese. Given the time it takes to absorb a useful amount of Mandarin, rather than teaching a group of students, I instead opted to teach my computer Mandarin via a simulation. Here’s how:

  • I first downloaded a list of Chinese words and their frequencies (also known as a lexicon). There are numerous lexicons of this nature available online for most languages, and each has unique qualities (definitions of what’s is or isn’t a word, sources of vocabulary, ways of measuring frequency, etc.) In this case, I used a Qing Cai and Mark Brysbaert’s lexicon derived from a corpus of 6,243 Chinese movies and TV series.
  • I wrote several Python programs to “teach” my computer Chinese according to each of the three methods as if it were a student with no prior experience. For each method, I tracked the order in which the student should learn Chinese words and how much Chinese (by percentage) that student should speak after each word. I also tracked the order in which the student should learn characters and how much Chinese the student would be able to read/write based on how many characters they learned. Feel free to reach out to me if you’d like to access the code I wrote for this project.
  • To save computing time, I ran these programs for the first 5000 words in the Cai and Brysbaert lexicon rather than all 99,121 words. These first 5000 words cover 93.3% of the Chinese language as per the lexicon and are comparable to the 5000 words expected of Chinese speakers who pass the highest level of the Hanyu Shuiping Kaoshi Chinese fluency test (HSK Level 6).
  • Given the results of the simulations, I compared the three methods to see how each would affect a student’s language mastery.

Assumptions

My process required me to make the following assumptions:

  1. The student is attempting to learn Chinese for oral and written communication, rather than just oral communication. In other words, the student wants to learn how to read and write in addition to speaking and listening, meaning that the student must learn Chinese characters.
  2. In this article, I refer to simplified characters for Mandarin Chinese. One could theoretically apply the same process to traditional characters or different dialects of Chinese that use characters/words with different frequencies. Downloading a different word lexicon would account for this change.
  3. A student must study a character in order to know how it is pronounced and what it means. While students comfortable with Chinese are occasionally able to guess a character’s pronunciation or meaning based on its appearance, its similarity to other learned characters, and its radicals, this effect is never exact and accounting for it in code is nontrivial. Additionally, while this technique is sometimes useful in reading, it is much more difficult when writing.
  4. There exist some small edge cases in my code that, while avoidable, have minimal impacts on the overall results. For example, the way my simulation for Method 2 works means that LHF words corresponding to a recently learned character may not appear in frequency order.
  5. In this exploration, I consider all Chinese characters equally difficult to learn. Other studies (such as this 2016 work from Loach and Wang or this 2013 paper from Yan et. al.) consider the complexity of specific characters and go so far as to require the student to learn simple component characters first before learning characters in which those components appear.
  6. In rare instances, certain characters appear to have multiple pronunciations. Accounting for this in code is also nontrivial and, as such, we will assume that when a student learns a character, they learn all of its pronunciations. This is even more realistic in Methods 2 and 3, since it is more likely that a character with multiple pronunciations would be studied in sequence with other words containing that character.
  7. For the sake of these simulations, we disregard the basic path dependency of a student’s first lessons while learning a language. For example, all three methods suggest learning the word 的 (de, of) first, but it would be unnatural for “of” to be the first word that students learn since it cannot be incorporated into a simple sentence. Given this path dependency, we should take the exact word/character orders proposed by the simulations with a grain of salt, particularly within the first few characters.

Results

After running all three methods, I compared how each would affect a student’s mastery of Chinese.

To start, we can consider what percentage mastery a student will have after each word learned. Even before running our explorative simulations, we would have been able to intuit that the standard approach would be the most effective on a per-word basis, since that is precisely what it optimizes for (learn the next most frequent word at any given time). What might have been less obvious is how dramatically Methods 2 and 3 would influence a student’s mastery on a per-word basis. Let’s take a look at the data:

Methods 2 and 3 have students learn all words that they can make with their current character set before learning any new characters. Some of the words that they can make will be common words with high-frequency, whereas others may be more obscure LHF words. For example, Method 2 instructs the student to learn the word 好看 (hăokàn, attractive) for .38% mastery before learning the word 和 (, and) for 4.48% mastery simply because the student already knows the characters for 好 (hăo, good) and 看 (kàn, look). We should note that Method 3 works somewhat better than Method 2 in covering common words (at some points around 1172 words about 2.3% better), given that it prioritizes the most frequent words by default. Still, both Methods 2 and 3 fall short of Method 1 here, with a gap as wide as 11.6% at 59 words. Lastly, it is important to note that in both the per-word and per-character results, we will see all 3 methods converge at the end of simulation, since they must all end up teaching the student the same 5000 words and 2067 characters regardless of order.

At first, we might view this as a weakness of Methods 2 and 3, particularly if our goal is to be able to communicate with the most frequent words as quickly as possible (as is often the case in Mandarin conversational classes which don’t include reading and writing). But this is actually the entire point of our optimization, which is based on the idea that learning Chinese characters takes significantly more effort than learning only pinyin. (In another world, it might be more convenient if all of Mandarin were written in pinyin and we could revert to a two-sided flashcard way of learning, but, alas, it would be somewhat sacrilegious to roll back the 3,000+ years of history backing the character system, its beauty, and its ability to keep written Chinese consistent over millennia.) Instead, it is much more informative to look at the per-character results of each method:

Looking at the per-character results tells a much more informative and interesting story about how these learning methods perform.

As we would expect, Methods 2 and 3 behave much better than Method 1 on this basis since they both account for LHF and make the most of the characters that a learner has mastered. That being said, two important conclusions arise from the results.

First, it is readily apparent that Methods 2 and 3 offer a significant advantage over Method 1 on a per-character basis. Once a student has learned 491 characters using Method 2, they are able to access a whole 5.2% more Chinese words. That is to say, this student can read 5.2% more Chinese than a peer just by optimizing which characters to learn.

Secondly, we note that Methods 2 and 3 are neck-in-neck throughout the per-character results, and more so than in the per-word results. Their largest gap in mastery on a per-character basis is 1.7% at 25 characters, as compared to a gap of 2.3% at 1172 words in the per-word data. Functionally, this is because both methods search for LHF and increase their per-character mastery, but Method 3 does so by also prioritizing per-word mastery. Essentially, Method 3 is a happy medium between Methods 1 and 2, though it behaves more similarly overall to Method 2.

In Practice

So what does this mean to an actual Chinese language learner?

The results of this study are quite conclusive. If you learn Chinese characters and simultaneously study all of the Low Hanging Fruit words associated with them, you can more quickly gain mastery of Chinese reading and writing (since knowing characters is not as necessary for speaking). However, doing so comes at the cost of learning the most common words first, since you would end up learning some LHF words that are not as useful in everyday life. Essentially:

If you want to learn to read and write Chinese, use Method 2 or 3 to study.

If you only want to learn conversational Chinese, use Method 1.

Get Started!

Use the Quizlet Flashcards I’ve generated to start learning Chinese characters more efficiently!

Ready to get started? Here are ordered Quizlet flashcards (password: “Medium”) I’ve created for Method 2 (Deck 1, Deck 2, Deck 3) and Method 3 (Deck 1, Deck 2, and Deck 3). (Note that translations for these decks were created by Google Translate and are not my own.) For Method 1, you can find a frequency-based dictionary here or the original Cai and Brysbaert lexicon here.

Discussion

I’m excited about the potential for this work to change how learners access written and spoken Chinese, but there is always more work to be done. Here are some ideas on how the above can be improved upon in the future:

  • Different Lexicons: For this exploration, I chose to base characters and word frequencies off of Cai and Brysbaert’s list. One could easily envision running the same simulations with other lexicons derived from TV shows, books, newspapers, or all of the above. Each will have a different effect on the results and depend on the student’s learning goals.
  • Different Learning Tactics: This exploration presumes that character learning is the most difficult aspect of Chinese language learning. Other learning methods could be tested to examine alternative learning styles or mastery metrics.
  • Different Assumptions: Future renditions might change my assumptions to add in nuance to the methods (for example, examining character difficulty and/or path-dependent character components).
  • Comparison to HSK and Textbooks: An extension of this work might look at the Hanyu Shuiping Kaoshi or Chinese textbooks to see how much they vary from optimized learning paths. This information could be used to improve the vocabulary selection for those resources. (Already, we could guess that the HSK is not optimized for the most frequent words given that 2663 characters are represented in the 5000 tested, as opposed to the 2067 suggested by the Cai and Brysbaert lexicon.)
  • More Computation: I’ve cut off my computation at 5000 words for the sake of computational convenience, but it is easy to change the variables of my methods to pull in more words and characters. Doing so might change some of the ordering in how words are optimized, since LHF beyond 5000 words were not taken into account in the above simulations.
  • Academic Discussion: There certainly exists academic discourse on Chinese language learning and optimized learning orders (including the PLOS papers linked above). This exploration could contribute to those academic discussions or serve as a practical example of implementing an optimized learning style via flashcards.

Thoughts on what might be the best next step for this exploration? Find the flashcards or thought process particularly useful? Comment below and be sure to recommend/share this post with others! Find other posts from me here and follow me for future updates!


Published by HackerNoon on 2016/12/29