Tomi Mester

@datalab

Learning languages very quickly — with the help of some very basic Data Science

I moved to Sweden 6 months ago with my girlfriend. It is a great country for expats like us, because almost everyone speaks really good English here. Even so, we would like to learn some Swedish, just to understand a bit more in daily conversation and about Swedish culture.

For me this is gonna be a hobby language, so I don’t want to put too much effort into it. Let’s say, I’d like to spend max 10–20 hours and my target is to understand 80% of daily chit-chat. Is it doable? We will see! Here’s my approach — backed with some beginner-level data science.

The concept

What I’ll present here is a pretty unorthodox method, I guess. But sometimes you need to be creative if you want things to be easier, right?

What I did:

  1. I took two popular TV shows, Friends and How I Met Your Mother.
  2. I got the Swedish subtitles of these TV shows — for all seasons, all episodes.
  3. I analyzed the most common words and sorted the words by number of occurrences.
  4. I made a cut at the top 1000 words.

The question is, how far can I go, if I learn only the top 1000 words of a language. Or being more pragmatic: What percentage of Swedish conversations are covered by the top 1000 words of Swedish language? I set up a short data script to find out — and by the end of this article I’ll share it with you as well! But first, I’ll show you how I got my answer to this question. (Note: my method works with every European language.)

(UPDATE: since I published this article, I got some nice tweets on Twitter from folks who had used this method before me! So it’s not so unorthodox after all! :-))

The code

The first step is to get the subtitles. It can be any and as many movies as you want. I picked Friends and HIMYM, because I have already watched both series in English before and I liked the vocabulary. But if you want to speak more like Rocky, get the subtitles of the Rocky movies. (An important note about subtitles! As far as I know, using subtitles for non-profit research projects like this one is not illegal. But to stay on the safe side, I invite you to do the same that I did: ask a native speaker friend to listen and type in the whole script for you instead of downloading. ;-))

Either way you go — once you have the subtitles, you will need a tool to do the text analysis part. I’ve described in this article how to get, install and start with data coding tools like Python, R, SQL and bash. You could use any of those for this project.

I chose the simplest one: bash. It’s so simple that the whole analysis actually needs only one line of code. On the screenshot below I broke it down to 14 lines just to make it more readable.

So here’s my exact bash code:

Github link for copy-pasting: here.

Line by line

What does the code do? In general, the first 9 lines will be about data cleaning and the rest 5 will do the actual analysis. Here it is line by line:

  • Line 1: Read-in all the subtitle files that I put previously in my “friends” and “himym” folders on my data server. From now on you can imagine all the subtitles as one big temporary text file. Every further modification will be performed on this “one big temporary text file”. Here’s a small sample of that:
  • Line 2: Remove all the lines that start with a number. (e.g. the timestamps)
  • Line 3: Replace all the question marks, exclamation marks, dots and pipes with spaces.
  • Line 4: Replace all the multiple spaces with single spaces.
  • Line 5: Remove everything but alphabetical characters, apostrophes and spaces.
  • Line 6: Remove all the unnecessary spaces from the beginning of the lines.
  • Line 7: Turn every space into a line-break (means we will have one word per line in our file).
  • Line 8: Remove all the empty lines.
  • Line 9: Turn every uppercase character into lowercase characters.

Now we have a clean list with all the words that were used in the TV shows. Every word appears as many times as it was used. Here’s a sample:

  • Line 10: Sort the words into alphabetical order.
  • Line 11: Make a list with every unique word and print the number of occurrences next to them.
  • Line 12: Order this word-list by the number of occurrences. (The most occurrences at the top of the list.)
  • Line 13: Remove spaces from the beginning of the lines.
  • Line 14: Print the top 1000 lines (top 1000 used words) into a file called 1000.csv.

Done! Don’t worry if you don’t get the whole script immediately. Even if it’s a pretty beginner-level data script, you still need some initial knowledge about bash to fully understand it. I’ve published 4 articles about the very basics of data coding in bash on my blog. You can read them here if you want to learn more:

Learn Data Analytics in Bash — from scratch (and here is: ep#1)

The results

We already have the top 1000 unique words that we were looking for. Nice! The only question left is: how much will I understand if I learn only these 1000 words? To answer it, I need to have two numbers:

  1. I need to count the total number of words (multiple used words counted multiple times).
  2. I need to count how many times the top 1000 unique words occur in the subtitles.

Comparing these two numbers will give back the usage ratio of the TOP 1000 words.

Back to my code! To answer 1) I have to change Line 14 to this code:

awk ‘{sum += $1} END {print sum}’

It will give you the total word count: 1 131 360

Then I’ll do the same with my freshly created 1000.csv file…

awk ‘{sum += $1} END {print sum}’ 1000.csv

I’ll get the total occurrences of the top 1000 words: 964 682

This is just amazing! 964682/1131360= 85.26%

Yes, it means if I learn the top 1000 words from Friends and How I Met Your Mother, I will understand 85% of the words in the script!

Reality check! I shot this picture at the IKEA on the weekend. All the words of the headline are understandable after learning only the top 200 words.

Another interesting stat! This chart below shows how many percent (Y) you cover by learning the TOP (X) words.

The result: a beautiful logarithmic function on a line chart. (This is not a trend or a smoothed line, this is the actual result!) After all this means that even if you invest the same amount of time to learn the first 1000 words as the second 1000 words, the first 1000 words will help you to understand 85%, the second 1000 only 5%. And so on.

The next steps

Okay, I have to admit, vocabulary is not everything. So my next steps will be:

  1. Learn the top 1000 Swedish words.
  2. At the same time do the Swedish Duolingo course to get some grammar knowledge. (Note that Swedish looks pretty similar to German and English, so I guess I won’t have serious trouble learning the grammar part.)
  3. Watch some movies with Swedish dub+subtitles, that I’ve already watched in English or Hungarian (which is my native language). This is important to match the pronunciation with the written format. And of course to learn some special expressions.

And then… we will see!

Don’t forget: my goal was to understand, not to fluently speak the language.

Disclaimers

To be frank I would not even call this project a “data science” project — as there is nothing very scientific in it. It’s just common sense and some text mining with basic data coding. And before NLP (Natural Language Processing) professionals start stoning me — I have to mention two known issues:

  1. This whole little language learning experiment of mine is based on a fair, but unproven assumption. I presumed that the vocabulary of these TV shows is pretty similar to the vocabulary of the daily discussions. Maybe this is just not true. Maybe. But it looks so logical that I’ve just decided to take this minimal risk anyway.
  2. The bash script I presented above is not 100% perfect. E.g. I didn’t implement any stemming algorithm to reduce words like “cooking,” “cooked” or “cooks” to the root format: “cook.” My estimation is that these kind of issues cause a top 5% inaccuracy in my results. And then I just asked myself: What’s the ROI (return on investment) here? Is that 5% accuracy really worth the extra effort? The answer was: definitely not.
    I was looking for a quick and dirty method to learn a language, so it’s kinda self-explanatory that the script I wrote for that should be quick and dirty as well. If I had spent days (and not minutes) optimizing my data script to 99.9% (and not to 95%), the whole project would have lost sight of the main objective: be time-efficient.

And one more note. I haven’t tested this method yet and until I do, it will remain a hypothesis. But the learning-experiment starts now, and I promise that I’ll let you know how it worked! (UPDATE: I’ve learnt the top 200 words — and I already feel much more confident with Swedish! :-)) Also, I invite you to be a part of this and share your experience whether this method worked for you or not.

Here are the top 1000 words in Swedish and in English. If you want to do the same for other languages, feel free to use my code.

Free video course

Do you want to learn Data Science and run similar projects? Then check out my new (free) online video course. Start here: How to Become a Data Scientist.

REGISTER HERE (FOR FREE): https://data36.com/how-to-become-a-data-scientist/

Conclusion

You can apply data in almost every segment of your life to do things smarter. Be creative and take advantage of the fact that there is so much available data around us — and it is so easy to analyze it and turn it into meaningful findings! And always stay critical, think about the return on investment and don’t be afraid to do quick and dirty things!

If you want to learn more about data science, check my blog (data36.com) and/or subscribe to my Newsletter! And also check my new data coding tutorial series: SQL for Data Analysis.

Thanks for reading! Enjoyed the article? Please just let me know by clicking the 💚 below. It also helps other people see the story!

Tomi Mester
author of data36.com
Twitter:
@data36_com

More by Tomi Mester

Topics of interest

More Related Stories