I moved to Sweden 6 months ago with my girlfriend. It is a great country for expats like us, because almost everyone speaks really good English here. Even so, we would like to learn some Swedish, just to understand a bit more in daily conversation and about Swedish culture.
For me this is gonna be a hobby language, so I don’t want to put too much effort into it. Let’s say, I’d like to spend max 10–20 hours and my target is to understand 80% of daily chit-chat. Is it doable? We will see! Here’s my approach — backed with some beginner-level data science.
What I’ll present here is a pretty unorthodox method, I guess. But sometimes you need to be creative if you want things to be easier, right?
What I did:
The question is, how far can I go, if I learn only the top 1000 words of a language. Or being more pragmatic: What percentage of Swedish conversations are covered by the top 1000 words of Swedish language? I set up a short data script to find out — and by the end of this article I’ll share it with you as well! But first, I’ll show you how I got my answer to this question. (Note: my method works with every European language.)
(UPDATE: since I published this article, I got some nice tweets on Twitter from folks who had used this method before me! So it’s not so unorthodox after all! :-))
The first step is to get the subtitles. It can be any and as many movies as you want. I picked Friends and HIMYM, because I have already watched both series in English before and I liked the vocabulary. But if you want to speak more like Rocky, get the subtitles of the Rocky movies. (An important note about subtitles! As far as I know, using subtitles for non-profit research projects like this one is not illegal. But to stay on the safe side, I invite you to do the same that I did: ask a native speaker friend to listen and type in the whole script for you instead of downloading. ;-))
Either way you go — once you have the subtitles, you will need a tool to do the text analysis part. I’ve described in this article how to get, install and start with data coding tools like Python, R, SQL and bash. You could use any of those for this project.
I chose the simplest one: bash. It’s so simple that the whole analysis actually needs only one line of code. On the screenshot below I broke it down to 14 lines just to make it more readable.
So here’s my exact bash code:
Github link for copy-pasting: here.
What does the code do? In general, the first 9 lines will be about data cleaning and the rest 5 will do the actual analysis. Here it is line by line:
Now we have a clean list with all the words that were used in the TV shows. Every word appears as many times as it was used. Here’s a sample:
Done! Don’t worry if you don’t get the whole script immediately. Even if it’s a pretty beginner-level data script, you still need some initial knowledge about bash to fully understand it. I’ve published 4 articles about the very basics of data coding in bash on my blog. You can read them here if you want to learn more:
Learn Data Analytics in Bash — from scratch (and here is: ep#1)
We already have the top 1000 unique words that we were looking for. Nice! The only question left is: how much will I understand if I learn only these 1000 words? To answer it, I need to have two numbers:
Comparing these two numbers will give back the usage ratio of the TOP 1000 words.
Back to my code! To answer 1) I have to change Line 14 to this code:
awk ‘{sum += $1} END {print sum}’
It will give you the total word count: 1 131 360
Then I’ll do the same with my freshly created 1000.csv file…
awk ‘{sum += $1} END {print sum}’ 1000.csv
I’ll get the total occurrences of the top 1000 words: 964 682
This is just amazing! 964682/1131360= 85.26%
Yes, it means if I learn the top 1000 words from Friends and How I Met Your Mother, I will understand 85% of the words in the script!
Reality check! I shot this picture at the IKEA on the weekend. All the words of the headline are understandable after learning only the top 200 words.
Another interesting stat! This chart below shows how many percent (Y) you cover by learning the TOP (X) words.
The result: a beautiful logarithmic function on a line chart. (This is not a trend or a smoothed line, this is the actual result!) After all this means that even if you invest the same amount of time to learn the first 1000 words as the second 1000 words, the first 1000 words will help you to understand 85%, the second 1000 only 5%. And so on.
Okay, I have to admit, vocabulary is not everything. So my next steps will be:
And then… we will see!
Don’t forget: my goal was to understand, not to fluently speak the language.
To be frank I would not even call this project a “data science” project — as there is nothing very scientific in it. It’s just common sense and some text mining with basic data coding. And before NLP (Natural Language Processing) professionals start stoning me — I have to mention two known issues:
The bash script I presented above is not 100% perfect. E.g. I didn’t implement any stemming algorithm to reduce words like “cooking,” “cooked” or “cooks” to the root format: “cook.” My estimation is that these kind of issues cause a top 5% inaccuracy in my results. And then I just asked myself: What’s the ROI (return on investment) here? Is that 5% accuracy really worth the extra effort? The answer was: definitely not.I was looking for a quick and dirty method to learn a language, so it’s kinda self-explanatory that the script I wrote for that should be quick and dirty as well. If I had spent days (and not minutes) optimizing my data script to 99.9% (and not to 95%), the whole project would have lost sight of the main objective: be time-efficient.
And one more note. I haven’t tested this method yet and until I do, it will remain a hypothesis. But the learning-experiment starts now, and I promise that I’ll let you know how it worked! (UPDATE: I’ve learnt the top 200 words — and I already feel much more confident with Swedish! :-)) Also, I invite you to be a part of this and share your experience whether this method worked for you or not.
Here are the top 1000 words in Swedish and in English. If you want to do the same for other languages, feel free to use my code.
Do you want to learn Data Science and run similar projects? Then check out my new (free) online video course. Start here: How to Become a Data Scientist.
REGISTER HERE (FOR FREE): https://data36.com/how-to-become-a-data-scientist/
You can apply data in almost every segment of your life to do things smarter. Be creative and take advantage of the fact that there is so much available data around us — and it is so easy to analyze it and turn it into meaningful findings! And always stay critical, think about the return on investment and don’t be afraid to do quick and dirty things!
If you want to learn more about data science, check my blog (data36.com) and/or subscribe to my Newsletter! And also check my new data coding tutorial series: SQL for Data Analysis.
Thanks for reading! Enjoyed the article? Please just let me know by clicking the 💚 below. It also helps other people see the story!
_Tomi Mester_author of data36.comTwitter: @data36_com