There are about 15 million lines of Python code starting with a variant of the form: # -*- coding:<some encoding> -*-
hosted on GitHub. To any person learning Python, this might seem as another single-line “comment”. It sure did look that way to me when I started out programming in Python. However, I soon realized that there is something mysterious and esoteric about it. It was some kind of special comment which was being used by every Python developer across codebases. In this article, I will try to break down the concepts behind this line of code.
ASCII: A character set to represent every good old unaccented English letter in memory using a number between 32 and 127. The letter “A” is represented by 65 while 32 represents a “Space”.
Unicode: Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. In Unicode, a letter maps to something called a code point which is still just a theoretical concept.
UTF-8: It is an encoding which deals with the problem of storing the characters in memory. In UTF-8, every code point from 0–127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII.
It does not make sense to have a string without knowing what encoding it uses. — Joel Spolsky
TL;DR — If you have a string, in memory or in a file, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
Whenever you need to type any non-ASCII character while defining literals, that is, a character which cannot be represented through an ASCII code(e.g. accented English alphabet, Greek symbols), the Python interpreter does not know the encoding to use in order to represent this character. Thus, defining an encoding enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor. This is where our special comment comes into picture!
If a comment in the first or second line of the Python script matches the regular expression
_coding[=:]\s*([-\w.]+)_
, this comment is processed as an encoding declaration; the first group of this expression names the encoding of the source code file. The encoding declaration must appear on a line of its own. If it is the second line, the first line must also be a comment-only line.
Thus, all the variations of the encoding statement are valid as long as the regular expression matches. Refer PEP 263 for examples.
NOTE: From Python 3.0, if no encoding declaration is found, the default encoding is UTF-8.
We will use some variations of the code in the above file to understand the importance of the encoding in files. So, I would suggest you copy and save this Python file somewhere.Perform the experiments described below and try to reason your observations.
AttributeError
Bonus: Modify L03 using encode
instead of decode
and execute the file.If you still are facing trouble understanding the observations or I may have skipped any significant observation, let me know in the comments.
In all other cases, it is recommended to mention an encoding at the top of your Python files. This has two advantages:1. Being explicit in your code2. Compatibility across Python2 and Python3