‘String’ is a sequence of characters.
Considering you as a programmer (why else on earth would you read this article dedicated to programmers?), there is a high probability that you already know this definition. And I think, it is fair to assume that you already used ‘String’ in at least one programming language.
Despite that, I should remind you one more time, ‘String’ in programming world has nothing to do with the ‘String Theory’ which Sheldon Cooper popularizes more than any physicist ever did or could do.
In this article, I’m going to introduce you to some very important concepts/terms related to String which you might need to use in your day to day programming. Also, familiarity with these terms might help you in your job interview one day, who knows?
Though some of the terms have a little complex meaning from ‘Linguistic’ aspect, I’ll try to avoid them and explain it from the view programmers need to understand. Another thing needs to be mentioned that, I’m not going to show any code examples or how you can do string operations in your favorite languages. I’ll just talk about the terms and concepts and your job is to find out how to use them when you need them. Makes sense? Ok, let’s go.
The primary component of a String is ‘Character’, wait, you already know that. But do you know that basically characters are just code representations? I mean, the computer doesn’t know characters, letters etc. or store them as such in memory. The computer stores them as numbers (as binary, to be precise). There are some conventions that define which number will represent which character. This character sets basically maps some numbers or code points to textual characters. Then at the time of visual representation, computers read them from memory and represent them as characters based on that mapping. This mapping and character representation is what they call ‘Character Encoding’. The following article did a pretty darn good job explaining everything about it. So, I’m suggesting you read that to gain a better understanding of ‘Character Encoding’.
This article is about encodings and character sets. An article by Joel Spolsky entitled The Absolute Minimum Every…kunststube.net
Most of the novice programmers (especially who started their programming with C-like languages) fairly assumes that all of the characters in the world are single byte and obviously ASCII characters. I’m not gonna blame them, because I was one of them too once. But now I know that it’s not true and there is a vast world of Unicode characters out in the wild and as a programmer you need to deal with them pretty frequently. So, the first thing you need to remember that,
All the characters in the world aren’t created equal and they are always not single byte characters.
Most of the Unicode characters can be stored as a 16 bit or 2-byte data type. But still, there are some which can’t. There are more than 136,000 code points defined in Unicode, where 2 bytes can store as many as around 65,536 characters. So, Multi-byte data types are needed to store the remainders. So, you should keep in mind that:
- Some programming languages (C, C++ etc.) provide single byte data type (char) for specifically ASCII characters and multi-byte data type (wchar_t) for Unicode.
- Some programming languages (Java etc.) have only multi-byte data types for character types.
- The days of one byte == one character is long gone, and you need to keep that in mind vividly.
Let’s give you another tiny bit of information while we are at it.
Character encoding can be fixed length or variable length. You might’ve heard encoding schemes like UTF-8, UTF-16, UTF-32 etc. There are various other encodings though, but let’s just have a brief understanding of these three of Unicode Character Encoding.
- UTF-16 represents most of the commonly used Unicode code points in a single 16-bit character type and uses two 16-bit characters to represent the remainder. That means UTF-16 is a variable length encoding using minimum 16-bits (2 bytes) and maximum 32-bits (4 bytes).
- UTF-8 uses one to four 8-bit characters to encode all Unicode code points. It’s also another variable length encoding which represents basically the ASCII characters using a single byte (that’s handy because it means ASCII text is also valid in UTF-8) and rest of the code points by 2, 3, 4 bytes as needed.
- UTF-32 uses 4 bytes for all characters. It is a fixed length encoding.
Your knowledge about character encoding might come in handy when you’ll work with internationalization and localization. Also, you can guess from this single discussion thread that this knowledge is very important in some other aspects like how much space, memory your text will consume based on it’s encoding.
All I’m saying is, just be careful about character encoding when you are dealing with String or Text.
Most of the languages provide ‘String’ as a basic data type. The thing you need to know about the language you are working with is, whether the String data type of your language is Mutable or Immutable. For instance, Java and Python String types are Immutable. If you try to treat an immutable String as mutable it will throw an error and you might wonder:
You might be a little curious to know why Strings are Immutable in some languages, right? Let’s explore it a little. You see, in languages like C, they store Strings as an array of characters and acts like general purpose arrays. So, as you can easily mutate an array, you can mutate the String too. On the other hand, Languages like Java and Python treats String as Object and they set aside a special area of memory called the “String constant pool”. When the compiler sees a String literal, it looks for the String in the pool. If a match is found, the reference to the new literal is directed to the existing String and no new String object is created. The existing String simply has one more reference. Here comes the point of making String objects immutable:
In the String constant pool, a String object is likely to have one or many references. If several references point to same String without even knowing it, it would be bad if one of the references modified that String value. That’s why String objects are immutable.
So, the thing to learn from here is, be sure about the Immutability of Strings in your favorite languages before work with them.
Ladies and gentleman, It’s time to learn some terms.
Substring of a string S, is a string which occurs in S.
Let’s say it more clearly. Any sequential part of a string is it’s substring. So, ‘is nice’ is a substring of the string ‘This is nice’. But ‘nice is’ not, because ‘nice is’ can’t be found as it is in the string ‘This is nice’. You must find the substring as it is in the String.
The list of all substrings of the string “apple” would be:
“apple”, “appl”, “pple”, “app”, “ppl”, “ple”, “ap”, “pp”, “pl”, “le”, “a”, “p”, “l”, “e”, “”.
A prefix of a string S is a substring of S that occurs at the beginning of S.
This might be clear from an example. The list of all prefixes of the string “apple” would be:
“apple”, “appl”, “app”, “ap”, “a”
So, to be a prefix, a substring needs to be started from the beginning of that string.
A proper prefix of a string is not equal to the string itself. So, “apple” won’t be a proper prefix of “apple”.
A suffix of a string S is a substring of S that occurs at the end of S.
The list of all substrings of the string “apple” would be:
“apple”, “pple”, “ple”, “le”, “e”
So, to be a suffix, a substring needs to be ended at the end of that string.
A proper suffix of a string is not equal to the string itself. So, “apple” won’t be a proper suffix of “apple”.
The formal definition of a subsequence is:
“a subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements.”
Let me explain it in simple words. A string is a sequence of characters, right? So, if we pick some characters maintaining the same sequence appeared in the original String then the resulting string would be a subsequence of the original string. Let’s say we have a string “Hello”. Then the “eo” is a subsequence of that string. But “ol” won’t be a subsequence, because it didn’t appear in the same sequence in “Hello”.
The list of all subsequences for the word “apple” would be:
“e, l, le, p, pe, pl, ple, p, pe, pl, ple, pp, ppe, ppl, pple, a, ae, al, ale, ap, ape, apl, aple, ap, ape, apl, aple, app, appe, appl, apple”.
Dude, did you get the difference between ‘substring’ and ‘subsequence’?
Let’s learn some common operations about Strings.
This means concatenating two or more strings together.
Let’s say there are two strings: ‘Hello ’ and ‘World’. The concatenation result of these two strings results in another string ‘Hello World’.
This means converting all the characters of a string to same Case, either ‘LowerCase’ or ‘UpperCase’.
Capitalization has various use cases. One of them is:
If you need to make a comparison between two Strings in Case insensitive manner, you always need to normalize both texts to either ‘LowerCase’ or ‘UpperCase’.
Well, to be honest, it won’t always be that simple. You’d face some weird cases if your strings contain Non-Latin characters. Let’s say your string is “résumé". To convert strings to UpperCase, you would generally convert the [a-z] characters set to [A-Z] characters set. So, what would be the UpperCase of é? In that case, you need to make some other considerations like Unicode Normalization and such. Learn more about it here.
Though it seems from the name that ‘Join’ is similar to ‘Concatenation’, but there is a subtle but important difference between them. Unlike ‘Concatenation’, ‘Join’ concatenates the strings using a given ‘separator’ or ‘glue’ between each of the string element. So, if a list of strings given like: [‘Hello’, ‘world’, ‘again’] along with a separator/glue ‘-’, then the resulting string will be: ‘Hello-world-again’. In some cases, the separator can be an empty string(‘’), where it will behave exactly like ‘Concatenation’.
Splitting or Tokenizing a string means to break down a single string to several parts based on a delimiter or set of delimiters. Let’s say you have a String: ‘Why so serious?’ Now, if you tokenize this string based on whitespace, you’ll end up with a list of strings or tokens: [‘Why’, ‘so’, ‘serious?’].
Notice that, the delimiter won’t be included in the resulting strings.
Also, in some cases, you might need to tokenize a string based on a pattern instead of some specific delimiters. This can be done easily by Regex.
Trimming/Stripping a string generally means removing leading and trailing whitespaces from a string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.) and all the line terminator characters (LF, CR, etc.). So, a given string: ‘ this is simple ’ would be ‘this is simple’ after trimming/stripping. Variants of Trimming/Stripping includes:
- trimming/stripping only leading(left trimming) or trailing(right trimming) whitespaces.
- any sequence of whitespace characters within the string is replaced with a single space, which is also known as ‘Space normalization’.
Sometimes you need to define an ordering for a list of strings. Most popular ordering is ‘Lexicographic Order’, which is also known as ‘Alphabetic order’ or ‘Dictionary order’. You’ve seen how words are ordered in a Dictionary, right? That’s exactly what lexicographic ordering is. So, if you order the list of words [‘ant’, ‘Zebra’, ‘apple’, ‘applecart’] in lexicographical order, it would be:
Note that, in lexicographical ordering, UpperCase letters come before Lowercase letters, that’s why ‘Zebra’ comes before ‘ant’. Also, longer strings come after the same prefixed smaller strings, that’s why ‘apple’ precedes ‘applecart’.
You’d very often do a lot of tasks which is directly or indirectly associated with some sort of string searching and/or comparison.
Let’s talk about some common types of String/Pattern matching.
The most common of them will be to check exact match between two strings. Which you would fairly assume to do by == operator. But I want to give a simple tip which might save you one day from a lot of trouble. If you store two strings in two variables and try to check their equality by == operator you need to make sure if your programming language is doing value match or just reference match? If your programming language treats String as Object then most likely when you try to check two String Object by == operator, it would check equality of the reference of the two Object, instead of their values. On the other hand, other programming languages might just check the value equality of these two string variables. So, keep that in mind.
You’ve seen ‘autocomplete’ or ‘autosuggestion’ features in google search or your phones contact search, right?
If you observe this closely you would see that they are doing basically some prefix match. Your typed string vs the list of strings they already stored. So, how would you do it by yourself if needed? If your list of strings is not large, you can take any naive approach to match with all of them. But it won’t be that much efficient for large lists. The efficient approach would be to use some sophisticated data structure like Trie. Details discussion about it is beyond the scope of this article, so you should dive deep into it on your own.
Similarly, if you need to do some suffix match, you can convert the problem to a prefix match problem by storing reverse of the strings along with the actual strings or you can take help of Suffix Tree.
Sometimes you need to do some pattern matching like, check if a string is a substring of a given string or if a pattern can be found in the given string. In that cases, you need to take help from various String searching algorithms and choose the one which fulfills your need.
In some cases, When we are running a search, we want to find relevant results not only for the exact expression we typed on the search bar, but also for the other possible forms of the words we used. For example, it’s very likely we will want to see results containing the form “ran”, “running” if we have typed “run” in the search bar. This can be achieved through two possible methods: stemming and lemmatization. The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. Though this term is most popular in NLP, it is also used in full-text search mechanism in some database system and text search engines like ElasticSearch.
The last thing I want to mention here is the ‘Phonetic String Matching’. Phonetic algorithms basically index words based on their pronunciation. So, if you want to match two words by their pronunciation you’d need to use Phonetic algorithms. Most common uses of Phonetic algorithms are ‘Spell checking’, ‘Searching’ etc. There are several phonetic algorithms available to do the job like soundex, metaphone etc. But be aware that most of the phonetic algorithms were developed for the English language. So, they might not work for other languages as you expect.
Does it seem like that I dumped a lot of information without explaining them in details? Yeah, that’s exactly what I did. Because my primary focus on this article was to make you familiar with the buzzwords of String/Text processing. If you at least know the terms you can explore it deeply when necessary.
So, that’s all for today. Leave your positive/negative feedback in the comments. Just don’t be too harsh, cause that’ll hurt my ego. Bye bye then. :)