Photo by on Mr Cup / Fabien Barral Unsplash ‘String’ is a sequence of characters. Considering you as a programmer (why else on earth would you read this article dedicated to programmers?), there is a high probability that you already know this definition. And I think, it is fair to assume that you already used ‘String’ in at least one programming language. Despite that, I should remind you one more time, ‘String’ in programming world has nothing to do with the which popularizes more than any physicist ever did or could do. ‘String Theory’ Sheldon Cooper In this article, I’m going to introduce you to some very important concepts/terms related to String which you might need to use in your day to day programming. Also, familiarity with these terms might help you in your job interview one day, who knows? Though some of the terms have a little complex meaning from ‘Linguistic’ aspect, I’ll try to avoid them and explain it from the view programmers need to understand. Another thing needs to be mentioned that, I’m not going to show any code examples or how you can do string operations in your favorite languages. I’ll just talk about the terms and concepts and your job is to find out how to use them when you need them. Makes sense? Ok, let’s go. Character Encoding: The primary component of a String is ‘Character’, wait, you already know that. But do you know that basically characters are just code representations? I mean, the computer doesn’t know characters, letters etc. or store them as such in memory. The computer stores them as numbers (as binary, to be precise). There are some conventions that define which number will represent which character. This character sets basically maps some numbers or code points to textual characters. Then at the time of visual representation, computers read them from memory and represent them as characters based on that mapping. This mapping and character representation is what they call ‘Character Encoding’. The following article did a pretty darn good job explaining everything about it. So, I’m suggesting you read that to gain a better understanding of ‘Character Encoding’. _This article is about encodings and character sets. An article by Joel Spolsky entitled The Absolute Minimum Every…_kunststube.net What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to… Most of the novice programmers (especially who started their programming with C-like languages) fairly assumes that all of the characters in the world are single byte and obviously ASCII characters. I’m not gonna blame them, because I was one of them too once. But now I know that it’s not true and there is a vast world of Unicode characters out in the wild and as a programmer you need to deal with them pretty frequently. So, the first thing you need to remember that, All the characters in the world aren’t created equal and they are always not single byte characters. Most of the Unicode characters can be stored as a 16 bit or 2-byte data type. But still, there are some which can’t. There are , where 2 bytes can store as many as around 65,536 characters. So, Multi-byte data types are needed to store the remainders. So, you should keep in mind that: more than 136,000 code points defined in Unicode Some programming languages (C, C++ etc.) provide single byte data type ( ) for specifically ASCII characters and multi-byte data type ( ) for Unicode. char wchar_t Some programming languages (Java etc.) have only multi-byte data types for character types. The days of is long gone, and you need to keep that in mind vividly. one byte == one character Let’s give you another tiny bit of information while we are at it. Character encoding can be fixed length or variable length. You might’ve heard encoding schemes like UTF-8, UTF-16, UTF-32 etc. There are various other encodings though, but let’s just have a brief understanding of these three of Unicode Character Encoding. UTF-16 represents most of the commonly used Unicode code points in a single 16-bit character type and uses two 16-bit characters to represent the remainder. That means UTF-16 is a variable length encoding using minimum 16-bits (2 bytes) and maximum 32-bits (4 bytes). UTF-8 uses one to four 8-bit characters to encode all Unicode code points. It’s also another variable length encoding which represents basically the ASCII characters using a single byte (that’s handy because it means ASCII text is also valid in UTF-8) and rest of the code points by 2, 3, 4 bytes as needed. UTF-32 uses 4 bytes for all characters. It is a fixed length encoding. Your knowledge about character encoding might come in handy when you’ll work with . Also, you can guess from that this knowledge is very important in some other aspects like how much space, memory your text will consume based on it’s encoding. internationalization and localization this single discussion thread All I’m saying is, just be careful about character encoding when you are dealing with String or Text. String Immutability: Most of the languages provide ‘String’ as a basic data type. The thing you need to know about the language you are working with is, whether the String data type of your language is or . For instance, Java and Python String types are Immutable. If you try to treat an immutable String as mutable it will throw an error and you might wonder: Mutable Immutable What’s wrong with this String? You might be a little curious to know why Strings are Immutable in some languages, right? Let’s explore it a little. You see, in languages like C, they store Strings as an array of characters and acts like general purpose arrays. So, as you can easily mutate an array, you can mutate the String too. On the other hand, Languages like Java and Python treats String as Object and they When the compiler sees a String literal, it looks for the String in the pool. If a match is found, the reference to the new literal is directed to the existing String and no new String object is created. The existing String simply has one more reference. Here comes the point of making String objects immutable: set aside a special area of memory called the “String constant pool”. In the String constant pool, a String object is likely to have one or many references. If several references point to same String without even knowing it, it would be bad if one of the references modified that String value. That’s why String objects are immutable. So, the thing to learn from here is, be sure about the Immutability of Strings in your favorite languages before work with them. Ladies and gentleman, It’s time to learn some terms. Substring: Substring of a string S, is a string which occurs in S. Let’s say it more clearly. Any sequential part of a string is it’s substring. So, ‘is nice’ is a substring of the string ‘This is nice’. But ‘nice is’ not, because ‘nice is’ can’t be found as it is in the string ‘This is nice’. You must find the substring as it is in the String. The list of all substrings of the string “ ” would be: “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “”. apple apple appl pple app ppl ple ap pp pl le a p l e Prefix: A prefix of a string S is a substring of S that occurs at the of S.This might be clear from an example. The list of all prefixes of the string “ ” would be: “ ”, “ ”, “ ”, “ ”, “ beginning apple apple appl app ap a” So, to be a prefix, a substring needs to be started from the beginning of that string. A of a string is not equal to the string itself. So, “apple” won’t be a proper prefix of “apple”. proper prefix Suffix: A suffix of a string S is a substring of S that occurs at the of S.The list of all substrings of the string “ ” would be: “ ”, “ ”, “ ”, “ ”, “ ” end apple apple pple ple le e So, to be a suffix, a substring needs to be ended at the end of that string. A of a string is not equal to the string itself. So, “apple” won’t be a proper suffix of “apple”. proper suffix Subsequence: The formal definition of a subsequence is: “a subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements.” Let me explain it in simple words. A string is a sequence of characters, right? So, if we pick some characters maintaining the same sequence appeared in the original String then the resulting string would be a subsequence of the original string. Let’s say we have a string “Hello”. Then the “eo” is a subsequence of that string. But “ol” won’t be a subsequence, because it didn’t appear in the same sequence in “Hello”. The list of all subsequences for the word “ ” would be: “ ”. apple e, l, le, p, pe, pl, ple, p, pe, pl, ple, pp, ppe, ppl, pple, a, ae, al, ale, ap, ape, apl, aple, ap, ape, apl, aple, app, appe, appl, apple Dude, did you get the difference between ‘substring’ and ‘subsequence’? Let’s learn some common operations about Strings. Concatenation: This means concatenating two or more strings together.Let’s say there are two strings: and . The concatenation result of these two strings results in another string . ‘Hello ’ ‘World’ ‘Hello World’ Capitalization/Case Folding: This means converting all the characters of a string to same Case, either ‘LowerCase’ or ‘UpperCase’. Capitalization has various use cases. One of them is: If you need to make a comparison between two Strings in manner, you always need to normalize both texts to either ‘LowerCase’ or ‘UpperCase’. Well, to be honest, it won’t always be that simple. You’d face some weird cases if your strings contain Non-Latin characters. Let’s say your string is “résumé". To convert strings to UpperCase, you would generally convert the [a-z] characters set to [A-Z] characters set. So, what would be the UpperCase of ? In that case, you need to make some other considerations like Unicode Normalization and such. Learn more about it . Case insensitive é here Join: Though it seems from the name that ‘Join’ is similar to ‘Concatenation’, but there is a subtle but important difference between them. Unlike ‘Concatenation’, ‘Join’ concatenates the strings using a given ‘separator’ or ‘glue’ between each of the string element. So, if a list of strings given like: along with a separator/glue , then the resulting string will be: . In some cases, the separator can be an empty string(‘’), where it will behave exactly like ‘Concatenation’. [‘Hello’, ‘world’, ‘again’] ‘-’ ‘Hello-world-again’ Split/Tokenize: Splitting or Tokenizing a string means to break down a single string to several parts based on a delimiter or set of delimiters. Let’s say you have a String: Now, if you tokenize this string based on whitespace, you’ll end up with a list of strings or tokens: . Notice that, the delimiter won’t be included in the resulting strings. Also, in some cases, you might need to tokenize a string based on a pattern instead of some specific delimiters. This can be done easily by . ‘Why so serious?’ [‘Why’, ‘so’, ‘serious?’] Regex Trim/Strip: Trimming/Stripping a string generally means removing leading and trailing whitespaces from a string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.) and all the . So, a given string: ‘ this is simple ’ would be ‘this is simple’ after trimming/stripping. Variants of Trimming/Stripping includes: line terminator characters (LF, CR, etc.) trimming/stripping only leading(left trimming) or trailing(right trimming) whitespaces. any sequence of whitespace characters within the string is replaced with a single space, which is also known as ‘Space normalization’. Lexicographic Ordering: Sometimes you need to define an ordering for a list of strings. Most popular ordering is , which is also known as ‘Alphabetic order’ or ‘Dictionary order’. You’ve seen how words are ordered in a Dictionary, right? That’s exactly what lexicographic ordering is. So, if you order the list of words in lexicographical order, it would be: ‘Lexicographic Order’ [‘ant’, ‘Zebra’, ‘apple’, ‘applecart’] Zebraantappleapplecart Note that, in lexicographical ordering, UpperCase letters come before Lowercase letters, that’s why ‘Zebra’ comes before ‘ant’. Also, longer strings come after the same prefixed smaller strings, that’s why ‘apple’ precedes ‘applecart’. String Searching/Comparison: You’d very often do a lot of tasks which is directly or indirectly associated with some sort of string searching and/or comparison. Let’s talk about some common types of String/Pattern matching. The most common of them will be to check exact match between two strings. Which you would fairly assume to do by == operator. But I want to give a simple tip which might save you one day from a lot of trouble. If you store two strings in two variables and try to check their equality by == operator you need to make sure if your programming language is doing value match or just reference match? If your programming language treats String as Object then most likely when you try to check two String Object by == operator, . On the other hand, other programming languages might just check the value equality of these two string variables. So, keep that in mind. it would check equality of the reference of the two Object, instead of their values You’ve seen ‘autocomplete’ or ‘autosuggestion’ features in google search or your phones contact search, right? If you observe this closely you would see that they are doing basically some prefix match. Your typed string vs the list of strings they already stored. So, how would you do it by yourself if needed? If your list of strings is not large, you can take any naive approach to match with all of them. But it won’t be that much efficient for large lists. The efficient approach would be to use some sophisticated data structure like . Details discussion about it is beyond the scope of this article, so you should dive deep into it on your own. Trie Similarly, if you need to do some suffix match, you can convert the problem to a prefix match problem by storing reverse of the strings along with the actual strings or you can take help of . Suffix Tree Sometimes you need to do some pattern matching like, check if a string is a substring of a given string or if a pattern can be found in the given string. In that cases, you need to take help from various and choose the one which fulfills your need. String searching algorithms In some cases, When we are running a search, we want to find relevant results not only for the exact expression we typed on the search bar, but also for the other possible forms of the words we used. For example, it’s very likely we will want to see results containing the form if we have typed in the search bar. This can be achieved through two possible methods: . The aim of both processes is the same: Though this term is most popular in , it is also used in full-text search mechanism in some database system and text search engines like . “ran”, “running” “run” stemming and lemmatization reducing the inflectional forms of each word into a common base or root. NLP ElasticSearch The last thing I want to mention here is the ‘Phonetic String Matching’. Phonetic algorithms basically index words based on their pronunciation. So, if you want to match two words by their pronunciation you’d need to use Phonetic algorithms. Most common uses of Phonetic algorithms are ‘Spell checking’, ‘Searching’ etc. There are several phonetic algorithms available to do the job like , etc. But be aware that most of the phonetic algorithms were developed for the English language. So, they might not work for other languages as you expect. soundex metaphone Phew! That’s a whole lot to digest at once. Does it seem like that I dumped a lot of information without explaining them in details? Yeah, that’s exactly what I did. Because my primary focus on this article was to make you familiar with the buzzwords of String/Text processing. If you at least know the terms you can explore it deeply when necessary. So, that’s all for today. Leave your positive/negative feedback in the comments. Just don’t be too harsh, cause that’ll hurt my ego. Bye bye then. :)

Apple

Glue

What every programmer should know about ‘String’

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How do you authenticate, mate?

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

How do you authenticate, mate?

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps