This is the first in a series of blogs we’re going to bring to you directly from the trenches, going into some of the nitty-gritty technical detail of some of the things we’re doing with the Protocol at the moment.
Today’s article comes from Alex Pinto, a recent addition to the Aventus blockchain engineering team who’s been spending the past few weeks getting up to speed on using Solidity, and will take us through some of the challenges and particularities of the language.
Today I give you a post about programming for the Ethereum blockchain using the Solidity language. I won’t follow any plan in doing this: my objective is only to write about my obstacles in learning this language and the practical difficulties I encounter in my daily work.
I want the freedom to write about any topic without having first to introduce preliminary material, as I’d have to do if I were writing a textbook. If you notice me talking about things I have not explained before, that is by design. Leave me a comment below and I’ll come back to them in a later post.
Today, I want to talk about strings in Solidity. Solidity is, at first, similar in syntax to Javascript and other C-like languages. Because of that, it is easy for a newcomer with a grounding in one of several common and widespread languages to get a quick grasp of what a Solidity program does. Nevertheless, Solidity is mighty in the proverbial details that hide unforeseen difficulties. That is the case of the type string
and the related type bytes
.
Both of these are dynamic array types, which means that they can store data of an arbitrary size. Each element of a variable of type bytes
is, unsurprisingly, a single byte. Each element of a variable of type string
is a character of the string. So far so good, but the initial looks are deceiving. One who comes from other languages might expect the string
type to provide several useful functions, like:
Bad news: Solidity’s string
does none of this! If we need any of the above, we have to do it manually.
So, let’s explore some of these difficulties and see what we can do about them. I open Remix and type the following code in a new file called string.sol.
The right side of the screen, in Remix, is taken by the developer’s area. In the Compile tab, I check the Auto-Compile option, so that Remix will notify me of errors and code-analysis warnings as I write my code. The static code-analysis is controlled by the options in the tab Analysis, and I usually have all options selected.
In the current case, Remix will report two warnings of the same kind: the methods I have written can potentially have a high-to-infinite gas cost. I will ignore that in this post.
The above contract is very minimal. It defines a state variable store
of type string
, a method to set it and a method to get it. Let’s test it.
In the Run tab, I hit Deploy and if there are no problems with the contract, a new area will appear below that button with the address where the contract is located and the functions that are available.
Below the working area, Remix shows a detailed record of the transaction’s result. Initially, it shows only a line indicating the account that deployed the contract, the contract and method that was called, ie String.(constructor)
, and how much Ether was passed to the execution (initially this is shown in Wei, which is the smallest unit of Ether, corresponding to 10^-18 Ether). We can expand it by clicking over the header, revealing logs, execution and transaction costs, available gas, final result, etc.
At this point, I just want to press the button getStore on the right, and notice how that shows beneath it the result:
Likewise, there is a new transaction log on the left and by clicking it we can see:
in the decoded output. All is well.
Now, I type “0123456789” in the textbox to the right of setStore and hit that button. Then I call getStore again and I receive that string. Thumbs up, we can do basic storage/retrieval with strings!
Let’s now go for more interesting things.
So far, I have accessed a literal string and we have seen how we can change it by assigning to it. But that is only a very coarse way of dealing with strings. Let us create a string character by character. This will introduce us to one peculiarity of Solidity programming: data location.
I create a new method that only returns a new string with three specific characters: “Abc”.
This is a well-intentioned effort, but does not work. Remix is kind enough to immediately point 4 errors and 1 warning:
Two of these are on the same line: string newString = new string(3);
The other three occur in the following lines, eg newString[0] = "A";
and are all of the same type:
To understand the first issue, I have to tell you about data location. Writing to the blockchain is very expensive. Every node that runs the transaction has to do the same writing, which makes the transaction more expensive and the blockchain bigger. When a node downloads a block containing this transaction, it will incur larger storage costs because of this writing. In Ethereum, every transaction has an associated cost, called gas, to incentivise programmers to be as economic as possible.
When writing a contract, authors have a choice of what kind of data to use: memory is cheap (i.e. it costs relatively low gas, but the data are volatile and lost after a function finishes executing); storage is the most expensive (and is absolutely needed for contract state, which must persist from function call to function call); there is also a calldata location (that corresponds to the values in the stack frame of a function that is executing). This is the cheapest location to use, but it has a limited size. In particular, that means that functions may be limited in their number of arguments.
Every data type has a default location. This is from the Solidity documentation:
Forced data location:
-parameters (not return) of external functions: calldata
-state variables: storage
Default data location:
-parameters (also return) of functions: memory
-all other local variables: storage
Notice the subtlety: function parameters are by default stored in memory, except if the function is external, in which case they will be stored in the stack (ie calldata). This means that a function that is perfectly alright when public
can suddenly have too many argumen_ts_ when made external
.
Now, let’s come back to our code and examine the line
string newString = new string(3);
This is a local variable inside the function, and so by default it is in storage. The new
keyword is used to specify the initial size of a memory dynamic array. Memory arrays cannot be resized. On the other hand, we can change the size of a storage dynamic array by changing its length
property, but can’t use new
with them.
This is the source of our error. In this case, all we want to do with this string is create it and return it to the outside. Let the outside world decide what to do with it, and whether it is temporary only or important enough to persist on the blockchain. In this example, the storage is not important, and the string will be created in memory. To do that, we add the memory
keyword in the declaration, like this:string memory newString = new string(3);
Let’s see the second sort of errors now. This is simple and unavoidable: Solidity does not currently allow index access to strings. From the FAQ:
_string_
is basically identical to_bytes_
only that it is assumed to hold the UTF-8 encoding of a real string. Since_string_
stores the data in UTF-8 encoding it is quite expensive to compute the number of characters in the string (the encoding of some characters takes more than a single byte). Because of that,
_string s; s.length;_
is not yet supported and not even index access
_s[2]_
The alternative is to first transform the string into bytes, and then access it directly. This works because string
is an array type, albeit with some restrictions.
But there is a trap to watch out for. bytes
stores raw data; string
stores UTF-8 characters. The following code does not always return the number of characters in _s
:
The problem here occurs if _s
contains any character that takes more than 1 byte to represent in UTF. In that case, the function returns the length of the byte representation of the input string, and will be more than the number of characters.
This has also an impact when trying to address a particular character of the string, as we cannot predict at which location the character’s bytes will be. We have to parse the string linearly identifying any multi-byte character, or else make sure we restrict our input to characters of fixed length. If we work exclusively with ASCII strings, for example, we’ll be safe.
Returning to our previous function, this works:
But for example, the following code which tries to set the third character of a string to X, will fail when it receives multi-byte characters.
This returns “AbXdef” for an input of “Abcdef”, but returns “XbÁnç!” for an input of “€bÁnç!”
There are still many more things that can be said about this topic, but this is a long enough post already, so I’ll wrap up. The key concept regarding the type string
is that this is an array of UTF-8 characters, and can be seamlessly converted to bytes
. This is the only way of manipulating the string at all. But it is important to note that UTF-8 characters do not exactly match bytes. The conversion in either direction will be accurate, but there is not an immediate relation between each byte index and the corresponding string index.For most things, there may be an advantage in representing the string directly as the type bytes
(avoiding conversions) and be very careful when using characters that are encoded in UTF by more than one byte.
That’s enough for now. See you another day, with more steps in this coding adventure.
Alex is a software engineer at Aventus, working on the blockchain engineering team. He has 20 years of experience working in technology, completing a PhD in Computer Science as well as a post-doctorate in Cryptography. As part of his research, Alex has published papers on Kolmogorov Complexity, Cryptography, Database Anonymization and Code Obfuscation.
Alex also spent seven years lecturing at the University Institute of Maia, including directing the degree programmes for BSc Computer Science and Information Systems and Software.
This article was originally posted on his blog.
Aventus is a blockchain-based protocol that delivers increased trust, security and control for the live-event ticketing industry, practically eliminating counterfeit tickets and unfair scalping. Organisers can create, manage and promote their events and associated tickets, dramatically reduce platform costs, and significantly influence secondary markets.
For more information, visit Aventus.io and follow Aventus on Twitter, Telegram and Reddit.