Walter Bright, who designed D, wrote an article called “So You Want to Write Your Own Language”. I love D and I respect Mr. Bright, but I disagree with nearly everything he says in it, mostly because D itself and Digital Mars’s work in general is (and should be) extremely atypical of the kind of programming language design that occurs today.
Digital Mars has been writing cross-platform compilers for decades (and when I was first getting into C on Windows in the early naughts, Digital Mars’s C compiler was the one that gave me the least trouble), and D was positioned as an alternative to C++ as a general purpose language. The thing is, most languages being developed today aren’t going to be C-like performance-conscious compiled languages that need to build binaries on multiple platforms and live in environments with no development toolchains (all major platforms either ship with a GNU or GNU-like UNIX dev toolchain or have a system like cygwin or homebrew that makes installing one straightforward), and trying to compete with C++ is not only technically difficult but also foolish for wholly non-technical reasons.
Design
The second piece of advice Mr. Bright gives is to stick relatively close to familiar syntax, in order to maximize audience. This makes sense if you are trying to compete with C++, as D does. But D is a great illustration of the problems with this approach: D does everything C++ does better than C++ does it, and yet it has failed to displace C++. The reason is that C++ is a general purpose language, like Java, C#, and Go (in other words, a language that performs every task roughly equally poorly) and the appeal of general purpose languages is that they allow one to trade time learning a language that’s a better fit for the problem for immediate implementation difficulty.
In other words, most things that are implemented in C++ are done that way not because C++ is the best tool for the job, but instead because C++ is marginally better suited than the three other languages that everybody on the dev team has known inside-out for twenty years. D can’t compete because the whole point of a general purpose language is to appeal to devs who failed the marshmallow test & want to minimize the initial steepness of the learning curve.
The domain of general purpose languages is crowded, and each of them casts a wide net: being technically better across the board than all of them is difficult, and without really heavy institutional support (like that given to C++ and Java by university accreditation boards & given to C# and Go by large corporations) being technically better will only give you a small community of dedicated fans and an isolated ecosystem. Writing a general purpose language is a bit like forming 2-person garage startup positioned against General Mills.
Instead, it makes more sense to target some underserved niche: general purpose languages are necessarily bad at serving most niches, and the syntax changes necessary to serve such a niche better are generally all too obvious to anybody actually developing in such a niche. Iteratively identifying and relieving pain points in real development work is sufficient to vastly improve a conventionally-structured language. Cutting the gordian knot by introducing a wildly unconventional syntax that’s better suited for certain kinds of work can be even easier, since such a syntax is allowed to be poorly suited for work outside its domain, and since the syntax of conventional general-purpose languages is made complicated & hard to reliably implement by decades of pressure to be suitable for all kinds of very dissimilar problems.
Syntax is the difference between a tea spoon and an ice pick. Rather than aspiring to being vice grips or duct tape, build a language that is exactly the tool you need, and allow it to be a poor fit for everything else.
When beggining a language project, it makes sense to try to get it minimally functional as quickly as possible, so that you can dogfood it & start identifying where your model of suitability has flaws. So (in direct contradiction to Mr. Bright’s list of false & true gods) it makes sense to start off with the easiest suitable syntax to parse and as few keywords as you know how to proceed with. As you write code in your language, you will identify useful syntactic sugar, useful changes to the behavior of corner cases, and keywords you would like to add — and none of these are likely to be the ones that were at the bottom of your bucket list initially, unless you’re sticking quite close to some known conventional language.
To mimic Mr. Bright’s format, here are some false gods of syntax:
Here are some true gods of syntax:
Implementation
Regex is a very powerful tool, if you know how to use it. Lexing involves identifying character classes, which is exactly what regex is best at. I wouldn’t recommend writing large chunks of your language as regular expressions, or using nonstandard extensions like lookahead & brace matching (which break some of the useful performance guarantees of vanilla regex anyway), but regex is invaluable for lexing, and with care it can be invaluable for parsing as well.
If you are writing your language implementation in C, using lex (or GNU flex) can be a great time saver. I wouldn’t recommend using Bison / YACC, personally — it’s possible to thread arbitrary C code into lex rules, and implementing parsing with this feature and a set of flags is much easier for simpler syntaxes than using a compiler-compiler. With regard to portability concerns, any major platform will have a development toolchain containing lex these days.
Of course, if you go the forth-lite route and have nearly completely consistent tokenization along a small set of special characters, this is much easier. Forth-lite languages can be split by whitespace and occasionally re-joined when grouping symbols like double quote are found; lisp-lite languages with only a few paired grouping symbols can easily be parsed into their own ASTs.
It is possible to design a series of regex match expressions and corresponding handler functions, and iteratively split chunks of code from the outside in. I did this in Mycroft, and I recommend it only if your syntax requires very little context: the two different meanings of comma and period in Mycroft (as well as handling strings) made parsing this way complicated in some cases, but this kind of parsing is very well-suited for languages like lisp where all grouping symbols are paired and very few characters are special-cased. This is a style I find quite natural, but I understand that many people prefer thinking of parsing in terms of a left-to-right iteration over tokens; one benefit is that grouping symbol mismatch will be reported in the middle of a block rather than at the very end (i.e., closer to where the actual mismatch probably is located, for large blocks with many levels of grouping).
In all of these cases, you should make sure that information about both position in the file & the tokenization & parsing context are available to every function, for error reporting purposes. If your syntax is sufficiently simple & consistent, this will be enough to ensure that you can produce useful error messages. Never expect any general purpose parsing or lexing tool to understand enough about your language to emit useful error messages; instead, expect to thread your own context through your implementation.
I agree with Mr. Bright about the proper way to handle errors: rather than trying to make the compiler or interpreter smarter than the programmer, it makes more sense to exit and print a sensible error message. However, he doesn’t touch upon the difference between compile-time and run-time errors here, and in an interpreted language or one that is compiled piecemeal at runtime (which will be typical of the kind of language somebody writes today) the distinction is a little more fuzzy. Rather than, as Mr. Bright suggests, exiting upon encountering any error, I recommend keeping information about current errors (and the full call stack information related to them) and going back up the stack until an error handler is reached; if we completely exhaust user code and get to the top level, we should probably print the whole call stack with our message (or some large section of it). This is the style used in Java, Python, and sometimes Lua. It’s straightforward to implement (Mycroft does it this way, with ~10 lines of handling in the main interpreter loop and less than 40 lines in a dedicated error module), and it can provide good error messages and the framework for flexible error handling in user code (which itself can be implemented easily — see Mycroft’s implementation for the throw and catch builtins).
With regard to performance: it’s not that performance doesn’t matter, but instead that performance matters only sometimes. Machine time is much cheaper than enginer time, and so a language that makes solving certain kinds of problems straightforward is valuable even if it does so slowly. A general purpose language has to run fast and compile fast in order to compete, but notoriously slow languages have become popular because they served particular niches, even in eras when processors were much slower and speed mattered much more (consider perl in the text-processing space — although, because of accidents of history, it took a positon better served by icon — or prolog in the expert system space; prolog was so well-suited to the expert system space that expert systems were prototyped in prolog and then hand-translated from prolog to C++ in order to avoid the performance overhead).
Unless you make huge dumb performance mistakes (and as the success of Prolog, PHP, Python, and Ruby makes clear, sometimes even big dumb performance mistakes aren’t fatal), if you scratch an itch the performance won’t be a deal-breaker, and optimization can come later, after both behavior and idiom are appropriately crystallized. Using a profiler is overkill in early iterations of the language, and since there’s a general trend in hardware toward many lower-power cores, you may get significantly more performance benefit from making parallelism easy to use (as Go does) than from tuning the critical path.
Lowering
Lowering is just an inverted way of looking at keyword minimization. When several mechanisms perform the same basic task with only small differences (as is the case with the three big looping constructs in imperative languages that Mr. Bright mentions), it makes sense to write a general function and then write syntactic sugar around it for the other cases, and if for some reason you have failed to do it that way in the first place then refactoring it (i.e., lowering) makes sense. However, I would recommend spending more initial effort identifying how to write well-factored builtin functions in the first place, rather than iteratively refactoring redundant code, because you will save debugging time & avoid problems related to minor differences locking in separate implementations that should be merged (for instance, in Python not only are dictionaries separate from objects, but there’s also a distinction between old-style and new-style objects, even though implementing objects and classes as a special case of dictionaries as Javascript and Lua do would make more sense).
Runtime library
Nearly all of Mr. Brights points about runtime libraries only really apply to languages competing with C++. He focuses on low-level stuff necessary for cycle-heavy processing on single cores, but this is almost entirely irrelevant to most of the code that gets written these days.
Here are my suggestions:
After the prototype
If your language is useful, you will use it. Maybe some other people will as well. Writing a language is a great learning opportunity even if nobody uses it, since it improves your understanding of the internals of even other unrelated languages.
Unless you are selling some proprietary compiler or interpreter (something that even Apple & Microsoft can’t get away with anymore) adoption rates don’t actually matter, except for stroking your ego — which is great, because there are a lot of languages out there and relatively few people are interested in learning obscure new ones.
If your language gets any use, it will grow and mature into something harder to predict as behaviors change to be more in line with the desires of users (even if the only user is yourself). When languages grow too fast or are built on top of a shaky foundation they can become messy, with new behaviors contradicting intuition; some languages have taken a tool-box approach and embraced this (notably Perl and to a lesser extent TCL), and other languages merely suffer from it (notably PHP and C++). It makes sense to try to make sure that new features behave in ways that align as closely as possible to idiom, and to develop a small set of strong simple idioms that guide all features; that way, pressure to become general purpose won’t make the learning curve for new users too rocky.
Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.
To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!