#Switch2Swift for Deep Learning

Author profile picture



If you are interested, what the recent fast.ai advanced and closed Deep Learning Class had to say about Google’s Swift for Tensorflow project, you might find this post interesting. Even if you attended the class, you should find here hopefully a good overview (with links into the class, presentations, and additional material), what Swift for Tensorflow is and why it might be relevant.
Forget about me and philosophy and better have a look at their, Jeremy Howard and Chris Laettner (the swift guy, not the overrated Basketball Player) official announcements
You might wanna start with why Google will support swift for tensorflowand why fast.ai (in short, it's about: accelerator performance, language type safety and code completion, traceability of all layers, deployment of research prototype code to production) will embrace swift. Swift for tensorflow should allow to be usable ‘all the way’, that is you learn (swift is easy, and notebooks make easier!), you research & code and then you ship to production.
Now, what I can give you as a walk through the arguments in the class + some more links.
I give you one section only, of my own thoughts why I find this an interesting project. Skip to the next header, if you want to dive into fast.ai and Googles thoughts.
Studying languages, and creating artificial languages actually has a long, long tradition in philosophy and theory of knowledge for good reasons: we wanna push the boundaries of our knowledge and languages, also programming languages are a big factor. You can find a long list of cryptic remarks (e.g. Ludwig Wittgenstein: ‘The limits of my language mean the limits of my world’) on the connection between language and knowledge.
Of course, they refer to natural or mathematical languages introduced at the time, but why would this be different for a programming language? I think it is not, and Deep Learning use-cases, e.g. recognizing cancer in CT scans or digesting text make this more obvious than before. We try to push the boundaries of our knowledge and programming languages to give us the opportunity, via the cloud, to use these huge CPUs, CPUs, and TPUs with such programming languages to do just that.
Imagine we could manage to gather large amounts of economically relevant data, then we try to predict the economic cycles. Maybe we can predict the next financial crisis, maybe we do get a ‘theory’ that can make unemployment understandable. A completely different theory than these mathematically disguised economic ideologies, hi-jacked in the political arena. Imagine we have some kind of large neural networks with convolutions, recurrence and what not … and that thing is able to predict things, which so far we could not. Now, that theory might be hard to grasp: it doesn’t have these human concepts like work, capital, law and classes and so on. Rather, layers and weights. But it’s not impossible to grasp — and if it predicts better, why would we not use it and try to understand it. Why would we expect anyways, that somehow something as complex as the economy would be humanly understandable in some simple concepts, nice mathematical graphs and so on. Something as messy & potentially nuanced as a Neural Network and large amounts of messy data might be a much better bet on predicting that level of complexity. It is already happening, CNNs are better than human doctors in predicting e.g. lung cancer from lung scans.
I could come up with many science-fiction like use cases of incredible relevant phenomena — be it climate change, diseases and so on. Eventually, coming up with new languages is really about pushing the boundaries of what humans can understand! This is a super-old topic for philosophers and having great computer languages is really just one, super important branch of this old topic. Languages matter, they are not some ephemeral, nerdy (they are as well!) thing — so, keep on reading because the swift for tensorflow is about enhancing and inventing a new language paradigm for programming languages that will allow us to use these silicon brains.


This advanced class is aimed at rebuilding the fast.ai pytorch framework from the ground. Hence, a lot of you learn is also about software engineering like callbacks. There are lots of gems in this class — do check lesson 1–5 — we focus on lessons 6 and 7 introducing swift.
Take a look at the Notebooks, and btw, even if you check lessons 1–5 you can check the python notebooks in a swift version. Focus on auto — differentiation here and here to understand those notebooks, you might want to have a peek at the preceding numbers starting with ‘00…’ and ‘01…’, a.o. on implementing matmul / dot product with swift. Before that, likely if you are new to swift like me, brush up your swift knowledge. It’s fun, it’s easy: I bet if you know python this code (slide from fast.ai part 2) will look so familiar you could even just start.
Maybe still take an easy swift tutorial like this tour or rather grab a tutorial from a notebook like the one from the swift for tensorflow team or fromfast.ai.
Python Notebooks are a standard, even if you eliminate python from the equation. Using google co-lab worked fine for me. In the beginning, there were some hiccups to get the swift kernel run stable, but things improved rapidly with the Google team working on it. Likely, when you read this in the future, co-lab for swift will be pretty stable. Note also, a huge advantage with using swift notebooks is that those afaik will be updated with at least some s4tf versions. You should also have a look at the fast.ai forum and theharebrain category and repository. Harebrain is the fast.ai open forum part, where the google and fast.ai team collaborate. From my understanding, fast.ai will want to re-write its framework based on swift (not giving up python and pytorch), the general question they try to answer during their collaboration is to ask:
What if high-level API design was able to influence the creation of a differentiable programming language?
There you have it. Differentiable what? This ‘Differentiable’ built right into swift is a key difference to what you get e.g. in python. So, if this the key theoretical concept may be good to know, what that term even means. Turns out, the term is coined by Yann Le Cunn (one of the really smart ones) so we better read this! Yeah, well — after having done — it’s a rebranding of Deep Learning. No issues, just go through this entire 1st part here and you know what it is ;-)
But before you go into the notebooks yourself, have an easy look through the videos and take my guidance.

#switch2swift: why?

So, okay. Maybe switch away from python — but why swift? Why not e.g. Julia?
Interesting enough, fast.ai apparently also was checking on julia [INSERT LINK: I bet I heard Jeremy mention that in some class but couldn’t find it back] as an alternative to swift4tensorflow. One reason to go for swift is certainly that Google will be pushing it, even though there seem many other reasons, some of which we will discuss. Having said that, there seems no reason to not have a fast.ai version in Julia and [INSERT LINK: same here, couldn't find back that link] Jeremy even mentions that.
For Google, you can find a long thread here, why they chose swift over options. So far, I can say for an ‘end — user’ (developer) am all looking forward to that because swift seems really fun and easy to use. And how good is it to have types and proper compile errors when working with a language instead of finding these things during runtime.

Python’s days (in Deep Learning) are numbered

Jeremy sets the stage very early in the course, to prep folks something is ‘going to happen’ to python. He does this via programming matmul / dot product in python and demonstrating how incredible sloooow this is. Now, Python solves this to reverting to e.g. Pytorch, which in turn uses C libraries under the hood, which comes with all kinds of problems (essentially, what swift would solve!)
I will not describe the different steps Jeremy walks through (element-wise operations, broadcasting, Einstein summation…) to ‘remove python’, save alone how pytorch or einsum does it. You can check the notebook here:
The bottom line is a little sad:
‘The way to make python faster is to remove Python’
Then, in lesson 12, the last before the 2 fast.ai swift lessons, Jeremy picks it up and here he says it. Python is dead (1st line, 2nd column)!
Well, he actually only says ‘Python’s days are numbered’.
But think about, how much fantastic work went into the python deep learning ecosystem and how many students & professionals learned that language!
For me, this was like Ned Stark’s Death in ‘Game of Thrones’ — really hard to accept! And who is Joffrey Baratheon in our case?
As you would expect in a moment of despair a savior (in the form of swift inventor Chris Laettner) is entering the stage.
In the short Q&A, I particularly liked the thought that swift could help to break the barrier between ‘normal’ and ‘deep learning/differentiable’ programming’ as to integrate your models in your normal applications. As you will know, one way of seeing Deep Learning is that is a different way of programming, where we show the computer examples and let her/him ‘figure out’ something like a ‘theory or hypothesis (e.g. a Convolutional Network) what e.g. a cat is, such that it can categorize, if a picture represents a cat or not. ‘Normal’ Programming is about flow control like if-then or loops. So, isn’t this a strange mixture to mix ‘differentiable programming’ (see more on that below) into the core of a language?
No! Not more than it is ‘weird’ to mix functional and object-oriented programming — swift apparently anyways as a design principle just takes up all patterns that are useful and mixes them up in a sensible way. There is no dogmatic ‘pure’ functional or object-oriented programming style, hence — no issues bringing differentiable programming into the mix. In fact, on the contrary, it’s a good thing mixing it up, as developers can bring all advantages of different approaches into their apps.
How does Jeremy come to the dramatic conclusion, that python is dead? Experience, I would say, with similar attempts as pytorch does with the JIT compiler. The context of this statement is the description of a RNN / LSTM Architecture (which doesn’t matter too much here) and as part of that, how pytorch uses a Just in Time Compiler to transform python to C++ and then compile it. Translating languages into each other are (I think possible to agree here) not elegant (at all) and probably points to rather less than perfect design in your language stack. Jeremy lists this as one of the issues you will encounter at some point with python in his blog post ‘https://www.fast.ai/2019/03/06/fastai-swift/’.
In the end, anything written in Python has to deal with one or more of the following: … Being converted into some different language (such as PyTorch using TorchScript, or TensorFlow using XLA), which means you’re not actually writing in the final target language, and have to deal with the mismatch between the language you think you’re writing, and the actual language that’s really being used (with at least the same debugging and profiling challenges of using a C library).
So, the argument is not that apparently ‘the current pytorch JIT has some bugs’, rather the approach is a symptom of some deeper flaws in the language design for using it with Deep Learning.
In the 1st swift lesson, Jeremy gets more specific, why ‘python has to go’
Python is nice… but:
Slow: forces things into external C libraries — see lesson 8!Concurrency: GIL forces more into external C librariesAccelerators: forces more into CUDA, etc.
Again, these are not issues one can overcome with the next Release of some new library. These issues are rooted in the design and evolution of the language. In the accompanying slides, python is described as
Atoms: C code that implements Python object methodsComposition: Slow interpreter that combines C calls in interesting ways
Keep these python ‘Atoms’ in mind (one might say: it’s rather large ‘molecules’), when below we come to swift, which has much smaller ‘Atoms’.

Pytorch & Tensorflow ‘load issues’

‘Now, this is python. Okay, no good but that’s why we have frameworks like pytorch and tensorflow — so we are good !’
Nooo! it’s still no good! Check this out. Summarised:
When it comes to speed (not even mentioning the issue that you cannot debug these frameworks because they have these large, inaccessible ‘C-molecules’) the problem is the overhead, when you load your matrix calculations to the GPU.
‘PyTorch is like an airplane: You have to give it plenty of work to do to justify the time to drive to the airport, go through security, take off, land…’
Google, that wrote tensorflow, has lots of data: for them loading their huge calculations on to such a ship-ship (see the picture below) to start the process is fine — for ‘normal’ people not. I other words, if you have google kind of workloads, this model is fine. For a lot of what do though, this model isn’t feasible.
‘TensorFlow was designed around the idea of creating a call graph, then feeding it values’
And the recent tensorflow evolution towards tf eager to give it a pytorch feel, also ‘doesn’t cut it.
‘tf eager is largely syntax-sugar. It still needs a lot to do*, to make it truly useful. As of April 2019, a small matrix multiply on GPU using tf eager takes 0.28ms, 10x longer than PyTorch. ‘
Key takeaway: how do we get these calculations onto the GPU? MLIR, the new Google Machine Learning Infrastructure (which essentially does not exist yet), that is being built out to live behind swift for tensorflow, is meant to address this.
This is the infrastructure part, the @differentiable we briefly introduce will leverage this. It’s not there yet, but it’s addressing an issue the current frameworks have.
Even if there was no Google Infrastructure overhaul for tensorflow, swift has some nice things to offer.

Swift — cool stuff

Why Swift, why not some other compiled language, if python isn't doing it? Turns out, Swift's design is helping with performance … a lot. And LLVM only works (mainly) on CPUs, while ‘the future’, that is LLVM will take this to GPUs.
Swifts design is a pre-req for the performance it can achieve. Swift translates into the intermediate representation LLVM and already this translation includes many optimizations. Chris Laettner speaks of an ‘infinitely hackable language’ with that he apparently means a.o.t. that one has access as a developer to the primitives of the language
Swift is syntactic sugar for LLVM!
Primitive operations are LLVM instructions
Composition: structs and classes:
String, Dictionary, but also Array, Int, and Float, are structs!
Rationale: forces the language features to be expressive and powerful
You can build things just like these, no barriers
This is a real difference to Python, but also to other languages like C++
Atoms: int, float, C arrays, pointers, …Composition: Structs and classes: std::complex, std::string, std::vector, …
Actually, you can build your own Float Datatype in swift.
Remember, the python ‘Atoms’ wrt to Deep Learning is actually C Code. Obviously, this stops us from changing this layer or debugging it easily. The upside is, of course, the performance — but what if we could use the ‘infinitely hackable swift llvm stack’ and have quasi C performance?
According to Jeremy, in some areas of numerical programming, we already can achieve C- level performance with swift. If you want to deep dive into swift, especially with regards to performance, check Jeremy’s blog on using swift for numerical computing.

Python Integration

Python is used for decades, includes sophisticated libraries like scikit or matplotlib. So, it’s crazy to move away for it — unless the performance is terrible and we can just keep on using libraries we want!
So, to recap: you can use python within your swift notebook! How is that possible?
Essentially, swift calls the python interpreter to execute imported python libraries like numpy. Contrary to what some might think, Python is not typeless — it has exactly one type: object. Swift maps its types into that, you can find that class in the swift github repo.
Find example notebook here and an explanation here.
That’s all: not much more to say but this is big of course! Maybe a little evil to ‘attack’ python by picking from what it what is useful until swift has rebuilt all these libraries and can leave python ‘dead’ on the road. If I was the CEO of python, I would inject some swift block now immediately, luckily we are in the open source world!

C Integration

I guess there is not much need to argue in favor of a C integration! Personally, I frankly wouldn’t know which C libraries I should use (and I guess many Data Scientists would not know either). Jeremy gives openCV as an example, as there is so much fast C Code written, no doubt there will be smart C usages. Find Jeremy’s examples on C Integration here.
Really interesting for language geeks is Chris Laettner explanation about how the C Integration works under the hood:
The best way to handle this is to write an entire C compiler and use that as a library in the Swift compiler …Unfortunately, that is hard and takes years to do. Fortunately, we already did it
Let Clang and LLVM do the heavy lifting:
Parse the header files, store them in binary modules, build ASTsGenerate code for all the weird C family features
As a programmer, use the C header file and you will get a generated swift interface. Swift parses and remaps C concepts into Swift, e.g. a double* turns into UnsafeMutablePointer<Double>.
I do think it’s worthwhile appreciating for a moment how cool & elegant this Integration actually is. It is actually an Integration on the level of the Intermediate Representation, so for e.g. speed towards the hardware this is entirely seamless and for the developer, this also looks pretty seamless (once you get over these C header files). This is not just a hack patching 2 languages together — show some respect ;-)

Swift Protocols

The swift book gives some nice insights into the design principles of the language. One obvious advantage compared to python (I just take it as a given, it’s an advantage) is that swift is compiled, so as a developer you find errors early on when writing code. Compiled languages tend to have better error messages and better IDEs, as they can use the compiler analytical powers.
One of the design principles I like best is that swift tries to be ‘not weird’, while borrowing ideas from other languages but mashing these up in a consistent manner. Hence, swift might feel familiar to you (it did to me when starting and I hadn’t used it before) because likely you know some version of the concepts it uses from other languages, like interfaces from Java or you might have seen that like Python monkey patching (check the funny explanation of the name here!).
Hey, maybe you just listen to the Swift Founder wrt its philosophy rather than getting this from me!
Let's get to one specific feature: Swift has something a lot safer and more principled than python monkey patching. Swift protocols, which are also a lot more powerful, because you can easily add methods to all types that have a certain set of behavior.
Check this out:
Here we extend isOdd to work on anything that is an integer. So, I am not aware I have seen this before e.g. in Java or other languages. But hey it’s just me and even if some language already this: swift doesn’t pretend to always innovate. Again, don't take it from me — hopefully I drew your attention to it — have the experts talk:
If you are interested in an example usage for fast.ai, check Jeremy's usage for the Data Block API with Protocol Oriented Programming. Here Jeremy uses swift’s C Integration capabilities and swift protocols to re-create the fast.ai Datablock API in combination with the aforementioned OpenCV C Integration. Again, no point I repeat all Details here: have a look at the explanations in the video.


Check out Chris on this ‘Differentiable’ idea, this is big and the idea is to have the base mechanism of deep learning be a core function of the language. Auto Differentiation is depending on the Compiler Infrastructure, find some explanations in the ‘S4tf under the hood’ section below. The s4tf project gives more theoretical explanations to this concept, find some more recent slides in ‘Swift_for_Tensorflow.pdf’
It is interesting to see, this research dates back half a century and Chris Laettner somewhere mentions they borrow these ideas from Fortran (so much as to Innovation ! ‘Just’ re-pick the right existing ideas, mash them up and you are half way to genius, see Satoshi Nakamoto!). Similar as for Neural Networks we stand on the shoulders of giants. I except this part to be a ‘moving target’ while Google changes its infrastructure towards MLIR, so check out these links and the MLIR links below and do some googling a little when you read this
We will focus here on the notebook and lesson, which is referencing thisfast.ai notebook. The notebook shows, how one coming from a pytorch world might use swift and then shows, where this can be done better. Find the section, where Chris (swift: struct & values/ functional way) ‘does it better’ than Jeremy (python: classes & references / stateful way) here. You should probably go through this a number of times, at least I had to and enjoyed it.
There are actually two things the swift way to auto-differentiate does differently compared to the ‘naive python guy learning swift’ way. One is about Value vs Reference Semantics, the 2nd about a functional way of programming, which allows the compiler to optimize and generalize better.
Let’s start with Value semantics in swift:
Value semantics:
variables stand for (or “means”) its valuethis is how math worksSwift structs work with Value Semantics
Reference semantics:
variable stands for a locationthis is what we’ve been forced to learn, usually when it bites usSwift classes work with Reference Semantics
If you use References Semantics with swift classes, there is a chance your tensor gets changed by other pieces of code that have that reference. A typical way to mitigate this is to clone tensors at some point manually, which is slow & costly if you do it and if you missed it you have hard to debug bugs. Swift, however, has a built-in mechanism ‘Copy On Write’, that essentially copies a struct for you, when needed but only then (so its memory efficient). There are more details here e.g. around keywords like ‘in-out’ (for function parameters) and ‘let’ (immutable constants) vs ‘var’ (mutable variables), find more info here.
For our context the bottom line is:
Do not use class:
Use structs:
To understand the back prop mechanism referenced in the video, you should know, what the chain rule is. If you have enough time, go through this math paper or find something similar. What I do not recommend: watching videos about backprop with graphics, where some Neurals fire — I have found this metaphor and illustration more confusing than going through the math once (and I am not a math guy!) because this gradient calculus is in reality stuff you can remember from school.
The 2nd ingredient is about not double-calculating functions in the forward path. The important part is Chainer Pattern, that will essentially allow you to avoid recalculating Tensors, because that can be quiet costly so we really need to optimize. Zoom in on Jeremy’s comparison on different programming styles for achieving this.
Note, that swift actually does this autodiff all for you with the @differentiable tag and you can do this on data types like Double or Float, not just on tensors like in Pytorch, which are differentiable, that is continuous. So, an Integer is not differentiable.
I leave you to it! Find another concise notebook for this here.

S4tf Under the hood

Now, I am going to thin ice. Well, I have walked there before because these things are also pretty new to me. But Compiler Infra- even more, and MLIR is a moving target, which very little on this planet yet understand. I will hence give some simplifications and many links (where hopefully things get updated) and kind of mind frame to understand this.
If you use s4tf now (June 2019) you actually can use (disclaimer: from my understanding) TF Eager below the surface. Which is fine for all your prototype and development in swift now. Then throughout the next year (maybe longer), Google should change the Infrastructure and (in theory) you should just benefit from the performance gains without doing anything. If you are a researcher or engineer you can go ahead today and cross fingers, google will catch up! This seems a little the idea from fast.ai to start now, where the advantages are not seen yet and then benefit when Google makes the promise come true.
Here is the planning as presented in May 2019 in the course.
Again, as of April 2019, this is rough planning and you might want to check this site for news when you read this post.
Let’s get XLA out of the way: It’s in an in-between state, an existing compiler. Nothing to say here, other than this is the current tensorflow and should not be the long term goal. Remember the quote from Jeremy above on using ‘other languages’ in which you write your code:
TensorFlow using XLA … means you’re not actually writing in the final target language, and have to deal with the mismatch between the language you think you’re writing, and the actual language that’s really being used…
In other words, XLA will be an intermediate state replacing the current TF eager mode (and remember the discussion above that this TF eager mode obviously also isn’t a good long term solution).
LLVM is an intermediary representation.
While generating intermediate code, it will also do a lot of code optimizations.
Constant foldingDead code eliminationInliningArithmetic simplificationLoop hoisting
And yes, I had to look up loop hoisting. Feel free to google, if you are lost ;-)
Lots more to say about LLVM, the goal for MLIR is apparently to kind of redo LLVM with lessons learned and have MLIR ready for auto differentiation and accelerators like GPU & TPUs, which Google also builds.
Check Chris Interview for a general explanation and more MLIR here.
What’s the basic problem we have with the afore-described @differentiable? My understanding of the problem in simple terms is
(1) we need to extract a graph for the backward prop (that chain rule above), that we can load to a GPU or TPU
(2) for ‘normal’ computing (hence, the normal control flows you write in python or
swift) we use CPUs, while for numerical computing we need to use GPUs!
The question we have is how we translate a python or swift program to compiler code, such that they run in the right order on the right device and get merged together correctly. This is possible because Compilers can actually inspect the code, see, ‘where the differentiable part kicks in’ and then transform it towards the hardware and orchestrate it with ‘normal code’.
Check this against this article. The assumption or the goal we have here is that swift should offer the developer a unified interface level, taken from the above article.
you write normal imperative Swift code against a normal Tensor API. You can use (or build) arbitrary high level abstractions without a performance hit, so long as you stick with the static side of Swift: tuples, structs, functions, non-escaping closures, generics, and the like. If you intermix tensor operations with host code, the compiler will generate copies back and forth. Likewise, you’re welcome to use classes, existentials, and other dynamic language features but they will cause copies to/from the host. When an implicit copy of tensor data happens, the compiler will remind you about it with a compiler warning.
As said, MLIR is a moving target and I don’t claim of course to fully grasp it. So let me just give me you two more remarks from Jeremy and Chris to dwell on.
Tensor Comprehensions: around minute 13.50 here Jeremy compares Deep Learning to Database Optimiser: You tell the Database with SQL what you want not how to get there, we will do a similar thing with Deep Learning.
And I find this remark from Chris Laettner intriguing:
‘Tensorflow is fundamentally a Compiler. It takes Models and then makes them go fast on hardware … it has a front end, optimizer and it has many backends … if you look at it in a particular way, it is a Compiler’
So, I am still scratching my head over both ;-)
The bottom line for me here is that we are integrating a new way of writing programs right into the heart of the language — just like we also mix up object-oriented and functional programming. The only difference is: @differentiable takes a lot of compiler techniques and infrastructure to make it happen from the core of the language. MLIR to a large extent is ‘just’ to make that happen.


These classes are for free, amazing and thanks to fast.ai and google for this.



The Noonification banner

Subscribe to get your daily round-up of top tech stories!