How do you recognise your friends, or tell them apart? You could point out features like brown hair or a square jaw, but that only narrows things down a little. These features are hardly unique between people. Yet, somehow — you’ve accumulated enough experience to just know — even if your friends have changed their hair colour or are making a funny face.
This sort of experience-based reasoning has hitherto been beyond the capability of computers, but recent advances in the field of Computer Vision are rapidly closing the gap between the performance of humans and software on a variety of visual tasks — and challenging long-held assumptions about what computers can or cannot do. Over just the last few years, researchers have created Artificial Intelligences (AIs) that are able to recognise people, objects, and situations in a picture as accurately as humans can. And perhaps even more surprising: These AIs are not “hard wired” with these capabilities, but instead learn from examples provided to them by their human creators.
Facebook can now automatically recognise and tag you or your friends in photos. Microsoft’s OneDrive can generate tags such as #hands or #portraits based on the pictures you take with your phone. FaceApp can modify pictures of people to change facial expressions — or even the gender of the photo subject! At Object AI, we’re creating a smart photo editor for e-commerce photos and digital publishing, which auto-detects the products in your photos and performs the time-consuming parts of photo editing automatically. All of these services rely on AI software called Artificial Neural Networks (ANNs).
Artificial neural networks (frequently shortened to just “Neural Networks”) attempt to model the way that neurons work in the human brain. Research into neural networks goes back at least 75 years, but the computational power of early neural networks was severely constrained by the limitations of the available computing hardware. As recently as a decade ago, neural networks were considered by many to be a theoretical novelty without useful application to real-world problems.
Enter a specialised piece of computer hardware called a Graphics Processing Unit (GPU). GPUs were created to provide high-speed 3-D computer graphics for video games; they excel at performing hundreds or thousands of simple mathematical calculations simultaneously. Neural network researchers realised that the simple, high-speed calculations performed by GPUs are also a perfect fit for the computational needs of a neural network. Finally, neural networks had access to sufficient computing power!
This sudden increase in computing power meant that neural networks could consist of many more neurons than was previously feasible. More neurons means more capacity to learn, especially if those neurons are grouped into flat “layers” that are then stacked and connected together in sequence. Researchers discovered that these “deeper” neural networks, consisting of many stacked layers, were much more powerful than the earlier “shallow” neural networks: Earlier layers learn to detect simple concepts in the input data, while later layers learn to combine these simple concepts into more complex concepts. Deep Neural Networks (DNNs) were born.
Using a method called “Deep Learning”, a deep neural network is trained on a large amount of training data to generate appropriate outputs in response to the given inputs. If a DNN has been trained well, it is able to “generalise”: When given a new input that it hasn’t seen before, it can deliver an appropriate output based on its previous experience. If that sounds rather vague and ambiguous (What are the inputs? What are the outputs?), it’s because DNNs can be applied to almost any example requiring experience to make a judgement.
For example: Let’s say we wanted to create a computer vision AI that could tell if someone was smiling or frowning in a picture. A deep neural network could be given ten thousand pictures of people smiling, and ten thousand pictures of people frowning. That would be the training data. During the training process, the DNN would run through all the images in the training data thousands of times; each time, it would give its best prediction as to the presence of a smile/frown in the image. If the DNN’s prediction is correct for that image, then no neuron settings change. If the DNN’s prediction is incorrect, then neuron settings are adjusted slightly throughout the DNN, to encourage the DNN to deliver a different (hopefully more correct) prediction the next time. The ideal end result is that by the end of the training, the neuron settings yield the correct answer for all the training images, and also generalise well to new “unseen” images.
So how does this relate to an AI recognising people? Simply, a deep neural network for computer vision can be trained to detect relevant “features” in a photo — in this case, actual facial features: eyes, noses, eyebrows, mouths, cheeks, chins… Given enough neurons and enough training photos, a DNN can learn to pick up on the common features and characteristics that make a person look the way they do, with an uncanny level of accuracy. Given enough pictures of you or your friends (that would be the training data), an advanced AI could then make a “guess” of who is depicted in a new picture. If you tag this picture manually, the picture gets added to the training data, and the AI gets even smarter for its next prediction.
If this sounds to you like little more than a social media gimmick, think again; this software could mean the difference between life and death. Because DNNs can be trained to recognise so many different inputs (and take action accordingly), they’re finding application in preventing accidents on roads and in the air. For instance, Drive.ai is integrating DNN-based image analysis into driverless cars, in order to quickly recognise pedestrians or other hazards. Iris Automation is building image-based hazard avoidance systems for drones, which will help drones avoid people and obstacles, even in the hands of a complete novice.
There are a seemingly endless array of new challenges in the field of computer vision and deep learning that researchers are only just starting to explore. Will Knight, writing in the MIT Technology Review, stated that one of the coming tasks for computer vision and artificial intelligence will be to analyse an image and understand what is happening in a scene, rather than just what it contains: Facebook can tell you if your image contains pizza, but is the pizza being made, cooked, or eaten? Is it in an oven, on a plate, or in someone’s hand? Being able to answer those questions would represent a massive leap in terms of computer “understanding”.
So if you have any interest in applications of AI that you’ve thus far seen only in science fiction, keep your eye on the field of computer vision over the next few years. After all, your computer is already keeping its eye on you!
Thanks for taking the time to read this. If you liked this article, click the ♡ below so other people will see it here on Medium.
Check out our A.I.-enabled smart photo editor, follow Object AI on Twitter, or Like Object AI on Facebook!