This post is about Machine Learning and data labeling. It takes you on a tour throughout some of the most interesting techniques, right under our noses, used by companies such as Google, LinkedIn or Facebook, to have users label their data. This is a praise to the immense creativity of the creators behind those practices, and a lesson for those who are swamped in data and are struggling to figure out how to make sense of it. Because it really is wonderful to watch how, through convenience for the user and rigorous statistical study, tech companies find ways to create value for the customer, the user and especially themselves.
When I began gathering information I wasn’t expecting so many contributions into the project, but I think that each and all of them added great value to the document you’re reading now. Although this is an ongoing effort because of the continuous advances in brain hacking by data engineers and human sciences experts. This first version comprises a total of 10 entries (intro + 9 techniques) available in the links by the end of this article. Each contains a technique for crowd data labeling and examples of how you can build your own versions.
This first post serves as an explanation of the context of this study and the nature of the data that is being annotated as we use the services offered by tech companies. With that said, let’s get going.
In the vast majority of cases in the attention economy the user and the customer are separate stakeholders. It all can be traced back from the price you pay for using Facebook, Google’s Search and Google’s Suite, Apple’s iOS, YouTube, Amazon shopping, LinkedIn, Musical.ly, Instagram, Pinterest, Reddit, Snapchat… which is of course ZERO.
Think about it, you get a world class Email service with all the features you could possibly imagine, almost 100% reliability, 15GB of free storage right out the bat, and Gmail doesn’t cost you a dime. You can endlessly stream any music genre, watch shows, tutorials, vlogs and consume unlimited content from YouTube while paying nothing. You can create a professional profile, find a better paying job, create valuable leads, network professionally and increase your income on LinkedIn while the pattern is the same, nothing comes out of your pocket. You can remain connected with the people you care about, find about and host events, create a page for your own business, build a brand, manage a following and so much more for free on Facebook.
It’s common knowledge that these companies make money with ads, but how is it exactly that they transform what we users do with on their platforms into real money?
To grasp this, let’s decompose the essence of a business, so we can identify its key aspects:
Now, to illustrate, let’s compare the business of a traditional shoe maker with that of Facebook.
So there you have it. We users not only are not the customers, when it comes to free online services, we and our behavior are the product.
Typically we think of data as pictures, videos, chat conversations or tweets, but reality is, that barely scratches the surface. The apps installed in our phones and the extensions in our browsers are capable of tracking our every action on real time. The number of seconds spent (presumably looking) at a certain screen in a phone, the number of times the word “awesome” or “mom” is used on a messaging app, entire search histories, comments, likes, shares, hearts, pokes (remember those?), hashtags, the pictures watched and for how long, are all easily extracted and mashed up together from the user’s digital footprint. We’re going to call this kind of data Behavioral Data, as opposed to the pictures, videos and conversations, which we’ll call Explicit Data.
Based on this definition we can take a new perspective on the data being collected on us by the tech companies: Behavioral Data is descriptive of how users are using the tools available to them in the platform. Therefore, by programmatically introducing alterations to those tools, service providers can design systematic ways to catch behavioral patterns that are reflective of the way users think and act upon certain cues. By doing this, service providers can transform terabytes worth of user interaction data into actionable insights and automated content curation engines that increase engagement time and ultimately the bottom line. For the service providers it is a no brainer to double down in A/B testing and analytics that implement Machine Learning to cluster users together based on behavior. In other words, with every click the user makes, he willingly tells the tech companies how he thinks and how his brain can be hacked.
Now that we’ve discussed the basics on what type of data we’re talking about, and why it is relevant for companies to label it, let’s dive into the techniques to get the users to label it for you.
Introduction | 1. P2P Connection Schemes | 2. Voting Systems | 3. Content Categorization | 4. Viewership | 5. Following on Search Results | 6. Autocomplete| 7. Straightforward Asking | 8. Correction of Human Mistakes | 9. Data Labeling Tools
I normally write here so feel free to take a look and clap like the world is ending (so you can teach Medium to show you more of… well, me.)