Here's What Google Knows About You and How to Download It

Written by David_Balaban | Published 2020/02/21
Tech Story Tags: google-takeout | google | privacy-top-story | download-your-google-data | google-mail-bot-privacy | google-chats-data-privacy | google-maps-data-privacy | google-chrome-data-privacy

TLDR Google stores a large amount of data about its users. After years of pressure and criticism, Google created a mechanism that allows you to dump all the data it collects about you. The data collected from 35 services, occupies 49 GB and is divided into three archives. If you do not want to download such amounts of data, you can order partial archives, which will include not all services. All text is exported as a single. file plus the attached pictures. All this is located in the Hangouts folder.via the TL;DR App

As you know, Google stores a large amount of data about its users. After years of pressure and criticism, Google created a mechanism that allows you to dump all the data it collects about you. This service is called Takeout. Let’s see what Google knows about us and agrees to share.

How to download your data stored by Google

The reasons for downloading such data can be different. For example, you may want to migrate to another service and transfer your data. Or maybe you want to make some kind of analytical system in the form of life-logging and Quantified Self, and the data collected by Google will help you with this. Or the country in which you liveб suddenly decides to fence off the Internet with some new great firewall not letting Google in.
So, to get your data, you need to go to takeout.google.com, check the boxes next to the services you are interested in and wait a bit. When I requested the full archive, the link came in a day, that is, about 15 hours later!
In a letter, Google mail bot cheerfully announced that the data collected from 35 services, occupies 49 GB and is divided into three archives.
If you do not want to download such amounts of data, you can order partial archives, which will include not all services. For example, if you turn off Photos, YouTube, and Gmail, as well as Google Drive, you can get only a few hundred megabytes.
By the way, the smaller the archive, the faster the download link appears.

Google search

Let's start with one of the most entertaining things - search query history. It is located in the Searches folder and is divided into files, each covering three months, for example, 2007-01-01 January 2007 to March 2007.json. If you open one of them, you will see that the information about each request consists of only two things: time (in Unix format) and the search string.
You can use some online converter to translate the time. For fun, I tried to look for occurrences of certain lines with the help of grep. Since the data is stored in JSON format, you will also need to convert it - I used the gron utility for this.
If you have gron, you can write something like this:
1 $ for F in *; do cat "$ {F}" | gron | grep "Tripwire"; done
And you will see all your search queries with the word Tripwire. What other keywords can you try? Well, for example, the word Download. Here's a fun idea: if you search for the @ symbol, you will find all the email addresses and Twitter accounts that you entered into Google search field.
Please note that there is no search for pictures and videos, but you can find them in the My Activity folder.

Google Chats

Perhaps you already have a folder with old AIM logs somewhere and you would like to add to it everything you’ve ever written through Google Talk and Hangouts. This is quite realistic, but, unfortunately, it is almost impossible to read the correspondence in the form in which it comes from Takeout.
All text is exported as a single JSON file plus the attached pictures. All this is located in the Hangouts folder.
There are no problems with the pictures, but JSON, for each written message, contains about two dozen lines of metadata. Perhaps the main headache is that instead of the sender’s name, you have the user ID.
Probably the simplest thing you can do is throw out everything but text. At least you can see some, albeit impersonal, conversations.

Google Maps

Let's start with a simple one the MyMaps folder. These are the routes you created on Google Maps — one KMZ file per route.
KMZ is a Google Earth format that is also supported by other mapping applications. In fact, it is a ZIP, in which the KML file is located, which is a valid XML. If, for some reason, this format is not suitable for your needs, you can use the GeoConverter service and convert it, for example, to GeoJSON, which is a bit easier to work with.
The Maps (your places) folder contains one file - Saved Places. It contains all your bookmarks from Google Maps in the form of another intricate structure. Each of the bookmarks is an element of the Features array that has a title, date added, date modified, and a link to Google Maps.
Finally, the most entertaining folder is Location History - a file with the entire history of your movements carrying a mobile phone in your pocket. My data took 5 MB of storage. This file is very simple, especially in comparison with other archives.
What can you do with this file?
For example, you can practice its analysis with Python. There is also specialized software like Location History Visualizer (costs $ 69)). Google also has a utility called Timeline, which is attached to Google Maps,
where accumulated data can be viewed every day, and here, in addition to the bare data that is issued through Takeout, you can find different kinds of analytics. Google, for example, identifies the names of places you have visited and separates motorcycles, cars, and bicycles.

Google Chrome

This is an extremely interesting folder that contains all cloud data of Google Chrome (and maybe not all - you can never be sure.) Here
is what can be found here:
- Bookmarks - the contents of bookmarks in the form of an HTML
list. It is not difficult to parse it - grab the data from a href and divide it
into sections according to the contents of h3.
- Dictionary – seems here we should have exceptions for spellchecking or additional vocabulary that I don’t have.
- Extensions - data on installed extensions.
- SearchEngines - data about additional search providers. If you ever need the rules for compiling search queries for different search providers, this file will come in handy.
- SyncSettings - Chrome settings.
- Autofill - in theory, there should be data for automatically filling out forms, but I found an empty array. It seems that if necessary, this data is easier to pull out of Chrome itself.
- BrowserHistory - to be honest, I expected to see a huge storehouse of personal information in this folder - a complete list of all the sites I have ever opened in Chrome. However, I was disappointed. This file lists only twelve links that I managed to open in mobile Chrome when I downloaded it on my iPhone for a test. On the desktop version, however, there is an impressive list of sites, and the Sync Everything checkbox is turned on. Seems like it is a glitch of Takeout.
If you are more fortunate with the last document, then you can easily analyze the history of your Internet movements. Google stores:
1.     The way you entered the site (following the LINK or directly – TYPED.)
2.     The page title.
3.     URL.
4.     Client ID (useful to understand whether it was your desktop or the phone.)
5.     Time in Unix format.

My Activity

Perhaps this is the most interesting folder, perhaps even more interesting than the search history. Its content provides an answer to the question of how exactly Google monitors users. Studying the folders, you can see that Google keeps data on every:
- Visit to a site affiliated with Google Adwords.
- Book opened with Google Books.
- Website you accessed through Chrome.
- Used API (Developers folder.)
- Quote opened in Google Finance.
- Request to Goggles (searching objects in the picture.)
- Visit to a page in the Google Play Store.
- Visit to Google help services (Help folder.)
- Image Search and clicks on the image links.
- Object viewed using a map (Maps.)
- Search in Google News and articles read on the source site.
- Search query and link clicks (Search folder.)
- Product search or purchase in the store (Shopping folder.)
- Video searched and clicked on from search results (Video Search.)
- Voice search (Voice and Audio folder.)
I want to note that the Shopping section wasn’t good at all. After so many years, Google barely tracked a couple of real purchases. Nevertheless, the array of information is very impressive. MP3 files in the Voice and Audio folder are especially astonishing. You can listen to your own voice, which pronounces phrases: “Okay, Google...”

WWW

You can view and filter the same information on Google My Activity website. There you can delete specific entries and disable tracking of specific activities.
At the same time, the format in which all this is presented leaves much to be desired. This is again an HTML with not the most digestible markup and 150-kilobyte stuff of Material Design in each file.

Other Google services collecting user data

It is hard to cover the data from four dozen products in detail. Let’s briefly go over some remaining services.
- Calendar
Custom calendars from Google Calendar in iCalendar (.ics) format, which is supported by many programs (including Outlook, Apple Calendar, and Thunderbird.) So, they can be imported easily.
- Photos
If you are actively using Photos, then this folder will have a huge list of
subdirectories - one for each day. The good news is that all photos are stored and can be downloaded in their original format, even if it's a huge RAW file. Each snapshot comes with JSON with metadata.
- YouTube
Here you will find all the videos that have ever been uploaded to YouTube. Just like photographs, in their original formats. And of course, JSON with metadata is attached to each video. There are also playlists, subscriptions, browsing and searching history in HTML and even comments - also in HTML.
- Classic Sites
These are sites created using the not very popular Google Sites service. I created a couple of test sites. They still exist and can be exported via Takeout. It’s better not to look inside the site pages - you will see huge amounts of optimized JS.
- Google Drive
Here you can find all documents from Google Drive. Text is exported as DOCX, tables as XLSX, comments as HTML with the same name as the document. It is convenient that the file names correspond to the headers, and the folder structure is also preserved.
- Google Pay
Data from this service is divided into two folders. Google Pay Send - a list of
transactions made through Google Pay (I have an empty CSV file) and Google Pay rewards, gift cards, offers (I have an empty PDF.)
- Mail
A complete archive of Gmail messages. There is only one file in the folder - All mail Including Spam and Trash.mbox. Its title speaks for itself. The format is Mbox, that is, in fact, a huge text file, where all messages with headers go in a row. It is not difficult to import it into most mail clients, but it will be much easier to connect them directly to Gmail and download only the necessary folders. Takeout here is useful only if you want to get just everything as an archive and save for the future.
- Google My Business
In my case, nothing is interesting at all here. JSON consisting literally of three lines: account number, first name, last name, and the note that is a personal account.
- Contacts
Contacts are divided into folders sorted by groups; in addition, there is a folder called All Contacts, which combines (and duplicates) the contents of other folders. vCard files and user photos are inside each folder. It is funny that user photos are displayed in their original form. That is, if someone placed his face in a circle and decided that no one else would ever see the rest, then he was mistaken.
- G Suite Marketplace
Here you can find apps that can be integrated with G Suite. I have nothing here, just one file - readme.
- Tasks
Google Tasks service is designed to maintain to-do lists. You can find it, for
example, in Gmail and Calendar. Your data is presented in the form of a sophisticated structure in JSON.
- Google Play Books
I was excited when I saw subdirectories with the names of books in this folder. I thought I could get copies of books bought in Play Books. Haha, of course not! Instead, in each of the subdirectories there is an HTML file with the book title, the name of the author, the time the book was last opened, the internal ID of the Play Books, and the link. The benefit to the user is dubious, but if there are a lot of books, you can pull out the list of all titles and author names.
There are still a few products/archives that have nothing inside due to the fact that I have never used them. These are Classroom, Fit, Groups, Google+, Play Music, Hangouts on Air, Search Contributions, Keep, and Voice.

Conclusion

It turns out that the most important archives, in addition to Mail, Photos, and Documents, are Searches, Location History, Browser History, and My Activity folders.
Does Takeout allow us to get all our data? Unlikely. There are no old discontinued services, neither Google Reader data, nor Wave. Reading news about Google, I hardly believe that all this data was destroyed. Most probably it was moved to cold storage.
You can find some other gaps and blank spaces in Takeout if you wish. In addition to the lack of data from the desktop version of Chrome, there are also less obvious things. For example, a page for viewing My Activity contains not only search queries, but also a link to the place where you made them.
There is nothing similar in the Takeout folder of My Activity. In addition, a lot of people use several Google accounts, Windows VPN or other private networks and switch between all these regularly. Thus, search results will differ and the same websites will come up for multiple accounts.
However, despite some shortcomings, we can say thank you to Google as it is a rare case when the service developers spend so much effort to improve data portability and increase the transparency of its collection.
As a result, Takeout may be useful for conducting an analysis.

Published by HackerNoon on 2020/02/21