paint-brush
Trump Said WHAT?!?! (Parsing and Searching JSON data)by@ethan.jarrell
4,016 reads
4,016 reads

Trump Said WHAT?!?! (Parsing and Searching JSON data)

by Ethan JarrellDecember 16th, 2017
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The ability to parse and search through JSON data can be incredibly powerful, regardless of what industry you’re in. In the following example, I’ll go over the following concepts, related to JSON data:

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Trump Said WHAT?!?! (Parsing and Searching JSON data)
Ethan Jarrell HackerNoon profile picture

The ability to parse and search through JSON data can be incredibly powerful, regardless of what industry you’re in. In the following example, I’ll go over the following concepts, related to JSON data:

  1. Creating a JSON file.
  2. Saving a JSON file online in a Mongo Database.
  3. Retrieving the data through a fetch API call.
  4. Searching through the data to find items that match specific search terms.
  5. Returning matching data.
  6. Creating simple data visualizations based on returned data.

What we want to do is, given a collection of documents, we want to sort through them all, and find if a certain character occurs in any of the documents. If it does occur in the document, we want to search that document, count the number of times it occurs, and return the count.

The reason this would be such a useful tool, is, imagine if you had a collection of speeches given by politicians. You could quickly sort through them, and see how often they say certain words. Or even better, we could sort for multiple words at the same time, and see which words appear the most often.

We could do the same thing with any type of document or collection of documents we wanted, and can be a very useful and powerful tool. Okay, so let’s get started. In this example, just to explain the concepts, I’m going to use some JSON data I’ve organized, with several of Donald Trump’s recent speeches. Then we can search it, and find how often certain topics occur in his speeches.

I have a working prototype finished, that you can check out to see what my basic end goal is. You can check it out here:

https://trumpspeechdata.herokuapp.com/

STEP 1 : GETTING CLEAN JSON DATA TO WORK WITH.

Below is a link to the clean JSON data. You can download it if you want, and you’re welcome to just use it in your own JavaScript project. But in the following steps, I’m going to show how to get it online to an Mongo Database through Mlabs. If you’d rather work with the file locally, then just skip the next steps.


trumpspeeches.json_Dropbox is a free service that lets you bring your photos, docs, and videos anywhere and share them easily. Never email…_www.dropbox.com

STEP 2 : SET UP AN MLABS ACCOUNT.

  1. Set up your account here: https://mlab.com/login/
  2. Create a new database.
  3. Create a new collection.
  4. Add a user to the database.
  5. Use the following command to import the file into the collection: mongoimport -h ds<database-number>.mlab.com:<database-number> -d signatures -c <collection> -u <user> -p <password> --file <input file>

You will see more detailed instructions on this step once you set up your account, under the “tools” tab.

STEP 3 : START A NEW HTML/JAVASCRIPT PROJECT.

Our HTML will be pretty basic. All we need is an input field, a submit button, and a div to display the results. Here’s what mine looks like:


<!DOCTYPE html><html>







<head><meta charset="utf-8"><title>The Talk Maker</title><link rel="stylesheet" href="/login.css"><link href="https://fonts.googleapis.com/css?family=Volkhov" rel="stylesheet"><script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script></head>

<body>



<label>search for a word or phrase</label><input id="input" /><buton id="button">search</button>

<div id="results"></div>

</body>

</html>

I included a link for jquery, because we may throw some of that in there too. The most important thing here is the <input>, <button> & <div> each which have an ID that we’ll use to select it in our JavaScript.

STEP 4: JAVASCRIPT

Best practice would be to have a separate JS file in our program. You could also put it in a <script> tag in the head if you prefer, but it’s just not always as clean that way. First, we’ll call our variables using the IDs from the HTML.



let input = document.getElementById('input');let button = document.getElementById('button');button.onclick = searchAPI;

The searchAPI will be the name our our function where we call our fetch. That will look something like this:

function searchAPI() {





fetch('https://api.mlab.com/api/1/databases?apiKey=myAPIKey').then(function(response) {if (response.status != 200) {window.alert("Sorry, looks like there's been an error" + response.status);return;}

response.json().then(function(data) {




let api = data;})})}

Since we uploaded our files that we’ll be using to Mlab, now we’ll need to retrieve that data. Because it’s all in one file that we imported, in one collection, our fetch url will look similar to the above, but with the collection and file names specific to your particular database. If you have trouble getting your URL to work, you can find the mLab documentation for retrieving data here: http://docs.mlab.com/data-api/#base-url

Now, I’ve started the function after our fetch where we’ll begin to parse our data. I’ve set the variable api equal to the data. Next, we’ll need to begin accessing the data that we want. Now for here, there are a few things that I want to be able to do.

  1. I want to search for a word, and count it’s occurrences in a document. For example, in these speeches by Trump, I might want to search any speeches with “economy” in the title, and then in those speeches, count the number of times he makes a reference to “the middle class”.
  2. Secondly, I’m going to make an array of common words that I can track in any of the documents. For example, my array might contain :

["middle class", "poverty", "economy", "taxes", "deficit"]

what I want to be able to do with this array (and we can make it as long and complex as we want) is, lets say we’re searching, again for speeches on the economy, and we want to quantify, not just the word we searched for, but also test that speech against any words in the array. That way, it would not just count the occurrence of the searched word, but also the occurrence of each individual word, so we could get a sense of which topics are the most important in each speech.

Here’s another way we could do this, that might be a useful too. We could have two arrays. One array that looks like this:

["good", "happy", "prosperity"]

and a second array that was opposite words like:

["sad", "bad", "poverty"]

With these two arrays, we could go through each speech and see which words of each array appear the most often. This would give us a sense of whether the speeches had a positive or negative outlook.

To accomplish all of this, I’m going to do a for loop after the fetch. Then I’m going to have a series of arrays inside my for loop to test against. One for the words that I want to test how often they appear, a second array for “good” words and a third array for “bad” words. Then, outside the for loop I’ll have several empty arrays which I’ll later push the matching data to, so that I can use it later. Now that we’re clear on what we want to do, let’s write the Pseudo Code, and then break it down into manageable chunks. Here’s what my Pseudo Code looks like:

function searchAPI() {





fetch('https://api.mlab.com/api/1/databases?apiKey=myAPIKey').then(function(response) {if (response.status != 200) {window.alert("Sorry, looks like there's been an error" + response.status);return;}

response.json().then(function(data) {

let api = data;



array A = [empty array to hold matched words pushed from for loop];array B = [empty array to hold "good" words pushed from for loop];array C = [empty array to hold "bad" words pushed from for loop];

for ( for loop to map through api) {



array 1 = [array that holds the words we want to find]array 2 = [array to hold good words to test for]array 3 = [array that holds bad words to test for]

//--We'll need to create some "if" statements here

if(the document contains the word we searched for){


search the document for additional words from array 1push matching words to array A




search the document for words in array 2push matching words to array Bsearch the document for words in array 3push matching words to array C

}

function A = function that compares the results of an array and give us a count



run array A through function A.run array B through function A.run array C through function A.

compare results of array B and C, and see which one is bigger.

see results from array A, and see which word appears most.

}

      })  
    })  
  }

Okay, now that we have a good start on our Pseudo Code, lets start putting this into some actual code. Here’s what I did, but you may come up with a better solution.



let matchedFrequentWords = [];let totalBadWords = [];let totalGoodWords = [];

        for (var i = 0; i < api.length; i++) {

          **let inputValue = input.value;   

//we'll include this in our initial if statement.**

          let frequentWords = \[ "america", "great", "republican", "democrat", "fight", "wall", "terrorism"\]; 

          let goodWords = \["good", "happy", "great", "happiness", "love", "win", "success"\];

          let badWords = \["bad", "sad", "terrible", "sadness", "hate", "fail", "unsuccessful"\];

        }

In the for loop, I’m going to start with a function that compares the words in the document agains the words in our bad array, and then the good array. My thought here is I’ll need to have an if statement, to check and see if the document even contains what I’m searching for. Then, inside the if statement, I’ll execute 2 for loops, one nested inside the other. One will loop through the document, the second loop will loop through the “good” array, any elements that match, I’ll push into the “totalGoodWords” array outside out initial for loop. To do this, since the text of the speech is one long string, I want to break it up into individual words, and push each word into an array that I can loop through easier. I can do that with the “split” method, and use a space as the split. This will basically turn the string into comma separated values, with each space between words being replaced with a comma. Then, instead of looping through the original text, I’ll loop through that array. Then, if there’s a match, what do we push? Our JSON data includes Speech Titles, Speech Date, and Speech Location, as well as the Speech Text. If there’s a match, we’ll push each element of the JSON data into our array, so that later we can not only compare how often a word appears, but we could compare, if certain topics are more common in certain locations, or at certain dates. Keep in mind, this is just a small test, with 4 speeches, and our results would be much better and more accurate if we were working with more data. Here’s what I came up with:









if(api[i].text.indexOf(inputValue) > -1) {let stringX = api[i].text.split(" ");for (var j = 0; j < badWords.length; j++) {for (var k = 0; k < stringX.length; k++) {if (badWords[j] == stringX[k]) {totalBadWords.push([api[i].speechtitle,api[i].speechdate,api[i].speechlocation,

                \])  
                        }  
                      }  
                    }  
                  }

So lets recap what’s going on here. We’ll search for a term in our HTML. Maybe we’d search for “economy” for example. Then, we’re looping through all the documents in our API with [i]. If [i] contains the string “economy”, we’re going to split the entire text into an array of words, and loop through that array [k], and loop through badWords [j], and find any matches. If there are matches, we’re going back to our original loop, [i], and pushing the speech title, speech date, and speech location. Remember, what we’re trying to find is the “mood”. How positive or negative the mood is of a speech. Because we have date, title and location information, we should be able to tell the mood of the speeches, and then see on what dates the “tone” of these speeches were either good or bad, and at what locations there was a more positive or negate tone. We’ll do basically the same thing for our “good array now:









if(api[i].text.indexOf(inputValue) > -1) {let stringX = api[i].text.split(" ");for (var j = 0; j < goodWords.length; j++) {for (var k = 0; k < stringX.length; k++) {if (goodWords[j] == stringX[k]) {totalGoodWords.push([api[i].speechtitle,api[i].speechdate,api[i].speechlocation,





])}}}}

Now, since we also want to test against all the words in first Array inside our for loop, we’ll basically use this same code again, but for our frequentWords array. However, instead of pushing the title date and location, we will only push the frequent word that is matched, because all we really want to know is which of the words appears most often.












if(api[i].text.indexOf(inputValue) > -1) {let stringX = api[i].text.split(" ");for (var j = 0; j < frequentWords.length; j++) {for (var k = 0; k < stringX.length; k++) {if (frequentWords[j] == stringX[k]) {matchedFrequentWords.push([frequentWords[j]])}}}}

All of these functions are going to be inside our original for loop. Now, we’ll go outside, below our for loop, and create a function. What we’ll want to do is create something that compares all the items, and returns us the item, and it’s frequency in the array, or how often it appears. I found this great function on StackOverflow that does just that. Here it is below:













Array.prototype.byCount= function(){var itm, a= [], L= this.length, o= {};for(var i= 0; i<L; i++){itm= this[i];if(!itm) continue;if(o[itm]== undefined) o[itm]= 1;else ++o[itm];}for(var p in o) a[a.length]= {item: p, frequency: o[p]};return a.sort(function(a, b){return o[b.item]-o[a.item];});}

Now, we’ll want to pass the arrays we’ve pushed things to, through this function to get a result. We’ll store the result as a variable so that we can access it later. Here’s how we would do that:

let frequentWordCount = matchedFrequentWords.byCount();

let badWordCount = totalBadWords.byCount();

let goodWordCount = totalGoodWords.byCount();

At this point, or even earlier, it’s a good idea to continually check our output in the console log, to make sure we’re getting the results that we expect.

In my search, I’m testing it out, searching for any speeches that contain the word “economy”.

If everything works out, and you have the same arrays and data that I used, your output for frequentWordCount should look like this:

  1. 0:{item: “great”, frequency: 42}
  2. 1:{item: “fight”, frequency: 4}
  3. 2:{item: “terrorism”, frequency: 4}
  4. 3:{item: “wall”, frequency: 1}

This tells us that, of all the speeches, “Great” appears 42 times, “fight” appears 4 times, “terrorism” appears 4 times and “wall” appears once. None of the other items we searched for appear at all.

The output for badWordCount should look something like this:

  1. 0:{item: “President Trumps Address to a Joint Session of Con…ss,February 28, 2017,At Joint Session of Congress”, frequency: 3}
  2. 1:{item: “Remarks by President Trump on TaxReform,September 5, 2017,Springfield, Missouri”, frequency: 2}
  3. 2:{item: “President Trump on the Paris Climate Accord,June 1, 2017,the Rose Garden”, frequency: 2}

This tells us that in The first speech, of all the bad words we used, they only appear 3 times, and 2 times in each of the other two speeches. Lets compare that with our output for goodWordCount, which should look similar to this:

  1. 0:{item: “Remarks by President Trump at Tax Reform Event,September 28, 2017,Indiana Farm Bureau Building”, frequency: 26}
  2. 1:{item: “President Trumps Address to a Joint Session of Con…ss,February 28, 2017,At Joint Session of Congress”, frequency: 25}
  3. 2:{item: “Remarks by President Trump on TaxReform,September 5, 2017,Springfield, Missouri”, frequency: 14}
  4. 3:{item: “President Trump on the Paris Climate Accord,June 1, 2017,the Rose Garden”, frequency: 12}

This shows us that the “tone” or “mood”, at least of these 4 speeches, seems to be much more positive than negative. And again, obviously, our data would be much better if we had more speeches, but this is good just for a simple test.

Now, I’m going to work on getting a percentage of the goodWords and the badWords, so I can compare and see whether the speeches were more positive or negative. I’m going to push the frequency values from the arrays into an array to get the sum, and then compare the good Sum to the total sum and the bad Sum to the total Sum. Here’s how I did that:




let goodNumbers = [];for (var i = 0; i < goodWordCount.length; i++) {goodNumbers.push(goodWordCount[i].frequency);}




let badNumbers = [];for (var i = 0; i < badWordCount.length; i++) {badNumbers.push(badWordCount[i].frequency);}



function getSum(total, num) {return total + num;}



let goodSum = goodNumbers.reduce(getSum);let badSum = badNumbers.reduce(getSum);let totalSum = goodSum + badSum;


let goodPercent = (Math.ceil((goodSum / totalSum) * 100));let badPercent = (Math.ceil((badSum / totalSum) * 100));

We should now have the percentage of good Words vs. the percentage of bad words. My ouput was this:

goodPercent = 92%

badPercent = 9%;

(I’m using Math.ceil, so they round up to the nearest whole number, because reasons).

Again, more data would help, but since we are originally searching for the economy, we could say, “On speeches covering the economy, Trump is about 92% optimistic.” This could be really useful if we had a larger amount of speeches, we could then search for say, “equality”, “women’s rights”, “abortion”, etc, and see if his optimism changes, or if the mood of the speech is better or worse on certain topics. Now, one other cool thing we can do, is represent that data in a bar graph. To do that, we’ll create a variable using backticks, and then plug that into our innerHTML results ID div.

Basically, what we’ll do is create a new div element. We’ll use inline style, since this element is being created dynamically in JavaScript. We’ll define it’s color, and height. Then it’s width we’ll define by the variable “goodPercent”, which will create a nice visual representation of the data we’ve just parsed. Here’s how that will look in JavaScript:










let goodBarGraph = `<div class="dataSubLabel" style="color:green">Good:</div><div style="padding-left:50px;background-color:green;color:white;display:flex;flex-direction: row;justify-content:flex-end;align-items:center;height:15px;width:${goodPercent}%;">${goodPercent}% </div>`;let badBarGraph = `<div class="dataSubLabel" style="color:red">Bad:</div><div style="padding-left:50px;background-color:red;color:white;display:flex;flex-direction: row;justify-content:flex-end;align-items:center;height:15px;width:${badPercent}%;">${badPercent}% </div>`document.getElementById('results').innerHTML += goodBarGraph;document.getElementById('results').innerHTML += badBarGraph;

Our entire goodBarGraph variable is enclosed in backticks, exept for two things. We’ve isloated the witdth value inside ${}, as well as the same variable inside a second div. We’ve done the same things for the badBarGraph, and then put those variables in the innerHTML of our “results” div. If we’ve done it correctly, we could see something like this:

Seeing that this works, we could do the same thing, but break it up by speech or by year or by venue. Depending on how complex our JSON object, we could roughly calculate the “good” or “bad” and break it down in lots of interesting ways. Pretty cool right? Okay, let’s take a step back now. We have that nice data which tells us how frequently certain words are used. Let’s do something interesting with that data as well. The principle will be more or less the same as what we’ve already done, but again, we can get some really exciting results, the more data we feed our functions. To have our visualization work, we’ll need to convert that data to percentages as well, so we can display it with more accuracy. Here’s how we could do that:



function getSum(total, num) {return total + num;}




let frequentWordTotal = [];for (var i = 0; i < frequentWordCount.length; i++) {frequentWordTotal.push(frequentWordCount[i].frequency);}

let frequentSum = frequentWordTotal.reduce(getSum);



let word1Percent = (Math.ceil((frequentWordCount[0].frequency / frequentSum) * 100));let word2Percent = (Math.ceil((frequentWordCount[1].frequency / frequentSum) * 100));let word3Percent = (Math.ceil((frequentWordCount[2].frequency / frequentSum) * 100));

Now, if everything works, our word1Percent will be about 83% and our word2Percent should be about 8%, which we know is roughly accurate, given the data we retrieved earlier, since we know word1, “great” appears 42 times, and word2, “fight” appears 4 times. So we can also visualize this data with another simple bar graph. Here’s how that might look in JavaScript:



let wordGraph = `<div class="dataSubLabel" style="color:#0056e0">${frequentWordCount[0].item} - appears ${frequentWordCount[0].frequency} times.</div><div style="padding-left:50px;background-color:#0056e0;color:white;display:flex;flex-direction: row;justify-content:flex-end;align-items:center;height:15px;width:${word1Percent}%;">${word1Percent}% </div>


<div class="dataSubLabel" style="color:#3460a8">${frequentWordCount[1].item} - appears ${frequentWordCount[1].frequency} times.</div><div style="padding-left:50px;background-color:#3460a8;color:white;display:flex;flex-direction: row;justify-content:flex-end;align-items:center;height:15px;width:${word2Percent}%;">${word2Percent}% </div>



<div class="dataSubLabel" style="color:#5874a3">${frequentWordCount[2].item} - appears ${frequentWordCount[2].frequency} times.</div><div style="padding-left:50px;background-color:#5874a3;color:white;display:flex;flex-direction: row;justify-content:flex-end;align-items:center;height:15px;width:${word3Percent}%;">${word3Percent}% </div>`

document.getElementById('results').innerHTML += wordGraph;

Here, the same as earlier, we’re defining the width of our bar graph by the percentage variable we defined earlier. Then, we also using the other variables contained in our wordFrequency object, to show, not only the percentage, but the actual word, and the word count. Here’s how that might look in our HTML.

Now, one piece of data we haven’t used much is the word we’re actually searching for. Using this format, we could now go back, and count the number of times it occurs in our for loop, and then push the speech title into an array. Then we could see similar data on when and where the President most often addresses that particular topic. In any case, this is a pretty simple project, but it can be fun to parse the data, and see patterns you might not expect. If you have any question or comments, feel free to reach out. Thanks!


Ethan Jarrell_My background is in graphic design, and I have spent the last 10 years doing both digital and print design for a…_www.upwork.com