Based on a question I recently received on my blog, I decided to write this post. Let me briefly explain some background and context for this idea. I recently created a site which holds many of Donald Trump’s speeches on the back end. On the front end, a user can search of any words to see if that word is contained in any of the speeches. If there’s a match, it gives you statistics on how, when and where Trump most often talks about whatever the user searches for. Here’s a working example of the site, if you’re interested: https://trumpspeechdata.herokuapp.com/
In a recent post I mentioned how, if you have a similar project, you might try to make the database searchable for users, and how there are often problems with that, where the search results don’t match a users intended results. One way to help mitigate that, is to use a mapped object of words, which groups similar words and searches into one keyWord umbrella. Basically, if a user’s search matches any words in the mapped object, it will be replaced with the “umbrella” word, so that the search results returned more closely match the topic, and not just the specific word the user searched for. For more information, here’s that post: https://hackernoon.com/creating-more-accurate-search-results-for-small-sites-436e64da79b6
So here’s today’s problem: If your database structure currently looks like this:
{"speechtitle": "Remarks by President Trump on TaxReform","speechdate": "September 2017","speechlocation": "Missouri","text": "blah, blah, blah, text here",}
There isn’t a field yet for tags. How would you add tags dynamically to the database? Well, first, you could do it manually, by reading each speech and manually adding the field and tags to each speech. But, let’s face it, that’s not feasible in most cases. What we want to do is add the most relevant tags to each database entry, without having to do it manually. I came up with a solution. For a peek at what I’m working towards in this article, you can click here, and refer back to it to see how it lines up with the code I’m writing:
https://trumpspeechdata.herokuapp.com/taginput
And let me admit up front, I’m sure there’s a better solution using probably Python, or PHP. My expertise is in Node.JS and JavaScript, so that’s what my solution is in, if you want to follow along. I think the old adage is true, “If all you have is a hammer, everything looks like a nail.” Anyhow, let’s get started. Here’s my thought process on this:
I need to sort through the text of each speech and find the most common words. Which ever words appear the most frequently, I will extract them to use as tags for that database entry. However, if I simply do that, I would end up with “and”, “the”, “it”, and “if” being the most frequently used words. So instead, I created an array of possible topics to loop through and match against the speech text…that way, I don’t get a lot of the “non-topic” text, and my results are more relevant. Here are the tags I used, and I’m storing them in an array:
let tags = [ "women's rights", "women", "deport", "border", "security", "immigration", "extremism", "terrorism", "tuition", "healthcare", "tax reform", "taxes", "citizenship", "abortion", "religion", "lgbt", "gay rights", "transgender", "military", "marriage", "gun control", "surveillance", "net neutrality", "drugs", "social security", "obamacare", "medicaid", "israel", "military spending", "north korea", "isis", "equal rights", "minimum wage", "welfare", "student loans", "education", "climate change", "transportation", "loans", "student", "school", "teacher", "jobs", "salary"];
Next I’ll create a function that matches these words against all the words in each speech. I’ll use a for loop for that. After using a fetch, and storing the data in a variable called “api” I’m ready to get going.
response.json().then(function(data) {
let api = data;
let tagsOutput = \[\];
for (var i = 0; i < api.length; i++) {
let tags = \[ "women's rights", "women", "deport", "border", "security", "immigration", "extremism", "terrorism", "tuition", "healthcare", "tax reform", "taxes", "citizenship", "abortion", "religion", "lgbt", "gay rights", "transgender", "military", "marriage", "gun control", "surveillance", "net neutrality", "drugs", "social security", "obamacare", "medicaid", "israel", "military spending", "north korea", "isis", "equal rights", "minimum wage", "welfare", "student loans", "education", "climate change", "transportation", "loans", "student", "school", "teacher", "jobs", "salary"\];
if(api\[i\].text.length > 1) {
let index = api.indexOf(api\[i\]);
let stringX = api\[i\].text.split(" ");
for (var j = 0; j < tags.length; j++) {
for (var k = 0; k < stringX.length; k++) {
if (tags\[j\] == stringX\[k\]) {
tagsOutput.push(\[
index,
tags\[j\],
api\[i\].title,
api\[i\].location,
api\[i\].date,
api\[i\].text,
\])
}
}
}
}
I keep my “tags” array inside my for loop. Then, I create an if statement, that makes sure the text is there. If it is, I execute a nested for loop, by first splitting the original text up into words and storing it in the variable stringX. Then I loop through my tags, and nest a second for loop inside that, where I loop through StringX, then if there’s match between the word from the text, and a word in tags, I push several things into a new array called “tagsOutput”. Here’s what I’m pushing into that array, and why:
Next, I need to count the frequency of how often certain words appear in certain speeches. I’ll use a function that calculates the item frequency, gets rid of duplicate entries, and returns the item, and the frequency. I’ll hand the function both the tile and the tag. This way, if the title of the speech is “rally in Arizona”, and the tags “war” appears 6 times in the speech, and “women” appears 8 times, I should get a result that looks like this:
{women, rally in Arizona, 8}{war, rally in Arizona, 6}
This way, I’m not just getting the frequency of the tag, or the frequency of the speech…but combining them to get the frequency of that tag in that speech. Here’s what the function will look like:
Array.prototype.byCount= function(){var itm, a= [], L= this.length, o= {};for(var i= 0; i<L; i++){itm= this[i];if(!itm) continue;if(o[itm]== undefined) o[itm]= 1;else ++o[itm];}for(var p in o) a[a.length]= {item: p, frequency: o[p]};return a.sort(function(a, b){return o[b.item]-o[a.item];});}
let tagFreq = tagsOutput.byCount();
Now I’m going to end up with a lot of tags with a frequency of 1, 2 or 3. I feel like, in a 45 minute speech, if “budget” only comes up twice, it’s not relevant enough to the speech to add it as a tag. So I’m going to create a new array, and loop through “tagFreq” and only add objects with frequency of higher than 6. That should ensure that these tags occur frequently enough in the speech to be added as a relevant tag.
let arrayOfTags = [];for (var i = 0; i < tagFreq.length; i++) {if (tagFreq[i].frequency > 6) {arrayOfTags.push([tagFreq[i].item])}}
This next part is wildly unnecessary, but I’m going to include it. I realized after the fact that, doing it this way returned a giant string, followed by the frequency. I need to have every part of that string separated into an array, so that the title, date, location and text have their own array index. There are better ways to do this. But I used a series of string splits at commas to extract the specific elements.
let matchedTags = [];for (var i = 0; i < arrayOfTags.length; i++) {
let index = arrayOfTags[i][0].split(',', 1)[0];let remainingString1 = arrayOfTags[i][0].split(/,(.+)/)[1];let tag = remainingString1.split(',', 1)[0];let remainingString2 = remainingString1.split(/,(.+)/)[1];let title = remainingString2.split(',', 1)[0];let remainingString3 = remainingString2.split(/,(.+)/)[1];let location = remainingString3.split(',', 1)[0];let remainingString4 = remainingString3.split(/,(.+)/)[1];let date = remainingString4.split(',', 1)[0];let text = remainingString4.split(/,(.+)/)[1];
matchedTags.push([index,tag,title,location,date,text])}
I was afraid there were too many commas in the speech text, and I was afraid JavaScript would accidentally break it up incorrectly. So here, what I’m doing is basically, splitting the string at the first comma, which gives me the “index.” I’m saving everything after “index” separately, and then doing the same thing, splitting it, and getting the tag at the next comma, rinse and repeat.
the “matched Tags” array gives me more or less the data structure that I’m hoping for. Now comes the tricky part. I’m going to shift gears here and go into the back end. With Node.JS I’m using MongoDB to create a document based database of my data. Instead of simply adding a “tags” field to my data, I’m actually going to go back and add two fields. In this entry I’m not going to go through the entire process of setting up the back end in Node, but just enough to explain what I’m doing.
Originally I had used Mongoose to create models for my data. I only had one model, “speeches”. Now, I’m going back in, and adding a second model called “speechID”. Here’s what the structure of that model looks like:
let Schema = mongoose.Schema;
const speechIDSchema = new Schema({
speechID: {type: String,},
})speechIDSchema.plugin(timestamps);const SpeechID = mongoose.model('speechID', speechIDSchema);
module.exports = SpeechID;
In my second model, I’m referencing the first model. This is similar to creating table associations in relational database. Here, my speech document will be related to my speechID document. The reason I’m doing this is that later, I’m going to need to add my tags to my speeches. By having a speech ID included in my “speech” model, I can use that in my post method to make sure that the tags I add, are added to the correct speeches. Here’s what my speech model looks like:
let Schema = mongoose.Schema;
const speechSchema = new Schema({
speechID: {type: String,ref: 'SpeechID',},title: {type: String,},date: {type: String,},location: {type: String,},text: {type: String,},tags: [{type: String,}],
})speechSchema.plugin(timestamps);const Speech = mongoose.model('speech', speechSchema);
module.exports = Speech;
As you can see here, each speech will get a speechID, but I don’t have to enter it in, it will simply get the ID from the speechID model. In my app.js, here’s how I structure my app.post to accommodate this change:
//==APP POST SPEECH=//
app.post('/speechnew/:speechID', function(req, res) {Speech.create({speechID: req.params.speechID,title: req.body.title,date: req.body.date,location: req.body.location,text: req.body.text,tags: req.body.tags,}).then(speechs => {res.json(speechs)});});
//==========================//
the app.post action will be /speechnew/:speechID . However, I’ll give it the actual speechID when the forms are created, again, making sure this way that the tags update the correct documents. This, you might notice is the “create” method, but when I add the tags, I’ll use the “findOneAndUpdate” method. It’s almost the same in Mongoose, just changing that one word basically. Here’s how that post request will look:
//==APP UPDATE SPEECH=//
app.post('/speechupdate/:speechID', function(req, res) {Speech.findOneAndUpdate({speechID: req.params.speechID,title: req.body.title,date: req.body.date,location: req.body.location,text: req.body.text,tags: req.body.tags,}).then(speechs => {res.json(speechs)});});
//==========================//
Again, I’m not going into a ton of the Node.js stuff here. All I need to do is make sure my form action matches the route here, and that my form input names match the req.body names here. Okay, let’s get back to JavaScript on the front end. I left off with the “matchedTags” array which holds the data structure I want. I’m going to break this next part down step by step. Basically, what’s going to happen, is I’m going to loop over my array, and create a form and form inputs for each array item. I’m going to set unique ID’s dynamically for each element that I create, by creating variables to use as the element ID. Then I’m going to use .value, to set the value of these form inputs to the data I have in my array. My form action will be to the update route I created on the back end, so the only thing that will change when I update should be the “tags”.
for (var i = 0; i < api.length; i++) {let speechID = api[i].speechID;
I’m looping over the api again, and saving each array element’s ID in a variable.
for (var k = 0; k < matchedTags.length; k++) {let formID = "form"+matchedTags[k][0]+"";let inputID1 = "input1"+matchedTags[k][0]+"";let inputID2 = "input2"+matchedTags[k][0]+"";let inputID3 = "input3"+matchedTags[k][0]+"";let inputID4 = "input4"+matchedTags[k][0]+"";let inputID5 = "input5"+matchedTags[k][0]+"";
Inside my initial for loop, I’m starting a second for loop, where I’m looping over my matchedTags array. Each element I create needs to have a unique ID. Since I don’t know how big my array will be, I have to let my for loop decide that. The only thing I do know is that there will be a form, and 5 input fields for “title”, “date”, “location”, “text”, and “tags”. each element gets a unique ID based on the speechID, which is contained in “matchedTags[k][0]”.
if(matchedTags[k][0] == api.indexOf(api[i])) {
Next, with an if statement, I’m making sure that the IndexOf I pulled out earlier, and have stored in matchedTags[k][0], matches the indexOf the api array. If there’s a match, I’ll start a new function, creating all the form elements I need, and using those IDs I made earlier.
let newForm = document.createElement('form');
newForm.id = formID;document.body.appendChild(newForm);document.getElementById('data').appendChild(newForm);newForm.action = "/speechupdate/"+speechID+"";newForm.method = "post";
Here’s my new form, again, using the formID variable. and in newForm.action I’m using the route I created on the back end /speechupdate/:speechID, only here, speechID is a variable which does contain the speech ID. This is what connects the form to the correct database entry. Next, I’ll create some inputs and append them to the form, just like I appended the form to and element in the HTML body:
let newInput1 = document.createElement('input');newInput1.id = inputID1;document.body.appendChild(newInput1);document.getElementById(formID).appendChild(newInput1);newInput1.name = "title";newInput1.value = matchedTags[k][2];
let newInput2 = document.createElement('input');
newInput2.id = inputID2;
document.body.appendChild(newInput2);
document.getElementById(formID).appendChild(newInput2);
newInput2.name = "date";
newInput2.value = matchedTags\[k\]\[4\];
let newInput3 = document.createElement('input');
newInput3.id = inputID3;
document.body.appendChild(newInput3);
document.getElementById(formID).appendChild(newInput3);
newInput3.name = "location";
newInput3.value = matchedTags\[k\]\[3\];
let newInput4 = document.createElement('input');
newInput4.id = inputID4;
document.body.appendChild(newInput4);
document.getElementById(formID).appendChild(newInput4);
newInput4.name = "text";
newInput4.value = matchedTags\[k\]\[5\];
let newInput5 = document.createElement('input');
newInput5.id = inputID5;
document.body.appendChild(newInput5);
document.getElementById(formID).appendChild(newInput5);
newInput5.name = "tags";
newInput5.value = matchedTags\[k\]\[1\];
let submitButton = document.createElement('input');
submitButton.id = "submitbtn";
document.body.appendChild(submitButton);
document.getElementById(formID).appendChild(submitButton);
submitButton.type = "submit";
Each form input is appended to the form, and has it’s own unique ID. The input.name also matches the req.body name that we used on the back end. Here’s what the entire function looks like all together:
for (var i = 0; i < api.length; i++) {let speechID = api[i].speechID;for (var k = 0; k < matchedTags.length; k++) {let formID = "form"+matchedTags[k][0]+"";let inputID1 = "input1"+matchedTags[k][0]+"";let inputID2 = "input2"+matchedTags[k][0]+"";let inputID3 = "input3"+matchedTags[k][0]+"";let inputID4 = "input4"+matchedTags[k][0]+"";let inputID5 = "input5"+matchedTags[k][0]+"";if(matchedTags[k][0] == api.indexOf(api[i])) {
let newForm = document.createElement('form');
newForm.id = formID;document.body.appendChild(newForm);document.getElementById('data').appendChild(newForm);newForm.action = "/speechupdate/"+speechID+"";newForm.method = "post";
let newInput1 = document.createElement('input');
newInput1.id = inputID1;
document.body.appendChild(newInput1);
document.getElementById(formID).appendChild(newInput1);
newInput1.name = "title";
newInput1.value = matchedTags\[k\]\[2\];
let newInput2 = document.createElement('input');
newInput2.id = inputID2;
document.body.appendChild(newInput2);
document.getElementById(formID).appendChild(newInput2);
newInput2.name = "date";
newInput2.value = matchedTags\[k\]\[4\];
let newInput3 = document.createElement('input');
newInput3.id = inputID3;
document.body.appendChild(newInput3);
document.getElementById(formID).appendChild(newInput3);
newInput3.name = "location";
newInput3.value = matchedTags\[k\]\[3\];
let newInput4 = document.createElement('input');
newInput4.id = inputID4;
document.body.appendChild(newInput4);
document.getElementById(formID).appendChild(newInput4);
newInput4.name = "text";
newInput4.value = matchedTags\[k\]\[5\];
let newInput5 = document.createElement('input');
newInput5.id = inputID5;
document.body.appendChild(newInput5);
document.getElementById(formID).appendChild(newInput5);
newInput5.name = "tags";
newInput5.value = matchedTags\[k\]\[1\];
let submitButton = document.createElement('input');
submitButton.id = "submitbtn";
document.body.appendChild(submitButton);
document.getElementById(formID).appendChild(submitButton);
submitButton.type = "submit";
console.log(newInput3.value);
}}}
Now, this solution works, but it could be cleaner, and there’s probably a better solution somewhere else. But if you’re like me, and really like node.js and JavaScript, this solution should work, if you need to update your database dynamically with tags that are generated by your database content. If you have any questions, or feedback, do reach out. Thanks!