## A classification approach with natural language processing in R

Audience

This short blog post is intended for someone who has basic understanding of the natural language processing algorithm *word2vec, *in addition to familiarity with machine learning basics, particularly in classification techniques, and is keen to learn the actual integration of the two in R. We will go over detailed data preparation as well as model optimization. It is also recommended to review this post for a brief introduction on *word2vec.*

Objective

The objective is to learn how the Dow Jones Industrial Average would be affected by daily popular news on Reddit.

Dataset

The datasets can be found here. The simplified dataset is consisted of daily Dow Jones data and top daily headlines from Reddit WorldNews Channel for the past 8 years. The target variable is Label (0 is down and 1 is up).

Approach

We will use the *word2vec *algorithm to first create word vectors of 100 dimensions on all the headlines. We will then transform the headlines into sentence vectors by averaging the word vectors that every sentence is consisted of. Finally we will use a variety of classification algorithms to predict whether Dow Jones Industrial Average would go up or down.

Data Preparation

1…Remove undesired characters , including punctuations and upper cases

2…Remove English stop words such as “i,” “my,” “myself”

3…Input cleaned headlines into *word2vec* model to obtain word vectors

4…Apply and average word vectors to all headlines

5…Return 2-dimension vector representations on headlines using t-SNE

6…Create dataset with target variable and 2-dimension vectors

Model Evaluation

Here we will use the *caret* package in R to evaluate the prediction accuracy on a variety of algorithms, including CART (rpart), logistic regression (glm), linear discriminate analysis (LDA), k-Nearest Neighbors (kNN), Support Vector Machine.with Radial Basis Function (svmRadial), and eXtreme Gradient Boosting (xgbLinear).

It is observed that all algorithms above have unsatisfying performance, with close to 50% accuracy and poor Kappa values. However, SVM with the Radial Basis Function kernel tends to perform better than other algorithms, with a 95% confidence interval on accuracy between 51% and 55%.

As a result, we will explore a few optimization options.

Performance Optimization

*1…Stacking Multiple Models*

We will first try to stack all the models to see if performance will improve.

Unfortunately, not much improvement is observed.

*2…Averaging All Headlines*

Instead of treating top headlines as individual variables, we will combine all headlines to create one set of 2-dimension vector.

A slight increase of 0.1% in mean accuracy is observed in SVM algorithm.

*3…Shifting Target Variable*

We will try shifting the target variable back 1–7 days, as daily news may have lagged effects on stock performance. Alternatively, we can also shift the target variable forward a couple of days, however, this approach will contradict with the objective of* predicting* stock marketing performance using daily top news.

We can observe that CART (rpart) now has the best performance in terms of mean accuracy at 53.66%. There is also another slight increase of 0.1% in mean accuracy is observed in the SVM algorithm.

*4…Tuning Model Parameters*

Here we will demonstrate how to tune parameters of the CART and SVM algorithms to see if we can increase the accuracy on our predictions.

**CART**

In rpart, the tuning parameter is *cp*, or *complexity parameter*. We will examine the accuracy on different values of* cp*. The optimal cp value is found to be 0.15, as shown below.

**SVM**

In SVM with Radial Basis Function, the tuning parameters are *sigma* and *C* (*cost*). We will examine the accuracy on different values of* sigma *and* C*. The optimal values are shown below, with *sigma = 0.05* and *C = 3*.

R Code

library(devtools)

library(tm)

library(ggplot2)

library(text2vec)

library(rword2vec)

library(wordVectors)

library(tidyverse)

library(Rtsne)

library(caret)

library(mlbench)

library(caretEnsemble)

setwd("~/Desktop/stock")

#Simple Visualization on DJIA data

stock <- read.csv("DJIA_table.csv")

stock$Date <- as.Date(stock$Date)

ggplot(data=stock,aes(x=Date,y=Open))+geom_line()+labs(title="Dow Jones Industrial Average")

#Read the simplified data set

data <- read.csv("stock.csv")

#Remove unwanted columns for natural language processing

text <- data

text$Date <- NULL

text$Label <- NULL

#Remove unwanted characters/punctuations/upper-cases

for (i in 1:25){

text[,i] <- gsub('b"',"",text[,i])

text[,i] <- gsub("b'","",text[,i])

text[,i] <- gsub("[[:punct:]]", "", text[,i])

text[,i] <- tolower(text[,i])

}

write.table(text,"text.txt")

#Rename column names

new_text <- subset(text,select=paste("Top",1,sep=""))

new_text <- setNames(new_text,"x")

for (i in 2:25){

new_text2 <- subset(text,select=paste("Top",i,sep=""))

new_text2 <- setNames(new_text2,"x")

new_text <- rbind(new_text,new_text2)

}

write.table(new_text,"text.txt")

final_text <- new_text

#Remove stopwords

stopWords <- stopwords("en")

'%nin%' <- Negate('%in%')

train <- lapply(final_text$x, function(x) {

t <- unlist(strsplit(x, " "))

t[t %nin% stopWords]

})

train.df = as.data.frame(do.call(rbind, train))

train.df$x <- do.call(paste, c(train.df[,1:45], sep=" "))

finaltext <- subset(train.df,select="x")

write.table(finaltext,"text_data.txt")

#Input words to word2vec

model = train_word2vec("text_data.txt", output="vec.bin",

threads = 4, vectors = 100, window = 12, min_count = 5, iter=10, force=TRUE)

#Plot the word vectors

plot(model)

#Convert binary to text format

bin_to_txt("vec.bin","model1text.txt")

#Remove first rows manually in the .txt file

#Convert .txt to .cvs

m1 <- read.table("model1text.txt",

header = F,quote = "", row.names = NULL, stringsAsFactors = FALSE)

colnames(m1)[1] <- "word"

#Apply word vectors to headlines and return 2-dimension representations with t-SNE

label <- subset(data,select="Label")

labeltext <- cbind(label,text)

labeltext$id <- 1:nrow(labeltext)

labeltext$id <- as.factor(labeltext$id)

for (i in 2:24){

s <- strsplit(labeltext[,i], split= " ")

new <- data.frame(V1 = rep(labeltext$id, sapply(s, length)), V2 = unlist(s))

colnames(new) <- c("id","word")

new <- merge(new,m1,by="word")

new2 <- aggregate(V2 ~ id, new, mean)

for (j in 4:102){

new3 <- aggregate(new[,j] ~ id, new, mean)

new2 <- merge (new2,new3,by="id")

}

colnames(new2)[2:101] <- paste("V", 1:100, sep="")

tsne <- Rtsne(new2, dims = 2, perplexity=100, verbose=TRUE, max_iter = 500)

t = as.data.frame(tsne$Y)

colnames(t) <- c(paste("top",i,"x",sep=""),paste("top",i,"y",sep=""))

labeltext <- cbind(labeltext,t)

}

write.csv(labeltext,"labeltext.csv")

#Add back Label information

mydata <- labeltext

mydata[,2:27] <- NULL

mydata$Label[mydata$Label==1]<- "Yes"

mydata$Label[mydata$Label==0]<- "No"

#Classfication Models

control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')

set.seed(1)

models <- caretList(Label~., data=mydata, trControl=control, methodList=algorithmList)

results <- resamples(models)

summary(results)

dotplot(results)

#-----Performance Optimization #1: Stacking Multiple Models-----

#Model Correlations

modelCor(results)

splom(results)

#Model Stacking with GLM

stackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

set.seed(2)

stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)

print(stack.glm)

#-----Performance Optimization #2: Averaging All Headlines-----

mydata2 <- mydata

mydata2$headline <- rowMeans(mydata2[,2:45], na.rm = TRUE)

mydata2[,2:45] <- NULL

control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')

set.seed(1)

models <- caretList(Label~., data=mydata2, trControl=control, methodList=algorithmList)

results <- resamples(models)

summary(results)

dotplot(results)

#-----Performance Optimization #3: Shifting Target Variable-----

shift <-subset(mydata,select="Label")

shift1 <- as.data.frame(shift[-1,])

shift2 <- as.data.frame(shift[-c(1,2),])

shift3 <- as.data.frame(shift[-c(1,2,3),])

shift4 <- as.data.frame(shift[-c(1,2,3,4),])

shift5 <- as.data.frame(shift[-c(1,2,3,4,5),])

shift6 <- as.data.frame(shift[-c(1,2,3,4,5,6),])

shift7 <- as.data.frame(shift[-c(1,2,3,4,5,6,7),])

mydata1 <- mydata[-1989,]

mydata2 <- mydata[-c(1988,1989),]

mydata3 <- mydata[-c(1987,1988,1989),]

mydata4 <- mydata[-c(1986,1987,1988,1989),]

mydata5 <- mydata[-c(1985,1986,1987,1988,1989),]

mydata6 <- mydata[-c(1984,1985,1986,1987,1988,1989),]

mydata7 <- mydata[-c(1983,1984,1985,1986,1987,1988,1989),]

mydata1$Label <- NULL

mydata2$Label <- NULL

mydata3$Label <- NULL

mydata4$Label <- NULL

mydata5$Label <- NULL

mydata6$Label <- NULL

mydata7$Label <- NULL

mydata1 <- cbind(shift1,mydata1)

colnames(mydata1)[1] <- "label1day"

mydata2 <- cbind(shift2,mydata2)

colnames(mydata2)[1] <- "label2day"

mydata3 <- cbind(shift3,mydata3)

colnames(mydata3)[1] <- "label3day"

mydata4 <- cbind(shift4,mydata4)

colnames(mydata4)[1] <- "label4day"

mydata5 <- cbind(shift5,mydata5)

colnames(mydata5)[1] <- "label5day"

mydata6 <- cbind(shift6,mydata6)

colnames(mydata6)[1] <- "label6day"

mydata7 <- cbind(shift7,mydata7)

colnames(mydata7)[1] <- "label7day"

control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')

set.seed(1)

models <- caretList(label3day~., data=mydata3, trControl=control, methodList=algorithmList)

results <- resamples(models)

summary(results)

dotplot(results)

#-----Performance Optimization #4: Tuning Model Parameters-----

set.seed(3)

control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)

#Tuning cp in CART

rpartGrid <- expand.grid(cp = c(0.01,0.05,0.1,0.15))

rpartFit <- train(label3day ~ .,

data = mydata3,

method = "rpart",

tuneLength = 10,

tuneGrid=rpartGrid)

rpartFit

ggplot(rpartFit)

#Tuning sigma and C in SVM

svmGrid <- expand.grid(sigma = c(0.01,0.05,0.1,0.15),

C = c(1,3,5,7,9))

svmFit <- train(label3day ~ ., data = mydata3,

method = "svmRadial",

trControl = control,

verbose = FALSE,

tuneGrid = svmGrid)

svmFit

ggplot(svmFit)

Questions, comments, or concerns?

jchen6912@gmail.com