James Chen

@jameschen_78678

Predict Stock Market with Daily Top News

A classification approach with natural language processing in R

Audience

This short blog post is intended for someone who has basic understanding of the natural language processing algorithm word2vec, in addition to familiarity with machine learning basics, particularly in classification techniques, and is keen to learn the actual integration of the two in R. We will go over detailed data preparation as well as model optimization. It is also recommended to review this post for a brief introduction on word2vec.

Objective

The objective is to learn how the Dow Jones Industrial Average would be affected by daily popular news on Reddit.

Dataset

The datasets can be found here. The simplified dataset is consisted of daily Dow Jones data and top daily headlines from Reddit WorldNews Channel for the past 8 years. The target variable is Label (0 is down and 1 is up).

Screenshot of the stock news data set
Approach

We will use the word2vec algorithm to first create word vectors of 100 dimensions on all the headlines. We will then transform the headlines into sentence vectors by averaging the word vectors that every sentence is consisted of. Finally we will use a variety of classification algorithms to predict whether Dow Jones Industrial Average would go up or down.

Data Preparation

1…Remove undesired characters , including punctuations and upper cases

2…Remove English stop words such as “i,” “my,” “myself”

Screenshot of the data set with unwanted features removed

3…Input cleaned headlines into word2vec model to obtain word vectors

Screenshot of the word vectors generated from cleaned headlines

4…Apply and average word vectors to all headlines

5…Return 2-dimension vector representations on headlines using t-SNE

6…Create dataset with target variable and 2-dimension vectors

Screenshot of the finished data set
Model Evaluation

Here we will use the caret package in R to evaluate the prediction accuracy on a variety of algorithms, including CART (rpart), logistic regression (glm), linear discriminate analysis (LDA), k-Nearest Neighbors (kNN), Support Vector Machine.with Radial Basis Function (svmRadial), and eXtreme Gradient Boosting (xgbLinear).

It is observed that all algorithms above have unsatisfying performance, with close to 50% accuracy and poor Kappa values. However, SVM with the Radial Basis Function kernel tends to perform better than other algorithms, with a 95% confidence interval on accuracy between 51% and 55%.

Performance on various classification algorithms

As a result, we will explore a few optimization options.

Performance Optimization

1…Stacking Multiple Models

We will first try to stack all the models to see if performance will improve.

Unfortunately, not much improvement is observed.

2…Averaging All Headlines

Instead of treating top headlines as individual variables, we will combine all headlines to create one set of 2-dimension vector.

Performance on the algorithms with averaged headlines

A slight increase of 0.1% in mean accuracy is observed in SVM algorithm.

3…Shifting Target Variable

We will try shifting the target variable back 1–7 days, as daily news may have lagged effects on stock performance. Alternatively, we can also shift the target variable forward a couple of days, however, this approach will contradict with the objective of predicting stock marketing performance using daily top news.

Performance on the algorithms with 3 day lag on target variable

We can observe that CART (rpart) now has the best performance in terms of mean accuracy at 53.66%. There is also another slight increase of 0.1% in mean accuracy is observed in the SVM algorithm.

4…Tuning Model Parameters

Here we will demonstrate how to tune parameters of the CART and SVM algorithms to see if we can increase the accuracy on our predictions.

CART

In rpart, the tuning parameter is cp, or complexity parameter. We will examine the accuracy on different values of cp. The optimal cp value is found to be 0.15, as shown below.

SVM

In SVM with Radial Basis Function, the tuning parameters are sigma and C (cost). We will examine the accuracy on different values of sigma and C. The optimal values are shown below, with sigma = 0.05 and C = 3.

R Code
library(devtools)
library(tm)
library(ggplot2)
library(text2vec)
library(rword2vec)
library(wordVectors)
library(tidyverse)
library(Rtsne)
library(caret)
library(mlbench)
library(caretEnsemble)
setwd("~/Desktop/stock")
#Simple Visualization on DJIA data
stock <- read.csv("DJIA_table.csv")
stock$Date <- as.Date(stock$Date)
ggplot(data=stock,aes(x=Date,y=Open))+geom_line()+labs(title="Dow Jones Industrial Average")
#Read the simplified data set
data <- read.csv("stock.csv")
#Remove unwanted columns for natural language processing
text <- data
text$Date <- NULL
text$Label <- NULL
#Remove unwanted characters/punctuations/upper-cases
for (i in 1:25){
text[,i] <- gsub('b"',"",text[,i])
text[,i] <- gsub("b'","",text[,i])
text[,i] <- gsub("[[:punct:]]", "", text[,i])
text[,i] <- tolower(text[,i])
}
write.table(text,"text.txt")
#Rename column names
new_text <- subset(text,select=paste("Top",1,sep=""))
new_text <- setNames(new_text,"x")
for (i in 2:25){
new_text2 <- subset(text,select=paste("Top",i,sep=""))
new_text2 <- setNames(new_text2,"x")
new_text <- rbind(new_text,new_text2)
}
write.table(new_text,"text.txt")
final_text <- new_text
#Remove stopwords
stopWords <- stopwords("en")
'%nin%' <- Negate('%in%')
train <- lapply(final_text$x, function(x) {
t <- unlist(strsplit(x, " "))
t[t %nin% stopWords]
})
train.df = as.data.frame(do.call(rbind, train))
train.df$x <- do.call(paste, c(train.df[,1:45], sep=" "))
finaltext <- subset(train.df,select="x")
write.table(finaltext,"text_data.txt")
#Input words to word2vec
model = train_word2vec("text_data.txt", output="vec.bin",
threads = 4, vectors = 100, window = 12, min_count = 5, iter=10, force=TRUE)
#Plot the word vectors
plot(model)
#Convert binary to text format
bin_to_txt("vec.bin","model1text.txt")
#Remove first rows manually in the .txt file
#Convert .txt to .cvs
m1 <- read.table("model1text.txt",
header = F,quote = "", row.names = NULL, stringsAsFactors = FALSE)
colnames(m1)[1] <- "word"
#Apply word vectors to headlines and return 2-dimension representations with t-SNE
label <- subset(data,select="Label")
labeltext <- cbind(label,text)
labeltext$id <- 1:nrow(labeltext)
labeltext$id <- as.factor(labeltext$id)
for (i in 2:24){
s <- strsplit(labeltext[,i], split= " ")
new <- data.frame(V1 = rep(labeltext$id, sapply(s, length)), V2 = unlist(s))
colnames(new) <- c("id","word")
new <- merge(new,m1,by="word")
new2 <- aggregate(V2 ~ id, new, mean)
for (j in 4:102){
new3 <- aggregate(new[,j] ~ id, new, mean)
new2 <- merge (new2,new3,by="id")
}
colnames(new2)[2:101] <- paste("V", 1:100, sep="")
tsne <- Rtsne(new2, dims = 2, perplexity=100, verbose=TRUE, max_iter = 500)
t = as.data.frame(tsne$Y)
colnames(t) <- c(paste("top",i,"x",sep=""),paste("top",i,"y",sep=""))
labeltext <- cbind(labeltext,t)
}
write.csv(labeltext,"labeltext.csv")
#Add back Label information
mydata <- labeltext
mydata[,2:27] <- NULL
mydata$Label[mydata$Label==1]<- "Yes"
mydata$Label[mydata$Label==0]<- "No"
#Classfication Models
control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')
set.seed(1)
models <- caretList(Label~., data=mydata, trControl=control, methodList=algorithmList)
results <- resamples(models)
summary(results)
dotplot(results)
#-----Performance Optimization #1: Stacking Multiple Models-----
#Model Correlations
modelCor(results)
splom(results)
#Model Stacking with GLM
stackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
set.seed(2)
stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)
print(stack.glm)
#-----Performance Optimization #2: Averaging All Headlines-----
mydata2 <- mydata
mydata2$headline <- rowMeans(mydata2[,2:45], na.rm = TRUE)
mydata2[,2:45] <- NULL
control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')
set.seed(1)
models <- caretList(Label~., data=mydata2, trControl=control, methodList=algorithmList)
results <- resamples(models)
summary(results)
dotplot(results)
#-----Performance Optimization #3: Shifting Target Variable-----
shift <-subset(mydata,select="Label")
shift1 <- as.data.frame(shift[-1,])
shift2 <- as.data.frame(shift[-c(1,2),])
shift3 <- as.data.frame(shift[-c(1,2,3),])
shift4 <- as.data.frame(shift[-c(1,2,3,4),])
shift5 <- as.data.frame(shift[-c(1,2,3,4,5),])
shift6 <- as.data.frame(shift[-c(1,2,3,4,5,6),])
shift7 <- as.data.frame(shift[-c(1,2,3,4,5,6,7),])
mydata1 <- mydata[-1989,]
mydata2 <- mydata[-c(1988,1989),]
mydata3 <- mydata[-c(1987,1988,1989),]
mydata4 <- mydata[-c(1986,1987,1988,1989),]
mydata5 <- mydata[-c(1985,1986,1987,1988,1989),]
mydata6 <- mydata[-c(1984,1985,1986,1987,1988,1989),]
mydata7 <- mydata[-c(1983,1984,1985,1986,1987,1988,1989),]
mydata1$Label <- NULL
mydata2$Label <- NULL
mydata3$Label <- NULL
mydata4$Label <- NULL
mydata5$Label <- NULL
mydata6$Label <- NULL
mydata7$Label <- NULL
mydata1 <- cbind(shift1,mydata1)
colnames(mydata1)[1] <- "label1day"
mydata2 <- cbind(shift2,mydata2)
colnames(mydata2)[1] <- "label2day"
mydata3 <- cbind(shift3,mydata3)
colnames(mydata3)[1] <- "label3day"
mydata4 <- cbind(shift4,mydata4)
colnames(mydata4)[1] <- "label4day"
mydata5 <- cbind(shift5,mydata5)
colnames(mydata5)[1] <- "label5day"
mydata6 <- cbind(shift6,mydata6)
colnames(mydata6)[1] <- "label6day"
mydata7 <- cbind(shift7,mydata7)
colnames(mydata7)[1] <- "label7day"
control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')
set.seed(1)
models <- caretList(label3day~., data=mydata3, trControl=control, methodList=algorithmList)
results <- resamples(models)
summary(results)
dotplot(results)
#-----Performance Optimization #4: Tuning Model Parameters-----
set.seed(3)
control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
#Tuning cp in CART
rpartGrid <- expand.grid(cp = c(0.01,0.05,0.1,0.15))
rpartFit <- train(label3day ~ .,
data = mydata3,
method = "rpart",
tuneLength = 10,
tuneGrid=rpartGrid)
rpartFit
ggplot(rpartFit)
#Tuning sigma and C in SVM
svmGrid <- expand.grid(sigma = c(0.01,0.05,0.1,0.15),
C = c(1,3,5,7,9))
svmFit <- train(label3day ~ ., data = mydata3, 
method = "svmRadial",
trControl = control,
verbose = FALSE,
tuneGrid = svmGrid)
svmFit
ggplot(svmFit)

Questions, comments, or concerns?
jchen6912@gmail.com

Topics of interest

More Related Stories