A classification approach with natural language processing in R Audience This short blog post is intended for someone who has basic understanding of the natural language processing algorithm in addition to familiarity with machine basics, particularly in classification techniques, and is keen to learn the actual integration of the two in R. We will go over detailed data preparation as well as model optimization. It is also recommended to review for a brief introduction on word2vec, learning this post word2vec. Objective The objective is to learn how the Dow Jones Industrial Average would be affected by daily popular news on . Reddit Dataset The datasets can be found . The simplified dataset is consisted of daily Dow Jones data and top daily headlines from Reddit WorldNews Channel for the past 8 years. The target variable is Label (0 is down and 1 is up). here Screenshot of the stock news data set Approach We will use the algorithm to first create word vectors of 100 dimensions on all the headlines. We will then transform the headlines into sentence vectors by averaging the word vectors that every sentence is consisted of. Finally we will use a variety of classification algorithms to predict whether Dow Jones Industrial Average would go up or down. word2vec Data Preparation 1…Remove undesired characters , including punctuations and upper cases 2…Remove English stop words such as “i,” “my,” “myself” Screenshot of the data set with unwanted features removed 3…Input cleaned headlines into model to obtain word vectors word2vec Screenshot of the word vectors generated from cleaned headlines 4…Apply and average word vectors to all headlines 5…Return 2-dimension vector representations on headlines using t-SNE 6…Create dataset with target variable and 2-dimension vectors Screenshot of the finished data set Model Evaluation Here we will use the package in R to evaluate the prediction accuracy on a variety of algorithms, including CART (rpart), logistic regression (glm), linear discriminate analysis (LDA), k-Nearest Neighbors (kNN), Support Vector Machine.with Radial Basis Function (svmRadial), and eXtreme Gradient Boosting (xgbLinear). caret It is observed that all algorithms above have unsatisfying performance, with close to 50% accuracy and poor Kappa values. However, SVM with the Radial Basis Function kernel tends to perform better than other algorithms, with a 95% confidence interval on accuracy between 51% and 55%. Performance on various classification algorithms As a result, we will explore a few optimization options. Performance Optimization 1…Stacking Multiple Models We will first try to stack all the models to see if performance will improve. Unfortunately, not much improvement is observed. 2…Averaging All Headlines Instead of treating top headlines as individual variables, we will combine all headlines to create one set of 2-dimension vector. Performance on the algorithms with averaged headlines A slight increase of 0.1% in mean accuracy is observed in SVM algorithm. 3…Shifting Target Variable We will try shifting the target variable back 1–7 days, as daily news may have lagged effects on . Alternatively, we can also shift the target variable forward a couple of days, however, this approach will contradict with the objective of stock marketing performance using daily top news. stock performance predicting Performance on the algorithms with 3 day lag on target variable We can observe that CART (rpart) now has the best performance in terms of mean accuracy at 53.66%. There is also another slight increase of 0.1% in mean accuracy is observed in the SVM algorithm. 4…Tuning Model Parameters Here we will demonstrate how to tune parameters of the CART and SVM algorithms to see if we can increase the accuracy on our predictions. CART In rpart, the tuning parameter is , or . We will examine the accuracy on different values of . The optimal cp value is found to be 0.15, as shown below. cp complexity parameter cp SVM In SVM with Radial Basis Function, the tuning parameters are and ( ). We will examine the accuracy on different values of and . The optimal values are shown below, with and . sigma C cost sigma C sigma = 0.05 C = 3 R Code library(devtools)library(tm)library(ggplot2)library(text2vec)library(rword2vec)library(wordVectors)library(tidyverse)library(Rtsne)library(caret)library(mlbench)library(caretEnsemble) setwd("~/Desktop/stock")#Simple Visualization on DJIA datastock <- read.csv("DJIA_table.csv")stock$Date <- as.Date(stock$Date) ggplot(data=stock,aes(x=Date,y=Open))+geom_line()+labs(title="Dow Jones Industrial Average") #Read the simplified data setdata <- read.csv("stock.csv") #Remove unwanted columns for natural language processingtext <- data text$Date <- NULLtext$Label <- NULL #Remove unwanted characters/punctuations/upper-casesfor (i in 1:25){text[,i] <- gsub('b"',"",text[,i])text[,i] <- gsub("b'","",text[,i])text[,i] <- gsub("[[:punct:]]", "", text[,i])text[,i] <- tolower(text[,i])} write.table(text,"text.txt") #Rename column namesnew_text <- subset(text,select=paste("Top",1,sep=""))new_text <- setNames(new_text,"x") for (i in 2:25){new_text2 <- subset(text,select=paste("Top",i,sep=""))new_text2 <- setNames(new_text2,"x")new_text <- rbind(new_text,new_text2)} write.table(new_text,"text.txt") final_text <- new_text #Remove stopwordsstopWords <- stopwords("en")'%nin%' <- Negate('%in%') train <- lapply(final_text$x, function(x) {t <- unlist(strsplit(x, " "))t[t %nin% stopWords]}) train.df = as.data.frame(do.call(rbind, train))train.df$x <- do.call(paste, c(train.df[,1:45], sep=" ")) finaltext <- subset(train.df,select="x") write.table(finaltext,"text_data.txt") #Input words to word2vecmodel = train_word2vec("text_data.txt", output="vec.bin",threads = 4, vectors = 100, window = 12, min_count = 5, iter=10, force=TRUE) #Plot the word vectorsplot(model)#Convert binary to text formatbin_to_txt("vec.bin","model1text.txt") #Remove first rows manually in the .txt file #Convert .txt to .cvsm1 <- read.table("model1text.txt",header = F,quote = "", row.names = NULL, stringsAsFactors = FALSE) colnames(m1)[1] <- "word" #Apply word vectors to headlines and return 2-dimension representations with t-SNElabel <- subset(data,select="Label")labeltext <- cbind(label,text)labeltext$id <- 1:nrow(labeltext)labeltext$id <- as.factor(labeltext$id) for (i in 2:24){s <- strsplit(labeltext[,i], split= " ")new <- data.frame(V1 = rep(labeltext$id, sapply(s, length)), V2 = unlist(s))colnames(new) <- c("id","word")new <- merge(new,m1,by="word")new2 <- aggregate(V2 ~ id, new, mean)for (j in 4:102){new3 <- aggregate(new[,j] ~ id, new, mean)new2 <- merge (new2,new3,by="id")}colnames(new2)[2:101] <- paste("V", 1:100, sep="")tsne <- Rtsne(new2, dims = 2, perplexity=100, verbose=TRUE, max_iter = 500)t = as.data.frame(tsne$Y)colnames(t) <- c(paste("top",i,"x",sep=""),paste("top",i,"y",sep=""))labeltext <- cbind(labeltext,t)} write.csv(labeltext,"labeltext.csv") #Add back Label informationmydata <- labeltextmydata[,2:27] <- NULL mydata$Label[mydata$Label 0]<- "No" 1]<- "Yes"mydata$Label[mydata$Label #Classfication Modelscontrol <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')set.seed(1)models <- caretList(Label~., data=mydata, trControl=control, methodList=algorithmList)results <- resamples(models)summary(results)dotplot(results) #-----Performance Optimization #1: Stacking Multiple Models-----#Model CorrelationsmodelCor(results)splom(results)#Model Stacking with GLMstackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)set.seed(2)stack.glm <- caretStack(models, method="glm", metric="Accuracy", trControl=stackControl)print(stack.glm) #-----Performance Optimization #2: Averaging All Headlines-----mydata2 <- mydatamydata2$headline <- rowMeans(mydata2[,2:45], na.rm = TRUE)mydata2[,2:45] <- NULL control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')set.seed(1)models <- caretList(Label~., data=mydata2, trControl=control, methodList=algorithmList)results <- resamples(models)summary(results)dotplot(results) #-----Performance Optimization #3: Shifting Target Variable-----shift <-subset(mydata,select="Label")shift1 <- as.data.frame(shift[-1,])shift2 <- as.data.frame(shift[-c(1,2),])shift3 <- as.data.frame(shift[-c(1,2,3),])shift4 <- as.data.frame(shift[-c(1,2,3,4),])shift5 <- as.data.frame(shift[-c(1,2,3,4,5),])shift6 <- as.data.frame(shift[-c(1,2,3,4,5,6),])shift7 <- as.data.frame(shift[-c(1,2,3,4,5,6,7),]) mydata1 <- mydata[-1989,]mydata2 <- mydata[-c(1988,1989),]mydata3 <- mydata[-c(1987,1988,1989),]mydata4 <- mydata[-c(1986,1987,1988,1989),]mydata5 <- mydata[-c(1985,1986,1987,1988,1989),]mydata6 <- mydata[-c(1984,1985,1986,1987,1988,1989),]mydata7 <- mydata[-c(1983,1984,1985,1986,1987,1988,1989),] mydata1$Label <- NULLmydata2$Label <- NULLmydata3$Label <- NULLmydata4$Label <- NULLmydata5$Label <- NULLmydata6$Label <- NULLmydata7$Label <- NULL mydata1 <- cbind(shift1,mydata1)colnames(mydata1)[1] <- "label1day"mydata2 <- cbind(shift2,mydata2)colnames(mydata2)[1] <- "label2day"mydata3 <- cbind(shift3,mydata3)colnames(mydata3)[1] <- "label3day"mydata4 <- cbind(shift4,mydata4)colnames(mydata4)[1] <- "label4day"mydata5 <- cbind(shift5,mydata5)colnames(mydata5)[1] <- "label5day"mydata6 <- cbind(shift6,mydata6)colnames(mydata6)[1] <- "label6day"mydata7 <- cbind(shift7,mydata7)colnames(mydata7)[1] <- "label7day" control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)algorithmList <- c('rpart','glm','lda','knn','svmRadial','xgbLinear')set.seed(1)models <- caretList(label3day~., data=mydata3, trControl=control, methodList=algorithmList)results <- resamples(models)summary(results)dotplot(results) #-----Performance Optimization #4: Tuning Model Parameters-----set.seed(3)control <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE) #Tuning cp in CARTrpartGrid <- expand.grid(cp = c(0.01,0.05,0.1,0.15)) rpartFit <- train(label3day ~ .,data = mydata3,method = "rpart",tuneLength = 10,tuneGrid=rpartGrid) rpartFitggplot(rpartFit) #Tuning sigma and C in SVMsvmGrid <- expand.grid(sigma = c(0.01,0.05,0.1,0.15),C = c(1,3,5,7,9)) svmFit <- train(label3day ~ ., data = mydata3,method = "svmRadial",trControl = control,verbose = FALSE,tuneGrid = svmGrid) svmFitggplot(svmFit) Questions, comments, or concerns?jchen6912@gmail.com