By: and Niko Laskaris, customer facing data scientist, Comet.ml Comet.ml Sections 1. Introduction to NLP 2. Dataset Exploration 3. NLP Processing 4. Training 5. Hyperparameter Optimization 6. Resources for Future Learning Introduction to NLP Natural Language Processing (NLP) is a subfield of machine learning concerned with processing and analyzing natural language data, usually in the form of text or audio. Some common challenges within NLP include speech recognition, text generation, and sentiment analysis, while some high-profile products deploying NLP models include Apple’s Siri, Amazon’s Alexa, and many of the chatbots one might interact with online. To get started with NLP and introduce some of the core concepts in the field, we’re going to build a model that tries to predict the sentiment (positive, neutral, or negative) of tweets relating to US Airlines, using the popular . Twitter US Airline Sentiment dataset Code snippets will be included in this post, but for fully reproducible notebooks and scripts, view all of the notebooks and scripts associated with this project on its Comet project . page Dataset Exploration Let’s start by importing some libraries. Make sure to install for experiment management, visualizations, code tracking and hyperparameter optimization. Comet # Comet comet_ml Experiment from import A few standard packages: pandas, numpy, matplotlib, etc. os pickle numpy np pandas pd matplotlib.pyplot plt # Standard packages import import import as import as import as for natural language processing functions: Nltk nltk nltk.tokenize sent_tokenize, word_tokenize nltk.corpus stopwords nltk.stem.snowball SnowballStemmer # nltk import from import from import from import and for machine learning models: Sklearn keras # sklearn preprocessing and machine learning models sklearn.model_selection train_test_split sklearn.ensemble GradientBoostingClassifier sklearn.metrics accuracy_score sklearn.utils shuffle sklearn.preprocessing OneHotEncoder sklearn.feature_extraction.text TfidfVectorizer

# Keras neural networks keras.models Sequential keras.layers Dense, Dropout, BatchNormalization, Flatten keras.layers.embeddings Embedding keras.preprocessing sequence keras.utils to_categorical keras.callbacks EarlyStopping for from import from import from import from import from import from import for from import from import from import from import from import from import Now we’ll load the data: raw_df = pd.read_csv( ) 'twitter-airline-sentiment/Tweets.csv' Let’s check the shape of the dataframe: raw_df.shape()
>>> ( , ) 14640 15 So we’ve got 14,640 samples (tweets), each with 15 features. Let’s take a look at what features this dataset contains. raw_df.columns , , , , , , , , , , , , , , 'tweet_id' 'airline_sentiment' 'airline_sentiment_confidence' 'negativereason' 'negativereason_confidence' 'airline' 'airline_sentiment_gold' 'name' 'negativereason_gold' 'retweet_count' 'text' 'tweet_coord' 'tweet_created' 'tweet_location' 'user_timezone' Let’s also take a look at airline sentiment for each airline (code can be found on ): Comet # Create a Comet experiment to start tracking our work
experiment = Experiment(
    api_key= , 
    project_name= , 
    workspace= )
experiment.add_tag( )
airlines= [ , , , , , ] i airlines:
     indices = airlines.index(i)
     new_df=raw_df[raw_df[ ]==i]
     count=new_df[ ].value_counts()
     experiment.log_metric( .format(i), count[ ])
     experiment.log_metric( .format(i), count[ ])
     experiment.log_metric( .format(i), count[ ])
experiment.end() '<HIDDEN>' 'nlp-airline' 'demo' 'plotting' 'US Airways' 'United' 'American' 'Southwest' 'Delta' 'Virgin America' for in 'airline' 'airline_sentiment' '{} negative' 0 '{} neutral' 1 '{} positive' 2 Every airline has more negative tweets than either neutral or positive tweets, with Virgin America receiving the most balanced spread of positive, neutral and negative of all the US airlines. While we’re going to focus on NLP-specific analysis in this write-up, there are excellent sources of further feature-engineering and exploratory data analysis. Kaggle kernels and are particularly instructive in analyzing features such as audience and tweet length as related to sentiment. here here Let’s create a new dataframe with only , , and features. tweet_id text airline_sentiment df = raw_df[[ , , ]] 'tweet_id' 'text' 'airline_sentiment' And now let’s take a look at a few of the tweets themselves. What’s the data look like? df[ ][ ]
> df[ ][ ]
> df[ ][ ]
> 'text' 1 "@VirginAmerica plus you've added commercials to the experience... tacky." 'text' 750 "@united you are offering us 8 rooms for 32 people #FAIL" 'text' 5800 "@SouthwestAir Your #Android Wi-Fi experience is terrible! $8 is a ripoff! I can't get to @NASCAR or MRN for @DISupdates #BudweiserDuels" Next, we’re going to conduct a few standard NLP preprocessing techniques to get our dataset ready for training. NLP Processing For the purposes of constructing NLP models, one must conduct some basic steps of text preprocessing in order to transfer text from human language to a machine readable format for further processing. Here we will cover some of the standard practices: . You can consult to learn about additional text preprocessing techniques. tokenization, stopword removal, and stemming this post Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into discrete pieces called . In the process of chopping up text, tokenization also commonly involves throwing away certain characters, such as punctuation. tokens It is simple (and often useful) to think of tokens simply as words, but to fine tune your understanding of the specific terminology of NLP tokenization, the is quite useful. Stanford NLP group’s overview The NLTK library has a built-in we will use to tokenize the US Airline Tweets. tokenizer nltk.tokenize word_tokenize
def tokenize(sentence):
    tokenized_sentence = word_tokenize(sentence) tokenized_sentence from import return Stopword Removal Sometimes, common words that may be of little value in determining the semantic quality of a document are excluded entirely from the vocabulary. These are called . stop words A general strategy for determining a list of stop words is to sort the terms by (total number of times each term appears in the document) and then to filter out the most frequent terms as a stop list — hand-filtered by semantic content. collection frequency NLTK has a standard stopword list we will adopt here. nltk.corpus stopwords = set(stopwords.words( ))
    ​def remove_stopwords(self, sentence)

       ​filtered_sentence = [

       ​ w sentence

           ​ ((w not self.stopwords) and

               ​(len(w) > ) and

               ​(w[: ] != ) and

               ​(w != ))

               ​filtered_sentence.append(w filtered sentence from import : ( , , ): . class PreProcessor def __init__ self df column_name self stopwords 'english' for in if in 1 2 '//' 'https' return Stemming For grammatical purposes, documents use different forms of a word (look, looks, looking, looked) that in many situations have very similar semantic qualities. Stemming is a rough process by which variants or related forms of a word are reduced (stemmed) to a common base form. As stemming is a removal of prefixed or suffixed letters from a word, the output may or may not be a word belonging to the language corpus. is a more precise process by which words are properly reduced to the base word from which they came. Lemmatization Examples: : car, cars, car’s, cars’ car Stemming become : am, are is be Lemmatization become : ‘the boy’s cars are different colors’ ‘the boy car is differ color’ Stemmed and Lemmatized Sentence become The most common algorithm for stemming English text is Porter’s algorithm. , a language for stemming algorithms, was developed by Porter in 2001 and is the basis for the NLTK implementation of its SnowballStemmer, which we will use here. Snowball nltk.stem.snowball SnowballStemmer = SnowballStemmer( )
    def stem(self, sentence): [self.stemmer.stem(word) word sentence] from import : ( , , ): . class PreProcessor def __init__ self df column_name self stemmer 'english' return for in Code for these preprocessing steps can be found on . Comet Next we’ll create a PreProcessor object, containing methods for each of these steps, and run it on the text column of our data frame to tokenize, stem and remove stopwords from the tweets. preprocessor = PreProcessor(df, )
df[ ] = preprocessor.full_preprocess() 'text' 'cleaned text' And now we’ll split our data into training, validation and test sets. df = shuffle(df, random_state=seed)
# Keep samples the data test set
test_set = df[: ]
# Get training and validation data
X_train, X_val, y_train, y_val = train_test_split(df[ ][ :], df[ ][ :], test_size= , random_state=seed)
# Get sentiment labels test set
y_test = test_set[ ] 1000 of as 1000 'cleaned_text' 1000 'airline_sentiment' 1000 0.2 for 'airline_sentiment' Now that we’ve split our data into train, validation and test sets, we’ll TF-IDF vectorize them TF-IDF Vectorization TFIDF, or , is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used to produce weights associated with words that can be useful in searches of information retrieval or text mining. term frequency — inverse document frequency The tf-idf value of a word increases proportionally to the number of times a word appears in a document, and is offset by the number of documents in the corpus that contain that word. This offset helps adjust for the fact that some words appear more frequently in general (think of how stopwords like ‘a’, ‘the’, ‘to’ might have incredibly high tf-idf values if not for offsetting). Source: https://becominghuman.ai/word-vectorizing-and-statistical-meaning-of-tf-idf-d45f3142be63 We will use scikit-learn’s implementation of , which converts a collection of raw documents (our twitter dataset) into a matrix of TF-IDF features. TfidfVectorizer vectorizer = TfidVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(test_set[ ]) 'cleaned_text' Training We are ready to start training our model. The first thing we’ll do is create a Comet experiment object: experiment = Experiment(api_key= , 
project_name= , workspace= ) 'your-personal-key' 'nlp-airline' 'demo' Next, we’ll build a , an , and a relatively straightforward and compare how each of these models performs. Oftentimes it’s hard to tell which architecture will perform best without testing them out. Light Gradient-Boosting classifier (LGBM) XGBoost classifier neural network with keras Comet’s project-level view helps make it easy to compare how different experiments are performing and let you easily move from model selection to model tuning. LGBM # sklearn val_acc Accuracy 's Gradient Boosting Classifier (GBM)
gbm = GradientBoostingClassifier(n_estimators=200, max_depth=6, random_state=seed)
gbm.fit(X_train, y_train)
# Check results
train_pred = gbm.predict(X_train)
val_pred = gbm.predict(X_val)
val_accuracy = round(accuracy_score(y_val,val_pred), 4)
train_accuracy = round(accuracy_score(y_train, train_pred), 4)
# log to comet
experiment.log_metric(' ', val_accuracy)
experiment.log_metric(' ', train_accuracy) XGBOOST xgb_params = { : , : , : , : , : , : , : , : seed
}
target_train = y_train.astype( ).cat.codes
target_val = y_val.astype( ).cat.codes
# Transform data into a matrix so that we can use XGBoost
d_train = xgb.DMatrix(X_train, label = target_train)
d_val = xgb.DMatrix(X_val, label = target_val)
# Fit XGBoost
watchlist = [(d_train, ), (d_val, )]
bst = xgb.train(xgb_params, d_train, , watchlist, early_stopping_rounds = , verbose_eval = )
# Check results XGBoost
train_pred = bst.predict(d_train)
val_pred = bst.predict(d_val)
experiment.log_metric( , round(accuracy_score(target_val, val_pred)* , ))
experiment.log_metric( , round(accuracy_score(target_train, train_pred)* , )) 'objective' 'multi:softmax' 'eval_metric' 'mlogloss' 'eta' 0.1 'max_depth' 6 'num_class' 3 'lambda' 0.8 'estimators' 200 'seed' 'category' 'category' 'train' 'validation' 400 50 0 for 'val_acc' 100 4 'Accuracy' 100 4 Neural Net # Generator so we can easily feed batches data to the neural network
def batch_generator(X, y, batch_size, shuffle):
    number_of_batches = X.shape[ ]/batch_size
    counter = sample_index = np.arange(X.shape[ ]) shuffle:
        np.random.shuffle(sample_index) True:
        batch_index = sample_index[batch_size*counter:batch_size*(counter+ )]
        X_batch = X[batch_index,:].toarray()
        y_batch = y[batch_index]
        counter += X_batch, y_batch (counter == number_of_batches): shuffle:
                np.random.shuffle(sample_index)
            counter = # Initialize sklearn NN val_acc max softmax binary_crossentropy accuracy of 0 0 0 if while 1 1 yield if if 0 's one-hot encoder class
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded_train = np.array(y_train).reshape(len(y_train), 1)
onehot_encoded_train = onehot_encoder.fit_transform(integer_encoded_train)
integer_encoded_val = np.array(y_val).reshape(len(y_val), 1)
onehot_encoded_val = onehot_encoder.fit_transform(integer_encoded_val)
experiment.add_tag(' ')
# Neural network architecture
initializer = keras.initializers.he_normal(seed=seed)
activation = keras.activations.elu
optimizer = keras.optimizers.Adam(lr=0.0002, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
es = EarlyStopping(monitor=' ', mode=' ', verbose=1, patience=4)
# Build model architecture
model = Sequential()
model.add(Dense(20, activation=activation, kernel_initializer=initializer, input_dim=X_train.shape[1]))
model.add(Dropout(0.5))
model.add(Dense(3, activation=' ', kernel_initializer=initializer))
model.compile(optimizer=optimizer, loss=' ', metrics=[' '])
# Hyperparameters
epochs = 15
batch_size = 32
# Fit the model using the batch_generator
hist = model.fit_generator(generator=batch_generator(X_train, onehot_encoded_train, batch_size=batch_size, shuffle=True), epochs=epochs, validation_data=(X_val, onehot_encoded_val), steps_per_epoch=X_train.shape[0]/batch_size, callbacks=[es]) Comparing our models using Comet’s project view, we can see that our Neural Network models are outperforming the XGBoost and LGBM experiments by a considerable margin. Comet Experiment List View Let’s select the neural net architecture for now and fine tune it. , since we’ve stored all of our experiments — including the XGBoost and LGBM runs we’re not going to use right now — if we decide we’d like to revisit those architectures in the future, all we’ll have to do is view those experiments in the Comet project page and we’ll be able to reproduce them instantly. Note Hyperparameter Optimization Now that we’ve selected our architecture from an initial search of XGBoost, LGBM and a simple keras implementation of a neural network, we’ll need to conduct a hyperparameter optimization to fine-tune our model. Hyperparameter optimization can be an incredibly difficult, computationally expensive, and slow process for complicating modeling tasks. Comet has built an that can conduct this search for you. Simply pass in the algorithm you’d like to sweep the hyperparameter space with, hyperparameters and ranges to search, and a metric to minimize or maximize, and Comet can handle this part of your modeling process for you. optimization service from comet_ml Optimizer = { : , : { : { : , : , : }, : { : , : , : }, : { : , : , : },
    }, : { : , : ,
    },
} = Optimizer(config, for experiment opt.get_experiments():
    experiment.add_tag('LR-Optimizer') = keras.initializers.he_normal( = keras.activations.elu = keras.optimizers.Adam( = EarlyStopping( = experiment.get_parameter( ) = Sequential( = model.evaluate(X_test, onehot_encoded_val, logging.info( , score) import config "algorithm" "bayes" "parameters" "batch_size" "type" "integer" "min" 16 "max" 128 "dropout" "type" "float" "min" 0.1 "max" 0.5 "lr" "type" "float" "min" 0.0001 "max" 0.001 "spec" "metric" "loss" "objective" "minimize" opt api_key="<HIDDEN>", project_name="nlp-airline", workspace="demo") in # Neural network architecture initializer seed=seed) activation optimizer lr=experiment.get_parameter("lr"), beta_1=0.99, beta_2=0.999, epsilon=1e-8) es monitor='val_acc', mode='max', verbose=1, patience=4) batch_size "batch_size" # Build model architecture model # Build model like above) score verbose=0) "Score %s" After running our optimization, it is straightforward to select the hyperparameter configuration that yielded the highest accuracy, lowest loss, or whatever performance you were seeking to optimize. Here we keep the optimization problem rather simple: we only search , , and . The parallel coordinates chart shown below, another native Comet feature, provides a useful visualization of the underlying hyperparameter space our optimizer has traversed: epoch batch_size dropout Comet Visualizations Dashboard Let’s run another optimization sweep, this time including a range of learning rates to test. Comet Visualizations Dashboard And again we get a view into the regions of the underlying hyperparameter space that are yielding higher values. val_acc Say now we’d like to compare the performance of two of our better models to keep fine-tuning. Simply select two experiments from your list and click the Diff button and Comet will allow you to visually inspect every code and hyperparameter change, as well as side-by-side visualizations of both experiments. Comet Experiment Diff View From here you can continue your model building. Fine tune one of the models we’ve pulled out of the architecture comparison and parameter optimization sweeps, or go back to the start and compare new architectures against our baseline models. All of your work is saved in your Comet project space. Resources for Future Learning For additional learning resources in NLP, check out fastai’s new or this published by Hugging Face that covers some of the best recent papers and trends in NLP. NLP course blog post Comet is doing for machine learning what GitHub did for software. We allow data scientists and teams to automatically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility. Sign up .

Amazon

Apple

Instantly

Getting Started with Natural Language Processing: US Airline Sentiment Analysis

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Approach Pre-Trained Deep Learning Models With Caution

$DEFI Token Hits 7 Major Exchanges: A Milestone Achievement

$JTC Network To List On BitMart Exchange

$500k Presale: TG.Casino Passes Milestone with Upcoming Telegram-Powered Platform

$3 Million in Seed Funding for Web3 Founders Announced By Necto Labs

$2M Backing and a Vision: How GAM3S.GG is Reshaping Web3 Gaming

Approach Pre-Trained Deep Learning Models With Caution

$DEFI Token Hits 7 Major Exchanges: A Milestone Achievement

$JTC Network To List On BitMart Exchange

$500k Presale: TG.Casino Passes Milestone with Upcoming Telegram-Powered Platform

$3 Million in Seed Funding for Web3 Founders Announced By Necto Labs

$2M Backing and a Vision: How GAM3S.GG is Reshaping Web3 Gaming

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps