“A timestamp is a sequence of characters or encoded information identifying when a certain event occurred, usually giving date and time of day, sometimes accurate to a small fraction of a second.” (Source Wikipedia — Nov 23,2016)
Abstract time specification along with definite time-frame references are commonly used in the text used in online promotions, especially while mentioning the details related to an offer validity. Even in our daily conversation between people, time is not always described in exact terms. Natural language is full of abstract time specifications in written as well as verbal dialogue. For example we often use phrases like “time is running out”, “few hours”. However it is not straightforward to quantify the urgency here and pin point a timestamp based on colloquial usage of time representations.
When the task is to sort different promotions based on their validity, natural language processing techniques can come very handy in figuring out when a coupon (promotion) can be used based on the duration of time a promotion is valid for. Following are some example sentences extracted from promotions’ data which was collected from different sources.
“Offer expires in 7 days. Exclusions apply. Discount taken at checkout. See disclaimer for details”
“Preferred Customer Private Sale — Thursday Feb. 11 at 2 pm — To redeem use code TODAY at checkout.”
“Limited Time Sale Events Powered by HauteLook”
Given a reference date (time when the promotion was sent), the first 2 sentences exactly specify the date and time when the promotion is valid, the last sentence though does not give such a concrete time frame yet it clearly indicates that the sale event is valid for a small duration. It is not uncommon to find such sentences in the promotion data, where the exact end time (coupon expiry) or begin time (when the coupon comes into effect) is not clearly specified.
Modeling sequence of words — If one uses a simple rule based approach which looks for specific sequence of words to determine the duration, it can quickly become unmanageable given the number of possible combinations of tokens to determine the timestamp (month, year, date, hour and minute etc) . The combinations will only increase if we have to determine both start and end time for a promotion. The goal is to minimize the number of hard-coded rules and exploit possible patterns or common underlying structure used in sentences specifying time and craft rules on the underlying structure.
Consider a sentence “Valid from 10th October to 15th November”, there are two timestamps in the sentence, one is 10th October and the other is 15th November. The promotion is valid between the mentioned dates and the start and end dates can be determined by looking at the word sequence before the actual dates in the sentence. The words “Valid from” indicate that the date 10th Oct is start date and “to” indicates 15th Nov is end date . This example shows the importance of using context [surrounding words] to determine the duration of validity of a promotion.
Point in time vs duration — The three sentences mentioned above also highlight another aspect of time description in general text. Phrases like “In 7 weeks” and “limited time” indicate duration with respect to a reference date and phrases like “15th November” indicate the exact point in time when a coupon validity starts or ends. Based on this observation we can come up with a grammar (or set of labels) to explain the time description structure in a given sentence.
Grammar (labels for describing words in a sentence which indicate time) — The labeling scheme follows the following design:
Any label starting with letter ‘B’ corresponds to Begin (start) Time of validity of a promotion and ‘E’ corresponds to the End Time. The last letter in the label, if it is ‘S’ means it is a string and ’N’ implies the word is an ‘Integer’ or ‘DIGIT’. “TP” in a label indicates “Time Point” or “Point in time” and “TD” implies “Time Duration”. Other characters “D”, “H”, “M”, “Y”, “m” are assigned to “Day”, “Hour”, “Month”, “Year” and “minute” respectively. “AP” in a tag indicates AM/PM. “Z” Indicates time zone and single P and S in time duration labels (TDPS, TDSS) are used to indicate ‘prefix’ and ‘suffix’ respectively. Duration is normally specified in a sentence as a (prefix, suffix) tuple — for example ‘7 weeks’ , ‘2 days’, ‘Nine hours’ etc.
The complete list of labels is hence — TDPS, TDPN, TDSS, (B|E)TPDN, (B|E)TPDS, (B|E)TPMS, (B|E)TPMN, (B|E)TPHS, (B|E)TPHN, (B|E)TPYN, (B|E)TPZ, (B|E)TPAP
All words corresponding to a tag ending with letter “N” are numbers. Some of the time zone strings are “et”, “pt”, “pdt” etc. Time duration suffixes are always strings (for eg. ‘hours’, ‘weeks’, ‘days’ etc) and duration prefixes can be both string and digits (eg. 2 , “two”, “limited”, “few” etc). The hour strings include words like “midnight”, “noon” etc and the day strings are “sunday”, “saturday”, “tomorrow” etc. Month strings are strings which mention month names in different formats eg. “jan” , “january” , “apr” etc. Special days like “father’s day”, “memorial day” etc are labeled as (TDPS,TDSS) time duration prefix-suffix tuples. Following are the labeled version of the three sentences mentioned earlier. In some cases where the begin and end time is not mentioned separately (2nd sentence), I have marked them as “E”(nd) labels as the offer’s validity ends that day or at that point in time.
“Offer expires in 7_TDPN days._TDSS Exclusions apply. Discount taken at checkout. See disclaimer for details”
“Preferred Customer Private Sale — Thursday_ETPDS Feb._ETPMS 11_ETPDN at 2_ETPHN pm_ETPAP — To redeem use code TODAY at checkout.”
“Limited_TDPS Time_TDSS Sale Events Powered by HauteLook”
After establishing the need for a structure/grammar definition and context specification, the next step is to pick a model which can model sequence of words in a sentence and predict a label for each word in it. The rules can then be written against the labels which are fewer in number than against entire set of words in the English vocabulary (well vocabulary in training data to be precise). The simplest model that we can use to model sequence of words is a HMM model.
Supervised HMM Model —
A hidden Markov model (HMM) is a Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. Here the hidden sequence of variables or states are made of labels from the grammar which are considered unobserved (hidden) and each state emits a word (observed) with certain emission probability. Each state or label in a given sequence depends on “n” previous labels according to certain transition probability distribution. Choice of “n” here determines the order of the HMM model.
Fig.1 shows the HMM states for the sentence “2 day sale”. The list of labels I mentioned earlier did not have words like “SALE” in it, however here we see that the word “sale” is both an observed word as well as the label. This goes back to the idea of context window I described in the “Modeling sequence of words” section. The next two sections explain this idea even further and introduces an innovative way to deal with unknown words.
Context window and pseudo labels — To understand this concept better let us look at the labeled sentence “This promotion expires in 10_TDPS days_TDSS offer valid only till 20_ETPDN th November_ETPMS but for our preferred customer the offer is valid till end of November_ETPMS”. The phrases around the labeled words, like “expires in”, “valid only till”, “end of” are sort of indicators that the words follow them in a sentence might contain time related information. A context window essentially tries to identify such phrases and assign importance to them by making them labels themselves. For eg. in the mentioned sentence, if the context window is 2 — which means any word which is 2 words before or ahead of the actual labeled-word will also be a labeled-word and label will be the word itself. This step is done before learning transition and emission probabilities on the training data. The complete labeled sentence after considering the context window (size 2) looks something like this:
“This_<NA> promotion_<NA> expires_<EXPIRES> in_<IN> 10_<TDPS> days_<TDSS> offer_<OFFER> valid_<VALID> only_<ONLY> till_<TILL> 20_<ETPDN> th_<TH> November_<ETPMS> but_<BUT> for_<FOR> our_<NA> preferred_<NA> customer_<NA> the_<NA> offer_<NA> is_<NA> valid_<NA> till_<NA> end_<END> of_<OF> November_<ETPMS>”
Notice that the word “valid” is labeled “<VALID>” in one instance and in the other it is labeled “<NA>”, so a given word can have more than one label based on the context it is used in, which in this case is defined by the size of the window. If you notice the label “<NA>” suddenly appears out of no where, the next section is devoted completely to explain this mysterious label and its importance in addressing the “problem of unknowns”.
<NA> and the problem of unknowns — An important consideration to keep in mind while using models like HMMs to model a sequence, is to make sure they generalize well. By generalization, I mean scenarios where a sentence with unseen words is given to the model and we have to predict the labels for the unknown sequence. This has generally been the hard problem to solve and different methods like using Maxent model and grouping words into categories based on certain similarities etc. have been proposed.
Here I have used a new method to address this problem with predicting the labels for unknown or unseen words. The inspiration is drawn from the fact that the words outside the context window (as defined above) normally do not carry much information about the actual timestamp embedded in a sentence. For example in the sentence earlier mentioned, words like “this”, “promotion”, “customer” do not contribute to identifying the coupon expiry time. We can use such words (which are outside the context window) and mark them as “unknowns”. And the way I do that is by assigning them a label “<NA>”. The unknown emission probability is obtained by add-1 smoothing (Laplace smoothing) on (label,word) pairs where label is <NA>. The emission probability on unknown word, P_e(<UNK>|<NA>) = 1/(N+C), where N is the total number of times we encounter (<NA>, word) combination in the training data and ‘C’ is number of unique words “<NA>” label is assigned to. Hence the probability of seeing an unknown word for any label is replaced with the emission probability of (<NA>,<UNK>) label word pair.
I use Viterbi decoding to infer the label sequence based on the observed word sequence and it is quite essential to figure out the unknown word emission probabilities before performing the inference task. If the emission probability is zero as we never saw the word in our training vocabulary then it is not possible to decode the sentence as all transition paths between states will carry a “zero” value. This means it will be impossible to decode the sentence if they contain words which we see only in test data and not in the training set.
Using randomly sampled (with replacement) 80% of data as training data (around 960 sentences) and the rest 20% as test (around 240 sentences), I calculated the average metrics on the test data (using 3 fold cross validation). The following table shows the average labeling accuracy on the true labels (labels from the grammar) for different context window sizes.
Pay attention to the “ % Decoding Failed” column, this column tells us the number of sentences in the test data, for which the model could not infer a label sequence based on the available training data. Decoding of a sentence fails in the following scenarios:
Two rows in the table above have % decoding failed sentences = 0 (or close to zero). In the following situations we achieve that:
Another extreme case is when the context window is -1 , where we have no <NA> label and every word has a distinct label (which is either a label from the grammar, or the word itself) , in that case the accuracy is high for the sentences which were decoded correctly. However almost 50% of sentences were not decoded, a classic example of overfitting. Window size of 2 and order 1 seems to perform better in terms of average label accuracy and average end time predictions. Average begin time prediction accuracy is almost always high because of the way the translator is designed which iteratively starts with the reference date and tries to figure out the best begin date from the sentence (see next section — Translation layer).
Once the labels are obtained using the decoding technique mentioned before. The next step is to parse these labels and figure out the actual time stamp. This translation step is mainly rules driven and the rules are compiled against a limited set of labels and not against the entire set of words in the training vocabulary. Few things to keep in mind before designing a translation layer for promotions data are:
The following 2 graphs (for HMM Order — 1, contextwindow size — 2), show the comparison between predicted and actual start and end times, sorted by actual start time and time to expiry. Points where there is a mismatch between predicted and actual time can be seen by the triangles which depart from the solid line. The time on y axis is the unix epoch time.
It is not difficult to see that when the predicted labels are not correct then the translation layer can not make a good estimate of the start and end times of a promotion. However, looking at different time signatures in a given sentence, the hope is that even if a subset of labels are correctly predicted then we can get an approximate estimate of the timestamp embedded in it.
Extending to general use-cases — The framework that I have built for estimating the validity time of a promotion can be broken down into primarily 3 steps
These steps can in general be applied to sentences which are not necessarily from promotions’ data. For example when we look at a chat bot conversations it is crucial to figure out timestamp as accurately as possible when a user asks for something to be done at a certain time.
Consider the following 4 sentences, where the first column shows the reference time or the point in time when the sentence was spoken. The tuple output shows the begin time and end time. When we look at the predicted end times, it is clear that the model trained on the data from promotions can be used to predict the timestamp for a sentence which is outside the promotions’ domain. On line 2, the begin time is after the end time, however the end time indicates accurate time when the event took place. This is also an example of a sentence where we don’t necessarily have an accurate begin time for an event.
As mentioned earlier the translator implementation also has a bearing on the accuracy of endtime and start time predictions. The method employed or the rules designed to parse the predicted labels can determine how the start and end times are calculated.
I have provided an initial implementation of parser, HMM Model and the translator here : Abstract Time Parser
Some Assumptions I have made in the code and scope for improvement
Further Improvements — I will end this blog discussing possible improvements to the approach discussed here to build, what I call, an “abstract time parser”
Special thanks to professor Kyunghyun Cho for reviewing the post and for providing valuable feedback and mentorship on the overall project I am working on, which is to apply Natural language processing, Machine learning and Deep learning techniques to analyze online marketing data and provide on-demand, intelligent and relevant promotions to the end user while improving the overall conversion rate for the seller.