In newspaper articles, emails, and other texts, it is natural to reference dates in relation to the publication date. For example, you will commonly see phrases like “yesterday” and “two weeks ago.” However, if you read an article a year after it was published, these timeframes are not immediately clear.
When analyzing articles using natural language processing (NLP) techniques, relative dates referenced in an article will not be interpreted properly, if at all, since there is no context for the models. Thus, in order to effectively use tools such as event extraction and question-answering engines, relative references need to be converted to a precise date.
In this blog post, we discuss our in-house solution to give analysts on our Novetta Mission Analytics team accurate date references when analyzing bodies of text. Our method applied named entity recognition (NER) built off of Flair, a state-of-the-art NLP library. Note: StanfordNLP resolved this problem with SUTime using their in-house NER method, but it is not available for commercial use.
Extracting Date References and Ground Truth Date
The first step in tackling this problem was to extract any phrases that were referring to dates within a set of documents. To do this we utilized AdaptNLP, an open-source framework with state-of-the-art NLP capabilities originally developed as an internal Novetta research project. AdaptNLP’s EasySequenceTagger, built on top of Flair, tagged entities using an NER model. From there, we kept only the entities with the NER tag ‘DATE.’ AdaptNLP also outputs the token span from Flair’s sentence object model, so we could have context for each date that was extracted. We stored the token spans of each ‘DATE’ entity in a Pandas DataFrame, where each document in the corpus had its own row of extracted dates to refer to later.
To resolve the extracted dates from the article data, we also needed to extract the actual date to which the articles were referring. We scraped the metadata of the article to obtain the published date of the article and stored this information in the appropriate DataFrame row, as shown below.
|0||0||Advertisement Supported by Jane Perlez BEI…||2019-09-25||[this summer, a decade ago, the past two years…|
|1||2||\n\t\tDownload\t\t\n Download\n \n Introductio…||2019-09-23||[Monday, April 15, 2019, September 2019, Tue…|
|2||3||A US Senate committee has passed a bill to sup…||2019-09-26||[Wednesday, May,last week, months, January, T…|
Figure 1. A snapshot of what the DataFrame looks like, with the last two columns representing the published date of the article based on the metadata and a list of all of the date references throughout an article.
Now that each article in our corpus had a published date and a list of phrases from the article, we applied the date resolution algorithm to each extracted date.
Date Resolution Algorithm
We implemented a rules-based algorithm to resolve the extracted dates. We first used the span of the extracted date to pull four pieces of information about the reference: unit type, tense, temporal unit, and word of interest.
The unit type determined if the date was referring to a time period involving days, weeks, months, years, or seasons. We made a different rule set for each unit type. If you had a phrase that said “1 week ago,” the rule subtracted 7 days from the published date of the article in order to resolve the date, but if the text says “1 month ago,” the rule would instead subtract 30 days from the published date of the article. To determine the unit type, a series of keyword mappings were developed. For example, given words such as “Monday” or “today,” you would most likely assume that the date is referring to a number of days in relation to the published date of the article, and so the unit type would be set to “days.”
Tense can return three values: past, present, or future. This part of the algorithm looked for keywords such as “last” and “ago” to indicate that the date being talked about was in the past. Words such as “next” or phrases such as “in the future” would set the tense to future. If none of the keywords we compiled were found, the tense was set to “present.”
Next, we extracted the temporal unit, or the quantity that is associated with the unit type. For example, “nine days ago” would have a unit type of “days” and a temporal unit of “9.” This step also included a transformation step to take any numerical reference written out in letters and convert it to digits. For example, “nine days ago” would transform “nine” to “9.” Now that the temporal unit could be classified as an integer, it could be implemented in computations needed later on for resolution. We also had to consider phrases that did not give a unit of time as a number. For example, the phrases “earlier this year” and “in recent months” convey a certain time period, but there is no reference to a specific number. In these cases, rules were based on the semantic meaning of the words. “Earlier” referred to the beginning of the time period referenced, whether it was earlier this week or earlier this month. Words like “recent” were interpreted to mean 2 months ago, or whatever the appropriate time frame was.
Finally, we extract the “word of interest,” which is usually the noun to which the unit refers. So in the phrase “5 days ago,” the word of interest is “days.” In the case of “last Tuesday,” the word of interest is “Tuesday.” Although the unit type and word of interest usually coincide, in certain cases they diverge. For example, in the phrase “last Tuesday,” the unit type would be “day” but the word of interest is “Tuesday.”
Once we extracted unit type, tense, temporal unit, and word of interest, we mapped the date from the metadata to the referenced date. Our algorithm has a tree of rules based on extracted values. For example, if the phrase for a referenced date was “six days ago” the extracted values would be as follows:
|Phrase||“Six days ago”|
|Word of Interest||days|
From these extracted values, the rule instructed the algorithm to take the publication date of the article and subtract 6 days.
For a referenced date of “next Tuesday,” the extracted values would be as follows:
|Word of Interest||Tuesday|
The rule set in this case would first determine the day of the week that the published date occurred. Let’s assume it was published on Thursday. The algorithm would then be instructed to find the next instance of “Tuesday” that occurs, which would add 5 days to the published date. Resolved dates are stored in a new column of the DataFrame to reflect the actual days to which the extracted dates refer.
As is the case with any rules-based algorithm, especially those involving natural language, there are always exceptions to the rules. In the case of date resolution, uncommon phrases are occasionally used to refer to dates. For example, “four score and seven years ago” would not be properly captured based on the rules of the current algorithm. Thus, more work needs to be done to capture every nuance that can be referenced. But for now, this algorithm is still useful to data scientists who need to use dates for further analysis or NLP work. As with much of our work, we realize that no model will be able to completely replace human judgement. Our goal is to use machine learning to significantly reduce the amount of manual effort analysts need to expend in order to analyze data, thus enabling them to spend more time on higher-level tasks.