[codeface] [PATCH v2 12/24] Improve parsing of date headers in the ML analysis

From: Claus Hunsen <hunsen@xxxxxxxxxxxxxxxxx>
To: codeface@xxxxxxxxxxxxx
Date: Thu, 1 Dec 2016 17:11:06 +0100

The default date format for parsing of mbox files is set by the 'snatm'
package to "%a, %d %b %Y %H %M:%S". In some projects' mbox files, e.g.,
the weekday ("%a") is missing and the resulting date will get "NA" for
those e-mails. Additionally, time-zone data (%z) have not been
incorporated (see original pattern below), although the commit analysis
incorporated this part of the date.

This patch introduces a fixing routine which re-parses ALL date headers
using a list of pre-defined patterns that have been sighted in real-world
.mbox files:
- "%a, %d %b %Y %H %M:%S" (the original pattern [1]),
- "%d %b %Y %H:%M:%S" (omitted weekday), and
- "%a, %d %b %Y %H:%M" (omitted seconds, "Wed, 21 Aug 2013 15:02 +0200").

[1] https://github.com/wolfgangmauerer/snatm/blob/master/pkg/R/makeforest.r#L47

The date headers are parsed right now as follows: first, trying to
include the time-zone information, then, excluding time-zone information
if not present. This guarantees that time-zone data is incorporated if
present, but potentially missing time-zone data does not break the date
parsing entirely for the current date header.

It might be the case, that the date is still set to NA for some mails,
but this patch catches, at least, the most common patterns that have
been observed.

Signed-off-by: Claus Hunsen <hunsen@xxxxxxxxxxxxxxxxx>
Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index f24d207..5a01a72 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -341,10 +341,48 @@ check.corpus.precon <- function(corp.base) {
     return(author)
   }

+  ## Condition #3: Date information should incorporate time-zone information
and should be present
+  fix.date <- function(doc) {
+    ## re-parse date headers to incorporate time-zone data.
+    ## this needs to be done, because the date inside the mbox file is
initially parsed with
+    ## the pattern "%a, %d %b %Y %H:%M:%S" which does not incorporate
time-zone data (%z) [1], which is,
+    ## on the other side, incorporated in the commit analysis.
+    ## [1] (see
https://github.com/wolfgangmauerer/snatm/blob/master/pkg/R/makeforest.r#L47)
+
+    ## get the date header as inside the mbox file
+    headers = meta(doc, tag = "header")
+    date.header = grep("^Date:", headers, value = TRUE, useBytes = TRUE)
+
+    ## patterns without time-zone pattern
+    date.formats.without.tz = c(
+      "%a, %d %b %Y %H:%M:%S",  # initially used format; e.g., "Date: Tue, 20
Feb 2009 20:24:54 +0100"
+      "%d %b %Y %H:%M:%S",  # missing weekday; e.g., "Date: 20 Feb 2009
20:24:54 +0100"
+      "%a, %d %b %Y %H:%M"  # missing seconds; e.g. "Date: Wed, 21 Aug 2013
15:02 +0200"
+    )
+    ## append time-zone part and incorporate pattern without time-zone
indicator
+    date.formats = c(
+      paste(date.formats.without.tz, "%z", sep = " "),
+      date.formats.without.tz
+    )
+
+    ## try to re-parse the header using adapted patterns:
+    ## parse date until any match with a pattern is found (date.new is not NA)
+    for (date.format in date.formats) {
+      date.new = strptime(gsub("Date: ", "", date.header), format =
date.format, tz = "GMT")
+      # if the date has been parsed correctly, break the loop
+      if (!is.na(date.new)) {
+        break()
+      }
+    }
+
+    return(date.new)
+  }
+
   ## Apply checks of conditions to all documents
   fix.corpus.doc <- function(doc) {
     meta(doc, tag="header") <- rmv.multi.refs(doc)
     meta(doc, tag="author") <- fix.author(doc)
+    meta(doc, tag="datetimestamp") <- fix.date(doc)
     return(doc)
   }

@@ -802,7 +840,7 @@ store.mail <- function(conf, forest, corp, ml.id ) {
   dat <- merge(dat, dates.df, by="ID")
   dat$ID <- NULL
   colnames(dat)[which(colnames(dat)=="threadID")] <- "threadId"
-
+
   ## Re-order columns to match the order as defined in the database to
   ## improve the stability
   dat = dat[c("projectId", "threadId", "mlId", "author", "subject",
"creationDate")]
--
2.10.2

References:
- [codeface] [PATCH v2 00/24] Several fixes and enhancements for Codeface
  - From: Claus Hunsen

[codeface] [PATCH v2 12/24] Improve parsing of date headers in the ML analysis

Other related posts: