On 24/11/16 11:34, Wolfgang Mauerer wrote:
Am 13/10/2016 um 17:29 schrieb Claus Hunsen:
The default date format for parsing of mbox files is set by the 'snatm'
package to "%a, %d %b %Y %H %M:%S". In some projects' mbox files, the
weekday ("%a") is missing and the resulting date will get "NA" for those
e-mails.
This patch introduces a fixing routine which re-parses the date header
using the slightly adapted pattern "%d %b %Y %H:%M:%S" (omitted
weekday).
It might be the case, that the date is still set to NA for some mails,
but this patch catches, at least, the most common other pattern that has
been observed.
Signed-off-by: Claus Hunsen <hunsen@xxxxxxxxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 51c3ac9..07bb897 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -338,10 +338,37 @@ check.corpus.precon <- function(corp.base) {
return(author)
}
+ ## Condition #3: Date information should be present
+ fix.date <- function(doc) {
+ date.doc = meta(doc, tag = "datetimestamp")
+
+ ## a date is properly set
+ if (!is.na(date.doc)) {
+ return(date.doc)
+ }
+
+ ## if the date is not properly set, we need to re-parse it.
+ ## this may be the case if the date inside the mbox file does not
+ ## match the pattern "%a, %d %b %Y %H:%M:%S".
+ ## (see
https://github.com/wolfgangmauerer/snatm/blob/master/pkg/R/makeforest.r#L47)
+
+ ## get the date header
+ headers = meta(doc, tag = "header")
+ date.header = grep("^Date:", headers, value = TRUE, useBytes = TRUE)
+
+ ## re-parse the header using adapted pattern
+ ## TODO: are there other potential pattern?
+ adapted.format = "%d %b %Y %H:%M:%S" # missing weekday; e.g., "Date:
20 Feb 2009 20:24:54 +0100"
+ date.new = strptime(gsub("Date: ", "", date.header), format =
adapted.format, tz = "GMT")
+
+ return(date.new)
+ }
+
## Apply checks of conditions to all documents
fix.corpus.doc <- function(doc) {
meta(doc, tag="header") <- rmv.multi.refs(doc)
meta(doc, tag="author") <- fix.author(doc)
+ meta(doc, tag="datetimestamp") <- fix.date(doc)
return(doc)
}
thanks -- do you know what the relevant RFC has to say about date
specifications? Up to know, we're parsing what we've seen in the wild,
but perhaps the spec would show some more patterns that we should take
into account.
[ day-of-week "," ] day month year time hour ":" minute [ ":" second ] zoneMay be a good idea to support all those patterns described by this
Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
Attachment:
signature.asc
Description: OpenPGP digital signature