[codeface] Re: Fwd: Mailing list analysis mbox input file

  • From: Wolfgang Mauerer <wm@xxxxxxxxxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Wed, 26 Mar 2014 20:18:57 +0100

Am 26/03/2014 19:11, schrieb Mitchell Joblin:
On Tue, Mar 25, 2014 at 9:39 AM, Mitchell Joblin <joblin.m@xxxxxxxxx> wrote:

I have attached a corpus and .mbox file with a small number of emails
. I am able to see more then one email per file but the the dates that
are returned are all NAs and then the analysis fails shortly after
gen.corpus is called.

I have found a solution to the problem. The dates were not being
parsed because my system LC_TIME locale was set to Germany. The dates
in the mailing list are in english so when the date was parsed it
returned NA. The lack of dates then cause a later failure when the
dates are used to find the overlap with the VCS revision dates. The
dates are parsed using the readmail function generator in the snatm
package where strptime is called and that relies on the LC_TIME
locale. I suggest that we manually set the LC_TIME locale in the R
great, thanks! Looking forward to the patch ;)

environment using Sys.setlocale(category = "LC_TIME", locale =
"en_US.UTF-8"). Is it safe to assume that all mbox files will be in
english or will it be necessary to make this user configurable?

that should not be necessary. As per RFC2822, Section 3.3, the
date/time specification uses English terms, so the locale en_US.UTF-8
should be safe.

However, we should test if other locales than C and en_XX do cause
unexpected problems in other places.

Best regards, Wolfgang

Kind regards,

Mitchell




Kind regards,

Mitchell

Best regards, Wolfgang


Kind regards,

Mitchell




Other related posts: