[codeface] Re: Fwd: Mailing list analysis mbox input file

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: Wolfgang Mauerer <wm@xxxxxxxxxxxxxxxx>
  • Date: Wed, 26 Mar 2014 19:11:21 +0100

On Tue, Mar 25, 2014 at 9:39 AM, Mitchell Joblin <joblin.m@xxxxxxxxx> wrote:
> On Mon, Mar 24, 2014 at 11:05 PM, Wolfgang Mauerer <wm@xxxxxxxxxxxxxxxx> 
> wrote:
>> Hi Mitchell,
>>
>> Am 24/03/2014 15:01, schrieb Mitchell Joblin:
>>
>>> Hello all,
>>>
>>> I am having problems with getting the mailing list analysis to run
>>> correctly. I have narrowed the problem down to the dispatch.all
>>> function in R/ml/analysis.r and the error message is "Mailing list
>>> does not cover any release range." The problem seems to be that not
>>> all the dates from the emails in the .mbox file are identified. For
>>> example, when I load an archive of one month of emails from qemu I
>>> only get a corpus with 1 document and 1 date. The date that is
>>> identified is the date of the first email in the mbox file.
>>>
>>> Does each individual email need to be in its own file for this to work
>>> correctly?
>>
>> there are two alternatives how the ml analysis deals with .mbox input
>> files
>>
>> - split them into indivual mail files and read them one after one
>> - process the complete mailbox in one go, without creating intermediate
>>   individual emails
>>
>> The latter approach is clearly more intelligent, but only works with
>> the most recent versions of tm and tm.plugin.mail. Which versions are
>> you working with?
>>
>>
>>>
>>> Should the gen.corpus function return a corpus with more than 1 document
>>> when a single archive with multiple emails is loaded?
>>
>> absolutely. It should return a corpus with as many documents as there
>> are emails.
>>
>
> I have attached a corpus and .mbox file with a small number of emails
> . I am able to see more then one email per file but the the dates that
> are returned are all NAs and then the analysis fails shortly after
> gen.corpus is called.

I have found a solution to the problem. The dates were not being
parsed because my system LC_TIME locale was set to Germany. The dates
in the mailing list are in english so when the date was parsed it
returned NA. The lack of dates then cause a later failure when the
dates are used to find the overlap with the VCS revision dates. The
dates are parsed using the readmail function generator in the snatm
package where strptime is called and that relies on the LC_TIME
locale. I suggest that we manually set the LC_TIME locale in the R
environment using Sys.setlocale(category = "LC_TIME", locale =
"en_US.UTF-8"). Is it safe to assume that all mbox files will be in
english or will it be necessary to make this user configurable?

Kind regards,

Mitchell



>
> Kind regards,
>
> Mitchell
>
>> Best regards, Wolfgang
>>>
>>>
>>> Kind regards,
>>>
>>> Mitchell
>>>
>>

Other related posts: