[codeface] Identical emails assigned to different authors within Codeface mailinglist analyses

  • From: Thomas Bock <bockthom@xxxxxxxxxxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Fri, 16 Dec 2016 14:40:44 +0100

Dear all,
Hi Wolfgang,

I used Codeface to analyze mailing lists of software projects. At the beginning, I used a small part of the mailing list to analyze which was just a subset of the whole mailing list to analyze. Later on, I analyzed the whole mailing list of the same project. When I compare the results of the two different Codeface runs, then some emails which occurred in both mailing lists are assigned to different authors.

For my research work, I determine bursts of emails, i.e., emails of two different persons sent within a small time-window (several days), based on the results in the Codeface database. Using the whole mailing list results in a significantly higher amount of emails and also a higher amount of authors than using the subset of the mailing list, which is reasonable. However, the number of determined bursts is significantly smaller when I use the whole mailing list compared to using the subset, which does not make sense. One reason for that is that emails which are contained in both mailing lists (subset and whole mailing list) are assigned to different authors.

Since the order in which email addresses or persons are added to the database influences how the id service adds or matches persons or emails to the database, the order of the emails may be relevant.

I am not sure if there is a way to produce more reproducible results for analyzing the same emails within the two different runs. Hence, my question is: Would it be reasonable to sort the emails (e.g., by creation date) while creating the corpus in Codeface to get more reproducible results?

Thank you in advance!

Best,
Thomas



Other related posts: