Dear all,
Hi Wolfgang,
I used Codeface to analyze mailing lists of software projects. At the
beginning, I used a small part of the mailing list to analyze which was
just a subset of the whole mailing list to analyze. Later on, I analyzed
the whole mailing list of the same project. When I compare the results
of the two different Codeface runs, then some emails which occurred in
both mailing lists are assigned to different authors.
For my research work, I determine bursts of emails, i.e., emails of two
different persons sent within a small time-window (several days), based
on the results in the Codeface database. Using the whole mailing list
results in a significantly higher amount of emails and also a higher
amount of authors than using the subset of the mailing list, which is
reasonable. However, the number of determined bursts is significantly
smaller when I use the whole mailing list compared to using the subset,
which does not make sense. One reason for that is that emails which are
contained in both mailing lists (subset and whole mailing list) are
assigned to different authors.
Since the order in which email addresses or persons are added to the
database influences how the id service adds or matches persons or emails
to the database, the order of the emails may be relevant.
I am not sure if there is a way to produce more reproducible results for
analyzing the same emails within the two different runs. Hence, my
question is: Would it be reasonable to sort the emails (e.g., by
creation date) while creating the corpus in Codeface to get more
reproducible results?
Thank you in advance!
Best,
Thomas