[codeface] LLVM email format

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Thu, 12 Nov 2015 10:49:47 +0100

Hi all,

I see that llvm has an odd format to specify who the email is from. An
example is of this is "Adrian Prantl via llvm-dev
<llvm-dev@xxxxxxxxxxxxxx>". Unfortunately this breaks the email
analysis because the "from" line does not get parsed properly and
decomposed into a (name, email) pair and the id service is returning
NAs. I propose that we:

1) See why this completely breaks the email analysis pass in the first
place and come up with a more robust handling of degenerate "From"
lines. We should definitely at the very least be able to continue
processing the other mails and produce a warning for a degenerate
case. Then we have three options for what to do with those emails. We
can remove it entirely, but I don't think that a good idea because we
can still make use of the email content for other analyses even if we
cannot attribute it to any person. Second option would be to attribute
it to a single "nobody" or "unresolved" person. Third is to attribute
it to a unique person. My opinion is that attributing it to a unique
person is the best. One reason for that is I don't want to create a
super high degree individual in the communication network because this
anomaly could destroy the interpretation of our network statistics. I
think the addition of unique individuals will just add some a few low
degree 1 nodes and this cause less harm to the communication network.
At the moment is seems like we are doing option 2 and resolve all
unknowns to a single identity.

2) The second major thing would be to add a specific parsing stage to
recognize theses llvm style "from" lines. Similar to what I did to add
the apache style tags to the vcs analysis. This approach doesn't scale
that well but LLVM is a nice project for us to analyze so its worth it
in this case.

What are your thoughts?

Kind regards,

Mitchell

Other related posts: