[codeface] Re: [PATCH 1/3] Add query to compute dev-dev edgelist based on mailing list communication

  • From: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>
  • Date: Mon, 26 Oct 2015 15:14:08 +0100



Am 26.10.2015 um 14:55 schrieb Mitchell Joblin:

On Mon, Oct 26, 2015 at 1:53 PM, Mitchell Joblin <joblin.m@xxxxxxxxx> wrote:
On Mon, Oct 26, 2015 at 1:08 PM, Andreas Ringlstetter
<andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx> wrote:


Am 26.10.2015 um 11:54 schrieb Mitchell Joblin:
Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
codeface/R/query.r | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/codeface/R/query.r b/codeface/R/query.r
index fa295cf..f087b0d 100644
--- a/codeface/R/query.r
+++ b/codeface/R/query.r
@@ -542,6 +542,21 @@ query.top.contributors.changes <- function(con,
range.id, limit=20) {
return(dat)
}

+## Compute edgelist for mailing list communication

Is this description accurate? It's only yielding edges for the OP of
each thread, it's omitting all edges occurring inside each thread.

Why should it only yield edges for the OP when it is joined with
thread_responses?


Maybe I misunderstood the model, but mail_thread is just recording the
OP of each thread, and each message is sorted into exactly one thread?
(Well, "at most" in the current model.)

If I'm not mistaking the snatm source code, then an entire thread is
collapsed into a single thread id.

So the join just yields edges towards the OP. It would yielded all edges
if mail_thread would have contained all sub-threads as well, but it
doesn't look like that.

E.g. if Wolfgang was to respond to this email, he would not only address
you, but me as well. This edge however will never show up in that query.
Only the edge from Wolfgang to you and me to you, will, despite Wolfgang
not even addressing you directly.


+query.mail.edgelist <- function(con, pid, start.date, end.date) {
+ query <- str_c("SELECT who AS `from`, createdBy AS `to`, COUNT(*) AS
`weight`",
+ "FROM mail_thread, thread_responses",
+ "WHERE
mail_thread.mailThreadId=thread_responses.mailThreadId",
+ "AND projectId=", pid,
+ "AND mailDate >=", sq(start.date),
+ "AND mailDate <", sq(end.date),

Does it make sense to partition a thread like that or should this match
on the thread creation date instead?

Each thread is assigned a unique Id, the creation date is not
necessarily unique. What if two threads are created at the exact same
moment in time?


I meant something different, which date is authoritative for the edge
date? The anchor point (to) or the source (from)?

Given that a threaded communication usually follows a single topic and
has a finite duration, the anchor has more significance IMHO. But that
depends of what statement you want to make. Activity is better measured
on the source, but


+ "GROUP BY mail_thread.mailThreadId",

I don't think this is doing what you expected it to do. You would need
to group both by mail_thread.mailThreadId and thread_responses.who,
otherwise you will get just a single edge per thread with a random value
in the "from" field.

I think this should be group by mail_thread.createdBy and
thread_responses.who, correct? the result should be a weight between
two developer ids that expresses the number of time they have
contributed to a common thread.

If that's the intention, you would need to join thread_responses onto
itself, with response1.thread = response2.thread AND response1.date >
response2.date AND response1.who != response2.who . That would in fact
yield all edges inside a single thread. Keep the additional join on
mail_thread to filter by project and range.

But it's not a good metric yet, as there is no differentiation between
direct responses and indirect ones happening much deeper in the tree, or
possibly even in some other sub-thread. Derailed communication will
screw this metric.

In fact, personally I would discard anything but direct responses. At
most refine it by respecting a single layer of indirection with reduced
weight.

This however would require the direct predecessor of each response to be
recorded in the first place.

Without, the only possible hack I could think of would be to assume that
all communication is flat, so that neighbored messages can be determined
based on the timestamp. Still possible to formulate that as a query,
fast enough, but I don't know how far this would derivate if a
communication splits into multiple asynchronous sub-threads (like this
one...).

--Andreas

--Mitchell


Right, that is totally wrong. Thanks for catching it.

--Mitchell


-- Andreas

+ sep=" ")
+ dat <- dbGetQuery(con, query)
+
+ return(dat)
+}
+
## Distributions for commit statistics
query.contributions.stats.range <- function(con, range.id,
include.id=FALSE) {
if (include.id) {




Other related posts: