[codeface] Re: data model

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • Date: Tue, 20 Oct 2015 15:32:06 +0200

On Tue, Oct 20, 2015 at 1:08 PM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:

Hi Mitchell,

sorry for the delayed response. I'm CC'ing the guys currently
cleaning up/documenting things, perhaps they have some opinion.

Sure, I had to get working on this anyway but I will share what I have shortly.


Am 12/10/2015 um 18:29 schrieb Mitchell Joblin:

From my understanding of the data model the relationship between a
developer and to which mail thread they have contributed is not
captured. We do have information about the creator of a thread and the
number of contributors to that thread, but there is no reference to
person Ids. I think that we should add a table to the mailing list

Yes, that's currently missing in the data base. We should add the
corresponding information.

Great.


analysis that is very similar to the commit_dependency table. In this
table we should have columns that contain a person id, a thread id,
the content of their message, and a time stamp. I would like to
preserve most of what the corpus object contains in R. Perhaps there
are some additional items.

For the content, keep in mind that there are two versions: The
raw data as posted to the list, and the stripped down version
after linguistic processing. If we keep any of them, we should keep
both.

Sounds good. Since we don't need the content right now I might not add
it at this time. When we need it later we can add it. Should not be an
issue to just add the columns.


If we want to construct one-mode cooperation graphs from mailing lists,
I guess the "Clusters" section of the DB schema also requires some
clean-up work. per_cluster_statistics is currently focused on
VCS derived clusters, but there would be different covariates of
interest for mailing lists. This should be generalised somehow.

Yes, that correct. We need a way to distinguish between different
graph types (e.g. VCS, email, bug tracker). Currently I am using the
clusterMethod for that since we are already abusing the model by
assigning complete graphs to cluster number -1 and complete graphs
anyway don't have a clusterMethod. For the email analysis I am not
storing the clusters, only the complete graph. We could just add a
'type" column to the clusters table if we don't wish to handle the -1
indexing hack at this time.

Kind regards,

Mitchell


twomode_vertices and twomode_edgelists are special collaboration
cases that currently make sense only for mailing lists, but could
maybe also be useful for bug trackers. Should we maybe also add
these to the "Clusters" group, and then equip the structures
in this group with back-links to artefact specific data?

Thanks & best regards, Wolfgang

Kind regards,

Mitchell


Other related posts: