[codeface] Re: data model

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>, <codeface@xxxxxxxxxxxxx>, Mitchell Joblin <joblin.m@xxxxxxxxx>, "Joblin, Mitchell (ext)" <mitchell.joblin.ext@xxxxxxxxxxx>
  • Date: Tue, 20 Oct 2015 16:14:05 +0200



Am 20/10/2015 um 15:38 schrieb Andreas Ringlstetter:



Am 20.10.2015 um 13:08 schrieb Wolfgang Mauerer:
Hi Mitchell,

sorry for the delayed response. I'm CC'ing the guys currently
cleaning up/documenting things, perhaps they have some opinion.

Am 12/10/2015 um 18:29 schrieb Mitchell Joblin:

From my understanding of the data model the relationship between a
developer and to which mail thread they have contributed is not
captured. We do have information about the creator of a thread and the
number of contributors to that thread, but there is no reference to
person Ids. I think that we should add a table to the mailing list

Yes, that's currently missing in the data base. We should add the
corresponding information.

analysis that is very similar to the commit_dependency table. In this
table we should have columns that contain a person id, a thread id,
the content of their message, and a time stamp. I would like to
preserve most of what the corpus object contains in R. Perhaps there
are some additional items.

For the content, keep in mind that there are two versions: The
raw data as posted to the list, and the stripped down version
after linguistic processing. If we keep any of them, we should keep
both.

If we want to construct one-mode cooperation graphs from mailing lists,
I guess the "Clusters" section of the DB schema also requires some
clean-up work. per_cluster_statistics is currently focused on
VCS derived clusters, but there would be different covariates of
interest for mailing lists. This should be generalised somehow.

twomode_vertices and twomode_edgelists are special collaboration
cases that currently make sense only for mailing lists, but could
maybe also be useful for bug trackers. Should we maybe also add
these to the "Clusters" group, and then equip the structures
in this group with back-links to artefact specific data?

I still have to evaluate how easy this is to clean up, but it doesn't
make much sense to carry both the twomode, and the cluster model.

I believe the generic pattern for all clusters is always directed
communication between two persons on a specific topic?

currently, it is, but this might not be the case in the future. We
will shortly be looking into clustering email networks, and in this
case, twomode clusters will likely also be of interest.

Commit communication is currently treated as a list of contributions,
but eventually flattened to a person<->person relation as well. But
since the cluster database (contrary to the twomode system) has no
notion of the "topic" this connection relates to, that information is
lost later on currently. The topic would either be an artefact, commit,
or whatever the relation is defined on. It could be worth it to keep the
type of contribution as well, as long as the cluster algorithm isn't
making this indistinguishable.

The twomode system isn't distinguishing by contribution type, but
defaulting to a constant doesn't hurt. It does support something else
though, which the cluster scheme is currently missing as well:
"untargeted" contributions AKA original post. This could be modeled as a
self reference though. (Either self reference of null target, unless
someone can provide a legit use case where a self reference would be
required to model something different which couldn't be differentiated
by contribution type.)

IIRC, self references can occur with proximity analysis when an author
contributes multiple changes to the same function during an analysis
interval, but we filter this out. I'm not sure if the filtering is done
before plotting, or before the data hit the storage. Mitchell?

Bug tracker follows the same scheme. Multiple persons contribute on a
common topic (a ticket), but communication is (except for the intial
report) always directed between two persons, with specific contribution
types. Much similar to the mailing list actually.
yes, that's we we should unify these things.


Personally, I would recommand to the extend the cluster scheme to
support both the features required by the ML analysis and the
bugtracker, and simply rewrite the ML queries to run straight inside the
cluster database.
that seems reasonable.

Unrelated to that, evaluate VCS analysis methods if more detailed
information can be stored when flattening, this shouldn't effect the
performance of PR and alike at all as the data can still be aggregated
by the database right before loading it into memory. After all, that's
why you are using a relational database.

We did suffer quite substantial performance problems in the past when
doing complex composite queries, that's why there are some
persistent views in the DB.

Oh, and the reason why we use a relational database is in the first
place simply that at the time Codeface/prosoda/quantarch was initiated,
there were no proper R and Python libraries for other interfacing
with other types of databases.

Best regards, Wolfgang Mauerer

Greetings,
Andreas

Thanks & best regards, Wolfgang

Kind regards,

Mitchell


Other related posts: