[codeface] Re: Data model

From: Wolfgang Mauerer <wm@xxxxxxxxxxxxxxxx>
To: codeface@xxxxxxxxxxxxx
Date: Thu, 02 Jul 2015 18:56:00 +0200

(sorry for the delayed response!)

Am 30/06/2015 um 17:01 schrieb Mitchell Joblin:

On Mon, Jun 8, 2015 at 11:30 AM, Claus Hunsen <hunsen@xxxxxxxxxxxxxxxxx>
wrote:

we had a short look on the data model of Codeface and now have some
questions, which somebody on this list might be able to answer. This
would help us to understand the model better and make our extensions
regarding software metrics properly.

- Does the data of the 'commit' table come solely from the blame analysis?

No. The blame analysis is used to map lines of code to developers. To
get the commit data we mostly use "git log" and "git show" commands
which are far faster than using blame. In the cluster.py file you can
see the commits getting added to the database table and from there you
can trace back where the commit object was generated. That mostly
occurs in VCS.py.

- What is the column 'commit_dependency.impl'?

This column is to add the implementation for whatever entity is added
there. Basically the source code. For features I don't think this gets
added but for functions/files it does.

- What is the 'author_commit_stats' table that seems to be more a view?

There is an author_commit_stats but that is not a view. There is
alternatively an author_commit_stats_view, which is a view. This table
stores data about the number of added, deleted, total lines by a given
developers and the number of commits they made.

- The same question for 'commit_communication'?

I'm not sure if this is ever filled. Perhaps we will use it in the
future to see when two developers discuss a commit, for example on a
mailing list.

this table is currently unused. It was intended to capture implicit
communication between contributors via a single commit (for instance,
a correction by contributor B of a piece of code written by
contributor A), but we do this type of analysis now differently.
We could expand and re-use the table in the way Mitchell described,
though. Or get rid of it.

- Can someone explain the idea of the "time series and plots"
submodule? This seems quite confusing to us.

Wolfgang wrote that so he can probably do a better job of explaining
what is going on there. I think that some various data is queried over
different revisions (e.g., number of lines of code added during a
revision) to produce a number of time series. Those time series are
then analyzed and plotted. No sure if that helps at all. @Wolfgang,
could you please shed a little more light on this?

Table timeseries stores, in fact, univariate time series. Currently, we
use it for things like "how much code flows into a project over time";
these series are computed during the analysis phase and smoothed with
various methods. For each smoothing method and time series, a unique
plot id is assigned. This makes sense for time series that are
ressource intensive to compute.

Time series are exported from the ID service; this has historical
reasons: We originally intended to make all derived data available
via a REST interface from the id service and perform plotting directly
in the browser (that's also the reason why the id service is written
in node.js). However, it turns out that JavaScript plotting libraries
at the time lacked many features we wanted, and that data transfer
between id service and browser was quite a bottleneck for longer,
unfiltered time series.

plot_bin was intended for generic plots represented by a raster
image of some sort. Table plots provides metadata about the available
plots (project, name of plot, release range and x and y labels --
we're limited to 2D graphs that way, but that is fine).

The approach does not scale too well; for sloccount results, we had to
add a new table for multi(4)-variate time series.

We can't get rid of storing time series data; some of them are expensive
to compute, and some are original data. However, we should consider
changing the naming.

Furthermore, we identified some smaller issues that might hinder
extension of the Codeface model and the linking of several related
projects. Hopefully, you can share your opinion on our observations.

- One author (table 'person') cannot be part of several projects. At
least, the author occurs several times in the database, once for each
project.
~> A 'person--project' table would help, while removing the FK from
the 'person' table.

Right, that is potentially an issue. So far we have not had a need to
do analyses that cross cut multiple projects. I can see this would be
useful for ecosystems or projects that split there work into multiple
repositories. I'm supportive of this change. If you would like to make
a change to the data model then please alter the model using mysql
workbench 6 and then forward engineer the model to generate the
script. Please put both the changes to the model and the generated
script in the same commit. Its difficult to identify the changes in to
the model since git sees it as just a binary. Thanks.

the decision to associate authors with a single project was deliberate.
On the one hand, it's quite rare to have a single person contribute
substantially to multiple projects. On the other project, the John Smith
and Fritz Muellers of this world would be wrongly associated to
contribute to N projects, albeit it's likely that they are different
persons with common names.

I agree that there are use cases where it is necessary to associate
a person with multiple projects. However, extending the person name
aliasing heuristics so that multiple identities within a project
are correctly resolved (John M. Smith and John Smith might be
both js@xxxxxxx) while simultaneously keeping different John Smiths
from different projects separate might be hard. Since nearly every
part of the core analysis depends on table person, I would suggest to
leave the table as (simple) as it is, and add a new link mechanism.

I've CC'ed Andreas Ringlstetter and Benjamin Hiefner -- they will be
working on analysing projects that are composed of one base project
and several satellites, for instance jQuery. The mechanism will also
be useful in this case.

- Additionally, the email addresses should also be moved to a mapping
table 'person--email'.

I don't quite get the rationale here, perhaps I am missing something.
Why not keep the emails in the person table?

I suppose that's because you dislike the limitation to 5 eMail addresses
per Person? It would surely be nicer to restructure table person to
link to a list of eMail addresses; however, we never had trouble
with this limit even for large projects that span decades.

- The 'release_timeline' table is lacking the commit hashes the
releases refer to. This way, we cannot identify the right commit in
the 'commit' table that corresponds to a 'release_timeline' object.
~> Adding the hash to the 'release_timeline' table and also adding a
mapping table 'commit--tag' would enable us to get the right commit
for a release tag.

Yes, I see the issue here. So you want the single commit that the tag
is referencing but not all the commits from a particular range. I
guess we would have the add the hash, no problem. Are you ok with
making those changes? I can have a look once you submit a patch. If
you need help with anything just ask.

please don't add the hash value directly, but a reference to the
appropriate entry in table commit, it contains the hash.

Thanks & best regards, Wolfgang

Kind regards,

Mitchell

Best regards,
Claus

Follow-Ups:
- [codeface] Re: Data model
  - From: Wolfgang Mauerer

[codeface] Re: Data model

Other related posts: