[codeface] Re: Multi datasource analysis

From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
To: <codeface@xxxxxxxxxxxxx>
Date: Thu, 5 Nov 2015 14:00:13 +0100

Am 05/11/15 um 13:23 schrieb Mitchell Joblin:

On Thu, Nov 5, 2015 at 12:39 PM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:

Am 27/10/15 um 12:04 schrieb Mitchell Joblin:

On Tue, Oct 27, 2015, 10:44 Andreas Ringlstetter
<andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx
<mailto:andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>> wrote:

Good Morning,

I want to add time series and cluster analysis spanning multiple
projects and datasources (e.g. VCS activity split across multiple
repositories) to codeface, but I'm not entirely sure which approach
to take.

Naturally, the user database, release ranges and the release timeline
need to be shared, and all data sources logically belong to a single
project. The aggregation into a single project may either happen when
initially filling the database, or by merging multiple existing projects
into one "manually" and only trigger the analysis pass.

I am aware what restructuring is necessary to achieve that (mostly
breaking up the monolithic cluster.py, so that import and analysis
phases are strictly separated), but I can't decide how to partition VCS
activity in individual projects.

So far I have worked out 3 different approaches, each single one with
obvious flaws:
- Partitioning of all projects based on natural timestamps defined by
releases in the master project. Most likely to break when projects are
making heavy use of overlapping feature branches , and the correlation
of release cycles in master and slave repositories can't be taken as
given for all projects. Essentially, one repository is declared as
authoritative, and every other repository is expected to follow the same
release cycles. If this premise doesn't hold, activity will be
miss-attributed.

My feeling is that this won't work that well. Release ranges already are
hard to interpret without considering how the release ranges are related
between repositories in an ecosystem project. I think the necessary
assumptions are too strong to be realistic.

This solution will surely not work in all scenarios, but I think that
the approach will be useful for practical purposes because it allows
us to see how well satellite repositories are aligned with the
master repo schedule.

- Grouping tags from multiple repositories by (API) compatibility.
Commits are not partitioned by timestamp, but exclusively assigned
towards a tag. This approach lacks any natural timely correlation

what's a natural correlation?

between corresponding commit sets from different repositories. There is
also the issue of being unable to correctly assign contributions towards
a specific version if a component was made upwards compatible ahead of

not sure what you mean with "made upwards compatible" here.

time. In return this should yield the most coherent data regarding
actual development cycles, even when releases are not happening timely
on subcomponents. This approach is only applicable for data sources
where activity can be mapped directly to a specific version.

I'm not sure I fully understand your consideration. So you are relying
on repository-external tags (say, A B C), and then you analyse
all dependent repos based on the times associated with these
tags?

Same feeling for this approach.

- Forcefully apply fixed time frames to all projects. Best suitable for
measuring community activity and long term health, but no correlation
with release cycles. Only approach not requiring manual tag evaluation.

If we can only choose one, then I would take this one. We anyways do a
lot of time series analysis to look at project evolution so in that
regard this approach would produce the most intuitive results. Some
project release a new revision with every commit it seems, while others
rarely ever increment the revision number. So far Linux seems to be the
rare project that has very consistent and meaniful release cycles.

Support all 3? Choose one?

And are overlapping release ranges even supported yet during analysis?
If not, then the second approach can be ruled out right away.

I think overlapping release ranges are supported.

not really: We currently work on the assumption that there's a given
path (in the config file) through the list of commits, like
rev1..rev2..rev3..rev4. Overlapping ranges can occur
with feature branches or similar things. Assuming you have three major
releases V1, V2, and V3, with V1 < V2 < V3 time-wise.
Assume there's V1.1, V1.2, V1.3, with V1.3 > V2. Then you can specify a
chain like V1..V1.1..V1.2..V1.3..V2..V3 in the config file, but the
results for V1.3..V2 are only of limited usefulness. Since the overhaul
of the data model will also include branch support, this problem will
be solved in general.

I agree with everything you said except you can still get overlapping
ranges if you specified a configuration file such as
v1.1..v1.3..v1.2..v1.4. So the interval v.1.3..v1.2 be of limited
useful for the same reason shown above but v1.2..v1.4 will be useful
and has overlap with v1.1..v1.3. I have never this but I don't see why
it wouldn't work. We essentially call git log for all contiguous pairs
of revision stored in the configuration file.

right -- but having an edge v1.3..v1.2 means deliberately going
backwards in time, which only anti-particles are supposed to do ;)
So effectively, overlapping ranges are sort-of-supported, but only
in pathological cases that are not really useful.

Actually, we should add a check some day so that configuration files
don't contain such constructions, unless someone comes up with a
legitimate use case for them.

All in all, I don't see that one single fixed approach caters for
all needs. IMHO option 1 and 3 are the most useful ones, and play
along most naturally with the two modes we support right now (tag
and time based). To me, the best approach in terms of restructuring
seems

* First analyse all projects of interest
* Then construct the time series for the different approaches from the
data

This would likely require to store the topology of the commit graph
in the database, or at least lists of commits for the possible
traversal variants. We have some code for this purpose (Mitchell, do
you know if Egon's stuff is accessible somewhere in public?), but be
warned that we already tried different non-trivial ways of traversing
the commit tree. This substantially increased complexity, but did not
lead to any important changes in the results compared to the simple
traversal we're using right now.

I think I was never included on those discussions with Egon and I am
not aware of an code for doing this.

we cooperated on GPLv2 code to parse git repos from python, this
included a tree walker with various fancy options. I'll ask him
if this is available somewhere.

Thanks, Wolfgang

--Mitchell

Best regards, Wolfgang Mauerer

Thanks,

Mitchell

The second approach also requires a different method for grouping into
release ranges, as commits are not grouped based on timestamp, but based
on reachability in the dependency graph instead. (Earliest tag they are
reachable from.) This method should be implemented any way, regardless
whether it is used for the second approach or not.

These overlapping release ranges also become relevant when the
bugtracker data source is merged in, as that datasource yields activity
events for a certain version past release, which does overlap with, but
not necessarily contribute to the next major release.

Greetings,
Andreas

Follow-Ups:
- [codeface] Re: Multi datasource analysis
  - From: Andreas Ringlstetter

References:
- [codeface] Re: Multi datasource analysis
  - From: Wolfgang Mauerer
- [codeface] Re: Multi datasource analysis
  - From: Mitchell Joblin

[codeface] Re: Multi datasource analysis

Other related posts: