[codeface] Re: Branch support

From: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>
To: <codeface@xxxxxxxxxxxxx>
Date: Wed, 11 Nov 2015 11:42:11 +0100

Am 10.11.2015 um 22:25 schrieb Wolfgang Mauerer:

Am 10/11/2015 um 16:01 schrieb Andreas Ringlstetter:

what is actually required to provide branch support in Codeface?

For starters, it's changing a few assumptions:

- Branches can overlap, meaning the corresponding ranges can overlap.
They no longer form a single series. This can be cheated around using
multiple projects for multiple series.

The least invasive way to model this, is defining new "meta-projects"
which are simply plotting multiple regular projects against each other.

I'm not much in favour of this approach: All release ranges of a
project (and the associated inferred data) are currently dispatched
from a project-specific view. A branch is nothing else than a
(generalised) release range, so it should be accessible like any other
release range.

Almost, a branch is a set of ranges, say a series.

The optimal solution of actually allowing multiple series per project
would break too much of the existing code base.

Why too much? I see these main modifications:

* Global time series (composed of multiple release range sub-series)
would be augmented with branch-specific time series (a branch
can, but need not be part of the global series).

How to treat overlapping, independent ranges in the time global series
analysis? Currently, the whole time series analysis is assuming an
absolute order over the ranges.

The Python part is fine, it's only operating inside a single range each,
so it works correctly (and with the existing database model) as long as
I can guarantee that every commit is attributed to at most a single
range, no matter how the ranges are partitioned. I only need to fix up
the range query when accessing Git.

But I don't know what analyse_ts.r does in detail.

* There needs to be a strategy how to present and order such series in
the web front-end. Widgets that compare ranges (like release distance)
need to be modified to compare meaningful ranges (the widget should
also be taught which ranges are pointless to compare, for instance
those generated by a sliding window approach).
* Clusters etc. need to be computed for every generalised range.

So essentially the same as before.

But this is no longer yielding an aggregated cluster representing the
whole community at a given time, only for the specific branch the range
was covering.

I'm not sure if it is possible to combine clusters in hindsight, since
the data is already normalized. I need to strip the premature
normalization to allow correct aggregation of doubled edges by the database.

- A branch can't be isolated using the "start..end" syntax, since it may
have multiple anchor points belonging designating different branches.
This requires to use the explicit multi point notation for git,
specifying the start commits with "--not start" or "^start". No
additional end commits are required when using tags. It is safe to add
start commits to every range query.

The start and end commits defining the range also need to be specified
when using the date based range partition method. It's not possible to
omit them.

I think this can be modeled by adding new branch boundary values to the
project configuration, single value for the branch end, and a list for
the branch base.

This is mainly an issue of coming up with a good DSL for describing
generalised ranges in the configuration file. While the problem is
surely complex in its full generality, I don't think going fully
general is necessary: When the analysed branch structure becomes
too complicated, it's usually not of interest to be examined. What
is important from my point of view is

* Tracking feature branches from the branch point to the merge point
* Slicing a history from A to B into N intervals (as already supported
for full histories, but could be generalised to more restricted
ranges)

Agreed, this makes sense for every overly large individual range. Range
exceeds a specified time interval -> forceful partitioning. That would
require to specify a maximum range length, and a target range length.

Smaller ranges can be specified explicitly by providing revisions closer
together than the target length.

* Combining sub-ranges into larger ranges (for instance, like all
current release ranges are currently spliced in some order to generate
the global time series)

A "larger range" is actually a new series.

I don't know yet how to do that. And I don't know if the results are
stable when comparing merged series/clusters vs. running the analysis
passes directly on the full range.

The splice process used for the global time series isn't applicable like
that for a series without an absolute order. So that needs to be adapted
as well, properly interleaving the ranges instead of simple splicing.

Anyway, the global time series analysis (respectively ANY global pass)
needs a new parameter to specify which series to operate on.

A new sanity check is required to check if all specified revisions are
within the branch boundary.

that would be much appreciated.

Btw: There's definitely a bug in the current system. If two specified
revisions happen to be in parallel branches, common code will be wrongly
attributed individually to two different ranges. This is caused by only
using the "start..end" notation, while it would have been necessary to
explicitly exclude ALL commits reachable from earlier revisions in the
range.

generating commit lists is deliberately performed as simple as possible.
There is no "correct" solution of ordering contributions wrt. to the
real world anyway -- just think of a developer who has experimented for
a couple of days, and then at some point in time squashes several
commits together to create a new one. The date attributed of this commit
will not show the real creation date of the code, but just the date
of the squash. Unless there's a realistic counterexample, we work on the
assumption that the currently used approach does not introduce
substantial perturbations. Some mis-attributed commits should usually
just cause insubstantial noise.

E.g. for the series "A B C D", the correct query for the commits
contributing to "D" is not "C..D", but actually "D ^A ^B ^C".
The semantic difference shows in the following graph:

D
|\
C|
|B
|/
A

This is already breaking the assumption about non-overlapping ranges
(not much of a surprise...), but Codeface should actually be capable of
handling this the moment the query is corrected.

can you provide an example of a real project where such a scenario
occurs?

Sorry, don't have public access to one, but I'm confident it does occur
in commercial projects running highly agile development models, where
you potentially have multiple feature branches in production (for A-B
consumer acceptance testing) which are only getting merged or discarded
later on. This results in having release tags on branches other than the
main line.

Also applies for every project which got forked, where the fork had
independent releases and the fork got merged back into the main project
- at least if the release tags of the fork have not been discarded.

I don't think it's applying to any of the project configurations shipped
with Codeface right now, not with the given selection of tags as none of
these point into feature branches.

Greetings,
Andreas Ringlstetter

Best regards, Wolfgang Mauerer

This might have caused double attributions in past analysis done with
codeface, I haven't checked if this pattern occurred in any of the
selected revision sets for the project configs shipped with Codeface.

Follow-Ups:
- [codeface] Re: Branch support
  - From: Wolfgang Mauerer

References:
- [codeface] Branch support
  - From: Andreas Ringlstetter
- [codeface] Re: Branch support
  - From: Wolfgang Mauerer

[codeface] Re: Branch support

Other related posts: