[codeface] Re: Branch support

  • From: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>
  • Date: Wed, 11 Nov 2015 16:20:13 +0100



Am 11.11.2015 um 15:40 schrieb Wolfgang Mauerer:

Am 11/11/2015 um 11:42 schrieb Andreas Ringlstetter:
> Am 10.11.2015 um 22:25 schrieb Wolfgang Mauerer:
Am 10/11/2015 um 16:01 schrieb Andreas Ringlstetter:
what is actually required to provide branch support in Codeface?

For starters, it's changing a few assumptions:

- Branches can overlap, meaning the corresponding ranges can overlap.
They no longer form a single series. This can be cheated around using
multiple projects for multiple series.

The least invasive way to model this, is defining new "meta-projects"
which are simply plotting multiple regular projects against each other.

I'm not much in favour of this approach: All release ranges of a
project (and the associated inferred data) are currently dispatched
from a project-specific view. A branch is nothing else than a
(generalised) release range, so it should be accessible like any other
release range.

Almost, a branch is a set of ranges, say a series.
I don't see why a generalised release range cannot be a
set of ranges, but you're free to call the concept as
you wish. As long as you use the term consistently ;)

The optimal solution of actually allowing multiple series per project
would break too much of the existing code base.

Why too much? I see these main modifications:

* Global time series (composed of multiple release range sub-series)
would be augmented with branch-specific time series (a branch
can, but need not be part of the global series).

How to treat overlapping, independent ranges in the time global series
analysis? Currently, the whole time series analysis is assuming an
absolute order over the ranges.
To clarify my point: Right now, we have sub-series that compose a
single global time series. With the proposed change, we could also
have multiple global (i.e., composed) time series that are not
identical to the current notion.

The Python part is fine, it's only operating inside a single range each,
so it works correctly (and with the existing database model) as long as
I can guarantee that every commit is attributed to at most a single
range, no matter how the ranges are partitioned. I only need to fix up
the range query when accessing Git.

But I don't know what analyse_ts.r does in detail.
This we can fix once we have the DB structure to handle branches etc.

* There needs to be a strategy how to present and order such series in
the web front-end. Widgets that compare ranges (like release distance)
need to be modified to compare meaningful ranges (the widget should
also be taught which ranges are pointless to compare, for instance
those generated by a sliding window approach).
* Clusters etc. need to be computed for every generalised range.

So essentially the same as before.

But this is no longer yielding an aggregated cluster representing the
whole community at a given time, only for the specific branch the range
was covering.
agreed. We will need to see how we handle this in the fusion of
different data sources (for instance, multiple branches will be
discussed on the same mailing list), but this is what makes research
research.

I'm not sure if it is possible to combine clusters in hindsight, since
the data is already normalized. I need to strip the premature
normalization to allow correct aggregation of doubled edges by the database.

why combine clusters?

Ah, not the clusters, but the interaction graphs from which the clusters
are computed.



- A branch can't be isolated using the "start..end" syntax, since it may
have multiple anchor points belonging designating different branches.
This requires to use the explicit multi point notation for git,
specifying the start commits with "--not start" or "^start". No
additional end commits are required when using tags. It is safe to add
start commits to every range query.

The start and end commits defining the range also need to be specified
when using the date based range partition method. It's not possible to
omit them.

I think this can be modeled by adding new branch boundary values to the
project configuration, single value for the branch end, and a list for
the branch base.

This is mainly an issue of coming up with a good DSL for describing
generalised ranges in the configuration file. While the problem is
surely complex in its full generality, I don't think going fully
general is necessary: When the analysed branch structure becomes
too complicated, it's usually not of interest to be examined. What
is important from my point of view is

* Tracking feature branches from the branch point to the merge point
* Slicing a history from A to B into N intervals (as already supported
for full histories, but could be generalised to more restricted
ranges)

Agreed, this makes sense for every overly large individual range. Range
exceeds a specified time interval -> forceful partitioning. That would
require to specify a maximum range length, and a target range length.

Smaller ranges can be specified explicitly by providing revisions closer
together than the target length.

* Combining sub-ranges into larger ranges (for instance, like all
current release ranges are currently spliced in some order to generate
the global time series)


A "larger range" is actually a new series.

I don't know yet how to do that. And I don't know if the results are
stable when comparing merged series/clusters vs. running the analysis
passes directly on the full range.

the two will very likely not agree, but I'm sure there's something to
be learned from the differences.

The splice process used for the global time series isn't applicable like
that for a series without an absolute order. So that needs to be adapted
as well, properly interleaving the ranges instead of simple splicing.

Anyway, the global time series analysis (respectively ANY global pass)
needs a new parameter to specify which series to operate on.
agreed.


A new sanity check is required to check if all specified revisions are
within the branch boundary.
that would be much appreciated.

Btw: There's definitely a bug in the current system. If two specified
revisions happen to be in parallel branches, common code will be wrongly
attributed individually to two different ranges. This is caused by only
using the "start..end" notation, while it would have been necessary to
explicitly exclude ALL commits reachable from earlier revisions in the
range.

generating commit lists is deliberately performed as simple as possible.
There is no "correct" solution of ordering contributions wrt. to the
real world anyway -- just think of a developer who has experimented for
a couple of days, and then at some point in time squashes several
commits together to create a new one. The date attributed of this commit
will not show the real creation date of the code, but just the date
of the squash. Unless there's a realistic counterexample, we work on the
assumption that the currently used approach does not introduce
substantial perturbations. Some mis-attributed commits should usually
just cause insubstantial noise.


E.g. for the series "A B C D", the correct query for the commits
contributing to "D" is not "C..D", but actually "D ^A ^B ^C".
The semantic difference shows in the following graph:

D
|\
C|
|B
|/
A

This is already breaking the assumption about non-overlapping ranges
(not much of a surprise...), but Codeface should actually be capable of
handling this the moment the query is corrected.
can you provide an example of a real project where such a scenario
occurs?

Sorry, don't have public access to one, but I'm confident it does occur
in commercial projects running highly agile development models, where
you potentially have multiple feature branches in production (for A-B
consumer acceptance testing) which are only getting merged or discarded
later on. This results in having release tags on branches other than the
main line.

Also applies for every project which got forked, where the fork had
independent releases and the fork got merged back into the main project
- at least if the release tags of the fork have not been discarded.

I don't think it's applying to any of the project configurations shipped
with Codeface right now, not with the given selection of tags as none of
these point into feature branches.

since the problem seems to affect only a tiny fraction of all projects,
the easiest approach is to not overcomplicate the range specification,
but rather to come up with means of bringing repos into the desired
form if required. In the worst case, we cannot analyse some commercial
projects, but that's the status quo anyway. Just think of projects
managed with ClearCase and other horrors along these lines.

It's not a problem. The partition logic required to properly isolate
branches requires me to make this change anyway.

I think I will go with the following notation in the config file:
- Unordered(!) set of tags used to partition the entire commit graph.
- List of series, each series being defined by upper and lower boundaries.

The boundaries specified in the series must only use enlisted tags.
Boundaries can be omitted.

An additional series with the name "global" is implicitly added if not
present, and has no boundaries per default.

Ranges are defined by a set of head and a set of base revisions. In the
most trivial case, only single head and no base revisions are specified.

"Base" might be misleading, it's actually a stop signal to stop
traversion of the graph when this revision is encountered. So multiple
bases can shadow each other.

The revision graph is partitioned into disjunct ranges by the following
process:
- For each tag A construct a range with A as the only head revision.
Check for any other tag B if it is in the history of A.
If it is, add B as a base revision of the range.
- For each range, check if the time difference between the oldest and
newest included revision is within the maximum allowed range size. If
not, compute a number of intermediate timestamps and split the range
accordingly. (This requires multiple base and head revisions to get a
fully disjunct and complete partition across all possible border cases.
The split occurs on the latest revision preceding(!) the timestamp)
- For each range, reduce the set of head and base revisions until each
is minimal. (Optional, but should improve human comprehension. I don't
expect to be able to remove head revisions, but base revisions can be
potentially eliminated if one is reachable by another.)

That gives me a complete and disjunct set of ranges which are common to
all series.

Each range can then be persisted in the database with the list of head
and base revisions.

For each series, I will then have to determine all ranges fully within
the boundary of the corresponding series.

Each series is then persisted in the database with the corresponding
list of ranges.


Beyond this point, no process should attempt to read the tag or series
list from the configuration file or even to re-construct it!

Two ranges can trivially be identified as consecutive by matching base
vs head commits in the database. (Careful: Two ranges can be matching on
more than one edge!)

So much for the setup and partitioning. Details on modifications
necessary to later stages as I get them fleshed out.

Greetings,
Andreas

Thanks & best regards, Wolfgang Mauerer


Greetings,
Andreas Ringlstetter

Best regards, Wolfgang Mauerer


This might have caused double attributions in past analysis done with
codeface, I haven't checked if this pattern occurred in any of the
selected revision sets for the project configs shipped with Codeface.

Other related posts: