[codeface] Re: Multi datasource analysis

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Tue, 27 Oct 2015 11:04:14 +0000

Hi Andreas,

Firstly, thanks for working the options!

On Tue, Oct 27, 2015, 10:44 Andreas Ringlstetter <
andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx> wrote:

Good Morning,

I want to add time series and cluster analysis spanning multiple
projects and datasources (e.g. VCS activity split across multiple
repositories) to codeface, but I'm not entirely sure which approach to take.

Naturally, the user database, release ranges and the release timeline
need to be shared, and all data sources logically belong to a single
project. The aggregation into a single project may either happen when
initially filling the database, or by merging multiple existing projects
into one "manually" and only trigger the analysis pass.

I am aware what restructuring is necessary to achieve that (mostly
breaking up the monolithic cluster.py, so that import and analysis
phases are strictly separated), but I can't decide how to partition VCS
activity in individual projects.

So far I have worked out 3 different approaches, each single one with
obvious flaws:
- Partitioning of all projects based on natural timestamps defined by
releases in the master project. Most likely to break when projects are
making heavy use of overlapping feature branches , and the correlation
of release cycles in master and slave repositories can't be taken as
given for all projects. Essentially, one repository is declared as
authoritative, and every other repository is expected to follow the same
release cycles. If this premise doesn't hold, activity will be
miss-attributed.

My feeling is that this won't work that well. Release ranges already are
hard to interpret without considering how the release ranges are related
between repositories in an ecosystem project. I think the necessary
assumptions are too strong to be realistic.


- Grouping tags from multiple repositories by (API) compatibility.
Commits are not partitioned by timestamp, but exclusively assigned
towards a tag. This approach lacks any natural timely correlation
between corresponding commit sets from different repositories. There is
also the issue of being unable to correctly assign contributions towards
a specific version if a component was made upwards compatible ahead of
time. In return this should yield the most coherent data regarding
actual development cycles, even when releases are not happening timely
on subcomponents. This approach is only applicable for data sources
where activity can be mapped directly to a specific version.

Same feeling for this approach.


- Forcefully apply fixed time frames to all projects. Best suitable for
measuring community activity and long term health, but no correlation
with release cycles. Only approach not requiring manual tag evaluation.

If we can only choose one, then I would take this one. We anyways do a lot
of time series analysis to look at project evolution so in that regard this
approach would produce the most intuitive results. Some project release a
new revision with every commit it seems, while others rarely ever increment
the revision number. So far Linux seems to be the rare project that has
very consistent and meaniful release cycles.

Support all 3? Choose one?

And are overlapping release ranges even supported yet during analysis?
If not, then the second approach can be ruled out right away.

I think overlapping release ranges are supported.

Thanks,

Mitchell

The second approach also requires a different method for grouping into
release ranges, as commits are not grouped based on timestamp, but based
on reachability in the dependency graph instead. (Earliest tag they are
reachable from.) This method should be implemented any way, regardless
whether it is used for the second approach or not.

These overlapping release ranges also become relevant when the
bugtracker data source is merged in, as that datasource yields activity
events for a certain version past release, which does overlap with, but
not necessarily contribute to the next major release.

Greetings,
Andreas

Other related posts: