[codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis

From: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>
To: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>, <codeface@xxxxxxxxxxxxx>
Date: Fri, 6 Nov 2015 16:00:45 +0100

Am 06.11.2015 um 13:32 schrieb Wolfgang Mauerer:

Am 06/11/15 um 11:25 schrieb Andreas Ringlstetter:

Am 05.11.2015 um 18:00 schrieb Wolfgang Mauerer:

Am 05/11/15 um 17:53 schrieb Mitchell Joblin:

On Thu, Nov 5, 2015 at 5:36 PM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:

Am 05/11/15 um 17:11 schrieb Mitchell Joblin:

your remark about the check taking place in the db layer caused me to
indeed change my mind -- since the warning is not emitted in the ML
specific part, we should not weaken existing checks. Henceforth my
new suggestion. --Wolfgang

If we have preconditions for the state of the database tables after an
the vcs analysis that are imposed by the web frontend or whatever,
then all these types of checks should be collected in a central place
rather than have preconditions scattered throughout the query
implementations. We should run those checks immediately after
populating the data base because the query for cycles could occur ages
later in a totally different part of the code. I guess the default
parameter is just a quick band aid solution?

I agree that we should have more consistent checks across the code base.
How we do that remains to be discussed -- for this case, the band aid
is sufficient, I guess.

Having a DB sanity check pass is an interesting idea -- CC'ing Andreas
Ringlstetter, who is AFAIK working on related cleanups. We may need
both, an "fsck" for populated data bases (no fixing, just checking), and
more checks in the code.

Somewhat agreed, data dependencies should be managed and checked in a
central location.

Even though that is different from extended plausibility checks on the
database, these should in fact reside in the database abstraction layer,
respectively (due to the lack of such a thing in codeface) in the

we don't provide a DB abstraction layer on purpose. The probability that
we will ever support different database implementations for the same
type of data is about \eps/2 because a) mysql works fine, b) we have no
manpower to maintain the layer (resp. there are more important things
to research), and c) I'm not aware of a problem that the layer would
solve.

Checks could go into a specific layer, but that would be for the
purpose of checking correctness, not abstracting DB operations.

sparingly used helper functions. The rational for this, is that these
type of checks actually represent an extension of the schema.

However, that's only for catching inconsistencies or schema violations.
Such as modules trying to write out invalid data, or data fetched from
the DB containing invalid values.
E.g. someone trying to insert an ML response into the data base with the
date of the response predating the original message.

Not for cases where entire modules have just been executed out of order,
so data models haven't even been filled yet.

What I was suggesting, was to loosen the coupling between these
"modules" we have right now, introduce a central router/schedule and
have these modules announce their dependencies (and outputs) to the
scheduler. Well, these "dependencies" would be more or less just free
form event names.

The control flow is usually nothing we have had problems with. What
caused issues in the past is that some part of the DB has not been
filled in as expected because of unforeseen problems with the
real-world data (for instance, a release range with no commit).
Pre/Postcondition checkers would help in these cases.

Isn't the focus of this patch the issue that the control flow changed (a
stop condition was removed)?

OK, what I suggested would be overkill to patch in, it's just what I
originally had in mind.

Simple pre/post-conditions would already suffice. At least as long as
failures in these tests are actually treated as fatal.

But then again: That requires to add defensive checks in a lot of
locations. On the call site, that is. So it still means a lot of
refactoring.

The precondition should not be embedded into the function which fetches
or write the data, but be outsourced to a function which only tests the
condition and reliably errors out when it isn't fulfilled.

Only simple validation rules can (and should) be enforced inside the
database abstraction helpers. Otherwise, this is reducing the
reusability of the helper function.

Outsourced, because the same precondition is most likely required in
more than a single location. So keep these functions rather generic as
well, better test against multiple conditions than a single monolithic
one. Also less refactoring required later on when a condition can be
lifted later on.

Question left is only:
a) Predicate + wrapper code / helper function (predicate itself only
return boolean)
or b) conditions with hard fail built in?

Only a minor difference, depending on whether the same predicates should
also be used to control conditional execution, or if conditional
execution based on predicates should be discouraged since it's
increasing complexity.

Obviously all predicates/conditions must be free of side effects. And a
stack trace should be provided in case of a failure, since the scope
from which the condition was checked is relevant.

Applied to the patch in subject:

Removing the stop was somewhat correct, but a lot of call sites are now
lacking defensive checks. Would all the sites calling this function
still working properly? It's not just the widgets, but also every
location accessing conf$boundaries.

And is it really a good idea, trying to fix this with a default
parameter bandaid?

For now (as no one wants to add the defensive checks), I would actually
suggest not changing the get.cycles at all for now.

Instead just add a new predicate, checking whether there are any ranges
at all, and make the call to get.cycles dependent as well, still
assuming non-empty result set. The predicate is pretty sure going to be
useful in other locations as well.

-> No band aid required, and the assumption (that the function works
when the set is empty) becomes explicit. The same predicate is also
going to be required if final validation phase should ever be added.

That's not protecting against modules having unintended side effects due
to bugs or "clever" hidden states, but as long as modules are only
accessing data they have declared a dependency on and announcing every
model or data set they provide, correct control flow can be ensured.

It's also no solution to "optional" data, such as (missing) release
ranges. If there are modules which are doing something different if
additional data is present, namely dispatching additional tasks, they
should be logically separated from the simpler ones. They may wrap them,
but they should really not try to probe for the availability of
additional data. That's just bound to break again if the execution order
is modified.

But to be honest:
I have no clue how the control flow of the mailing list analysis
currently looks like. Respectively which code sections are dead and
which are not (and why there are "let it RIP" comments). Or which data
is being persisted in the database, and where someone tried to juggle
data in memory or in the file system for far too long.

(Btw.: Data in the file system, relative to the R scripts. Whyyyyy???)

What data are you referring to?

Sorry, false alarm. Got mislead by one of my notes (regarding the
default log path), and the (leftover?) .gitignore file in the ML directory.

In general, I was referring to relics such as the file based wordnet
analysis, or the set of operations running directly on the forest
generated by snatm.

-- Andreas

And introducing such a system in hindsight means some massive
refactoring. Probably worth it though, given how much time was spent on
this discussion alone, and assuming that this was neither the last nor
the first time someone is trying to modify the data flow.

For now, Benjamin Hiefner and I will only focus on getting the Python
part of the VCS analysis untangled, and we had no intention of even
touching the R parts so far, leave alone the ML analysis.

As Mitchell said, the checks should run immediately after the
database is populated with raw data. The checks should thus not be
language specific, but should be usable in various circumstances. No
need for you to touch ML analysis or R, though.

Best regards, Wolfgang Mauerer

Follow-Ups:
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Wolfgang Mauerer

References:
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Wolfgang Mauerer
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Mitchell Joblin
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Wolfgang Mauerer
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Mitchell Joblin
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Wolfgang Mauerer
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Mitchell Joblin
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Wolfgang Mauerer
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Andreas Ringlstetter
- [codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis
  - From: Wolfgang Mauerer

[codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis

Other related posts: