[codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>, <codeface@xxxxxxxxxxxxx>
  • Date: Fri, 6 Nov 2015 17:17:25 +0100



Am 06/11/15 um 16:00 schrieb Andreas Ringlstetter:



Am 06.11.2015 um 13:32 schrieb Wolfgang Mauerer:


Am 06/11/15 um 11:25 schrieb Andreas Ringlstetter:
Am 05.11.2015 um 18:00 schrieb Wolfgang Mauerer:
Am 05/11/15 um 17:53 schrieb Mitchell Joblin:
On Thu, Nov 5, 2015 at 5:36 PM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:
Am 05/11/15 um 17:11 schrieb Mitchell Joblin:

your remark about the check taking place in the db layer caused me to
indeed change my mind -- since the warning is not emitted in the ML
specific part, we should not weaken existing checks. Henceforth my
new suggestion. --Wolfgang

If we have preconditions for the state of the database tables after an
the vcs analysis that are imposed by the web frontend or whatever,
then all these types of checks should be collected in a central place
rather than have preconditions scattered throughout the query
implementations. We should run those checks immediately after
populating the data base because the query for cycles could occur ages
later in a totally different part of the code. I guess the default
parameter is just a quick band aid solution?

I agree that we should have more consistent checks across the code base.
How we do that remains to be discussed -- for this case, the band aid
is sufficient, I guess.

Having a DB sanity check pass is an interesting idea -- CC'ing Andreas
Ringlstetter, who is AFAIK working on related cleanups. We may need
both, an "fsck" for populated data bases (no fixing, just checking), and
more checks in the code.

Somewhat agreed, data dependencies should be managed and checked in a
central location.

Even though that is different from extended plausibility checks on the
database, these should in fact reside in the database abstraction layer,
respectively (due to the lack of such a thing in codeface) in the

we don't provide a DB abstraction layer on purpose. The probability that
we will ever support different database implementations for the same
type of data is about \eps/2 because a) mysql works fine, b) we have no
manpower to maintain the layer (resp. there are more important things
to research), and c) I'm not aware of a problem that the layer would
solve.

Checks could go into a specific layer, but that would be for the
purpose of checking correctness, not abstracting DB operations.

sparingly used helper functions. The rational for this, is that these
type of checks actually represent an extension of the schema.

However, that's only for catching inconsistencies or schema violations.
Such as modules trying to write out invalid data, or data fetched from
the DB containing invalid values.
E.g. someone trying to insert an ML response into the data base with the
date of the response predating the original message.

Not for cases where entire modules have just been executed out of order,
so data models haven't even been filled yet.

What I was suggesting, was to loosen the coupling between these
"modules" we have right now, introduce a central router/schedule and
have these modules announce their dependencies (and outputs) to the
scheduler. Well, these "dependencies" would be more or less just free
form event names.

The control flow is usually nothing we have had problems with. What
caused issues in the past is that some part of the DB has not been
filled in as expected because of unforeseen problems with the
real-world data (for instance, a release range with no commit).
Pre/Postcondition checkers would help in these cases.

Isn't the focus of this patch the issue that the control flow changed (a
stop condition was removed)?

with control flow, I was referring to the order in which various
passes are executed. The order as such is not changed by the patch.
What we could run into without stop() is that one pass will be run
with incomplete input data, which could not happen before.

OK, what I suggested would be overkill to patch in, it's just what I
originally had in mind.

Simple pre/post-conditions would already suffice. At least as long as
failures in these tests are actually treated as fatal.

But then again: That requires to add defensive checks in a lot of
locations. On the call site, that is. So it still means a lot of
refactoring.

The precondition should not be embedded into the function which fetches
or write the data, but be outsourced to a function which only tests the
condition and reliably errors out when it isn't fulfilled.

Only simple validation rules can (and should) be enforced inside the
database abstraction helpers. Otherwise, this is reducing the
reusability of the helper function.

Outsourced, because the same precondition is most likely required in
more than a single location. So keep these functions rather generic as
well, better test against multiple conditions than a single monolithic
one. Also less refactoring required later on when a condition can be
lifted later on.

Question left is only:
a) Predicate + wrapper code / helper function (predicate itself only
return boolean)
or b) conditions with hard fail built in?


Only a minor difference, depending on whether the same predicates should
also be used to control conditional execution, or if conditional
execution based on predicates should be discouraged since it's
increasing complexity.

Obviously all predicates/conditions must be free of side effects. And a
stack trace should be provided in case of a failure, since the scope
from which the condition was checked is relevant.

we do that for the R passes, plus a VM dump is created on error
that can be used to debug the problem a posteriori.




Applied to the patch in subject:

Removing the stop was somewhat correct, but a lot of call sites are now
lacking defensive checks. Would all the sites calling this function
still working properly? It's not just the widgets, but also every
location accessing conf$boundaries.

And is it really a good idea, trying to fix this with a default
parameter bandaid?

For now (as no one wants to add the defensive checks), I would actually
suggest not changing the get.cycles at all for now.

Instead just add a new predicate, checking whether there are any ranges
at all, and make the call to get.cycles dependent as well, still
assuming non-empty result set. The predicate is pretty sure going to be
useful in other locations as well.

hm, are you suggesting something like the following?

if (there_are_cycles()) {
return(get.cycles())
} else {
what.to.do.here()?
}

But what do we do when there are no cycles? Returning NULL or something
similar would require updating all callers to stop, because the case
that there are no cycles in handled nowhere, and is not something that
we can recover from.

-> No band aid required, and the assumption (that the function works
when the set is empty) becomes explicit. The same predicate is also
going to be required if final validation phase should ever be added.


That's not protecting against modules having unintended side effects due
to bugs or "clever" hidden states, but as long as modules are only
accessing data they have declared a dependency on and announcing every
model or data set they provide, correct control flow can be ensured.

It's also no solution to "optional" data, such as (missing) release
ranges. If there are modules which are doing something different if
additional data is present, namely dispatching additional tasks, they
should be logically separated from the simpler ones. They may wrap them,
but they should really not try to probe for the availability of
additional data. That's just bound to break again if the execution order
is modified.

But to be honest:
I have no clue how the control flow of the mailing list analysis
currently looks like. Respectively which code sections are dead and
which are not (and why there are "let it RIP" comments). Or which data
is being persisted in the database, and where someone tried to juggle
data in memory or in the file system for far too long.

(Btw.: Data in the file system, relative to the R scripts. Whyyyyy???)

What data are you referring to?

Sorry, false alarm. Got mislead by one of my notes (regarding the
default log path), and the (leftover?) .gitignore file in the ML directory.

In general, I was referring to relics such as the file based wordnet
analysis, or the set of operations running directly on the forest
generated by snatm.

the data we store in the FS are costly to compute, and it helps with
development if we can re-use the results during debugging sessions
without having to recompute after each small change.
With the exception of a few saves bytes in the file system,
what are the advantages of removing/no creating these files?

Best regards, Wolfgang Mauerer

Other related posts: