[codeface] Re: [PATCH 3/5] Allow mailing list analysis to run even if no vcs analysis

  • From: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>
  • To: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>, <codeface@xxxxxxxxxxxxx>
  • Date: Fri, 6 Nov 2015 11:25:18 +0100



Am 05.11.2015 um 18:00 schrieb Wolfgang Mauerer:



Am 05/11/15 um 17:53 schrieb Mitchell Joblin:
On Thu, Nov 5, 2015 at 5:36 PM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:


Am 05/11/15 um 17:11 schrieb Mitchell Joblin:

your remark about the check taking place in the db layer caused me to
indeed change my mind -- since the warning is not emitted in the ML
specific part, we should not weaken existing checks. Henceforth my
new suggestion. --Wolfgang

If we have preconditions for the state of the database tables after an
the vcs analysis that are imposed by the web frontend or whatever,
then all these types of checks should be collected in a central place
rather than have preconditions scattered throughout the query
implementations. We should run those checks immediately after
populating the data base because the query for cycles could occur ages
later in a totally different part of the code. I guess the default
parameter is just a quick band aid solution?

I agree that we should have more consistent checks across the code base.
How we do that remains to be discussed -- for this case, the band aid
is sufficient, I guess.

Having a DB sanity check pass is an interesting idea -- CC'ing Andreas
Ringlstetter, who is AFAIK working on related cleanups. We may need
both, an "fsck" for populated data bases (no fixing, just checking), and
more checks in the code.

Somewhat agreed, data dependencies should be managed and checked in a
central location.

Even though that is different from extended plausibility checks on the
database, these should in fact reside in the database abstraction layer,
respectively (due to the lack of such a thing in codeface) in the
sparingly used helper functions. The rational for this, is that these
type of checks actually represent an extension of the schema.

However, that's only for catching inconsistencies or schema violations.
Such as modules trying to write out invalid data, or data fetched from
the DB containing invalid values.
E.g. someone trying to insert an ML response into the data base with the
date of the response predating the original message.

Not for cases where entire modules have just been executed out of order,
so data models haven't even been filled yet.

What I was suggesting, was to loosen the coupling between these
"modules" we have right now, introduce a central router/schedule and
have these modules announce their dependencies (and outputs) to the
scheduler. Well, these "dependencies" would be more or less just free
form event names.

That's not protecting against modules having unintended side effects due
to bugs or "clever" hidden states, but as long as modules are only
accessing data they have declared a dependency on and announcing every
model or data set they provide, correct control flow can be ensured.

It's also no solution to "optional" data, such as (missing) release
ranges. If there are modules which are doing something different if
additional data is present, namely dispatching additional tasks, they
should be logically separated from the simpler ones. They may wrap them,
but they should really not try to probe for the availability of
additional data. That's just bound to break again if the execution order
is modified.

But to be honest:
I have no clue how the control flow of the mailing list analysis
currently looks like. Respectively which code sections are dead and
which are not (and why there are "let it RIP" comments). Or which data
is being persisted in the database, and where someone tried to juggle
data in memory or in the file system for far too long.

(Btw.: Data in the file system, relative to the R scripts. Whyyyyy???)

And introducing such a system in hindsight means some massive
refactoring. Probably worth it though, given how much time was spent on
this discussion alone, and assuming that this was neither the last nor
the first time someone is trying to modify the data flow.

For now, Benjamin Hiefner and I will only focus on getting the Python
part of the VCS analysis untangled, and we had no intention of even
touching the R parts so far, leave alone the ML analysis.

--Andreas

Thanks, Wolfgang




--Mitchell




Other related posts: