[codeface] Re: [PATCH 5/5] Add test for global mail analysis

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • Date: Thu, 5 Nov 2015 13:44:45 +0100

On Thu, Nov 5, 2015 at 1:43 PM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:



Am 05/11/15 um 13:13 schrieb Mitchell Joblin:
On Thu, Nov 5, 2015 at 11:54 AM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:


Am 05/11/15 um 11:41 schrieb Mitchell Joblin:
On Thu, Nov 5, 2015 at 12:30 AM, Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> wrote:
Am 29/10/15 um 11:57 schrieb Mitchell Joblin:

+ conf$pid <- gen.clear.project.id.con(conf$con, project.name,
analysis.method)
+
+ ## Run analysis
+ dispatch.all(conf, path, res.dir)
+
+ ## Query for edgelist
+ start.date <- "2000-01-01"
+ end.date <- "2020-01-01"
+ edgelist <- query.mail.edgelist(conf$con, conf$pid, start.date,
end.date)
+
+ ## Get author id to name mapping
+ id.to.name <- sapply(unique(unlist(edgelist[,c(1,2)])),
is unique(unlist(edgelist[,c(1,2)])) guaranteed to return a continuous
interval without gaps for real-world data sets (i.e., is it guaranteed
that no number is missing for some reason)? If yes, this guarantee
should be documented; something like 1:max(edgelist[,c(1,2)]) might
be a bit easier to understand (I had to contemplate for a few moments
to see what you are trying to do here). Alternatively, a comment would
also help.

Why do we need a continuous interval for the id? By continuous, do you
mean contiguous? The ids are definitely not contiguous in most case.

right, contiguous would be the appropriate term.
1:max(..) will destroy the correspondence between the vertices in the
edge list ids and the author names. unique(unlist(edgelist[,c(1,2)]))
will just flatten the edge list into a vector with unique person ids.

if the IDs are not contiguous, I don't see why the code should work
correctly. Assume we have edges 1-2, 1-4, 2,4, and 5-9. Then
unique(unlist(...)) would deliver 1,2,4,5,9, right?

In this case, we will get an array with 5 string entries
for id.to.name. Later on, I might want to know the name of id 9 -- but
entry 9 is not populated, because there are only 5 entries, with
the entry for index 9 on position 5.

I may miss some point so that the above consideration is wrong, but
even if so, some documentation why the mechanism works as expected
would be helpful. I'm heading for maximal clarity here because we
used to have lots of painful effort with converting between in-graph,
in-vcs and other IDs that I don't want to repeat.

Yes, I remember that horror :) I circumvented that problem by avoiding
igraphs contiguous ids by assigning an attribute to the nodes that
contains the global ids. Then we don't have to maintain the
local-to-global id maps. In this case the edge list elements are of
type string and contain the global id. When using sapply with a string
vector, the result is a named vector. That vector can then be indexed
using the string coerced global ids to retrieve the name.

Ah, representing the edges as strings is a nice trick in this case.
But please spend a line in the source code to share it with world, this
is not easily visible as is ;)

Sure, I will add that.




+ function(id) query.person.name(conf$con, id))
+ edgelist[,c(1,2)] <- sapply(edgelist[,c(1,2)], function(id)
id.to.name[id])
+
+ ## Generate graph from database data
+ g.db <- graph.data.frame(edgelist)
+
+ ## Generate target graph
+ edgelist.target <- data.frame(from=c("chatty kathy", "nasty nate"),
+ to=c("bossy bill", "sneaky sam"),

haha -- glad you follow Johannes Ebke's test case naming conventions ;)
+ weight=1)
+ g.target <- graph.data.frame(edgelist.target)
+
+ ## Test for edge agreement
+ res <- all(E(g.target) == E(g.db))

what's not clear to me: g.db is constructed from a subset of the
real-world qemu mailing list. Why should the edge list agree with the
artificial one constructed above?

The test data is a trimmed down and manipulated version of the qemu
mailing list. I contrived a set of threads, that if processed
correctly, will yield the ground truth (the so called artificial one).
There is a many opportunities to mess up between going from the raw
mail data to the dev-dev communication network. For example, that
problem with tm.plugin.mail failing to join mails based on common
threads properly would have been easily found with this test.

but these mainpulated entries are still somewhere in your tree, correct?
I don't see them in the current master revision of the mbox file:

I didn't bother to send the patch with the test data addition because
I didn't see the point. It will be pull request though.

Sure, that's nothing that needs review. But please also add a line of
documentation to clarify that our qemu mbox files contains entries
for bossy bill and friends, these guys would no be expected on
unmodified data.

Sounds good. Thanks!

--Mitchell


Thanks, Wolfgang

--Mitchell


$ grep chatty gmane.comp.emulators.qemu.mbox
$

Thanks, Wolfgang

+}
+
test_that("Forest generation functions correctly", {
expect_true(test.genforest())
})

test_that("Corpus precondiction checks works", {
expect_true(test.check.corpus.precon())
- })
\ No newline at end of file
+ })
+
+test_that("Global analysis returns correct network", {

Global _mailing list_ analysis (so that it can be distinguished from
VCS analysis tests)

Sure, I will change that.

Thanks,

Mitchell


Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxx>

Thanks for the series, in particular for introducing new test cases!

Best regards, Wolfgang

+ expect_true(test.global.analysis())
+ })




Other related posts: