[codeface] Re: igraph vertex indexing irregularities

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxx>
  • To: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • Date: Wed, 20 Nov 2013 14:21:58 +0100

On 20/11/13 11:10, Mitchell Joblin wrote:
> On Tue, Nov 19, 2013 at 6:27 PM, Wolfgang Mauerer
> <wolfgang.mauerer@xxxxxxxxxxx> wrote:
>> On 19/11/13 16:24, Mitchell Joblin wrote:
>>
>>> It has recently become apparent that we rely on igraph to automatically
>>> index the vertices sequentially, although there are exceptions where
>>> this sequential index assumption is simply incorrect. In general, we
>>> need to be more careful of these indexing issues and incorporate more
>>> sanity checks after graph manipulations. In this direction, the
>> absolutely agreed, we need more sanity checks to avoid code silently
>> producing bogus results without explicit errors. In-DB, local and
>> global graph indexing is slightly messy right now. It would be great
>> if you could work at that some time.
>>
>>> following patch can be used to introduce explicit ids for the vertices
>>> by setting row and column names for the adjacency matrix. This is not an
>>> appropriate solution because now igraph use vertex ids that are strings
>>> by casting integer row and column names from the adjacency matrix. We
>>> need the vertex ids to be integer type so that we can use it for
>>> indexing other data structures. Until this moment I see no opportunity
>>> to convert the vertex ids to integer type and using the V(g)$name
>>> attribute is restricted to character type.
>>
>> I see the implicit conversion for get.data.frame() where the numeric
>> IDs seem to be converted to character based ones. Are there any
>> other places where this happens?
> 
> I see actually don't see there is any implicit conversion for
> get.data.frame(). I see that get.data.frame() eventually calls
> get.edges() which then uses the V(g)$name attribute to build up the
> edgelist. If we don't explicitly set vertex ids then get.data.frame()
> will return integers.
we do set indices manually in the patch.
>>>
>>> While searching for a solution I also came across this posting
>>> (http://lists.gnu.org/archive/html/igraph-help/2007-10/msg00014.html)
>>> where the creator of igraph mentions that sometimes the vertex ids can
>>> be silently renumbered.
>>>
>>> I propose that we should handle indexing entirely on our own by
>>> assigning a vertex attribute that is not ever manipulated by igraph. We
>>> should implement sanity checks after all graph manipulations to check
>>> the vertex attribute has not been dropped or changed.
>> If we proceed like this, we will need to pay attention to igraph
>> operations that implicitly use the internal numbering system
>> and not our proposed explicit stable numbering system, for instance
>> get.data.frame(). When we assign consecutive indices to the adjacency
>> matrix the graph is created from, get.data.frame() is the only opportunity
>> when a conversion from character back to numeric is required as far as a
>> saw. So how about first setting the ids sequentially, and then making
>> sure that the edge list contains only numeric entries?
> 
> That would be a better solution, but the exact problem is that I see
> no way to set the ids sequentially and making sure the edge list
> contains only integer entries. Essentially that is the technical
> problem. The edge list uses the V(g)$name attribute which is a
> character. V(g)$name is the attribute igraph offers to set your own
> vertex sequence. When we don't want to explicitly set the ids and rely
> on igraph's implicit indexing we can have integer type. I need to look
> into the igraph code a little more to see what happens when no
> V(g)$name attribute is set, I think thats in the C code though.
>>
>> --- a/codeface/R/cluster/persons.r
>> +++ b/codeface/R/cluster/persons.r
>> @@ -648,7 +648,21 @@ plot.group <- function(N, .tags, .iddb, .comm) {
>>  ## compute some attributes for proper visualisation, and export the
>>  ## result as a graphviz dot format if a filename is provided.
>>  save.group <- function(conf, .tags, .iddb, idx, .prank, .filename=NULL, 
>> label) {
>> -  g <- graph.adjacency(.tags[idx,idx], mode="directed")
>> +  ## Select the subset of the global graph that forms the community,
>> +  ## and ensure that the per-cluster indices range from 1..|V(g.cluster)|
>> +  subset <- .tags[idx, idx]
>> +
>> +  ## A 1x1 matrix is not of class matrix, but of class numeric,
>> +  ## so ncol() won't work any more in this case. Ensure that we are actually
>> +  ## working with a matrix before explicitely setting the row and column
>> +  ## names to consecutive numbers.
>> +  if (class(subset) == "matrix") {
>> +    rownames(subset) <- 1:ncol(subset)
>> +    colnames(subset) <- 1:ncol(subset)
>> +  }
>> +
>> +  g <- graph.adjacency(subset, mode="directed")
>> +
>>    ## as.character is important. The igraph C export routines bark
>>    ## otherwise (not sure what the actual issue is)
>>    ## NOTE: V(g)$name as label index does NOT work because the name attribute
>> @@ -698,6 +712,8 @@ store.graph.db <- function(conf, baselabel, idx, .iddb, 
>> g.reg, g.tr, j) {
>>    ## Construct a systematic representation of the graph for the data base
>>    edges <- get.data.frame(g.reg, what="edges")
>>    colnames(edges) <- c("fromId", "toId")
>> +  edges$fromId <- as.numeric(edges$fromId)
>> +  edges$toId <- as.numeric(edges$toId)
>>
>> This way, we make the decision for sequential indices explicit,
>> but avoid having to use special accessor functions for them.
>> Do you see any issues with the approach? It survives basic testing,
>> but I have not looked into the implications in detail.
>> Code is in branch understand on github to make testing easier
>> (see b4c809f44c).
> 
> Yeah we could definitely do that. I originally wanted to avoid having
> to cast since I imagine we will again have bugs where we forgot to
> cast it and just assumed it was integer type. That integer assumption
> is what kind of lead us to this problem in the first place. The
> biggest problem I see above is we still are relying on igraphs
> indexing which can be silently changed, so I would rather avoid using
> it altogether. If we do use the above solution I also think as.integer
but we're setting the indices manually for the subgraphs with the
patch above, so indices will be sequential -- are there any other
places I have missed where we compute subgraphs?

We could factor out subgraph generation into a subroutine
to have the code in one central place -- this should help to avoid
improper manual subset calculations.

> would be better than as.numeric since R's numeric data type is
> basically double precision float I think.
good catch. I've changed the patch accordingly (btw.: there
are some more places where we convert to numeric although integer
would be sufficient, should be cleaned up some time).

Best regards, Wolfgang

Other related posts: