On 20/11/13 11:10, Mitchell Joblin wrote: > On Tue, Nov 19, 2013 at 6:27 PM, Wolfgang Mauerer > <wolfgang.mauerer@xxxxxxxxxxx> wrote: >> On 19/11/13 16:24, Mitchell Joblin wrote: >> >>> It has recently become apparent that we rely on igraph to automatically >>> index the vertices sequentially, although there are exceptions where >>> this sequential index assumption is simply incorrect. In general, we >>> need to be more careful of these indexing issues and incorporate more >>> sanity checks after graph manipulations. In this direction, the >> absolutely agreed, we need more sanity checks to avoid code silently >> producing bogus results without explicit errors. In-DB, local and >> global graph indexing is slightly messy right now. It would be great >> if you could work at that some time. >> >>> following patch can be used to introduce explicit ids for the vertices >>> by setting row and column names for the adjacency matrix. This is not an >>> appropriate solution because now igraph use vertex ids that are strings >>> by casting integer row and column names from the adjacency matrix. We >>> need the vertex ids to be integer type so that we can use it for >>> indexing other data structures. Until this moment I see no opportunity >>> to convert the vertex ids to integer type and using the V(g)$name >>> attribute is restricted to character type. >> >> I see the implicit conversion for get.data.frame() where the numeric >> IDs seem to be converted to character based ones. Are there any >> other places where this happens? > > I see actually don't see there is any implicit conversion for > get.data.frame(). I see that get.data.frame() eventually calls > get.edges() which then uses the V(g)$name attribute to build up the > edgelist. If we don't explicitly set vertex ids then get.data.frame() > will return integers. we do set indices manually in the patch. >>> >>> While searching for a solution I also came across this posting >>> (http://lists.gnu.org/archive/html/igraph-help/2007-10/msg00014.html) >>> where the creator of igraph mentions that sometimes the vertex ids can >>> be silently renumbered. >>> >>> I propose that we should handle indexing entirely on our own by >>> assigning a vertex attribute that is not ever manipulated by igraph. We >>> should implement sanity checks after all graph manipulations to check >>> the vertex attribute has not been dropped or changed. >> If we proceed like this, we will need to pay attention to igraph >> operations that implicitly use the internal numbering system >> and not our proposed explicit stable numbering system, for instance >> get.data.frame(). When we assign consecutive indices to the adjacency >> matrix the graph is created from, get.data.frame() is the only opportunity >> when a conversion from character back to numeric is required as far as a >> saw. So how about first setting the ids sequentially, and then making >> sure that the edge list contains only numeric entries? > > That would be a better solution, but the exact problem is that I see > no way to set the ids sequentially and making sure the edge list > contains only integer entries. Essentially that is the technical > problem. The edge list uses the V(g)$name attribute which is a > character. V(g)$name is the attribute igraph offers to set your own > vertex sequence. When we don't want to explicitly set the ids and rely > on igraph's implicit indexing we can have integer type. I need to look > into the igraph code a little more to see what happens when no > V(g)$name attribute is set, I think thats in the C code though. >> >> --- a/codeface/R/cluster/persons.r >> +++ b/codeface/R/cluster/persons.r >> @@ -648,7 +648,21 @@ plot.group <- function(N, .tags, .iddb, .comm) { >> ## compute some attributes for proper visualisation, and export the >> ## result as a graphviz dot format if a filename is provided. >> save.group <- function(conf, .tags, .iddb, idx, .prank, .filename=NULL, >> label) { >> - g <- graph.adjacency(.tags[idx,idx], mode="directed") >> + ## Select the subset of the global graph that forms the community, >> + ## and ensure that the per-cluster indices range from 1..|V(g.cluster)| >> + subset <- .tags[idx, idx] >> + >> + ## A 1x1 matrix is not of class matrix, but of class numeric, >> + ## so ncol() won't work any more in this case. Ensure that we are actually >> + ## working with a matrix before explicitely setting the row and column >> + ## names to consecutive numbers. >> + if (class(subset) == "matrix") { >> + rownames(subset) <- 1:ncol(subset) >> + colnames(subset) <- 1:ncol(subset) >> + } >> + >> + g <- graph.adjacency(subset, mode="directed") >> + >> ## as.character is important. The igraph C export routines bark >> ## otherwise (not sure what the actual issue is) >> ## NOTE: V(g)$name as label index does NOT work because the name attribute >> @@ -698,6 +712,8 @@ store.graph.db <- function(conf, baselabel, idx, .iddb, >> g.reg, g.tr, j) { >> ## Construct a systematic representation of the graph for the data base >> edges <- get.data.frame(g.reg, what="edges") >> colnames(edges) <- c("fromId", "toId") >> + edges$fromId <- as.numeric(edges$fromId) >> + edges$toId <- as.numeric(edges$toId) >> >> This way, we make the decision for sequential indices explicit, >> but avoid having to use special accessor functions for them. >> Do you see any issues with the approach? It survives basic testing, >> but I have not looked into the implications in detail. >> Code is in branch understand on github to make testing easier >> (see b4c809f44c). > > Yeah we could definitely do that. I originally wanted to avoid having > to cast since I imagine we will again have bugs where we forgot to > cast it and just assumed it was integer type. That integer assumption > is what kind of lead us to this problem in the first place. The > biggest problem I see above is we still are relying on igraphs > indexing which can be silently changed, so I would rather avoid using > it altogether. If we do use the above solution I also think as.integer but we're setting the indices manually for the subgraphs with the patch above, so indices will be sequential -- are there any other places I have missed where we compute subgraphs? We could factor out subgraph generation into a subroutine to have the code in one central place -- this should help to avoid improper manual subset calculations. > would be better than as.numeric since R's numeric data type is > basically double precision float I think. good catch. I've changed the patch accordingly (btw.: there are some more places where we convert to numeric although integer would be sufficient, should be cleaned up some time). Best regards, Wolfgang