[codeface] Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: Preparing time series data - sloccount analysis

  • From: Wolfgang Mauerer <wm@xxxxxxxxxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Thu, 05 Mar 2015 14:50:22 +0100


Am 05/03/2015 um 14:19 schrieb Matthias Gemmer:
>> Von: codeface-bounce@xxxxxxxxxxxxx <codeface-bounce@xxxxxxxxxxxxx> im 
>> Auftrag von Mitchell Joblin <joblin.m@xxxxxxxxx>
>> Gesendet: Donnerstag, 5. März 2015 13:14
>> An: Wolfgang Mauerer
>> Cc: codeface@xxxxxxxxxxxxx
>> Betreff: [codeface] Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: 
>> Preparing time series data - sloccount analysis
>>
>> On Thu, Mar 5, 2015 at 11:12 AM, Wolfgang Mauerer
>> <wolfgang.mauerer@xxxxxxxxxxx> wrote:
>>> On 05.03.2015 12:04, Matthias Gemmer wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Browse[1]> print(plot.id)
>>>>>>>>>>>> numeric(0)
>>>>>>>>>>>
>>>>>>>>>>> so that's the culprit... There is no valid plot ID for the time
>>>>>>>>>>> series in the database. Can you please check that an appropriate
>>>>>>>>>>> table is available in the database?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There is a table called timeseries with the column plotId.
>>>>>>>>>> mysql> DESCRIBE timeseries;
>>>>>>>>>> +--------------+------------+------+-----+---------+-------+
>>>>>>>>>> | Field        | Type       | Null | Key | Default | Extra |
>>>>>>>>>> +--------------+------------+------+-----+---------+-------+
>>>>>>>>>> | plotId       | bigint(20) | NO   | MUL | NULL    |       |
>>>>>>>>>> | time         | datetime   | NO   |     | NULL    |       |
>>>>>>>>>> | value        | double     | NO   |     | NULL    |       |
>>>>>>>>>> | value_scaled | double     | YES  |     | NULL    |       |
>>>>>>>>>> +--------------+------------+------+-----+---------+-------+
>>>>>>>>>> 4 rows in set (0.02 sec)
>>>>>>>>>>
>>>>>>>>>> The table is also filled with data. The table contains datasets for
>>>>>>>>>> plotId=5, plotId=6, plotId=7 and plotId=8.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Which values do sloccount.plot.id (and understand.plot.id) have
>>>>>>>>>>> in do.complexity.analysis (Frame 3/4)?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The values for sloccount.plot.id and understand.plot.id are
>>>>>>>>>> obviously
>>>>>>>>>> invalid.
>>>>>>>>>>
>>>>>>>>>> Browse[1]> print(sloccount.plot.id)
>>>>>>>>>> numeric(0)
>>>>>>>>>> Browse[1]> print(understand.plot.id)
>>>>>>>>>> numeric(0)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> it was not so obvious to me; I was trying to ensure that
>>>>>>>>> parallelisation did not introduce any issues here. But your
>>>>>>>>> observation
>>>>>>>>> clarified that this is not the case.
>>>>>>>>>
>>>>>>>>> Since the error seems to be deterministically reproducible at your
>>>>>>>>> site, can you debug around the creation of the index (for instance by
>>>>>>>>> printing out what's going on; alternatively, you could also use the
>>>>>>>>> built-in debugger)?
>>>>>>>>>
>>>>>>>>
>>>>>>>> In the file codeface/R/complexity.r:
>>>>>>>>
>>>>>>>> Assignment of sloccount.plot.id and understand.plot.id:
>>>>>>>>    ## Obtain a plot IDs for the sloccount and understand raw time
>>>>>>>> series before
>>>>>>>>    ## parallel processing commences to avoid race conditions
>>>>>>>>    sloccount.plot.id <- get.or.create.plot.id(conf, "sloccount")
>>>>>>>>    understand.plot.id <- get.or.create.plot.id(conf, "understand_raw")
>>>>>>>>        -> sloccount.plot.id and understand.plot.id have the value "x".
>>>>>>>>               Are these values feasible? Or Shall I have a closer look
>>>>>>>> at the function 'get.or.create.plot.id'?
>>>>>>>
>>>>>>>
>>>>>>> since the SQL specification for the plot ID is
>>>>>>>
>>>>>>> `id` BIGINT NOT NULL AUTO_INCREMENT
>>>>>>>
>>>>>>> the value "x" seems quite impossible. Can you please query your
>>>>>>> database to see what value is stored there?
>>>>>>>
>>>>>>
>>>>>> The table is empty.
>>>>>> mysql> select * from plots;
>>>>>> Empty set (0.01 sec)
>>>>>
>>>>>
>>>>> please try to run the other SQL statements produced by the code to see
>>>>> why no entry is created. get.or.create.plot.id() inserts a new entry
>>>>> into the table is no ID for a desired plot is available.
>>>>
>>>>
>>>> The branch which creates a plot ID is not entered. The condition
>>>> 'length(res) < 1' is
>>>> in both cases (sloccount.plot.id and understand.plot.id) not satisfied.
>>>>
>>>> For sloccount.plot.id <- get.or.create.plot.id(conf, "sloccount"):
>>>>    res <- dbGetQuery(con, str_c(query, ";"))
>>>>    # str_c(query, ";"): SELECT id FROM plots WHERE name='sloccount' AND
>>>> projectId=2;
>>>>    # res: "id"
>>>>    # length(res): 1
>>>>    if (length(res) < 1) {
>>>>      ## Plot ID is not assigned yet, create one
>>>>      res <- get.clear.plot.id.con(con, pid, plot.name, range.id)
>>>>    } else {
>>>>      res <- res$id
>>>>    }
>>>>    # res: "x"
>>>
>>>
>>> @Mitchell, could you try to reproduce this? I don't see why a result
>>> with non-zero length should be returned from the SQL query if the
>>> database is empty.
>>
>> The SQL query probably returns a data frame and length(..) called on a
>> data frame does not return the number of rows. To get the number of
>> rows of a data frame you should be using nrow(..) instead of
>> length(..).
>>
>> --Mitchell
>>
> 
> That worked for me.
> After replacing 'length' with 'nrow' a new plot ID is created!

The following patch should fix this for good then:

> diff --git a/codeface/R/db.r b/codeface/R/db.r
> index db53811..32da240 100644
> --- a/codeface/R/db.r
> +++ b/codeface/R/db.r
> @@ -59,10 +59,10 @@ get.clear.plot.id.con <- function(con, pid, plot.name, 
> range.id=NULL,
>  
>    res <- dbGetQuery(con, str_c("SELECT id", query))
>  
> -  if (length(res) != 1) {
> +  if (nrow(res) != 1) {
>      stop("Internal error: Plot ", plot.name, " appears multiple times in DB",
>           "for project ID ", pid)
> -  }
> +}
>  
>    return(res$id)
>  }
> @@ -81,7 +81,7 @@ get.plot.id.con <- function(con, pid, plot.name, 
> range.id=NULL) {
>      query <- str_c(query, " AND releaseRangeId=", range.id)
>    }
>    res <- dbGetQuery(con, str_c("SELECT id", query))
> -  if (length(res) < 1) {
> +  if (nrow(res) < 1) {
>      stop("Internal error: Plot ", plot.name, " not found in DB",
>           " for project ID ", pid)
>    }
> @@ -104,7 +104,7 @@ get.or.create.plot.id.con <- function(con, pid, 
> plot.name, range.id=NULL) {
>    }
>    res <- dbGetQuery(con, str_c(query, ";"))
>  
> -  if (length(res) < 1) {
> +  if (nrow(res) < 1) {
>      ## Plot ID is not assigned yet, create one
>      res <- get.clear.plot.id.con(con, pid, plot.name, range.id)
>    } else {
> @@ -125,7 +125,7 @@ get.revision.id <- function(conf, tag) {
>                      str_c("SELECT id FROM release_timeline WHERE projectId=",
>                            conf$pid, " AND tag=", sq(tag), " AND 
> type='release'"))
>  
> -  if (length(res) > 1) {
> +  if (nrow(res) > 1) {
>      stop("Internal error: Revision if for tag ", tag, " (project ", 
> conf$project,
>           ") appears multiple times in DB!")
>    }

However, I don't really understand why it worked before in this case...
@Mitchell: I'll push the patch to master unless you object, but can you
please try to understand why we did not run into problems earlier?

Thanks, Wolfgang

Other related posts: