[codeface] Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: Preparing time series data - sloccount analysis

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Thu, 5 Mar 2015 14:05:19 +0000

On Thu, Mar 5, 2015 at 1:50 PM, Wolfgang Mauerer <wm@xxxxxxxxxxxxxxxx> wrote:
>
>
> Am 05/03/2015 um 14:19 schrieb Matthias Gemmer:
>>> Von: codeface-bounce@xxxxxxxxxxxxx <codeface-bounce@xxxxxxxxxxxxx> im 
>>> Auftrag von Mitchell Joblin <joblin.m@xxxxxxxxx>
>>> Gesendet: Donnerstag, 5. März 2015 13:14
>>> An: Wolfgang Mauerer
>>> Cc: codeface@xxxxxxxxxxxxx
>>> Betreff: [codeface] Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: AW: Re: 
>>> Preparing time series data - sloccount analysis
>>>
>>> On Thu, Mar 5, 2015 at 11:12 AM, Wolfgang Mauerer
>>> <wolfgang.mauerer@xxxxxxxxxxx> wrote:
>>>> On 05.03.2015 12:04, Matthias Gemmer wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Browse[1]> print(plot.id)
>>>>>>>>>>>>> numeric(0)
>>>>>>>>>>>>
>>>>>>>>>>>> so that's the culprit... There is no valid plot ID for the time
>>>>>>>>>>>> series in the database. Can you please check that an appropriate
>>>>>>>>>>>> table is available in the database?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> There is a table called timeseries with the column plotId.
>>>>>>>>>>> mysql> DESCRIBE timeseries;
>>>>>>>>>>> +--------------+------------+------+-----+---------+-------+
>>>>>>>>>>> | Field        | Type       | Null | Key | Default | Extra |
>>>>>>>>>>> +--------------+------------+------+-----+---------+-------+
>>>>>>>>>>> | plotId       | bigint(20) | NO   | MUL | NULL    |       |
>>>>>>>>>>> | time         | datetime   | NO   |     | NULL    |       |
>>>>>>>>>>> | value        | double     | NO   |     | NULL    |       |
>>>>>>>>>>> | value_scaled | double     | YES  |     | NULL    |       |
>>>>>>>>>>> +--------------+------------+------+-----+---------+-------+
>>>>>>>>>>> 4 rows in set (0.02 sec)
>>>>>>>>>>>
>>>>>>>>>>> The table is also filled with data. The table contains datasets for
>>>>>>>>>>> plotId=5, plotId=6, plotId=7 and plotId=8.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Which values do sloccount.plot.id (and understand.plot.id) have
>>>>>>>>>>>> in do.complexity.analysis (Frame 3/4)?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The values for sloccount.plot.id and understand.plot.id are
>>>>>>>>>>> obviously
>>>>>>>>>>> invalid.
>>>>>>>>>>>
>>>>>>>>>>> Browse[1]> print(sloccount.plot.id)
>>>>>>>>>>> numeric(0)
>>>>>>>>>>> Browse[1]> print(understand.plot.id)
>>>>>>>>>>> numeric(0)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> it was not so obvious to me; I was trying to ensure that
>>>>>>>>>> parallelisation did not introduce any issues here. But your
>>>>>>>>>> observation
>>>>>>>>>> clarified that this is not the case.
>>>>>>>>>>
>>>>>>>>>> Since the error seems to be deterministically reproducible at your
>>>>>>>>>> site, can you debug around the creation of the index (for instance by
>>>>>>>>>> printing out what's going on; alternatively, you could also use the
>>>>>>>>>> built-in debugger)?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the file codeface/R/complexity.r:
>>>>>>>>>
>>>>>>>>> Assignment of sloccount.plot.id and understand.plot.id:
>>>>>>>>>    ## Obtain a plot IDs for the sloccount and understand raw time
>>>>>>>>> series before
>>>>>>>>>    ## parallel processing commences to avoid race conditions
>>>>>>>>>    sloccount.plot.id <- get.or.create.plot.id(conf, "sloccount")
>>>>>>>>>    understand.plot.id <- get.or.create.plot.id(conf, "understand_raw")
>>>>>>>>>        -> sloccount.plot.id and understand.plot.id have the value "x".
>>>>>>>>>               Are these values feasible? Or Shall I have a closer look
>>>>>>>>> at the function 'get.or.create.plot.id'?
>>>>>>>>
>>>>>>>>
>>>>>>>> since the SQL specification for the plot ID is
>>>>>>>>
>>>>>>>> `id` BIGINT NOT NULL AUTO_INCREMENT
>>>>>>>>
>>>>>>>> the value "x" seems quite impossible. Can you please query your
>>>>>>>> database to see what value is stored there?
>>>>>>>>
>>>>>>>
>>>>>>> The table is empty.
>>>>>>> mysql> select * from plots;
>>>>>>> Empty set (0.01 sec)
>>>>>>
>>>>>>
>>>>>> please try to run the other SQL statements produced by the code to see
>>>>>> why no entry is created. get.or.create.plot.id() inserts a new entry
>>>>>> into the table is no ID for a desired plot is available.
>>>>>
>>>>>
>>>>> The branch which creates a plot ID is not entered. The condition
>>>>> 'length(res) < 1' is
>>>>> in both cases (sloccount.plot.id and understand.plot.id) not satisfied.
>>>>>
>>>>> For sloccount.plot.id <- get.or.create.plot.id(conf, "sloccount"):
>>>>>    res <- dbGetQuery(con, str_c(query, ";"))
>>>>>    # str_c(query, ";"): SELECT id FROM plots WHERE name='sloccount' AND
>>>>> projectId=2;
>>>>>    # res: "id"
>>>>>    # length(res): 1
>>>>>    if (length(res) < 1) {
>>>>>      ## Plot ID is not assigned yet, create one
>>>>>      res <- get.clear.plot.id.con(con, pid, plot.name, range.id)
>>>>>    } else {
>>>>>      res <- res$id
>>>>>    }
>>>>>    # res: "x"
>>>>
>>>>
>>>> @Mitchell, could you try to reproduce this? I don't see why a result
>>>> with non-zero length should be returned from the SQL query if the
>>>> database is empty.
>>>
>>> The SQL query probably returns a data frame and length(..) called on a
>>> data frame does not return the number of rows. To get the number of
>>> rows of a data frame you should be using nrow(..) instead of
>>> length(..).
>>>
>>> --Mitchell
>>>
>>
>> That worked for me.
>> After replacing 'length' with 'nrow' a new plot ID is created!
>
> The following patch should fix this for good then:
>
>> diff --git a/codeface/R/db.r b/codeface/R/db.r
>> index db53811..32da240 100644
>> --- a/codeface/R/db.r
>> +++ b/codeface/R/db.r
>> @@ -59,10 +59,10 @@ get.clear.plot.id.con <- function(con, pid, plot.name, 
>> range.id=NULL,
>>
>>    res <- dbGetQuery(con, str_c("SELECT id", query))
>>
>> -  if (length(res) != 1) {
>> +  if (nrow(res) != 1) {
>>      stop("Internal error: Plot ", plot.name, " appears multiple times in 
>> DB",
>>           "for project ID ", pid)
>> -  }
>> +}
>>
>>    return(res$id)
>>  }
>> @@ -81,7 +81,7 @@ get.plot.id.con <- function(con, pid, plot.name, 
>> range.id=NULL) {
>>      query <- str_c(query, " AND releaseRangeId=", range.id)
>>    }
>>    res <- dbGetQuery(con, str_c("SELECT id", query))
>> -  if (length(res) < 1) {
>> +  if (nrow(res) < 1) {
>>      stop("Internal error: Plot ", plot.name, " not found in DB",
>>           " for project ID ", pid)
>>    }
>> @@ -104,7 +104,7 @@ get.or.create.plot.id.con <- function(con, pid, 
>> plot.name, range.id=NULL) {
>>    }
>>    res <- dbGetQuery(con, str_c(query, ";"))
>>
>> -  if (length(res) < 1) {
>> +  if (nrow(res) < 1) {
>>      ## Plot ID is not assigned yet, create one
>>      res <- get.clear.plot.id.con(con, pid, plot.name, range.id)
>>    } else {
>> @@ -125,7 +125,7 @@ get.revision.id <- function(conf, tag) {
>>                      str_c("SELECT id FROM release_timeline WHERE 
>> projectId=",
>>                            conf$pid, " AND tag=", sq(tag), " AND 
>> type='release'"))
>>
>> -  if (length(res) > 1) {
>> +  if (nrow(res) > 1) {
>>      stop("Internal error: Revision if for tag ", tag, " (project ", 
>> conf$project,
>>           ") appears multiple times in DB!")
>>    }
>
> However, I don't really understand why it worked before in this case...
> @Mitchell: I'll push the patch to master unless you object, but can you
> please try to understand why we did not run into problems earlier?

Please push to master after running the test suite, otherwise please
make a pull request and then I will test it before merging.

Thanks,

Mitchell

>
> Thanks, Wolfgang
>

Other related posts: