[codeface] Re: [PATCH 2/2] Change developer classification to use new db query

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>, "Joblin, Mitchell (ext)" <mitchell.joblin.ext@xxxxxxxxxxx>
  • Date: Tue, 20 Oct 2015 12:57:22 +0200

Hi Mitchell,

Am 15/10/2015 um 10:48 schrieb Mitchell Joblin:

- For the project evolution analysis we need to make many
large queries regarding the commits in parallel and it
"far" -> "is far". Tiny details seem to jump at me today.

far more efficient to do the aggregation in database

Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
codeface/R/developer_classification.r | 10 +++++-----
codeface/R/test_developer_classification.r | 3 ++-
2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/codeface/R/developer_classification.r
b/codeface/R/developer_classification.r
index 67cf035..84a24b5 100644
--- a/codeface/R/developer_classification.r
+++ b/codeface/R/developer_classification.r
@@ -10,17 +10,17 @@ source("query.r")
## the structural complexity introduced by core and peripheral
## developers in free software projects.
get.developer.class.con <- function(con, project.id, start.date, end.date) {
- commit.df <- get.commits.by.date.con(con, project.id, start.date, end.date)
- developer.class <- get.developer.class(commit.df)
+ commit.count.df <- get.commits.by.date.con(con, project.id, start.date,
end.date,
+ commit.count=TRUE)
+ developer.class <- get.developer.class(commit.count.df)

return(developer.class)
}

## Low-level function to compute classification
-get.developer.class <- function(commit.df, threshold=0.8) {
- author.commit.count <- count(commit.df, "author")
+get.developer.class <- function(author.commit.count, threshold=0.8) {
author.commit.count <-
author.commit.count[order(-author.commit.count$freq),]
- num.commits <- nrow(commit.df)
+ num.commits <- sum(author.commit.count$freq)

I overlooked one thing in the last commit: When you perform
"SELECT author, COUNT(*) as freq FROM commit (...) GROUP BY author",
what you get is not a frequency, but rather an absolute count.
I was briefly confused by "freq" here; maybe another name is
more appropriate? For instance, freq->num.commits, and the
local variable would then change like num.commits->num.commits.total
or something.

commit.threshold <- round(threshold * num.commits)
core.test <- cumsum(author.commit.count$freq) < commit.threshold
core.developers <- author.commit.count[core.test,]
diff --git a/codeface/R/test_developer_classification.r
b/codeface/R/test_developer_classification.r
index a4b9071..a3903bb 100644
--- a/codeface/R/test_developer_classification.r
+++ b/codeface/R/test_developer_classification.r
@@ -6,7 +6,8 @@ get.developer.class.test <- function() {
sample.size <- 1000

commit.df <- data.frame(author=sample(1:50, size=sample.size, replace=T))
- developer.class <- get.developer.class(commit.df, threshold)
+ author.commit.count <- count(commit.df, "author")
+ developer.class <- get.developer.class(author.commit.count, threshold)
res <- sum(developer.class$core$freq) < threshold*sample.size
return(res)
}


Looks good to me.

Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>

Other related posts: