[codeface] Re: [PATCH 6/6] Try to improve email detection heuristics

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>
  • Date: Mon, 23 Nov 2015 11:23:33 +0100



Am 23/11/2015 um 11:20 schrieb Claus Hunsen:

can you incorporate a reference to issue #34 into the commit
message? E.g., "cf. issue #34".

This suggestion is basically for documentation purposes, i.e.,
GitHub will add a cross-reference to the issue and will link the
proposed message part in the commit message directly to the issue
site. You can even close tickets via the commit message, if
appropriate [1].

oh, I did not know that. Looks cool -- I will add a reference. Does
anyone know standard formats established for this purpose? I think
Apache projects have some.

Thanks, Wolfgang

In the end, this would likely encourage everybody to use the issue
tracker on GitHub additionally to the mailing list to document or
report issues, when they find any.

[1]
https://help.github.com/articles/closing-issues-via-commit-messages/

Best, Claus

Am 21.11.2015 um 21:46 schrieb Wolfgang Mauerer:
Make two attempts at fixing incorrect email addresses:

* Replace ' at ' and ' AT ' earlier. This allows us to parse
things like "Hans Huber <huber at hubercorp.com>" * When names
are given in parentheses, strip the parens. This allows us to
parse things like hans@xxxxxxxxxxxxx (Hans Huber)

Also fix some minor spacing issues while at it, and improve the
source code documentation a bit.

Based on suggestions by Claus Hunsen.

Signed-off-by: Wolfgang Mauerer
<wolfgang.mauerer@xxxxxxxxxxxxxxxxx> Tested-by: Claus Hunsen
<claus.hunsen@xxxxxxxxxxxxxxxxx> --- codeface/R/ml/analysis.r |
18 ++++++++++-------- codeface/R/ml/ml_utils.r | 12 ++++++++++++
2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 53a7335..98895c7 100644 --- a/codeface/R/ml/analysis.r +++
b/codeface/R/ml/analysis.r @@ -226,6 +226,11 @@
check.corpus.precon <- function(corp.base) { ## Trim trailing and
leading whitespace author <- str_trim(author)

+ ## Replace textual ' at ' with @, sometimes + ## we can
recover an email + author <- sub(' at ', '@', author) +
author <- sub(' AT ', '@', author) + ## Check if email exists
email.exists <- grepl("<.+>", author, TRUE)

@@ -234,11 +239,6 @@ check.corpus.precon <- function(corp.base)
{ "<xxxyyy@xxxxxxx>); attempting to recover from: ", author)
logdevinfo(msg, logger="ml.analysis")

- ## Replace textual ' at ' with @, sometimes - ## we
can recover an email - author <- sub(' at ', '@', author) -
author <- sub(' AT ', '@', author) - ## Check for @ symbol r <-
regexpr("\\S+@\\S+", author, TRUE) email <- substr(author, r, r +
attr(r,"match.length")-1) @@ -258,7 +258,7 @@ check.corpus.precon
<- function(corp.base) { ## string minus the new email part as
name, and construct ## a valid name/email combination name <-
sub(email, "", author, fixed=TRUE) - name <-
str_trim(name) + name <- fix.name(name) }

## Name and author are now given in both cases, construct @@
-266,13 +266,15 @@ check.corpus.precon <- function(corp.base) {
author <- paste(name, ' <', email, '>', sep="") } else { -
## Verify that the order is correct + ## There is a correct
email address. Ensure that the order is correct + ## and fix
cases like "<hans.huber@xxxxxxxxxxxxx> Hans Huber" + ## Get email
and name parts r <- regexpr("<.+>", author, TRUE) if(r[[1]] == 1)
{ email <- substr(author, r, r + attr(r,"match.length")-1) name
<- sub(email, "", author, fixed=TRUE) - name <-
str_trim(name) + name <- fix.name(name) email <-
str_trim(email) author <- paste(name,email) } diff --git
a/codeface/R/ml/ml_utils.r b/codeface/R/ml/ml_utils.r index
a733972..2cb9d80 100644 --- a/codeface/R/ml/ml_utils.r +++
b/codeface/R/ml/ml_utils.r @@ -437,3 +437,15 @@
ml.thread.loc.to.glob <- function(ml.id.map, loc.id) {

return(global.id) } + +## Given a name with leading and pending
whitespace that is possibly +## surrounded by braces, return the
name proper. +fix.name <- function(name) { + name <-
str_trim(name) + if (substr(name, 1, 1) == "(" && substr(name,
str_length(name), +
str_length(name)) == ")") { + name <- substr(name, 2,
str_length(name)-1) + } + + return (name) +}



Other related posts: