[codeface] Re: [PATCH 10/26] Fix problem with NULL authors in mailing list analysis

  • From: Thomas Bock <bockthom@xxxxxxxxxxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Wed, 30 Nov 2016 10:34:09 +0100

Am 24.11.2016 um 11:06 schrieb Wolfgang Mauerer:

Am 13/10/2016 um 17:29 schrieb Claus Hunsen:
From: Thomas Bock <bockthom@xxxxxxxxxxxxxxxxx>

Within mailing list analysis some persons with name 'NULL' could occur in the 
database due to some parsing problems. The fix considers two cases:
- only email address in angle brackets is provided. Then just use the first 
part of the email as name.
- name looks like an email address. In that case also use only the first part 
of that as name in order to avoid parsing problems.
can you please limit the length of these lines to 75-80 chars?
Depending on the commit history browser, such long lines can lead
to an inconvenient display of the message.

I'm not sure about the statement "in order to avoid parsing
problems." -- the commits makes sure that in this case no
NULL value is written into the DB, but would the parser otherwise
cause issues, or are you referring to an error that's later
on caused by the NULL value?
If the name of an author looks like an email address, without the fix the name is treated as email address and the actual email address will not get parsed. That's the "parsing problem" I have meant. So, in the end, in the DB there is a NULL value for the name and the name itself is treated as email address.

The fix avoids that by only treating the local part of the name (which looks like an email) as name. So, the actual email address gets parsed as email address. In addition, there is no NULL value for the name any more.


Signed-off-by: Thomas Bock <bockthom@xxxxxxxxxxxxxxxxx>
  codeface/R/ml/analysis.r | 34 +++++++++++++++++++++++++++++++---
  1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 3218864..51c3ac9 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -299,14 +299,42 @@ check.corpus.precon <- function(corp.base) {
        ## Get email and name parts
        r <- regexpr("<.+>", author, TRUE)
        if(r[[1]] == 1) {
see below...
-        email <- substr(author, r, r + attr(r,"match.length")-1)
-        name <- sub(email, "", author, fixed=TRUE)
-        name <- fix.name(name)
+        ## Check if only an email is provided
+        if(attr(r, "match.length") == nchar(author)) {
see below...
+          ## Only an email like "<hans.huber@xxxxxxxxxxxxx>" is provided
+          email <- substr(author, r+1, r + nchar(author)-2)
+          name <- gsub("\\.", " ",gsub("@.*", "", email))
+        } else {
+          ## email and name both are provided
+          email <- substr(author, r, r + attr(r,"match.length")-1)
+          name <- sub(email, "", author, fixed=TRUE)
+          name <- fix.name(name)
+        }
          email <- str_trim(email)
          author <- paste(name,email)
+ ## Check if name looks like an email address.
+    ## Since that causes parsing problems, use only the local part of an
+    ## email address as name.
+    ## Get email and name parts first
+    r <- regexpr("<.+>", author, TRUE)
+    if(r[[1]] >= 1) {
... here: Can you please use only one of attr(r, "match.length")
or r[[1]] (the former preferred)? Otherwise, is not easy to see
at the first glance that the statements compute identical things.

+      email <- substr(author, r, r + attr(r,"match.length")-1)
nitpick: ,<space> (admittedly, there are tons of this in the source
base, but let's try to keep new code clean)

+      name <- sub(email, "", author, fixed=TRUE)
+      name <- fix.name(name)
+      if(regexpr("\\S+@\\S+", author, TRUE)[1]==1) {
+        ## Name looks like an email address. Use only local part as name.
+        name <- gsub("\\.", " ",gsub("@.*", "", name))
+      }
+      author <- paste(name,email)
+    }

Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>

Other related posts: