Am 13/10/2016 um 17:29 schrieb Claus Hunsen:
From: Thomas Bock <bockthom@xxxxxxxxxxxxxxxxx>can you please limit the length of these lines to 75-80 chars?
Within mailing list analysis some persons with name 'NULL' could occur in the
database due to some parsing problems. The fix considers two cases:
- only email address in angle brackets is provided. Then just use the first
part of the email as name.
- name looks like an email address. In that case also use only the first part
of that as name in order to avoid parsing problems.
see below...
Signed-off-by: Thomas Bock <bockthom@xxxxxxxxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 34 +++++++++++++++++++++++++++++++---
1 file changed, 31 insertions(+), 3 deletions(-)
diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 3218864..51c3ac9 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -299,14 +299,42 @@ check.corpus.precon <- function(corp.base) {
## Get email and name parts
r <- regexpr("<.+>", author, TRUE)
if(r[[1]] == 1) {
- email <- substr(author, r, r + attr(r,"match.length")-1)see below...
- name <- sub(email, "", author, fixed=TRUE)
- name <- fix.name(name)
+
+ ## Check if only an email is provided
+ if(attr(r, "match.length") == nchar(author)) {
+ ## Only an email like "<hans.huber@xxxxxxxxxxxxx>" is provided... here: Can you please use only one of attr(r, "match.length")
+ email <- substr(author, r+1, r + nchar(author)-2)
+ name <- gsub("\\.", " ",gsub("@.*", "", email))
+ } else {
+ ## email and name both are provided
+ email <- substr(author, r, r + attr(r,"match.length")-1)
+ name <- sub(email, "", author, fixed=TRUE)
+ name <- fix.name(name)
+ }
+
email <- str_trim(email)
author <- paste(name,email)
}
}
+ ## Check if name looks like an email address.
+ ## Since that causes parsing problems, use only the local part of an
+ ## email address as name.
+
+ ## Get email and name parts first
+ r <- regexpr("<.+>", author, TRUE)
+ if(r[[1]] >= 1) {
+ email <- substr(author, r, r + attr(r,"match.length")-1)nitpick: ,<space> (admittedly, there are tons of this in the source
+ name <- sub(email, "", author, fixed=TRUE)
+ name <- fix.name(name)
+
+ if(regexpr("\\S+@\\S+", author, TRUE)[1]==1) {
+ ## Name looks like an email address. Use only local part as name.
+ name <- gsub("\\.", " ",gsub("@.*", "", name))
+ }
+ author <- paste(name,email)
+ }
+
return(author)
}