[codeface] Re: [PATCH 1/2] Remove move problematic characters from the email authors

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>
  • Date: Tue, 15 Dec 2015 16:48:52 +0100



Am 11/12/2015 um 18:32 schrieb Mitchell Joblin:

- Parenthesis in the author name cause the id service to return
NA for the person ids

Looks good to me -- there might be names like
Josef "Joe" Kaeser, but we would leave them nonetheless
unique.

Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>

Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 9f13d29..9c10403 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -227,8 +227,10 @@ check.corpus.precon <- function(corp.base) {
}

## Remove problematic punctuation characters
- author <- gsub("\"", " ", author)
- author <- gsub(",", " ", author)
+ problem.characters <- c("\"", ",", "\\(", "\\)")
+ for (p.char in problem.characters) {
+ author <- gsub(p.char, " ", author)
+ }

## Trim trailing and leading whitespace
author <- str_trim(author)


Other related posts: