[codeface] Re: [PATCH] Remove quotation and comma characters from email authors

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>, Mitchell Joblin <joblin.m@xxxxxxxxx>
  • Date: Thu, 12 Nov 2015 20:10:45 +0100



Am 12/11/2015 um 16:39 schrieb Mitchell Joblin:

- Some author lines are "bob, smith" <bob.smith@xxxxxxxx>

Are they "Huber, Jacqueline <jacqueline.huber@xxxxxxxx>",
or really "Jacqueline, Huber <jacqueline.huber@xxxxxxxx>"?
In southern Germany and Austria, it's common to specify
"surname, name" instead of "name surname" (don't know if this
convention is in use anywhere else in the world; for sorting
names, it certainly is used in many places). If the majority of
findings on mailing lists is of the first form, we should consider also
swapping the two names as part of the fixup.

Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 68577f5..556511d 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -225,6 +225,10 @@ check.corpus.precon <- function(corp.base) {
author <- "unknown"
}

+ ## Remove problematic punctuation characters
+ author <- gsub("\"", " ", author)
Is this supposed to catch things like 'William "Bill" Gates'?
If yes, I suppose this causes problems in a later stage, but
could you document where? Additionally, please document the rationale
for both regexps in the commit description because of the -5 mojo
thing.

Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>

Thanks, Wolfgang
+ author <- gsub(",", " ", author)
+
## Trim trailing and leading whitespace
author <- str_trim(author)



Other related posts: