[codeface] Re: [PATCH] Remove quotation and comma characters from email authors

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>, codeface@xxxxxxxxxxxxx
  • Date: Thu, 12 Nov 2015 20:08:34 +0000

Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx> schrieb am Do., 12.
Nov. 2015 20:15:

Am 12/11/2015 um 16:39 schrieb Mitchell Joblin:

- Some author lines are "bob, smith" <bob.smith@xxxxxxxx>

Are they "Huber, Jacqueline <jacqueline.huber@xxxxxxxx>",
or really "Jacqueline, Huber <jacqueline.huber@xxxxxxxx>"?
In southern Germany and Austria, it's common to specify
"surname, name" instead of "name surname" (don't know if this
convention is in use anywhere else in the world; for sorting
names, it certainly is used in many places). If the majority of
findings on mailing lists is of the first form, we should consider also
swapping the two names as part of the fixup.

Not sure about that. I just saw that the analysis breaks if commas are
maintained in the name because the id service doesn't assign them ids. I
will have a closer look tomorrow.



Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 68577f5..556511d 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -225,6 +225,10 @@ check.corpus.precon <- function(corp.base) {
author <- "unknown"
}

+ ## Remove problematic punctuation characters
+ author <- gsub("\"", " ", author)
Is this supposed to catch things like 'William "Bill" Gates'?

No it wasn't for a case like above. There were quotes around the whole name
(first and last) in some cases. I don't think it breaks anything but then
the name entered in the database also contains the quotes I think.


If yes, I suppose this causes problems in a later stage, but
could you document where? Additionally, please document the rationale
for both regexps in the commit description because of the -5 mojo
thing.

Right! Ill add a better description.

Thanks,

Mitchell

Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>

Thanks, Wolfgang
+ author <- gsub(",", " ", author)
+
## Trim trailing and leading whitespace
author <- str_trim(author)


Other related posts: