Hi Wolfgang,
thank you very much for the insight and help.
I opened two tickets on GitHub [1][2] to support this thread: one for
the failing e-mail-address processing, the other for the documentation.
[1] https://github.com/siemens/codeface/issues/34
[2] https://github.com/siemens/codeface/issues/35
Further comments on your last e-mail are inline.
Consider the following From lines from the SQLite mbox files:The order will currently be changed, but the name remains in braces.
From: elau1004 at aim.com (Edward Lau) From: ambrus atThere are three things to note: 1) The author's name is written in
math.bme.hu (=?UTF-8?Q?Zsb=C3=A1n_Ambrus?=)
braces behind the e-mail address,
2) the @-symbol is replaced by " at ", andThis is supported when the email is in a bogus format (for
instance "Hans Huber huber at hubercorp.com". Interestingly,
we currently don't catch "Hans Huber <huber at hubercorp.com>.
3) the second author name contains some UTF8 encoding.
I'm pretty sure I've dealt with this issue earlier when generating
two-mode communication graphs, but I could not find anything in
the source code. Mitchell, do you know if we take care of this
anywhere? Maybe it implicitly hidden somewhere. Maybe I just failed to
commit the code.
We convert between encoding <X> and UTF-8 in snatm during the
content_transformer pass, but this will not handle the case
you describe AFAIK.
As I am not that familiar with the corresponding source-code, myNo, for the reasons described above. To correct the two deficiencies
questions are: 1) Is that pattern already supported by Codeface and
its name service?
I've pointed out above (not considering the encoding question), the
following patch should help. Could you please review it?
diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 53a7335..98895c7 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -226,6 +226,11 @@ check.corpus.precon <- function(corp.base) {
## Trim trailing and leading whitespace
author <- str_trim(author)
+ ## Replace textual ' at ' with @, sometimes
+ ## we can recover an email
+ author <- sub(' at ', '@', author)
+ author <- sub(' AT ', '@', author)
+
## Check if email exists
email.exists <- grepl("<.+>", author, TRUE)
@@ -234,11 +239,6 @@ check.corpus.precon <- function(corp.base) {
"<xxxyyy@xxxxxxx>); attempting to recover from: ",
author)
logdevinfo(msg, logger="ml.analysis")
- ## Replace textual ' at ' with @, sometimes
- ## we can recover an email
- author <- sub(' at ', '@', author)
- author <- sub(' AT ', '@', author)
-
## Check for @ symbol
r <- regexpr("\\S+@\\S+", author, TRUE)
email <- substr(author, r, r + attr(r,"match.length")-1)
@@ -258,7 +258,7 @@ check.corpus.precon <- function(corp.base) {
## string minus the new email part as name, and construct
## a valid name/email combination
name <- sub(email, "", author, fixed=TRUE)
- name <- str_trim(name)
+ name <- fix.name(name)
}
## Name and author are now given in both cases, construct
@@ -266,13 +266,15 @@ check.corpus.precon <- function(corp.base) {
author <- paste(name, ' <', email, '>', sep="")
}
else {
- ## Verify that the order is correct
+ ## There is a correct email address. Ensure that the order is
correct
+ ## and fix cases like "<hans.huber@xxxxxxxxxxxxx> Hans Huber"
+
## Get email and name parts
r <- regexpr("<.+>", author, TRUE)
if(r[[1]] == 1) {
email <- substr(author, r, r + attr(r,"match.length")-1)
name <- sub(email, "", author, fixed=TRUE)
- name <- str_trim(name)
+ name <- fix.name(name)
email <- str_trim(email)
author <- paste(name,email)
}
diff --git a/codeface/R/ml/ml_utils.r b/codeface/R/ml/ml_utils.r
index 963cd2d..f596829 100644
--- a/codeface/R/ml/ml_utils.r
+++ b/codeface/R/ml/ml_utils.r
@@ -445,3 +445,15 @@ ml.thread.loc.to.glob <- function(ml.id.map,
loc.id) {
return(global.id)
}
+
+## Given a name with leading and pending whitespace that is possibly
+## surrounded by braces, return the name proper.
+fix.name <- function(name) {
+ name <- str_trim(name)
+ if (substr(name, 1, 1) == "(" && substr(name, str_length(name),
+ str_length(name)) == ")") {
+ name <- substr(name, 2, str_length(name)-1)
+ }
+
+ return (name)
+}
2) Can Codeface handle the replaced @-symbol automatically? 3) Is
the author name properly converted to UTF8 internally? (It's "Zsbán
Ambrus", actually. Note the a with acute!)
BTW, can someone please document the supported patterns of the
From lines in Wiki, Readme, or somewhere else, when everything is
implemented (also from the other current threads)? E.g., "The 'via
...' pattern gets treated as follows: [...]" or "Codeface handles '
at ' (NOT) as '@' automatically". That way, nobody would get
confused when they get empty/weird outputs from a mailing-list
analysis on a given mbox file.
That would definitely be helpful. Could you please file a bug so we
don't forget to do this some time? I don't have time to take care
of this right now, but since a student will be working on adding
support for google groups soon, this seems a nice topic for him. You
might even assign the ticket to Georg Berner (CCed) ;)
With the patch above, we support abominations of the form
Hans Huber huber@xxxxxxxxxxxxx
Hans Huber huber at hubercorp.com
Hans Huber <huber at hubercorp.com> ("AT" instead of "at" also works)
<huber@xxxxxxxxxxxxx> Hans Huber
hans huber @ hubercorp.com Hans Huber
hans huber @ hubercorp.com (Hans Huber)
and return the valid email address
"Hans Huber <hans.huber@xxxxxxxxxxxxxx>" in each case. Having a unit
test for these cases would also be helpful. Can you please add that to
the ticket?
For the unit test, we would need to factor out the whole conversion
black magic, and make it independent of document processing. But that
would be a good idea anyway.
Attachment:
signature.asc
Description: OpenPGP digital signature