[codeface] Re: [PATCH] Fix case where email "From" field has atypical form

From: Claus Hunsen <hunsen@xxxxxxxxxxxxxxxxx>
To: codeface@xxxxxxxxxxxxx
Date: Tue, 17 Nov 2015 16:43:24 +0100

Hi Wolfgang,

thank you very much for the insight and help.
I opened two tickets on GitHub [1][2] to support this thread: one for
the failing e-mail-address processing, the other for the documentation.

[1] https://github.com/siemens/codeface/issues/34
[2] https://github.com/siemens/codeface/issues/35

Further comments on your last e-mail are inline.

Consider the following From lines from the SQLite mbox files:

From: elau1004 at aim.com (Edward Lau) From: ambrus at
math.bme.hu (=?UTF-8?Q?Zsb=C3=A1n_Ambrus?=)

There are three things to note: 1) The author's name is written in
braces behind the e-mail address,

The order will currently be changed, but the name remains in braces.

I tested this via the mailing-list test, and I found that the braces are
removed also *without* the patch being applied. (see issue #34 [1])

2) the @-symbol is replaced by " at ", and

This is supported when the email is in a bogus format (for
instance "Hans Huber huber at hubercorp.com". Interestingly,
we currently don't catch "Hans Huber <huber at hubercorp.com>.

The failing pattern you described is properly handled after applying the
patch!

3) the second author name contains some UTF8 encoding.

I'm pretty sure I've dealt with this issue earlier when generating
two-mode communication graphs, but I could not find anything in
the source code. Mitchell, do you know if we take care of this
anywhere? Maybe it implicitly hidden somewhere. Maybe I just failed to
commit the code.

We convert between encoding <X> and UTF-8 in snatm during the
content_transformer pass, but this will not handle the case
you describe AFAIK.

A look into the DB confirms that the string is written to the DB as is.
I.e., the name of the person is "=?UTF-8?Q?Zsb=C3=A1n_Ambrus?=". (see
also issue #34 [1])

As I am not that familiar with the corresponding source-code, my
questions are: 1) Is that pattern already supported by Codeface and
its name service?

No, for the reasons described above. To correct the two deficiencies
I've pointed out above (not considering the encoding question), the
following patch should help. Could you please review it?

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index 53a7335..98895c7 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -226,6 +226,11 @@ check.corpus.precon <- function(corp.base) {
## Trim trailing and leading whitespace
author <- str_trim(author)

+ ## Replace textual ' at ' with @, sometimes
+ ## we can recover an email
+ author <- sub(' at ', '@', author)
+ author <- sub(' AT ', '@', author)
+
## Check if email exists
email.exists <- grepl("<.+>", author, TRUE)

@@ -234,11 +239,6 @@ check.corpus.precon <- function(corp.base) {
"<xxxyyy@xxxxxxx>); attempting to recover from: ",
author)
logdevinfo(msg, logger="ml.analysis")

- ## Replace textual ' at ' with @, sometimes
- ## we can recover an email
- author <- sub(' at ', '@', author)
- author <- sub(' AT ', '@', author)
-
## Check for @ symbol
r <- regexpr("\\S+@\\S+", author, TRUE)
email <- substr(author, r, r + attr(r,"match.length")-1)
@@ -258,7 +258,7 @@ check.corpus.precon <- function(corp.base) {
## string minus the new email part as name, and construct
## a valid name/email combination
name <- sub(email, "", author, fixed=TRUE)
- name <- str_trim(name)
+ name <- fix.name(name)
}

## Name and author are now given in both cases, construct
@@ -266,13 +266,15 @@ check.corpus.precon <- function(corp.base) {
author <- paste(name, ' <', email, '>', sep="")
}
else {
- ## Verify that the order is correct
+ ## There is a correct email address. Ensure that the order is
correct
+ ## and fix cases like "<hans.huber@xxxxxxxxxxxxx> Hans Huber"
+
## Get email and name parts
r <- regexpr("<.+>", author, TRUE)
if(r[[1]] == 1) {
email <- substr(author, r, r + attr(r,"match.length")-1)
name <- sub(email, "", author, fixed=TRUE)
- name <- str_trim(name)
+ name <- fix.name(name)
email <- str_trim(email)
author <- paste(name,email)
}
diff --git a/codeface/R/ml/ml_utils.r b/codeface/R/ml/ml_utils.r
index 963cd2d..f596829 100644
--- a/codeface/R/ml/ml_utils.r
+++ b/codeface/R/ml/ml_utils.r
@@ -445,3 +445,15 @@ ml.thread.loc.to.glob <- function(ml.id.map,
loc.id) {

return(global.id)
}
+
+## Given a name with leading and pending whitespace that is possibly
+## surrounded by braces, return the name proper.
+fix.name <- function(name) {
+ name <- str_trim(name)
+ if (substr(name, 1, 1) == "(" && substr(name, str_length(name),
+ str_length(name)) == ")") {
+ name <- substr(name, 2, str_length(name)-1)
+ }
+
+ return (name)
+}

The patch is fine and enables Codeface to treat "Hans Huber <huber at
hubercorp.com>" properly. (see #34 [1])

2) Can Codeface handle the replaced @-symbol automatically? 3) Is
the author name properly converted to UTF8 internally? (It's "Zsbán
Ambrus", actually. Note the a with acute!)

BTW, can someone please document the supported patterns of the
From lines in Wiki, Readme, or somewhere else, when everything is
implemented (also from the other current threads)? E.g., "The 'via
...' pattern gets treated as follows: [...]" or "Codeface handles '
at ' (NOT) as '@' automatically". That way, nobody would get
confused when they get empty/weird outputs from a mailing-list
analysis on a given mbox file.

That would definitely be helpful. Could you please file a bug so we
don't forget to do this some time? I don't have time to take care
of this right now, but since a student will be working on adding
support for google groups soon, this seems a nice topic for him. You
might even assign the ticket to Georg Berner (CCed) ;)

With the patch above, we support abominations of the form

Hans Huber huber@xxxxxxxxxxxxx
Hans Huber huber at hubercorp.com
Hans Huber <huber at hubercorp.com> ("AT" instead of "at" also works)
<huber@xxxxxxxxxxxxx> Hans Huber
hans huber @ hubercorp.com Hans Huber
hans huber @ hubercorp.com (Hans Huber)

and return the valid email address
"Hans Huber <hans.huber@xxxxxxxxxxxxxx>" in each case. Having a unit
test for these cases would also be helpful. Can you please add that to
the ticket?

For the unit test, we would need to factor out the whole conversion
black magic, and make it independent of document processing. But that
would be a good idea anyway.

I opened the ticket [2] and added all information that we gathered in
this thread. (I hope, I haven't forgotten anything.)
Unfortunately, I could not find Georg on GitHub, so the ticket is
unassigned right now. I am not even sure if I can assign somebody anyway
as I am not the owner of the repo.

Best,
Claus

Attachment: signature.asc
Description: OpenPGP digital signature

References:
- [codeface] [PATCH] Fix case where email "From" field has atypical form
  - From: Mitchell Joblin
- [codeface] Re: [PATCH] Fix case where email "From" field has atypical form
  - From: Mitchell Joblin
- [codeface] Re: [PATCH] Fix case where email "From" field has atypical form
  - From: Wolfgang Mauerer
- [codeface] Re: [PATCH] Fix case where email "From" field has atypical form
  - From: Mitchell Joblin
- [codeface] Re: [PATCH] Fix case where email "From" field has atypical form
  - From: Wolfgang Mauerer
- [codeface] Re: [PATCH] Fix case where email "From" field has atypical form
  - From: Claus Hunsen
- [codeface] Re: [PATCH] Fix case where email "From" field has atypical form
  - From: Wolfgang Mauerer

[codeface] Re: [PATCH] Fix case where email "From" field has atypical form

Other related posts: