[codeface] Re: [PATCH] Remove emails form corpus that have a duplicate id

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Wed, 16 Dec 2015 20:35:24 +0100

On Wed, Dec 16, 2015 at 8:33 PM, Mitchell Joblin
<mitchell.joblin.ext@xxxxxxxxxxx> wrote:

- The duplicate ids causes problems when we convert the corpus
into a data frame before storing to the database because ids
are used as row names which must be unique

Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 2 ++
1 file changed, 2 insertions(+)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index c1c9c5e..8203002 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -417,6 +417,8 @@ dispatch.all <- function(conf, repo.path, resdir) {
## NOTE: We only compute the forest for the complete interval to allow for
creating
## descriptive statistics.
corp <- corp.base$corp
+ ## Remove duplicate mails
+ corp <- corp[!duplicated(meta(corp, "id"))]

This is the first time I experienced issues with duplicate mail ids.
We should check further to confirm that the duplicated ids correspond
to duplicated messages, one would assume that's the case. If there are
duplicates we should remove them much earlier in the analysis.

--Mitchell



## NOTE: conf must be present in the defining scope
do.normalise.bound <- function(authors) {
--
2.1.4



Other related posts: