[codeface] Re: [PATCH] Remove emails form corpus that have a duplicate id

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>
  • Date: Thu, 17 Dec 2015 10:40:18 +0100



Am 16/12/2015 um 20:33 schrieb Mitchell Joblin:

- The duplicate ids causes problems when we convert the corpus
into a data frame before storing to the database because ids
are used as row names which must be unique

Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
codeface/R/ml/analysis.r | 2 ++
1 file changed, 2 insertions(+)

diff --git a/codeface/R/ml/analysis.r b/codeface/R/ml/analysis.r
index c1c9c5e..8203002 100644
--- a/codeface/R/ml/analysis.r
+++ b/codeface/R/ml/analysis.r
@@ -417,6 +417,8 @@ dispatch.all <- function(conf, repo.path, resdir) {
## NOTE: We only compute the forest for the complete interval to allow for
creating
## descriptive statistics.
corp <- corp.base$corp
+ ## Remove duplicate mails
+ corp <- corp[!duplicated(meta(corp, "id"))]

Looks good to me, thanks.

Reviewed-by: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>

## NOTE: conf must be present in the defining scope
do.normalise.bound <- function(authors) {


Other related posts: