[codeface] [PATCH 2/3] Updates for changes to the tm packages

  • From: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Fri, 2 Oct 2015 10:46:07 +0200

Signed-off-by: Mitchell Joblin <mitchell.joblin.ext@xxxxxxxxxxx>
---
pkg/R/makeforest.r | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/pkg/R/makeforest.r b/pkg/R/makeforest.r
index 0f66812..2e1c6ff 100644
--- a/pkg/R/makeforest.r
+++ b/pkg/R/makeforest.r
@@ -51,15 +51,14 @@ gen.corpus <- function (ml, repo.path="./", suffix=".txt",
outdir=NULL,
corp <- preprocess(corp)
}
corp.orig <- corp
-
corp <- tm_map(corp, content_transformer(function(x) iconv(enc2utf8(x),
sub="byte")))
corp <- tm_map(corp, tm.plugin.mail::removeCitation, removeQuoteHeader=T)
corp <- tm_map(corp, tm.plugin.mail::removeSignature, marks=marks)
corp <- tm_map(corp, tm.plugin.mail::removeMultipart)
## NOTE: It's important to apply tolower before stopword removal;
## otherwise, phrases like "I'm" won't be removed properly
- corp <- tm_map(corp, tolower)
- corp <- tm_map(corp, removeWords.useBytes, stopwords("english"))
+ corp <- tm_map(corp, content_transformer(tolower))
+ corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, tm::removeNumbers)
corp <- tm_map(corp, tm::removePunctuation)
corp <- tm_map(corp, tm::stripWhitespace)
--
2.1.4


Other related posts:

  • » [codeface] [PATCH 2/3] Updates for changes to the tm packages - Mitchell Joblin