I'm using R for data analytics andconnected it with elasticsearch and retrieve a dataset of Shakespeare Complete Works.
library("elastic")connect()maxi<-count(index='shakespeare')s<-Search(index='shakespeare',size=maxi)dat<-s$hits$hits[[1]]$`_source`$text_entryfor(iin2:maxi){dat<-c(dat,s$hits$hits[[i]]$`_source`$text_entry)}rm(s)
After that I want to do a tf_idf matrix but apparently I can't since it uses too much memory (I have 4GB of RAM), here is my code:
library("tm")myCorpus<-Corpus(VectorSource(dat))myCorpus<-tm_map(myCorpus,content_transformer(tolower),lazy=TRUE)myCorpus<-tm_map(myCorpus,content_transformer(removeNumbers),lazy=TRUE)myCorpus<-tm_map(myCorpus,content_transformer(removePunctuation),lazy=TRUE)myCorpus<-tm_map(myCorpus,content_transformer(removeWords),stopwords("en"),lazy=TRUE)myTdm<-TermDocumentMatrix(myCorpus,control=list(weighting=function(x)weightTfIdf(x,normalize=FALSE)))
myCorpus is around 400 Mb.
But then I do:
>m<-as.matrix(myTdm)Errorinvector(typeof(x$v),nr*nc):vector size cannot beNAIn addition:Warning message:In nr*nc:NAs produced by integer overflow
You can use the removesparseterm function.
Removes sparse terms from a document-term or term-document matrix.
something like this:
# NOT RUN { data("crude") tdm <- TermDocumentMatrix(crude) removeSparseTerms(tdm, 0.2) # }