Notifications
Clear all

错误说vector size cannot be NA when using R with data mining

RSS

(@ganesh)
Noble Member
Joined: 2 years ago
Posts: 1362
31/03/2021 11:06 am

I'm using R for data analytics andconnected it with elasticsearch and retrieve a dataset of Shakespeare Complete Works.

library("elastic")connect()maxi<-count(index='shakespeare')s<-Search(index='shakespeare',size=maxi)dat<-s$hits$hits[[1]]$`_source`$text_entryfor(iin2:maxi){dat<-c(dat,s$hits$hits[[i]]$`_source`$text_entry)}rm(s)

After that I want to do a tf_idf matrix but apparently I can't since it uses too much memory (I have 4GB of RAM), here is my code:

library("tm")myCorpus<-Corpus(VectorSource(dat))myCorpus<-tm_map(myCorpus,content_transformer(tolower),lazy=TRUE)myCorpus<-tm_map(myCorpus,content_transformer(removeNumbers),lazy=TRUE)myCorpus<-tm_map(myCorpus,content_transformer(removePunctuation),lazy=TRUE)myCorpus<-tm_map(myCorpus,content_transformer(removeWords),stopwords("en"),lazy=TRUE)myTdm<-TermDocumentMatrix(myCorpus,control=list(weighting=function(x)weightTfIdf(x,normalize=FALSE)))

myCorpus is around 400 Mb.

But then I do:

>m<-as.matrix(myTdm)Errorinvector(typeof(x$v),nr*nc):vector size cannot beNAIn addition:Warning message:In nr*nc:NAs produced by integer overflow

Quote
(@anamika)
Noble Member
Joined: 2 years ago
Posts: 1381
31/03/2021 11:07 am

You can use the removesparseterm function.

Removes sparse terms from a document-term or term-document matrix.

something like this:

# NOT RUN { data("crude") tdm <- TermDocumentMatrix(crude) removeSparseTerms(tdm, 0.2) # }

ReplyQuote
Share:
Baidu