Error with hclust
1
0
Entering edit mode
Abby ▴ 10
@03f4b235
Last seen 3.6 years ago
Sharon

Hi all,

I'm trying to create a dendrogram with qualitative data in a txt file. This is what I have so far:

RQ1_cloud <- readLines("RQ1.txt")

RQ1_corpus <- Corpus(VectorSource(RQ1_cloud))

RQ1_clean <- tm_map(RQ1_corpus, tolower)

RQ1_clean <- tm_map(RQ1_clean, removeNumbers)

RQ1_clean <- tm_map(RQ1_clean, removePunctuation)

RQ1_clean <- tm_map(RQ1_clean, stripWhitespace)

RQ1_clean <- tm_map(RQ1_clean, removeWords, stopwords())

wordcloud(RQ1_clean, min.freq = 10, scale = c(2, 0.2), colors = brewer.pal(9, "RdPu"))

RQ1_tdm <- TermDocumentMatrix(RQ1_clean)

freq <- colSums(as.matrix(RQ1_tdm))

length(freq)

ord <- order(freq)

dtms <- removeSparseTerms(RQ1_tdm, 0.2)

RQ1_matrix <- as.matrix(RQ1_tdm)

RQ1_sorted <- sort(colSums(RQ1_matrix), decreasing = T)

head(RQ1_sorted)

1 2 5 3 4

1484 1430 104 0 0

RQ1_df <- data.frame(word = names(RQ1_sorted), freq = RQ1_sorted)

head(RQ1_df)

word freq

1 1 1484

2 2 1430

5 5 104

3 3 0

4 4 0

findFreqTerms(RQ1_tdm, lowfreq=50) character(0)

findAssocs(RQ1_tdm, c("digital"), corlimit=0.85) $digital

           abstract             abstractly”            abstractness 

               1.00                    1.00                    1.00 

         acceptance                  access               acclimate 

               1.00                    1.00                    1.00 

etc.

dtmss <- removeSparseTerms(RQ1_tdm, 0.15)

library(cluster)

d <- dist(t(dtmss), method="euclidian")

fit <- hclust(d, method="complete", members= NULL) Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)

However, I get an error with that says Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)

I've seen online that I need to remove zero variance columns/rows but I'm unsure how to do that. Thank you all for your help!

-Abby

dendrogram hclust textanalysis R RStudio • 3.7k views
ADD COMMENT
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 5 weeks ago
Republic of Ireland

Hi,

It can also mean that there are NA, -Inf, NULL, or NaN values in your data, dtmss.

To search for rows of 0 variance, it would be:

apply(dtmss, 1, var) == 0

..or,

matrixStats::rowVars(dtmss) == 0

These will return boolean vectors of TRUE | FALSE, with TRUE representing any row having zero variance. Note that these functions will also return NA if there is even 1 NA value in the row; thus, you can use this information to perform the additional filtering for NA values.

[[[[[[[[[[[[[[

Another option to consider would be to impute the missing values and then filter for zero-variance genes prior to running dist() / hclust().

For example, impute missing values as 0:

dtmss[is.na(dtmss)] <- 0

Impute with half the lowest non-zero value:

dtmss[is.na(dtmss)] <- (min(dtmss, na.rm = TRUE) / 2)

The best strategy will depend on the distribution of the input data and how it was processed.

Kevin

ADD COMMENT
0
Entering edit mode

I put it the function: apply(dtmss, 1, var) == 0

It gave me : logical(0)

Does that mean there are rows having zero variance? And if there are is there a way I can take them out of the data? I'm trying to make a dendrogram about word association from an article.

Also this is the output

dput(dtmss)
structure(list(i = integer(0), j = integer(0), v = numeric(0), 
    nrow = 0L, ncol = 5L, dimnames = list(Terms = NULL, Docs = c("1", 
    "2", "3", "4", "5"))), class = c("TermDocumentMatrix", "simple_triplet_matrix"
), weighting = c("term frequency", "tf"))

Let me know if you can help thank you!

ADD REPLY
0
Entering edit mode

Hey, so, dtmss is not numerical data. Are you following some tutorial to do this work?

ADD REPLY
0
Entering edit mode

No it's not numerical data I'm trying to make a dendrogram about word association from an the article "6 Ways Digital Media Impacts the Brain" by Saga Briggs. First I tried to follow this YouTube video but I got "Error in train_word2vec("C:/Users/Abby/Documents/R/RQ1", output_file = "C:/Users/Abby/Documents/R/RQ1", : could not find function "train_word2vec". I think it's because train_word2vec was a function from the package wordVectors which I believe is not available anymore. So then I used this video to make a word cloud which was successful. Then I was playing around with this video https://www.youtube.com/watch?v=ys6y18Piqfc and this website https://rstudio-pubs-static.s3.amazonaws.com/265713_cbef910aee7642dc8b62996e38d2825d.html to make my dendrogram. With the video, I was able to get the word association:

RQ1_tdm <- TermDocumentMatrix(RQ1_clean)

freq <- colSums(as.matrix(RQ1_tdm))

length(freq)

ord <- order(freq)

dtms <- removeSparseTerms(RQ1_tdm, 0.2)

RQ1_matrix <- as.matrix(RQ1_tdm)

RQ1_sorted <- sort(colSums(RQ1_matrix), decreasing = T)

head(RQ1_sorted)

1 2 5 3 4

1484 1430 104 0 0

RQ1_df <- data.frame(word = names(RQ1_sorted), freq = RQ1_sorted)

head(RQ1_df)

word freq

1 1 1484

2 2 1430

5 5 104

3 3 0

4 4 0

findFreqTerms(RQ1_tdm, lowfreq=50) character(0)

findAssocs(RQ1_tdm, c("digital"), corlimit=0.85) $digital

abstract abstractly” abstractness

1.00 1.00 1.00

acceptance access acclimate

1.00 1.00 1.00

However, when I go on to:

dtmss <- removeSparseTerms(RQ1_tdm, 0.15)

library(cluster)

d <- dist(t(dtmss), method="euclidian")

fit <- hclust(d, method="complete", members= NULL)

plot(fit, hang=-1)

That's when I'm getting the message: "Error in hclust(d, method = "complete", members = NULL) : NA/NaN/Inf in foreign function call (arg 10)"

When I do apply(dtmss, 1, var) == 0, I get "logical(0)" which I've been told indicates that I have a vector that's supposed to contain boolean values, but the vector has zero length.

ADD REPLY
0
Entering edit mode

Hi, I may not have time to look through all of those videos - sorry. The apply() function will not work if the data is non-numerical. Basically, dist(t(dtmss), method="euclidian") will only work correctly if dtmss is numerical and has no NA, Nan, NULL, or -Inf values. What is the output of str(dtmss)?

Unfortunately, this is also somewhat out of the scope of this forum, which is dedicated to Bioconductor packages. I can continue to respond, though, in order to help

ADD REPLY
0
Entering edit mode

No problem! Thank you so much for all your help! The output of str(dtmss) is

List of 6

$ i : int(0)

$ j : int(0)

$ v : num(0)

$ nrow : int 0

$ ncol : int 5

$ dimnames:List of 2

..$ Terms: NULL

..$ Docs : chr [1:5] "1" "2" "3" "4" ...

  • attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"

  • attr(*, "weighting")= chr [1:2] "term frequency" "tf"

ADD REPLY
1
Entering edit mode

install.packages("devtools")

devtools::install_github("bmschmidt/wordVectors")

library(wordVectors)

library(magrittr)

if (!file.exists("RQ1.txt"))

unzip("RQ1.txt",exdir="RQ1")

if (!file.exists("RQ1.txt")) prep_word2vec(origin="RQ1",destination="RQ1.txt",lowercase=T,bundle_ngrams=2)

if (!file.exists("RQ1_vectors.bin")) {model = train_word2vec("RQ1.txt","RQ1_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("RQ1_vectors.bin")

model %>% closest_to("digital")

RQ1 = c("digital")

term_set = lapply(RQ1, function(rq1) {nearest_words = model %>% closest_to(model[[rq1]],20)

nearest_words$word}) %>% unlist

subset = model[[term_set,average=F]]

subset %>%

cosineDist(subset) %>%

as.dist %>%

hclust %>%

plot

This is what I ended up going with if anyone is interested! I suggest cleaning the text first and saving it as a txt file

ADD REPLY

Login before adding your answer.

Traffic: 824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6