Inconsistent geo_char for rse_tx
2
0
Entering edit mode
@jacquesvan-helden-12519
Last seen 16 months ago

There is a bug with the pheno tables of the rse_tx objects. It occurs with several recount IDs, but not all.

For several experiments, the "characteristics" column of the DataFrame returned by colData(rse_tx) contains strangely placed quotes which perturb the parsing. I paste below a minimal code that reproduces the bug. 

Did anyone face this bug before ? Is there a a trick to circumvent it ?

 

#### Gene-wise counts (this first part works fine) ####

## Download data in rse-gene format
recountID <- "SRP056295"
gene_url <- download_study(project = recountID, type = "rse-gene", download = TRUE)
print(gene_url)

## Load the rse_gene object in memory
load(file.path(recountID, 'rse_gene.Rdata'))

## Extract GEO characteristics from the rse_gene object
gene_geochar <- recount::geo_characteristics(colData(rse_gene))
head(gene_geochar)
table(gene_geochar)

#### Transcript-wise counts #####

## Download the rse-tx object
tx_url <- download_study(project = recountID, type = "rse-tx", download = TRUE)
print(tx_url)

## Inconsistency: the following line fails on Linux systems because the extension
## is RData for transcripts, whereas it is Rdata for genes.
## It works on Mac OS X because the system is flexible with file upper/lower cases.
load(file.path(recountID, 'rse_tx.Rdata'))

## This works on Linux as well as Mac OS X
load(file.path(recountID, 'rse_tx.RData'))

## Extract GEO characteristics from the rse_gene object
tx_geochar <- recount::geo_characteristics(colData(rse_tx))
head(tx_geochar)
table(tx_geochar)

## The bug apparently comes from the pheno table
head(colData(rse_tx)$characteristics)

bug recount • 344 views
ADD COMMENT
0
Entering edit mode

It would help if you tagged the package that this object comes from  not just 'bug' so the maintainers are notified 

ADD REPLY
0
Entering edit mode

It's not a bug in the software per se, but instead seems to be malformed colData slots in some of the RangedSummarizedExperiments that you can download:

> head(colData(rse_gene)$characteristics)
CharacterList of length 6
[[1]] tissue: Bone marrow cell type: acute myeloid leukemia (AML) cells
[[2]] tissue: Bone marrow cell type: acute myeloid leukemia (AML) cells
[[3]] tissue: Bone marrow cell type: acute myeloid leukemia (AML) cells
[[4]] tissue: Bone marrow cell type: acute myeloid leukemia (AML) cells
[[5]] tissue: Bone marrow cell type: acute myeloid leukemia (AML) cells
[[6]] tissue: Bone marrow cell type: acute myeloid leukemia (AML) cells

> head(colData(rse_tx)$characteristics)
[1] "c(\"tissue: Bone marrow\", \"cell type: acute myeloid leukemia (AML) cells\")"
[2] "c(\"tissue: Bone marrow\", \"cell type: acute myeloid leukemia (AML) cells\")"
[3] "c(\"tissue: Bone marrow\", \"cell type: acute myeloid leukemia (AML) cells\")"
[4] "c(\"tissue: Bone marrow\", \"cell type: acute myeloid leukemia (AML) cells\")"
[5] "c(\"tissue: Bone marrow\", \"cell type: acute myeloid leukemia (AML) cells\")"
[6] "c(\"tissue: Bone marrow\", \"cell type: acute myeloid leukemia (AML) cells\")"

So those might need to be regenerated. In the interim you could convert the characteristics column in the rse_tx colData to a CharacterList, in which case it would work just like the rse_gene.

ADD REPLY
0
Entering edit mode
@lcolladotor
Last seen 12 days ago
United States

Hi,

I never got an email for this thread since the recount tag was not used initially.

In any case, I updated recount::geo_characteristics() in version 1.10.12 (BioC 3.9 -- current release) and 1.11.12 (BioC 3.10 -- current devel) such that now, with your code the following runs.

stopifnot(identical(
    geo_characteristics(colData(rse_gene)),
    geo_characteristics(colData(rse_tx))
))

Updating the R package was easier than updating the data itself for now.

Thanks @shepherl and @James W. MacDonald for your replies!

Best, Leo

PS The change is recorded at https://github.com/leekgroup/recount/commit/4a9e36f8b65461a829000040e1f422b51a778fd0 which was an implementation of James' answer:

if(is.character(pheno$characteristics)) {
    ## Solves https://support.bioconductor.org/p/116480/
    pheno$characteristics <- IRanges::CharacterList(
        lapply(lapply(pheno$characteristics, str2lang), eval)
    )
 }
ADD COMMENT
0
Entering edit mode
@jacquesvan-helden-12519
Last seen 16 months ago

Hi Leonardo,

Thanks for the fix, the test now runs fine.

And thanks fro recount, a great package providing researchers with instant access to thousands of RNA-seq datasets.

Best regards,

Jacques

ADD COMMENT

Login before adding your answer.

Traffic: 488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6