Question

Create TPM gene tables/dataframe from ReCount3

1

Entering edit mode

David ▴ 20

@93249c3e

Last seen 22 months ago

Hong Kong

Hi everyone.

First of all, thank you so much for the ReCount3 project as it helps a lot on analysing the public datasets. As shown from the title, being a "newbie" on bioinformatics/coding, i have a hard time to get TPM values from the data download via ReCount3.

Basically i tried to follow the manual (https://bioconductor.org/packages/release/bioc/manuals/recount3/man/recount3.pdf) but still i have no idea on the follow-up.

My question is: How do i able to get the TPM value for genes after the step of: recount::getTPM("RangedSummarizedExperiment_of_datasetx), because i do not want to do DEseq2 but just to get a dataframe with TPM value for gene (row) in different samples (column)


## Once you have your RSE object, you can transform the raw coverage
## base-pair coverage counts using transform_counts().
## For RPKM, TPM or read outputs, check the details in transform_counts().
assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615)

## Compute TPMs
assays(rse_gene_SRP009615)$TPM <- recount::getTPM(rse_gene_SRP009615)
colSums(assay(rse_gene_SRP009615, "TPM")) / 1e6 ## Should all be equal to 1

Because since they are "all equal to 1 now" from the manual, i really do not know what to do. I tried to do: assay(rse_gene_SRP009615,"TPM") but it does not work.

Thank you so much for the help.

(Sidenote: i hope the GTEx dataset for the brain tissue in ReCount3 can be further broken down into sub-section in the brain like in GTEx portal so that i can get specific part of the brain like frontal cortex.)

Regards, David

recount3 TCGA GTEx DESeq2 • 2.7k views

ADD COMMENT • link 2.8 years ago David ▴ 20

score 0 · Answer 1 · 2022-06-07

Hi David,

assays(rse_gene_SRP009615)$TPM is one way you can access the TPM matrix. I think that you will benefit from reading the SummarizedExperiment vignette number 1, that is, http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html. In particular, section 2 which talks about the anatomy of a SummarizedExperiment object. recount3 returns RangedSummarizedExperiment objects and assumes familiarity with the SummarizedExperiment as noted at http://bioconductor.org/packages/release/bioc/vignettes/recount3/inst/doc/recount3-quickstart.html#22_Required_knowledge.

As for your sidenote question, one of the colData() columns from your RSE object will be the column with the GTEx detailed tissue region. You'll be able to use that to subset your object. That will contain SMTSD in lowercase from looking at https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt. Actually, here's the code:

library("recount3")
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
rse <- recount3::create_rse_manual(
    project = "BRAIN",
    project_home = "data_sources/gtex",
    organism = "human",
    annotation = "gencode_v26",
    type = "gene"
)
#> 2022-06-07 10:06:57 downloading and reading the metadata.
#> 2022-06-07 10:06:58 caching file gtex.gtex.BRAIN.MD.gz.
#> 2022-06-07 10:06:58 caching file gtex.recount_project.BRAIN.MD.gz.
#> 2022-06-07 10:06:58 caching file gtex.recount_qc.BRAIN.MD.gz.
#> 2022-06-07 10:06:59 caching file gtex.recount_seq_qc.BRAIN.MD.gz.
#> 2022-06-07 10:06:59 downloading and reading the feature information.
#> 2022-06-07 10:06:59 caching file human.gene_sums.G026.gtf.gz.
#> 2022-06-07 10:07:00 downloading and reading the counts: 2931 samples across 63856 features.
#> 2022-06-07 10:07:00 caching file gtex.gene_sums.BRAIN.G026.gz.
#> 2022-06-07 10:07:22 construcing the RangedSummarizedExperiment (rse) object.
grep("smtsd", colnames(colData(rse)))
#> [1] 15
colnames(colData(rse))[15]
#> [1] "gtex.smtsd"
table(rse$gtex.smtsd)
#> 
#>                          Brain - Amygdala 
#>                                       163 
#>  Brain - Anterior cingulate cortex (BA24) 
#>                                       201 
#>           Brain - Caudate (basal ganglia) 
#>                                       273 
#>             Brain - Cerebellar Hemisphere 
#>                                       250 
#>                        Brain - Cerebellum 
#>                                       285 
#>                            Brain - Cortex 
#>                                       286 
#>              Brain - Frontal Cortex (BA9) 
#>                                       224 
#>                       Brain - Hippocampus 
#>                                       220 
#>                      Brain - Hypothalamus 
#>                                       221 
#> Brain - Nucleus accumbens (basal ganglia) 
#>                                       262 
#>           Brain - Putamen (basal ganglia) 
#>                                       221 
#>        Brain - Spinal cord (cervical c-1) 
#>                                       171 
#>                  Brain - Substantia nigra 
#>                                       154
dim(rse)
#> [1] 63856  2931
dim(rse[, rse$gtex.smtsd == "Brain - Amygdala"])
#> [1] 63856   163

^{Created on 2022-06-07 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)}

Best, Leo