Read in my reference genome as a txdb object....

Question

using qCount from the QuasR package

0

Entering edit mode

jabbar_campbell • 0

@jabbar_campbell-23909

Last seen 4.1 years ago

United States

HELP!!! I'm trying to generate an qcount object that includes geneID, Because my txdb object contains NA's it wont run. Is there a way to filter out all the NA's from my Granges object? The QuasR documentation doesn't mention having to do this...

Read in my reference genome as a txdb object....

txdb<- makeTxDbFromGFF(file = "D:/Reference Genomes/gencode-hr19.gtf", format = "gtf")

Find where the NA's start in the ID column of my txdb object and perform qCount on the preceding subset.....

mycount<- QuasR::qCount(proj=myproj_full, query=txdb2[1:254340], clObj = cl, collapseBySample = TRUE)

qCount QuasR NA • 1.2k views

ADD COMMENT • link updated 5.3 years ago by Michael Stadler ▴ 350 • written 5.3 years ago by jabbar_campbell • 0

score 0 · Answer 1 · 2020-07-29

Hi @jabbar_campbell

The identifiers returned by qCount (rownames of the matrix) are taken from your query, so the solution will be to make sure your query contains the correct gene identifiers or names.

It would be helpful if you provide a reproducible example, in this case e.g. I don't know what the source of the gtf file was, and how txdb2 was constructed. My guess is that you are indeed not using a TxDb object, but rather a GRanges object or similar (TxDb objects cannot be subset as you do with txdb2 in your example).

The following code for example uses a gtf file from Human Gencode Release 34 and will give you Ensembl gene identifiers (make sure your gene annotation is consistent with your reference genome):

library(QuasR)

gtffile <-
"ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.basic.annotation.gtf.gz"
txdb <- makeTxDbFromGFF(gtffile, format = "gtf")
genes(txdb)

The txdb object can then be used directly with qCount, or you can extract a GRanges object with exon regions to quantify genes. In the latter case, the returned identifiers are the unique names of the GRanges object. These two should give identical results (see example below), but the second gives you more flexibility (e.g. to filter out genes without known identifier).

Here is a complete working example based on the example genome, reads and annotation that is part of the QuasR package:

library(QuasR)

# copy example data to temporary directory
td <- tempdir()
file.copy(system.file(package="QuasR", "extdata"), td, recursive=TRUE)

# get gene annotation (Ensembl identifiers)
gtffile <- file.path(td, "extdata", "hg19sub_annotation.gtf")
txdb <- makeTxDbFromGFF(gtffile, format = "gtf")
genes(txdb)

# align reads
proj <- qAlign(file.path(td, "extdata", "samples_rna_single.txt"),
file.path(td, "extdata", "hg19sub.fa"))

# count reads per gene
# ... using TxDb
cnt1 <- qCount(proj, txdb)
cnt1

# ... using exons GRanges extracted from TxDb
query <- unlist(exonsBy(txdb, by = "gene"))
cnt2 <- qCount(proj, query)
cnt2

# ... which both give you identical results (Ensemble identifiers)
identical(cnt1, cnt2)

I hope that clarifies it. The above example and much more is also available from the vignette and function documentations.

Michael