Search
Question: Correct version of EnsDb.Hsapiens for ensembl transcriptome
0
gravatar for emmak
9 months ago by
emmak20
emmak20 wrote:

I downloaded the transcriptome:

 ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz , to quantify read counts using Salmon.

I thought it is equivalent to EnsDb.Hsapiens.v86, and created a data frame to map transcript id to gene id. 

txdf <- transcripts(EnsDb.Hsapiens.v86, return.type="DataFrame")

tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")])

When I imported the data: txi <- tximport(files, type="salmon", tx2gene=tx2gene) #without using ignoreTxVersion = TRUE

I got the error: 

  Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :   
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

My question is how/where I can obtain the correct version of  EnsDb.Hsapiens for the transcriptome (ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz) ? 

 

Below are the first few lines of a file and tx2gene

> head(tx2gene)
            tx_id         gene_id
1 ENST00000387314     ENSG00000210049
2 ENST00000389680     ENSG00000211459
3 ENST00000387342     ENSG00000210077
4 ENST00000387347     ENSG00000210082
5 ENST00000612848     ENSG00000276345
6 ENST00000386347     ENSG00000209082

> readLines(files[1], 10)

Name    Length    EffectiveLength    TPM    NumReads
ENST00000448914.1    13    5.000    0.000000    0.000000
ENST00000631435.1    12    5.000    0.000000    0.000000
ENST00000434970.2    9    4.000    0.000000    0.000000
ENST00000415118.1    8    4.000    0.000000    0.000000
ENST00000633010.1    16    6.000    0.000000    0.000000
ENST00000632968.1    17    6.000    0.000000    0.000000
ENST00000603693.1    19    7.000    0.000000    0.000000
ENST00000452198.1    18    7.000    0.000000    0.000000
ENST00000632609.1    31    10.000    0.000000    0.000000

 

 

Any help would be greatly appreciated,

Thank you,

 

 

ADD COMMENTlink modified 9 months ago • written 9 months ago by emmak20
0
gravatar for Michael Love
9 months ago by
Michael Love20k
United States
Michael Love20k wrote:

Can you show head(tx2gene) and also show us the first few lines of one of the quant files:

readLines(files[1], 10)
ADD COMMENTlink written 9 months ago by Michael Love20k

Thank you, Mike. I just added additional information. It ran if I used the ignoreTxVersion. But I worry that I may miss some genes because of the mismatch.

ADD REPLYlink written 9 months ago by emmak20
2

You don't have a choice because your tx2gene doesn't contain any version numbers. I don't think there should be duplicate transcripts in the cDNA FASTA with the same ID but different version number.

ADD REPLYlink written 9 months ago by Michael Love20k
1

I agree with Mike, you're using the same Ensembl release, so you can expect that, even if the version number on the tx ID is not present in the EnsDb, they refer to the same transcripts. Ensembl is quite strict and consistent with their IDs.
 

ADD REPLYlink written 9 months ago by Johannes Rainer1.3k

Thank you, Johannes!

ADD REPLYlink written 9 months ago by emmak20

Thank Mike, I will go with ignoreTxVersion. 

I tried to download hg38 transcriptome from ucsc (for salmon) but I am not sure which is correct one (please advise ). So, I tried mrna and refMrna (from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/) but I got the above error for both. It looks like they have different IDs from tx2gene. 

Then, I tried the piece of code in your doc (http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html) and got the same error. I don't know if I did something wrong. Do you have any suggestion ?

dir <- system.file("extdata", package = "tximportData")
list.files(dir)
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples
files <- file.path(dir, "salmon", samples$run, "quant.sf")
names(files) <- paste0("sample", 1:6)
all(file.exists(files))


txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]  # tx ID, then gene ID
head(tx2gene)

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)

 

> readLines(files[1], 10)

Name    Length    EffectiveLength    TPM    NumReads
NR_103451    865    706.217    0.048255    1.000000
NM_001243523    577    182.000    0.000000    0.000000
NR_038931    2432    2416.429    0.072372    5.131725
NR_110396    2909    2445.053    1.225229    87.906687
NR_038401    2146    1696.583    0.140607    7.000000
NR_027354    1104    886.567    0.079356    2.064459
NR_027311    2290    2150.480    0.000000    0.000000
NM_001257177    2344    1695.910    0.120568    6.000000

 

> head(tx2gene)
      TXNAME GENEID
1 uc002qsd.4      1
2 uc002qsf.2      1
3 uc003wyw.1     10
4 uc002xmj.3    100
5 uc010xbn.1   1000
6 uc002kwg.2   1000

ADD REPLYlink written 9 months ago by emmak20
1

I don't have recommendations for which transcriptome to use. The important thing is to use the FASTA and GTF from the same source and version, and keep track of this information so you can reproduce the analysis later. You can easily make a tx2gene table directly from a GTF file.

ADD REPLYlink written 9 months ago by Michael Love20k

Thank Mike,

how about the error ? Do you see anything wrong ?

 

ADD REPLYlink written 9 months ago by emmak20
1

Those transcript names are clearly not the same, so how can the software match on them. I also just recommended that you need to pair the FASTA and GTF from the same source, not try to match post hoc and see if any errors pop up.

You need to do a little more work on your end in between posting questions for the developers. I try to answer questions as soon as possible, but i'm limited if I'm getting pinged too frequently.

ADD REPLYlink written 9 months ago by Michael Love20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 405 users visited in the last hour