error in aSwitchlist (isoformswitchanalyzeR)
1
0
Entering edit mode
Last seen 2.5 years ago

Hi, I am trying to import salmon quant data to isotypeswitchanalyzeR for transcript variant use analysis. But I am seeing this error-

aSwitchList <- importRdata(

• isoformCountMatrix = txi.tx$counts, • isoformRepExpression = txi.tx$abundance,

• designMatrix = mycols,

• isoformExonAnnoation = "hg19_knowngene.gtf",

• showProgress = FALSE

• )

Step 1 of 6: Checking data... Using row.names as 'isoformid' for 'isoformCountMatrix'. If not suitable you must add them manually. Using row.names as 'isoformid' for 'isoformRepExpression'. If not suitable you must add them manually. Step 2 of 6: Obtaining annotation... importing GTF (this may take a while) converting annotated CDSs

Error in importRdata(isoformCountMatrix = txi.tx$counts, isoformRepExpression = txi.tx$abundance, :

The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (jacard similarity < 0.95).

Either isforoms found in the annotation are not quantifed or vise versa.

Specifically:

78631 isoforms were quantified.

82960 isoforms are annotated.

Only 78631 overlap.

This combination cannot be analyzed since it will cause discrepencies between quantification and annotation thereby skewing all analysis.

Please make sure they belong together and try again. For more info see the FAQ in the vignette.

The GTF file I used during salmon quantification contains 82860 transcripts whereas my quant files contain 78631 rows.

Any help would be appreciated.

isoformswitchanalyzeR tximport rnaseq salmon • 877 views
0
Entering edit mode

Which version of IsoformSwitchAnalyzeR are you using?

Cheers Kristoffer

0
Entering edit mode

I have loaded 1.4.0. on rstudio.

0
Entering edit mode

Could you try updating to 1.5.6 (from the devel branch) and see if the problem persists? You can find the installation instructions here. Remember to restart your R session after the update.

Cheers Kristoffer

0
Entering edit mode

yeah, sure. I will try it.

0
Entering edit mode

I tried with 1.5.6 and got the same error.

1
Entering edit mode
@kvittingseerup-7956
Last seen 10 weeks ago
European Union

Thanks for clarifying versions above. The problem is that I have hardcoded the min Jaccard similarity of the annotation and the quantification to prevent people from using the wrong annotation with their quantifications. And as you can see the JC similarity is 0.948 (so close). When I recalculate it with the number of features form the files I downloaded I get 0.958 - did you use an older version?

It is the first time I have seen the ratio fall below my cutoff - but I will lower it in the next release (which however will not be until hopefully next week somewhere). So you either have to wait a bit or to switch to another annotation (such as Gencode genes (usually recommend)).

Out of curiosity why do you choose UCSC Knowngenes?

Cheers Kristoffer

0
Entering edit mode

Thank you Kristoffer for the clarification. Well, I uploaded knowngene file from ucsc table browser to the galaxy and then downloaded it some 4-5 months ago, so I don't know whether they have updated it recently. Please forward me the link from where you downloaded the file, I am having hard time finding it. Thats why I took convoluted route.

Coming to why I chose it, In one of the paper different gtf annotation were compared for rna seq data analysis and they found that different number of DEGs with different gtf files were produced.I also read somewhere that knowngene gtf contains gene annotation common to ncbi refseq and ensemble. Further I suppose its also updated time to time.

Anyways, if you have any suggestion regarding the use of use of gtf, please advice me.

P.S. I also found something odd on ucsc knowngene gtf file -

  GENEID     TXNAME ntx


1 uc001aaa.3 uc001aaa.3 1

2 uc001aac.4 uc001aac.4 1

3 uc001aae.4 uc001aae.4 1

4 uc001aah.4 uc001aah.4 1

5 uc001aai.1 uc001aai.1 1

6 uc001aak.3 uc001aak.3 1

Why there is no difference between GENEID and TXNAME?

0
Entering edit mode

That is a known problem. UCSC does not have geneids which means you cannot directly use the with IsoformSwitchAnalyzeR (since it relies on geneids). So you would need to manually "concatenate" transcripts into genes. There are some lists hidden somewhere on UCSC where you can find txid conversion to gene names - but that is not good enough since quite a few genes are located in multiple spots in the human genome so it requires some work - was one of the major headaches I had when reanalyzing TCGA data in this paper.

Based on that (and the fact that the annotation you quantify is 10 years old) I would suggest re-quantifying with Ensemble/Gencode. For Gencode the “Comprehensive gene annotation” GTF and the “Transcript sequences” fasta file is a perfect pair. And I would also suggest doing it in Hg38 (instead of Hg19) (all Gencode version > 19 are in Hg38) unless you have a good reason not to.

Cheers Kristoffer

0
Entering edit mode

Thanks again kristoffer for all the inputs and the support. I will try with genecode annotation. The issue I find with the annotation files is that no two annotation system replicate the results and I have some worrisome data generated using gencode, ensemble and ucsc gtf files and I found significant differences in the result even after using same analysis pipeline. Anyway, this is a different topic need to be addressed. Thanks for your time.