PCA results different from th same sample but different counting method
1
0
Entering edit mode
@jarod_v6liberoit-6654
Last seen 5.9 years ago
Italy

I have 10 sample with two condition (case/controll).  I aligned using STAR 2.25a . On the  left image after aligned I count using htseq-count and then I select only counts are from protein-coding and I use for generate pca plot. In the second case (right image) I use rsem with STAR and I use to import the counts using tximport. In the first case I use ensemble 74 in the second case hg38/84. Both pca are different  What could be wrong? In the first case seem the date of the run do not affect pca (Using htseqcount) in the second case PC1 are able to divide one batch .

Here the script for prepare the counts that I use (only protein coding)

ret<-data.frame(rownames(dt))
ret$ensembl <- sapply(strsplit(rownames(dt),split="nn+"),"[",1)
#use biomart

#ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
ensembl = useMart( host="dec2013.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol","family","gene_biotype","family_description"),
                  filters = "ensembl_gene_id",
                  values = list(ret$ensembl,"protein_coding"),
                  mart = ensembl )
idx <- match( ret$ensembl, genemap$ensembl_gene_id )

ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
ret$entrez <- genemap$entrezgene[ idx ]
ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
ret$family <- genemap$family[ idx ]
ret$biotype <- genemap$gene_biotype[ idx ]
ret$family_description <- genemap$family_description[ idx ]

#prepare  coding

matrice1<-ret[ret$biotype=="protein_coding",]
head(matrice1$ensembl)
countRaw<-as.data.frame(counts(dds,normalized=FALSE))
countRawCoding <-  countRaw[matrice1$ensembl,]

and for the plot pca

http://imgur.com/6CKCUPq

PCA from proten coding counts obtained using htseq count or txtimport from rsem-star

 

 

deseq2 pca tximportdata htseqcounts • 2.0k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States
The htseq count and RSEM commands might help one find some differences.
ADD COMMENT
0
Entering edit mode

Thanks so much:

rsem-calculate-expression -p 10 --paired-end  --star --star-path  /illumina/software/PROG/STAR-2.5.2a/bin/Linux_x86_64_static/  --forward-prob 0    --star-gzipped-read-file     trim/367/367.trim.pair1.fastq.gz trim/367/367.trim.pair2.fastq.gz  \ /illumina/software/database/Trasc_ensemble84/rsemGenome/Homo_sapiens.GRCh38.84  ALIGNrsem/367

 

 

samtools sort -no ALIGN/366/Aligned.sortedByCoord.out.bam tmp/366 |samtools view
  - |htseq-count  --mode=intersection-nonempty --stranded=reverse --type=exon --idattr=gene_id - /illumina/software/database/Trasc_GH37_74/Homo_sapiens.GRCh37.74.gtf  > Count/367_counts.txt

 

ADD REPLY
0
Entering edit mode

Well I don't see anything obvious. Those look correct.

I don't know then why gene level counts would be very different across htseq and RSEM. It's expected to have some differences, because RSEM is also probabilistically assigning multimapping reads, but the correlation across methods is usually very high.

Try some basic EDA. Compare column sums across software, plot htseq vs RSEM for the same sample, etc.

It could also be the Ensembl version that makes a difference.

All the little differences in a pipeline can add up to different quantifications...

ADD REPLY
0
Entering edit mode

I try to do some deep analysis . I found something strange. If I use tximport and rsem I found 2 sample are really different . Please see the Image

http://imgur.com/AntGAu9

 

In the first row you can see data obtained from htseq-count and the rlog trasfromation seem work. (From Left to right comparison 1vs2,3vs4,4vs5,boxplot). In the second row the same but using data obtained from tximport and rsem. Sample 3 and Sample 4 seem are not normalize.

 

 

 

ADD REPLY
0
Entering edit mode

If you want to make a clean comparison you will need to eliminate any unnecessary differences. For example, you are using different genomes: GRCh38.84 vs GRCh37.74

But in the end, you can feel free to use whichever pipeline you prefer or think makes the most sense for your data.

ADD REPLY
0
Entering edit mode

I have try a new comparison with both methods. I found some really relevant difference from both experiment.  Here I report pca from both methods. In this case I see the same distortion but with TPM I see more separation. Any suggestion??  http://imgur.com/a/ojx7b

 

ADD REPLY
0
Entering edit mode

I don't have any specific recommendation here.

My preferred pipeline these days is to use transcript quantifiers and tximport.

ADD REPLY

Login before adding your answer.

Traffic: 944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6