Question: PCA results different from th same sample but different counting method
0
gravatar for jarod_v6@libero.it
3.0 years ago by
Italy
jarod_v6@libero.it40 wrote:

I have 10 sample with two condition (case/controll).  I aligned using STAR 2.25a . On the  left image after aligned I count using htseq-count and then I select only counts are from protein-coding and I use for generate pca plot. In the second case (right image) I use rsem with STAR and I use to import the counts using tximport. In the first case I use ensemble 74 in the second case hg38/84. Both pca are different  What could be wrong? In the first case seem the date of the run do not affect pca (Using htseqcount) in the second case PC1 are able to divide one batch .

Here the script for prepare the counts that I use (only protein coding)

ret<-data.frame(rownames(dt))
ret$ensembl <- sapply(strsplit(rownames(dt),split="nn+"),"[",1)
#use biomart

#ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
ensembl = useMart( host="dec2013.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol","family","gene_biotype","family_description"),
                  filters = "ensembl_gene_id",
                  values = list(ret$ensembl,"protein_coding"),
                  mart = ensembl )
idx <- match( ret$ensembl, genemap$ensembl_gene_id )

ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
ret$entrez <- genemap$entrezgene[ idx ]
ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
ret$family <- genemap$family[ idx ]
ret$biotype <- genemap$gene_biotype[ idx ]
ret$family_description <- genemap$family_description[ idx ]

#prepare  coding

matrice1<-ret[ret$biotype=="protein_coding",]
head(matrice1$ensembl)
countRaw<-as.data.frame(counts(dds,normalized=FALSE))
countRawCoding <-  countRaw[matrice1$ensembl,]

and for the plot pca

http://imgur.com/6CKCUPq

PCA from proten coding counts obtained using htseq count or txtimport from rsem-star

 

 

ADD COMMENTlink modified 3.0 years ago by Michael Love24k • written 3.0 years ago by jarod_v6@libero.it40
Answer: PCA results different from th same sample but different counting method
0
gravatar for Michael Love
3.0 years ago by
Michael Love24k
United States
Michael Love24k wrote:
The htseq count and RSEM commands might help one find some differences.
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Michael Love24k

Thanks so much:

rsem-calculate-expression -p 10 --paired-end  --star --star-path  /illumina/software/PROG/STAR-2.5.2a/bin/Linux_x86_64_static/  --forward-prob 0    --star-gzipped-read-file     trim/367/367.trim.pair1.fastq.gz trim/367/367.trim.pair2.fastq.gz  \ /illumina/software/database/Trasc_ensemble84/rsemGenome/Homo_sapiens.GRCh38.84  ALIGNrsem/367

 

 

samtools sort -no ALIGN/366/Aligned.sortedByCoord.out.bam tmp/366 |samtools view
  - |htseq-count  --mode=intersection-nonempty --stranded=reverse --type=exon --idattr=gene_id - /illumina/software/database/Trasc_GH37_74/Homo_sapiens.GRCh37.74.gtf  > Count/367_counts.txt

 

ADD REPLYlink written 3.0 years ago by jarod_v6@libero.it40

Well I don't see anything obvious. Those look correct.

I don't know then why gene level counts would be very different across htseq and RSEM. It's expected to have some differences, because RSEM is also probabilistically assigning multimapping reads, but the correlation across methods is usually very high.

Try some basic EDA. Compare column sums across software, plot htseq vs RSEM for the same sample, etc.

It could also be the Ensembl version that makes a difference.

All the little differences in a pipeline can add up to different quantifications...

ADD REPLYlink written 3.0 years ago by Michael Love24k

I try to do some deep analysis . I found something strange. If I use tximport and rsem I found 2 sample are really different . Please see the Image

http://imgur.com/AntGAu9

 

In the first row you can see data obtained from htseq-count and the rlog trasfromation seem work. (From Left to right comparison 1vs2,3vs4,4vs5,boxplot). In the second row the same but using data obtained from tximport and rsem. Sample 3 and Sample 4 seem are not normalize.

 

 

 

ADD REPLYlink written 2.9 years ago by jarod_v6@libero.it40

If you want to make a clean comparison you will need to eliminate any unnecessary differences. For example, you are using different genomes: GRCh38.84 vs GRCh37.74

But in the end, you can feel free to use whichever pipeline you prefer or think makes the most sense for your data.

ADD REPLYlink written 2.9 years ago by Michael Love24k

I have try a new comparison with both methods. I found some really relevant difference from both experiment.  Here I report pca from both methods. In this case I see the same distortion but with TPM I see more separation. Any suggestion??  http://imgur.com/a/ojx7b

 

ADD REPLYlink written 2.6 years ago by jarod_v6@libero.it40

I don't have any specific recommendation here.

My preferred pipeline these days is to use transcript quantifiers and tximport.

ADD REPLYlink written 2.6 years ago by Michael Love24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 145 users visited in the last hour