Question

PCA results different from th same sample but different counting method

0

Entering edit mode

jarod_v6@libero.it ▴ 40

@jarod_v6liberoit-6654

Last seen 7.0 years ago

Italy

I have 10 sample with two condition (case/controll). I aligned using STAR 2.25a . On the left image after aligned I count using htseq-count and then I select only counts are from protein-coding and I use for generate pca plot. In the second case (right image) I use rsem with STAR and I use to import the counts using tximport. In the first case I use ensemble 74 in the second case hg38/84. Both pca are different What could be wrong? In the first case seem the date of the run do not affect pca (Using htseqcount) in the second case PC1 are able to divide one batch .

Here the script for prepare the counts that I use (only protein coding)

ret<-data.frame(rownames(dt))
ret$ensembl <- sapply(strsplit(rownames(dt),split="nn+"),"[",1)
#use biomart

#ensembl = useMart( "ensembl", dataset = "hsapiens_gene_ensembl" )
ensembl = useMart( host="dec2013.archive.ensembl.org",biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl" )

genemap <- getBM( attributes = c("ensembl_gene_id", "hgnc_symbol","family","gene_biotype","family_description"),
                  filters = "ensembl_gene_id",
                  values = list(ret$ensembl,"protein_coding"),
                  mart = ensembl )
idx <- match( ret$ensembl, genemap$ensembl_gene_id )

ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
ret$entrez <- genemap$entrezgene[ idx ]
ret$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
ret$family <- genemap$family[ idx ]
ret$biotype <- genemap$gene_biotype[ idx ]
ret$family_description <- genemap$family_description[ idx ]

#prepare  coding

matrice1<-ret[ret$biotype=="protein_coding",]
head(matrice1$ensembl)
countRaw<-as.data.frame(counts(dds,normalized=FALSE))
countRawCoding <-  countRaw[matrice1$ensembl,]

and for the plot pca

http://imgur.com/6CKCUPq

deseq2 pca tximportdata htseqcounts • 2.3k views

ADD COMMENT • link updated 9.6 years ago by Michael Love 43k • written 9.6 years ago by jarod_v6@libero.it ▴ 40

score 0 · Answer 1 · 2016-07-07

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

The htseq count and RSEM commands might help one find some differences.

ADD COMMENT • link 9.6 years ago Michael Love 43k

0

Entering edit mode

Thanks so much:

rsem-calculate-expression -p 10 --paired-end  --star --star-path  /illumina/software/PROG/STAR-2.5.2a/bin/Linux_x86_64_static/  --forward-prob 0    --star-gzipped-read-file     trim/367/367.trim.pair1.fastq.gz trim/367/367.trim.pair2.fastq.gz  \ /illumina/software/database/Trasc_ensemble84/rsemGenome/Homo_sapiens.GRCh38.84  ALIGNrsem/367

samtools sort -no ALIGN/366/Aligned.sortedByCoord.out.bam tmp/366 |samtools view
  - |htseq-count  --mode=intersection-nonempty --stranded=reverse --type=exon --idattr=gene_id - /illumina/software/database/Trasc_GH37_74/Homo_sapiens.GRCh37.74.gtf  > Count/367_counts.txt

ADD REPLY • link 9.6 years ago jarod_v6@libero.it ▴ 40

0

Entering edit mode

Well I don't see anything obvious. Those look correct.

I don't know then why gene level counts would be very different across htseq and RSEM. It's expected to have some differences, because RSEM is also probabilistically assigning multimapping reads, but the correlation across methods is usually very high.

Try some basic EDA. Compare column sums across software, plot htseq vs RSEM for the same sample, etc.

It could also be the Ensembl version that makes a difference.

All the little differences in a pipeline can add up to different quantifications...

ADD REPLY • link 9.6 years ago Michael Love 43k

0

Entering edit mode

I try to do some deep analysis . I found something strange. If I use tximport and rsem I found 2 sample are really different . Please see the Image

http://imgur.com/AntGAu9

In the first row you can see data obtained from htseq-count and the rlog trasfromation seem work. (From Left to right comparison 1vs2,3vs4,4vs5,boxplot). In the second row the same but using data obtained from tximport and rsem. Sample 3 and Sample 4 seem are not normalize.

ADD REPLY • link 9.6 years ago jarod_v6@libero.it ▴ 40

0

Entering edit mode

If you want to make a clean comparison you will need to eliminate any unnecessary differences. For example, you are using different genomes: GRCh38.84 vs GRCh37.74

But in the end, you can feel free to use whichever pipeline you prefer or think makes the most sense for your data.

ADD REPLY • link 9.6 years ago Michael Love 43k

0

Entering edit mode

I have try a new comparison with both methods. I found some really relevant difference from both experiment. Here I report pca from both methods. In this case I see the same distortion but with TPM I see more separation. Any suggestion?? http://imgur.com/a/ojx7b

ADD REPLY • link 9.2 years ago jarod_v6@libero.it ▴ 40

0

Entering edit mode

I don't have any specific recommendation here.

My preferred pipeline these days is to use transcript quantifiers and tximport.

ADD REPLY • link 9.2 years ago Michael Love 43k