Dear Mailing list,
I'm using DEXSeq to analyze RNA-seq data from Illumina HiSeq2000 sequencer. I followed all the steps according to the vignette, getting a count of reads at exon level / per gene / per sample. Next steps into R work perfectly.
Anyway, I have some problem with the upstream passages with python scripts. We are interested in a specific gene, but we have 0 reads for all exons / per gene / per sample for it when using the dexseq_count.py script. This gene is strongly expressed in the samples we sequenced (it's a diagnostic marker) and, indeed, I have several reads mapped to its locus in my bam files.
These are the steps i did:
I downloaded the gtf file from Ensembl (same version of the alignment), modified chromosomes name and processed the file with dexseq_prepare_annotation.py
1) The gene I'm interested in exists in the processed gtf file:
> less FILE.GTF | grep ENSG00000002586
chrX ... aggregate_gene 2609220 2659350 ...
chrX ... exonic_part 2609220 2609316 exonic_part_number "001" ....
chrX ... exonic_part 2609317 2609320 exonic_part_number "002" ....
... ... ... ... ... ....
2) If I look at the same region in my .bam files by samtools, the gene is expressed:
for i in FILE.BAM do; samtools view $i chrX:2609220-2659350 | wc; done
> 17431, 10137, 29503, 20486, 43234, .... (number of reads of the aggregate_gene)
for i in FILE.BAM; do samtools view $i chrX:2609220-2609316 | wc; done
> 165, 137, 143, 473, 304, ... (number of reads for the 1st exon for each sample)
And so on for all exons..
3) when I do the count by dexseq_count.py I have 0 for all exons of my gene of interest in all samples:
dexseq_count.py -p yes -s no -r pos -f bam $FILE.GTF $FILE.BAM output.txt
ENSG00000002586:001 0
ENSG00000002586:002 0
ENSG00000002586:003 0
.... ...
Other genes (also on the same chrX) have counting reads: this is the tail of my counting file for the 1st sample. Rest of samples have numbers of the same order of magnitude:
ENSG00000257111:001 21
ENSG00000257111:002 4
ENSG00000257112:001 66
_ambiguous 1057133
_ambiguous_readpair_position 0
_empty 8974313
_lowaqual 0
_notaligned 0
Accordingly to "samtools flagstat" this is the number of reads of the 1st sample:
51700345 + 0 in total (QC-passed reads + QC-failed reads)
4772889 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
51700345 + 0 mapped (100.00%:-nan%)
46927456 + 0 paired in sequencing
23974422 + 0 read1
22953034 + 0 read2
35066086 + 0 properly paired (74.72%:-nan%)
41167519 + 0 with itself and mate mapped
5759937 + 0 singletons (12.27%:-nan%)
1553678 + 0 with mate mapped to a different chr
1098550 + 0 with mate mapped to a different chr (mapQ>=5)
Data is from paired-end sequencing run. About strendedness option, I'm not sure about the library preparation protocol, so I tried with either -s "no" or -s "yes" options. Bam files are position sorted, anyway I also tried to sort bam files by name and use -n name option, but with the same result.
Let me know if any further information could be useful.
Thanks,
Andrea
PS I attached session.info(), even if related passages are upstream.
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DEXSeq_1.14.0 DESeq2_1.8.1
[3] RcppArmadillo_0.5.100.1.0 Rcpp_0.11.6
[5] GenomicRanges_1.20.3 GenomeInfoDb_1.4.0
[7] IRanges_2.2.1 S4Vectors_0.6.0
[9] Biobase_2.28.0 BiocGenerics_0.14.0
[11] BiocParallel_1.2.1
loaded via a namespace (and not attached):
[1] genefilter_1.50.0 statmod_1.4.21 locfit_1.5-9.1
[4] reshape2_1.4.1 splines_3.2.0 lattice_0.20-31
[7] colorspace_1.2-6 survival_2.38-1 XML_3.98-1.1
[10] foreign_0.8-63 DBI_0.3.1 RColorBrewer_1.1-2
[13] lambda.r_1.1.7 plyr_1.8.2 stringr_1.0.0
[16] zlibbioc_1.14.0 Biostrings_2.36.1 munsell_0.4.2
[19] gtable_0.1.2 futile.logger_1.4.1 hwriter_1.3.2
[22] latticeExtra_0.6-26 geneplotter_1.46.0 biomaRt_2.24.0
[25] AnnotationDbi_1.30.1 proto_0.3-10 acepack_1.3-3.3
[28] xtable_1.7-4 scales_0.2.4 Hmisc_3.16-0
[31] annotate_1.46.0 XVector_0.8.0 Rsamtools_1.20.1
[34] gridExtra_0.9.1 ggplot2_1.0.1 digest_0.6.8
[37] stringi_0.4-1 grid_3.2.0 tools_3.2.0
[40] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.6
[43] RSQLite_1.0.0 Formula_1.2-1 cluster_2.0.1
[46] futile.options_1.0.0 MASS_7.3-40 rpart_4.1-9
[49] nnet_7.3-9
Hi Andrea,
Thanks for your detailed report! Could you add some of the sam files, ordered by read name, of the reads mapping to your gene of interest?
Alejandro
Hi Alejandro,
thank you for your quick reply. I'm sorry but I haven't sam files of my samples but bam only, so I can have index only for position-sorted files.
I'm not sure if following step could be a proper method, but I tried sorting by bash command the output of the bam files indexed by position. If you know a better way let me know, because I'm new to this type of data and any possible correction is appreciated:
samtools view SAMPLE1.bam chrX:2609220-2659350 | sort | head -n 5
samtools view SAMPLE2.bam chrX:2609220-2659350 | sort | head -n 5
samtools view SAMPLE3.bam chrX:2609220-2659350 | sort | head -n 5
Thank you,
Andrea
SAMPLE 1
HWI-H217:48:C20WDACXX:1:1101:10092:74144 163 chrX 2609296 3 78M = 2609362 144 CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5 XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559296 HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10092:74144 83 chrX 2609362 3 78M = 2609296 -144 CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@ XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559362 HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10107:39090 345 chrX 2658885 3 78M * 0 0 ATTAGAACAGCTGCCTGAGGCTCCTCCCTGAAGGACACCTGCCTGAGAGCAGAGATGGAGGCCTTCTGTTCACGGCAG ###@;CCC@;1;@=777FGC7C886(8=B4H@CF<DDEFD;91C3GCHCEGIHACC=F>FE?F>G>D>D=DDD;A@@@ AS:i:-7 XM:i:2 XO:i:0 XG:i:0 MD:Z:0G75G1 NM:i:2 NH:i:2 CC:Z:chrY CP:i:2608885 HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10371:143793 161 chrX 2637730 3 17M650N61M = 2640681 3668 TGTTGATGGAGAAAATGACGACCCACGACCACCGAACCCACCCAAACCGATGCCAAATCCAAACCCCAACCACCCTAG BC?DFFFFHHHHHGHGFHGGFGGF?F>@F@@GHGIE@FIEIIFHFCBEFA9?BDDCCCCDDCDDDDBB8?;?BDD8AC AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:78 NM:i:0 XS:A:+ NH:i:2 CC:Z:chrY CP:i:2587730 HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10371:143793 81 chrX 2640681 3 35M639N43M = 2637730 -3668 TGCTGACCTTGCGGATGGCGTTTCAGGTGGAGAAGGAAAAGGAGGCAGTGATGGTGGAGGCAGCCACAGGAAAGAAGG @CCCCACCDBFHFIIIHIIGHFC@IIIIGIIGHGFBGIIEGGIHG@IIIIIGFF;HEGF7IIIHGDFDFFDDDDD@@@ AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:78 NM:i:0 XS:A:+ NH:i:2 CC:Z:chrY CP:i:2590681 HI:i:0
SAMPLE 2
HWI-H217:48:C20WDACXX:2:1101:10055:5571 163 chrX 2609293 3 78M = 2609386 171 CTTCGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTC @<<DD;?AFHFFHGIIGIDH?D<6B@FHAGB@GE5:=>AB;B55@@@>;>@9@985@DDDCD@DB7<;B?B;<8?>@3XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559293 HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10055:5571 83 chrX 2609386 3 78M = 2609293 -171 GCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGGCCTGCTGGGTGTTCTGGTCGCCGC >BA?8BB;97>@@4AC>?BB@@<@@5???BBBBB@>96>ACA9IGDIEGHIFCGDGB)=IGFBGED>HFFDDABB@@@XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559386 HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10059:108872 147 chrX 2659249 3 78M = 2659134 -193 TTAAAATAAGGTAAGCCCTTCAATTTGTTTCTTCAATATTTCTTTCATTTGTAGGGATATTTGTTTTTCATATCAGAC HHHHJJJJGFJJJIJJJJJJIJJJJIJIIJIJIIJIIEJIIJJJIJJJJJJJJJJJJJIJJIJJJHHHHHFFFFFCCC XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2609249 HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10059:108872 99 chrX 2659134 3 26M1I51M = 2659249 193 ACTCAAATGTCAACCCCACCGAGGCACCCCCCCCGTCCCCCAGAATCTTGGCTGTTTACAAATCACGTGTCCATCGAG CCCFFFFFHHHHHHIJJJJJJJJIJEGIJJJJJGD9=ADDDDB<C@AAC:@?>?8@C:>CCD@>@@?<ACBCDDD?CD AS:i:-8 XM:i:0 XO:i:1 XG:i:1 MD:Z:77 NM:i:1 NH:i:2 CC:Z:chrY CP:i:2609134 HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10104:186675 145 chrX 2637722 3 25M650N53M = 2609460 -28990 GATGCTGTTGTTGATGGAGAAAATGACGACCCACGACCACCGAACCCACCCAAACCGATGCCAAATCCAAACCCCAAG @C?A<@BDDDDCCA:@CA;DFA?A?;BCADEFHECAADGJHGCAIHF?AD?HGEJIGJHG>BJIIFH?FDDDDDFC@B AS:i:-5 XM:i:1 XO:i:0 XG:i:0 MD:Z:77C0 NM:i:1 XS:A:+ NH:i:2 CC:Z:chrY CP:i:2587722 HI:i:0
SAMPLE 3
HWI-H217:48:C20WDACXX:4:1101:10014:118563 355 chrX 2644345 3 70M11826N8M = 2644393 11952 GTCGCCGTGGCTGGAGCCATCTCTAGCTTCATTGCTTACCAGAAAAAGAAGCTATGCTTCAAAGAAAATGCAGAACAA @@@FFFFDHHGFHJIJIJJJIGIJDIIIGGHGGIEIIGHIJJIJJIIGGHECHIIJIJJGGJGFHHHFFFFFFEEEEE AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:78 NM:i:0 XS:A:+ NH:i:2 CC:Z:chrY CP:i:2594345 HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10014:118563 403 chrX 2644393 3 22M11826N56M = 2644345 -11952 AAGCTATGCTTCAAAGAAAATGCAGAACAAGGGGAGGTGGACATGGAGAGCCACCGGAATGCCAACGCAGAGCCAGCG EEDEEFFFFFFGHHHHHJIIGJJJIIIGGJJJJIIJIJJJIIJJJJIGIGGD?JJIJJJJIJJIIHGHGHFFFFFCCC AS:i:-5 XM:i:1 XO:i:0 XG:i:0 MD:Z:77T0 NM:i:1 XS:A:+ NH:i:2 CC:Z:chrY CP:i:2594393 HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10042:126483 329 chrX 2658971 3 78M * 0 0 TTTTAATCTTGCGATGTGCTTTGCTTGTTGCTGGGCGGATGATGTTTACTAACGATGAATTTTACATCCAAAGGGGGA @@CFDDEDHHHHHGIGFGCHIIGH<FHHIJJJJJFGGGBHB>GIHGIFIIAHA??D;BDEEEECCCDDCCC<5512?@ XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2608971 HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10070:113641 339 chrX 2659105 3 78M = 2659078 -105 GAAAGGCTGGCCATTATTAAGTCCCTGTAACTCAAATGTCAACCCCACCGAGGCACCCCCCCGTCCCCCAGAATCTTG CDA?7B@C??@:>>DDAC@DB@?CC@ACDDBAADDDDCD@B<B@@BBDDC@?20-DDIIIIGGIGFFFGHDFFFF@@@ XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2609105 HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10070:113641 419 chrX 2659078 3 78M = 2659105 105 CCGGGGGGGCGGTTTCCCATGGGATGTGAAAGGCTGGCCATTATTAAGTCCCTGTAACTCAAATGTCAACCCCACCGA B@@FDFFDDDBD7B@CD@CDDCDA>@(8>CACB<8A@B<A?CCDD8>C>CACCDACDDEDDCDCCDDCD@A>>BDDDD XA:i:0 MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2609078 HI:i:0
The problem seems to be that the tag NH ("Number of reported alignments that contains the query in the current record" according to the smalls specifications), indicates that these read fragments are mapping 2 times to the reference genome. This is the reason why dexseq_count.py is ignoring them.
What aligner are you using?
Could you check if the reads are really mapping multiple times to the reference genome or if the aligner is somehow adding a wrong value to the tag?
Hi Alejandro,
sorry for the late reply but I was waiting the answer from the facility who did the alignment.
- The aligner is TopHat/Bowtie with default settings
- Indeed, if I correctly interpret the output, reads are mapping to two different regions. Checking in SAMPLE 1 the first read from previous post:
> samtools view SAMPLE1 chrX | grep HWI-H217:48:C20WDACXX:1:1101:10092:74144
HWI-H217:48:C20WDACXX:1:1101:10092:74144 163 chrX 2609296 3 78M = 2609362 144 CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5 XA:i:0MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559296 HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10092:74144 83 chrX 2609362 3 78M = 2609296 -144 CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@ XA:i:0MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559362 HI:i:0
> samtools view SAMPLE1 chrY | grep HWI-H217:48:C20WDACXX:1:1101:10092:74144
HWI-H217:48:C20WDACXX:1:1101:10092:74144 419 chrY 2559296 3 78M = 2559362 144 CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5 XA:i:0MD:Z:78 NM:i:0 NH:i:2 HI:i:1
HWI-H217:48:C20WDACXX:1:1101:10092:74144 339 chrY 2559362 3 78M = 2559296 -144 CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@ XA:i:0MD:Z:78 NM:i:0 NH:i:2 HI:i:1
As a further info, the gene is on the pseudoautosomal region. Do you know if there is any option in dexseq_count.py to bypass this problem?
Let me also know if any further analysis can be helpful,
thanks,
Andrea
I see, it seems like your gene of interest is contain in the region of chromosome Y that shares homology with chromosome X. Therefore the aligner maps the reads to both the chromosome X and the chromosome Y. You could either modify the dexseq_count.py script and change the line
a.optional_field("NH") > 1
But I would not recommend so, since you have the risk of considering multiple mapping reads for genes in the other chromosomes as well. What I would suggest is to re-align the reads to the reference genome but this time excluding the Y-chromosome from the index of the reference genome that you input to the aligner.
Alejandro