DEXSeq: 0 reads for all exons for an expressed gene
0
0
Entering edit mode
Andrea Grilli ▴ 240
@andrea-grilli-4664
Last seen 8.9 years ago
Italy, Bologna, Rizzoli Orthopaedic Ins…

Dear Mailing list,

I'm using DEXSeq to analyze RNA-seq data from Illumina HiSeq2000 sequencer. I followed all the steps according to the vignette, getting a count of reads at exon level / per gene / per sample. Next steps into R work perfectly.

Anyway, I have some problem with the upstream passages with python scripts. We are interested in a specific gene, but we have 0 reads for all exons / per gene / per sample for it when using the dexseq_count.py script. This gene is strongly expressed in the samples we sequenced (it's a diagnostic marker) and, indeed, I have several reads mapped to its locus in my bam files.

 

These are the steps i did:

I downloaded the gtf file from Ensembl (same version of the alignment), modified chromosomes name and processed the file with dexseq_prepare_annotation.py

 

1) The gene I'm interested in exists in the processed gtf file:

> less FILE.GTF | grep ENSG00000002586
chrX    ...    aggregate_gene    2609220    2659350    ...
chrX    ...    exonic_part    2609220    2609316    exonic_part_number "001"   ....
chrX    ...    exonic_part    2609317    2609320    exonic_part_number "002"   ....
...         ...    ...                    ...              ...                ....

 

2) If I look at the same region in my .bam files by samtools, the gene is expressed:

for i in FILE.BAM do; samtools view $i chrX:2609220-2659350 | wc; done
> 17431, 10137, 29503, 20486, 43234, ....    (number of reads of the aggregate_gene)

for i in FILE.BAM; do samtools view $i chrX:2609220-2609316 | wc; done
> 165, 137, 143, 473, 304, ... (number of reads for the 1st exon for each sample)

And so on for all exons..

 

3) when I do the count by dexseq_count.py I have 0 for all exons of my gene of interest in all samples:

dexseq_count.py -p yes -s no  -r pos -f bam $FILE.GTF $FILE.BAM output.txt

ENSG00000002586:001    0
ENSG00000002586:002    0
ENSG00000002586:003    0

....                                        ...

Other genes (also on the same chrX) have counting reads: this is the tail of my counting file for the 1st sample. Rest of samples have numbers of the same order of magnitude:

ENSG00000257111:001    21
ENSG00000257111:002    4
ENSG00000257112:001    66
_ambiguous    1057133
_ambiguous_readpair_position    0
_empty    8974313
_lowaqual    0
_notaligned    0

 

Accordingly to "samtools flagstat" this is the number of reads of the 1st sample:

51700345 + 0 in total (QC-passed reads + QC-failed reads)
4772889 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
51700345 + 0 mapped (100.00%:-nan%)
46927456 + 0 paired in sequencing
23974422 + 0 read1
22953034 + 0 read2
35066086 + 0 properly paired (74.72%:-nan%)

41167519 + 0 with itself and mate mapped
5759937 + 0 singletons (12.27%:-nan%)
1553678 + 0 with mate mapped to a different chr
1098550 + 0 with mate mapped to a different chr (mapQ>=5)

 

Data is from paired-end sequencing run. About strendedness option, I'm not sure about the library preparation protocol, so I tried with either -s "no" or -s "yes" options. Bam files are position sorted, anyway I also tried to sort bam files by name and use -n name option, but with the same result.

Let me know if any further information could be useful.

Thanks,

Andrea

 

PS I attached session.info(), even if related passages are upstream.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] DEXSeq_1.14.0             DESeq2_1.8.1             
 [3] RcppArmadillo_0.5.100.1.0 Rcpp_0.11.6              
 [5] GenomicRanges_1.20.3      GenomeInfoDb_1.4.0       
 [7] IRanges_2.2.1             S4Vectors_0.6.0          
 [9] Biobase_2.28.0            BiocGenerics_0.14.0      
[11] BiocParallel_1.2.1       

loaded via a namespace (and not attached):
 [1] genefilter_1.50.0    statmod_1.4.21       locfit_1.5-9.1      
 [4] reshape2_1.4.1       splines_3.2.0        lattice_0.20-31     
 [7] colorspace_1.2-6     survival_2.38-1      XML_3.98-1.1        
[10] foreign_0.8-63       DBI_0.3.1            RColorBrewer_1.1-2  
[13] lambda.r_1.1.7       plyr_1.8.2           stringr_1.0.0       
[16] zlibbioc_1.14.0      Biostrings_2.36.1    munsell_0.4.2       
[19] gtable_0.1.2         futile.logger_1.4.1  hwriter_1.3.2       
[22] latticeExtra_0.6-26  geneplotter_1.46.0   biomaRt_2.24.0      
[25] AnnotationDbi_1.30.1 proto_0.3-10         acepack_1.3-3.3     
[28] xtable_1.7-4         scales_0.2.4         Hmisc_3.16-0        
[31] annotate_1.46.0      XVector_0.8.0        Rsamtools_1.20.1    
[34] gridExtra_0.9.1      ggplot2_1.0.1        digest_0.6.8        
[37] stringi_0.4-1        grid_3.2.0           tools_3.2.0         
[40] bitops_1.0-6         magrittr_1.5         RCurl_1.95-4.6      
[43] RSQLite_1.0.0        Formula_1.2-1        cluster_2.0.1       
[46] futile.options_1.0.0 MASS_7.3-40          rpart_4.1-9         
[49] nnet_7.3-9

dexseq • 2.1k views
ADD COMMENT
0
Entering edit mode

Hi Andrea, 

Thanks for your detailed report! Could you add some of the sam files, ordered by read name, of the reads mapping to your gene of interest?

Alejandro

 

ADD REPLY
0
Entering edit mode

Hi Alejandro,

thank you for your quick reply. I'm sorry but I haven't sam files of my samples but bam only, so I can have index only for position-sorted files.

 

I'm not sure if following step could be a proper method, but I tried sorting by bash command the output of the bam files indexed by position. If you know a better way let me know, because I'm new to this type of data and any possible correction is appreciated:

samtools view SAMPLE1.bam chrX:2609220-2659350 | sort | head -n 5
samtools view SAMPLE2.bam chrX:2609220-2659350 | sort | head -n 5

samtools view SAMPLE3.bam chrX:2609220-2659350 | sort | head -n 5

Thank you,

Andrea

 

 

SAMPLE 1

HWI-H217:48:C20WDACXX:1:1101:10092:74144    163    chrX    2609296    3    78M    =    2609362    144    CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG    B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5    XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2559296    HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10092:74144    83    chrX    2609362    3    78M    =    2609296    -144    CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG    DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@    XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2559362    HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10107:39090    345    chrX    2658885    3    78M    *    0    0    ATTAGAACAGCTGCCTGAGGCTCCTCCCTGAAGGACACCTGCCTGAGAGCAGAGATGGAGGCCTTCTGTTCACGGCAG    ###@;CCC@;1;@=777FGC7C886(8=B4H@CF<DDEFD;91C3GCHCEGIHACC=F>FE?F>G>D>D=DDD;A@@@    AS:i:-7    XM:i:2    XO:i:0    XG:i:0    MD:Z:0G75G1    NM:i:2    NH:i:2    CC:Z:chrY    CP:i:2608885    HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10371:143793    161    chrX    2637730    3    17M650N61M    =    2640681    3668    TGTTGATGGAGAAAATGACGACCCACGACCACCGAACCCACCCAAACCGATGCCAAATCCAAACCCCAACCACCCTAG    BC?DFFFFHHHHHGHGFHGGFGGF?F>@F@@GHGIE@FIEIIFHFCBEFA9?BDDCCCCDDCDDDDBB8?;?BDD8AC    AS:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:78    NM:i:0    XS:A:+    NH:i:2    CC:Z:chrY    CP:i:2587730    HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10371:143793    81    chrX    2640681    3    35M639N43M    =    2637730    -3668    TGCTGACCTTGCGGATGGCGTTTCAGGTGGAGAAGGAAAAGGAGGCAGTGATGGTGGAGGCAGCCACAGGAAAGAAGG    @CCCCACCDBFHFIIIHIIGHFC@IIIIGIIGHGFBGIIEGGIHG@IIIIIGFF;HEGF7IIIHGDFDFFDDDDD@@@    AS:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:78    NM:i:0    XS:A:+    NH:i:2    CC:Z:chrY    CP:i:2590681    HI:i:0

 

 

SAMPLE 2

HWI-H217:48:C20WDACXX:2:1101:10055:5571    163    chrX    2609293    3    78M    =    2609386    171    CTTCGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTC    @<<DD;?AFHFFHGIIGIDH?D<6B@FHAGB@GE5:=>AB;B55@@@>;>@9@985@DDDCD@DB7<;B?B;<8?>@3XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2559293    HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10055:5571    83    chrX    2609386    3    78M    =    2609293    -171    GCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGGCCTGCTGGGTGTTCTGGTCGCCGC    >BA?8BB;97>@@4AC>?BB@@<@@5???BBBBB@>96>ACA9IGDIEGHIFCGDGB)=IGFBGED>HFFDDABB@@@XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2559386    HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10059:108872    147    chrX    2659249    3    78M    =    2659134    -193    TTAAAATAAGGTAAGCCCTTCAATTTGTTTCTTCAATATTTCTTTCATTTGTAGGGATATTTGTTTTTCATATCAGAC    HHHHJJJJGFJJJIJJJJJJIJJJJIJIIJIJIIJIIEJIIJJJIJJJJJJJJJJJJJIJJIJJJHHHHHFFFFFCCC    XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2609249    HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10059:108872    99    chrX    2659134    3    26M1I51M    =    2659249    193    ACTCAAATGTCAACCCCACCGAGGCACCCCCCCCGTCCCCCAGAATCTTGGCTGTTTACAAATCACGTGTCCATCGAG    CCCFFFFFHHHHHHIJJJJJJJJIJEGIJJJJJGD9=ADDDDB<C@AAC:@?>?8@C:>CCD@>@@?<ACBCDDD?CD    AS:i:-8    XM:i:0    XO:i:1    XG:i:1    MD:Z:77    NM:i:1    NH:i:2    CC:Z:chrY    CP:i:2609134    HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10104:186675    145    chrX    2637722    3    25M650N53M    =    2609460    -28990    GATGCTGTTGTTGATGGAGAAAATGACGACCCACGACCACCGAACCCACCCAAACCGATGCCAAATCCAAACCCCAAG    @C?A<@BDDDDCCA:@CA;DFA?A?;BCADEFHECAADGJHGCAIHF?AD?HGEJIGJHG>BJIIFH?FDDDDDFC@B    AS:i:-5    XM:i:1    XO:i:0    XG:i:0    MD:Z:77C0    NM:i:1    XS:A:+    NH:i:2    CC:Z:chrY    CP:i:2587722    HI:i:0
 

SAMPLE 3

HWI-H217:48:C20WDACXX:4:1101:10014:118563    355    chrX    2644345    3    70M11826N8M    =    2644393    11952    GTCGCCGTGGCTGGAGCCATCTCTAGCTTCATTGCTTACCAGAAAAAGAAGCTATGCTTCAAAGAAAATGCAGAACAA    @@@FFFFDHHGFHJIJIJJJIGIJDIIIGGHGGIEIIGHIJJIJJIIGGHECHIIJIJJGGJGFHHHFFFFFFEEEEE    AS:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:78    NM:i:0    XS:A:+    NH:i:2    CC:Z:chrY    CP:i:2594345    HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10014:118563    403    chrX    2644393    3    22M11826N56M    =    2644345    -11952    AAGCTATGCTTCAAAGAAAATGCAGAACAAGGGGAGGTGGACATGGAGAGCCACCGGAATGCCAACGCAGAGCCAGCG    EEDEEFFFFFFGHHHHHJIIGJJJIIIGGJJJJIIJIJJJIIJJJJIGIGGD?JJIJJJJIJJIIHGHGHFFFFFCCC    AS:i:-5    XM:i:1    XO:i:0    XG:i:0    MD:Z:77T0    NM:i:1    XS:A:+    NH:i:2    CC:Z:chrY    CP:i:2594393    HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10042:126483    329    chrX    2658971    3    78M    *    0    0    TTTTAATCTTGCGATGTGCTTTGCTTGTTGCTGGGCGGATGATGTTTACTAACGATGAATTTTACATCCAAAGGGGGA    @@CFDDEDHHHHHGIGFGCHIIGH<FHHIJJJJJFGGGBHB>GIHGIFIIAHA??D;BDEEEECCCDDCCC<5512?@    XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2608971    HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10070:113641    339    chrX    2659105    3    78M    =    2659078    -105    GAAAGGCTGGCCATTATTAAGTCCCTGTAACTCAAATGTCAACCCCACCGAGGCACCCCCCCGTCCCCCAGAATCTTG    CDA?7B@C??@:>>DDAC@DB@?CC@ACDDBAADDDDCD@B<B@@BBDDC@?20-DDIIIIGGIGFFFGHDFFFF@@@    XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2609105    HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10070:113641    419    chrX    2659078    3    78M    =    2659105    105    CCGGGGGGGCGGTTTCCCATGGGATGTGAAAGGCTGGCCATTATTAAGTCCCTGTAACTCAAATGTCAACCCCACCGA    B@@FDFFDDDBD7B@CD@CDDCDA>@(8>CACB<8A@B<A?CCDD8>C>CACCDACDDEDDCDCCDDCD@A>>BDDDD    XA:i:0    MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2609078    HI:i:0

ADD REPLY
0
Entering edit mode

The problem seems to be that the tag NH ("Number of reported alignments that contains the query in the current record" according to the smalls specifications), indicates that these read fragments are mapping 2 times to the reference genome. This is the reason why dexseq_count.py is ignoring them. 

What aligner are you using?

Could you check if the reads are really mapping multiple times to the reference genome or if the aligner is somehow adding a wrong value to the tag?

 

ADD REPLY
0
Entering edit mode

Hi Alejandro,
sorry for the late reply but I was waiting the answer from the facility who did the alignment.
- The aligner is TopHat/Bowtie with default settings
- Indeed, if I correctly interpret the output, reads are mapping to two different regions. Checking in SAMPLE 1 the first read from previous post:

> samtools view SAMPLE1 chrX | grep HWI-H217:48:C20WDACXX:1:1101:10092:74144

HWI-H217:48:C20WDACXX:1:1101:10092:74144    163    chrX    2609296    3    78M    =    2609362    144    CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG    B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5    XA:i:0MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2559296    HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10092:74144    83    chrX    2609362    3    78M    =    2609296    -144    CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG    DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@    XA:i:0MD:Z:78    NM:i:0    NH:i:2    CC:Z:chrY    CP:i:2559362    HI:i:0

> samtools view SAMPLE1 chrY | grep HWI-H217:48:C20WDACXX:1:1101:10092:74144

HWI-H217:48:C20WDACXX:1:1101:10092:74144    419    chrY    2559296    3    78M    =    2559362    144    CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG    B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5    XA:i:0MD:Z:78    NM:i:0    NH:i:2    HI:i:1
HWI-H217:48:C20WDACXX:1:1101:10092:74144    339    chrY    2559362    3    78M    =    2559296    -144    CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG    DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@    XA:i:0MD:Z:78    NM:i:0    NH:i:2    HI:i:1

As a further info, the gene is on the pseudoautosomal region. Do you know if there is any option in dexseq_count.py to bypass this problem?
Let me also know if any further analysis can be helpful,
thanks,
Andrea

 

ADD REPLY
0
Entering edit mode

I see, it seems like your gene of interest is contain in the region of chromosome Y that shares homology with chromosome X. Therefore the aligner maps the reads to both the chromosome X and the chromosome Y. You could either modify the dexseq_count.py script and change the line 

a.optional_field("NH") > 1

But I would not recommend so, since you have the risk of considering multiple mapping reads for genes in the other chromosomes as well. What I would suggest is to re-align the reads to the reference genome but this time excluding the Y-chromosome from the index of the reference genome that you input to the aligner.

Alejandro

ADD REPLY

Login before adding your answer.

Traffic: 691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6