Question

DEXSeq: 0 reads for all exons for an expressed gene

0

Entering edit mode

Andrea Grilli ▴ 240

@andrea-grilli-4664

Last seen 8.9 years ago

Italy, Bologna, Rizzoli Orthopaedic Ins…

Dear Mailing list,

I'm using DEXSeq to analyze RNA-seq data from Illumina HiSeq2000 sequencer. I followed all the steps according to the vignette, getting a count of reads at exon level / per gene / per sample. Next steps into R work perfectly.

Anyway, I have some problem with the upstream passages with python scripts. We are interested in a specific gene, but we have 0 reads for all exons / per gene / per sample for it when using the dexseq_count.py script. This gene is strongly expressed in the samples we sequenced (it's a diagnostic marker) and, indeed, I have several reads mapped to its locus in my bam files.

These are the steps i did:

I downloaded the gtf file from Ensembl (same version of the alignment), modified chromosomes name and processed the file with dexseq_prepare_annotation.py

1) The gene I'm interested in exists in the processed gtf file:

> less FILE.GTF | grep ENSG00000002586
chrX    ...    aggregate_gene   2609220   2659350    ...
chrX    ...    exonic_part   2609220   2609316    exonic_part_number "001"   ....
chrX    ...    exonic_part   2609317   2609320    exonic_part_number "002"   ....
...         ...    ...                  ...              ...                ....

2) If I look at the same region in my .bam files by samtools, the gene is expressed:

for i in FILE.BAM do; samtools view $i chrX:2609220-2659350 | wc; done
> 17431, 10137, 29503, 20486, 43234, .... (number of reads of the aggregate_gene)

for i in FILE.BAM; do samtools view $i chrX:2609220-2609316 | wc; done
> 165, 137, 143, 473, 304, ... (number of reads for the 1st exon for each sample)

And so on for all exons..

3) when I do the count by dexseq_count.py I have 0 for all exons of my gene of interest in all samples:

dexseq_count.py -p yes -s no -r pos -f bam $FILE.GTF $FILE.BAM output.txt

ENSG00000002586:001   0
ENSG00000002586:002   0
ENSG00000002586:003   0

.... ...

Other genes (also on the same chrX) have counting reads: this is the tail of my counting file for the 1st sample. Rest of samples have numbers of the same order of magnitude:

ENSG00000257111:001   21
ENSG00000257111:002   4
ENSG00000257112:001   66
_ambiguous   1057133
_ambiguous_readpair_position   0
_empty   8974313
_lowaqual   0
_notaligned   0

Accordingly to "samtools flagstat" this is the number of reads of the 1st sample:

51700345 + 0 in total (QC-passed reads + QC-failed reads)
4772889 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
51700345 + 0 mapped (100.00%:-nan%)
46927456 + 0 paired in sequencing
23974422 + 0 read1
22953034 + 0 read2
35066086 + 0 properly paired (74.72%:-nan%)

41167519 + 0 with itself and mate mapped
5759937 + 0 singletons (12.27%:-nan%)
1553678 + 0 with mate mapped to a different chr
1098550 + 0 with mate mapped to a different chr (mapQ>=5)

Data is from paired-end sequencing run. About strendedness option, I'm not sure about the library preparation protocol, so I tried with either -s "no" or -s "yes" options. Bam files are position sorted, anyway I also tried to sort bam files by name and use -n name option, but with the same result.

Let me know if any further information could be useful.

Thanks,

Andrea

PS I attached session.info(), even if related passages are upstream.

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=it_IT.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel stats     graphics grDevices utils     datasets
[8] methods   base

other attached packages:
[1] DEXSeq_1.14.0             DESeq2_1.8.1
[3] RcppArmadillo_0.5.100.1.0 Rcpp_0.11.6
[5] GenomicRanges_1.20.3      GenomeInfoDb_1.4.0
[7] IRanges_2.2.1             S4Vectors_0.6.0
[9] Biobase_2.28.0            BiocGenerics_0.14.0
[11] BiocParallel_1.2.1

loaded via a namespace (and not attached):
[1] genefilter_1.50.0    statmod_1.4.21       locfit_1.5-9.1
[4] reshape2_1.4.1       splines_3.2.0        lattice_0.20-31
[7] colorspace_1.2-6     survival_2.38-1      XML_3.98-1.1
[10] foreign_0.8-63       DBI_0.3.1            RColorBrewer_1.1-2
[13] lambda.r_1.1.7       plyr_1.8.2           stringr_1.0.0
[16] zlibbioc_1.14.0      Biostrings_2.36.1    munsell_0.4.2
[19] gtable_0.1.2         futile.logger_1.4.1 hwriter_1.3.2
[22] latticeExtra_0.6-26 geneplotter_1.46.0   biomaRt_2.24.0
[25] AnnotationDbi_1.30.1 proto_0.3-10         acepack_1.3-3.3
[28] xtable_1.7-4         scales_0.2.4         Hmisc_3.16-0
[31] annotate_1.46.0      XVector_0.8.0        Rsamtools_1.20.1
[34] gridExtra_0.9.1      ggplot2_1.0.1        digest_0.6.8
[37] stringi_0.4-1        grid_3.2.0           tools_3.2.0
[40] bitops_1.0-6         magrittr_1.5         RCurl_1.95-4.6
[43] RSQLite_1.0.0        Formula_1.2-1        cluster_2.0.1
[46] futile.options_1.0.0 MASS_7.3-40          rpart_4.1-9
[49] nnet_7.3-9

dexseq • 2.1k views

ADD COMMENT • link 8.9 years ago Andrea Grilli ▴ 240

0

Entering edit mode

Hi Andrea,

Thanks for your detailed report! Could you add some of the sam files, ordered by read name, of the reads mapping to your gene of interest?

Alejandro

ADD REPLY • link 8.9 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi Alejandro,

thank you for your quick reply. I'm sorry but I haven't sam files of my samples but bam only, so I can have index only for position-sorted files.

I'm not sure if following step could be a proper method, but I tried sorting by bash command the output of the bam files indexed by position. If you know a better way let me know, because I'm new to this type of data and any possible correction is appreciated:

samtools view SAMPLE1.bam chrX:2609220-2659350 | sort | head -n 5
samtools view SAMPLE2.bam chrX:2609220-2659350 | sort | head -n 5

samtools view SAMPLE3.bam chrX:2609220-2659350 | sort | head -n 5

Thank you,

Andrea

SAMPLE 1

HWI-H217:48:C20WDACXX:1:1101:10092:74144   163   chrX   2609296   3   78M   =   2609362   144   CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG   B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5   XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2559296   HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10092:74144   83   chrX   2609362   3   78M   =   2609296   -144   CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG   DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@   XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2559362   HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10107:39090   345   chrX   2658885   3   78M   *   0   0   ATTAGAACAGCTGCCTGAGGCTCCTCCCTGAAGGACACCTGCCTGAGAGCAGAGATGGAGGCCTTCTGTTCACGGCAG   ###@;CCC@;1;@=777FGC7C886(8=B4H@CF<DDEFD;91C3GCHCEGIHACC=F>FE?F>G>D>D=DDD;A@@@   AS:i:-7   XM:i:2   XO:i:0   XG:i:0   MD:Z:0G75G1   NM:i:2   NH:i:2   CC:Z:chrY   CP:i:2608885   HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10371:143793   161   chrX   2637730   3   17M650N61M   =   2640681   3668   TGTTGATGGAGAAAATGACGACCCACGACCACCGAACCCACCCAAACCGATGCCAAATCCAAACCCCAACCACCCTAG   BC?DFFFFHHHHHGHGFHGGFGGF?F>@F@@GHGIE@FIEIIFHFCBEFA9?BDDCCCCDDCDDDDBB8?;?BDD8AC   AS:i:0   XM:i:0   XO:i:0   XG:i:0   MD:Z:78   NM:i:0   XS:A:+   NH:i:2   CC:Z:chrY   CP:i:2587730   HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10371:143793   81   chrX   2640681   3   35M639N43M   =   2637730   -3668   TGCTGACCTTGCGGATGGCGTTTCAGGTGGAGAAGGAAAAGGAGGCAGTGATGGTGGAGGCAGCCACAGGAAAGAAGG   @CCCCACCDBFHFIIIHIIGHFC@IIIIGIIGHGFBGIIEGGIHG@IIIIIGFF;HEGF7IIIHGDFDFFDDDDD@@@   AS:i:0   XM:i:0   XO:i:0   XG:i:0   MD:Z:78   NM:i:0   XS:A:+   NH:i:2   CC:Z:chrY   CP:i:2590681   HI:i:0

SAMPLE 2

HWI-H217:48:C20WDACXX:2:1101:10055:5571   163   chrX   2609293   3   78M   =   2609386   171   CTTCGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTC   @<<DD;?AFHFFHGIIGIDH?D<6B@FHAGB@GE5:=>AB;B55@@@>;>@9@985@DDDCD@DB7<;B?B;<8?>@3XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2559293   HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10055:5571   83   chrX   2609386   3   78M   =   2609293   -171   GCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGGCCTGCTGGGTGTTCTGGTCGCCGC   >BA?8BB;97>@@4AC>?BB@@<@@5???BBBBB@>96>ACA9IGDIEGHIFCGDGB)=IGFBGED>HFFDDABB@@@XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2559386   HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10059:108872   147   chrX   2659249   3   78M   =   2659134   -193   TTAAAATAAGGTAAGCCCTTCAATTTGTTTCTTCAATATTTCTTTCATTTGTAGGGATATTTGTTTTTCATATCAGAC   HHHHJJJJGFJJJIJJJJJJIJJJJIJIIJIJIIJIIEJIIJJJIJJJJJJJJJJJJJIJJIJJJHHHHHFFFFFCCC   XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2609249   HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10059:108872   99   chrX   2659134   3   26M1I51M   =   2659249   193   ACTCAAATGTCAACCCCACCGAGGCACCCCCCCCGTCCCCCAGAATCTTGGCTGTTTACAAATCACGTGTCCATCGAG   CCCFFFFFHHHHHHIJJJJJJJJIJEGIJJJJJGD9=ADDDDB<C@AAC:@?>?8@C:>CCD@>@@?<ACBCDDD?CD   AS:i:-8   XM:i:0   XO:i:1   XG:i:1   MD:Z:77   NM:i:1   NH:i:2   CC:Z:chrY   CP:i:2609134   HI:i:0
HWI-H217:48:C20WDACXX:2:1101:10104:186675   145   chrX   2637722   3   25M650N53M   =   2609460   -28990   GATGCTGTTGTTGATGGAGAAAATGACGACCCACGACCACCGAACCCACCCAAACCGATGCCAAATCCAAACCCCAAG   @C?A<@BDDDDCCA:@CA;DFA?A?;BCADEFHECAADGJHGCAIHF?AD?HGEJIGJHG>BJIIFH?FDDDDDFC@B   AS:i:-5   XM:i:1   XO:i:0   XG:i:0   MD:Z:77C0   NM:i:1   XS:A:+   NH:i:2   CC:Z:chrY   CP:i:2587722   HI:i:0

SAMPLE 3

HWI-H217:48:C20WDACXX:4:1101:10014:118563   355   chrX   2644345   3   70M11826N8M   =   2644393   11952   GTCGCCGTGGCTGGAGCCATCTCTAGCTTCATTGCTTACCAGAAAAAGAAGCTATGCTTCAAAGAAAATGCAGAACAA   @@@FFFFDHHGFHJIJIJJJIGIJDIIIGGHGGIEIIGHIJJIJJIIGGHECHIIJIJJGGJGFHHHFFFFFFEEEEE   AS:i:0   XM:i:0   XO:i:0   XG:i:0   MD:Z:78   NM:i:0   XS:A:+   NH:i:2   CC:Z:chrY   CP:i:2594345   HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10014:118563   403   chrX   2644393   3   22M11826N56M   =   2644345   -11952   AAGCTATGCTTCAAAGAAAATGCAGAACAAGGGGAGGTGGACATGGAGAGCCACCGGAATGCCAACGCAGAGCCAGCG   EEDEEFFFFFFGHHHHHJIIGJJJIIIGGJJJJIIJIJJJIIJJJJIGIGGD?JJIJJJJIJJIIHGHGHFFFFFCCC   AS:i:-5   XM:i:1   XO:i:0   XG:i:0   MD:Z:77T0   NM:i:1   XS:A:+   NH:i:2   CC:Z:chrY   CP:i:2594393   HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10042:126483   329   chrX   2658971   3   78M   *   0   0   TTTTAATCTTGCGATGTGCTTTGCTTGTTGCTGGGCGGATGATGTTTACTAACGATGAATTTTACATCCAAAGGGGGA   @@CFDDEDHHHHHGIGFGCHIIGH<FHHIJJJJJFGGGBHB>GIHGIFIIAHA??D;BDEEEECCCDDCCC<5512?@   XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2608971   HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10070:113641   339   chrX   2659105   3   78M   =   2659078   -105   GAAAGGCTGGCCATTATTAAGTCCCTGTAACTCAAATGTCAACCCCACCGAGGCACCCCCCCGTCCCCCAGAATCTTG   CDA?7B@C??@:>>DDAC@DB@?CC@ACDDBAADDDDCD@B<B@@BBDDC@?20-DDIIIIGGIGFFFGHDFFFF@@@   XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2609105   HI:i:0
HWI-H217:48:C20WDACXX:4:1101:10070:113641   419   chrX   2659078   3   78M   =   2659105   105   CCGGGGGGGCGGTTTCCCATGGGATGTGAAAGGCTGGCCATTATTAAGTCCCTGTAACTCAAATGTCAACCCCACCGA   B@@FDFFDDDBD7B@CD@CDDCDA>@(8>CACB<8A@B<A?CCDD8>C>CACCDACDDEDDCDCCDDCD@A>>BDDDD   XA:i:0   MD:Z:78   NM:i:0   NH:i:2   CC:Z:chrY   CP:i:2609078   HI:i:0

ADD REPLY • link 8.9 years ago Andrea Grilli ▴ 240

0

Entering edit mode

The problem seems to be that the tag NH ("Number of reported alignments that contains the query in the current record" according to the smalls specifications), indicates that these read fragments are mapping 2 times to the reference genome. This is the reason why dexseq_count.py is ignoring them.

What aligner are you using?

Could you check if the reads are really mapping multiple times to the reference genome or if the aligner is somehow adding a wrong value to the tag?

ADD REPLY • link 8.9 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hi Alejandro,
sorry for the late reply but I was waiting the answer from the facility who did the alignment.
- The aligner is TopHat/Bowtie with default settings
- Indeed, if I correctly interpret the output, reads are mapping to two different regions. Checking in SAMPLE 1 the first read from previous post:

> samtools view SAMPLE1 chrX | grep HWI-H217:48:C20WDACXX:1:1101:10092:74144

HWI-H217:48:C20WDACXX:1:1101:10092:74144 163 chrX 2609296 3 78M = 2609362 144 CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5 XA:i:0MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559296 HI:i:0
HWI-H217:48:C20WDACXX:1:1101:10092:74144 83 chrX 2609362 3 78M = 2609296 -144 CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@ XA:i:0MD:Z:78 NM:i:0 NH:i:2 CC:Z:chrY CP:i:2559362 HI:i:0

> samtools view SAMPLE1 chrY | grep HWI-H217:48:C20WDACXX:1:1101:10092:74144

HWI-H217:48:C20WDACXX:1:1101:10092:74144 419 chrY 2559296 3 78M = 2559362 144 CGAGTCCCCGGGCTTCGCCCCACCCGGCCCGTGGGGGAGTATCTGTCCTGCCGCCTTCGCCCACGCCCTGCACTCCGG B@@DDFDFDHDCAHCFGGGGIGHGHEGGGG<E<=EFC6;<?C;;A@5@@C@A85<DB@25@B<@@;@B7<BDACC??5 XA:i:0MD:Z:78 NM:i:0 NH:i:2 HI:i:1
HWI-H217:48:C20WDACXX:1:1101:10092:74144 339 chrY 2559362 3 78M = 2559296 -144 CCTGCACTCCGGGACCGTCCCTGCGCGCTCTGGGCGCACCATGGCCCGCGGGGCTGCGCTGGCGCTGCTGCTCTTCGG DDDC<8>;;;BB@;BBDB<;DDDDBBCADDDBDDDCC>CDDBB?DIIIIGHD<HADABGHHEIIGFFDDHDDFDD@@@ XA:i:0MD:Z:78 NM:i:0 NH:i:2 HI:i:1

As a further info, the gene is on the pseudoautosomal region. Do you know if there is any option in dexseq_count.py to bypass this problem?
Let me also know if any further analysis can be helpful,
thanks,
Andrea

ADD REPLY • link 8.9 years ago Andrea Grilli ▴ 240

0

Entering edit mode

I see, it seems like your gene of interest is contain in the region of chromosome Y that shares homology with chromosome X. Therefore the aligner maps the reads to both the chromosome X and the chromosome Y. You could either modify the dexseq_count.py script and change the line

a.optional_field("NH") > 1

But I would not recommend so, since you have the risk of considering multiple mapping reads for genes in the other chromosomes as well. What I would suggest is to re-align the reads to the reference genome but this time excluding the Y-chromosome from the index of the reference genome that you input to the aligner.

Alejandro

ADD REPLY • link 8.9 years ago Alejandro Reyes ★ 1.9k