Question

No counts generated when running python counting script.

1

Entering edit mode

azim.munivar ▴ 10

@azimmunivar-11408

Last seen 7.6 years ago

Hi there,

I'm trying to use your python script to count overlaps with the Transcript feature but I'm getting no counts at the end. Also, I'm not sure exactly what are all of the options I can use for the feature type. As far as I can tell, the python script constrains based on what exists in the GFF file, but when I try 'transcript' or 'exon' it fails, even though these appear to be in lowercase in my GFF file.

Here is an example of a call that I used (though R)

pyCount <- function(bamfile){
gff <- normalizePath("./gencode.v19.annotation.gff3")
chrsize <- normalizePath("./chrmsize_ready.txt")
output <- paste(bamfile,"_count.table.txt",sep="")
system2("python",paste(pycounterfile,
" -i ",bamfile,
" -a ",gff,
" -g ",chrsize,
" -o ",output,
" -v ",
" -q ",0,
" -l ",0,
" -L ",5000,
" --duplicate_filter ",1000,
" -f","Transcript",sep="")
)
}

I used these parameters as I tried to make it as least stringent as possible. I split my bamfile into just chr19, and then converted to SAM and ran the script with the above parameters.

The output I get is:

Genome file loaded

500000 reads processed

1000000 reads processed

1500000 reads processed

Alignment file processed

1645041 pairs of reads processed

1545944 pairs with multiple alignments

0 pairs with low quality alignments

34902 pairs with aberrations

0 pairs with an insert size out of the specifiec boundaries

0 estimated PCR duplicates

64195 pairs used to generate the final count table

Counts written to file chr19/FND1_004_TGACCA_L005Aligned.sortedByCoord.out.bam_subset_chr19.sam_count.table.txt

2615566 annotated features processed in the annotation file provided

0 annotated features retained to write the count table

Any additional information would be very helpful.

Here is some output from samtools:

>samtools flagstat FND1_004_TGACCA_L005Aligned.sortedByCoord.out.bam_subset_chr19.sam

1768065 + 0 in total (QC-passed reads + QC-failed reads)

169511 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

1768065 + 0 mapped (100.00% : N/A)

1598554 + 0 paired in sequencing

799330 + 0 read1

799224 + 0 read2

1598310 + 0 properly paired (99.98% : N/A)

1598310 + 0 with itself and mate mapped

244 + 0 singletons (0.02% : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

# This file was produced by samtools stats (1.3+htslib-1.3) and can be plotted using plot-bamstats

# This file contains statistics for all reads.

# The command line was: stats FND1_004_TGACCA_L005Aligned.sortedByCoord.out.bam_subset_chr19.sam

# CHK, Checksum [2]Read Names [3]Sequences [4]Qualities

# CHK, CRC32 of reads which passed filtering followed by addition (32bit overflow)

CHK 4170e87a d7d4eb0a c708789a

# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.

SN raw total sequences: 1598554

SN filtered sequences: 0

SN sequences: 1598554

SN is sorted: 1

SN 1st fragments: 799330

SN last fragments: 799224

SN reads mapped: 1598554

SN reads mapped and paired: 1598310 # paired-end technology bit set + both mates mapped

SN reads unmapped: 0

SN reads properly paired: 1598310 # proper-pair bit set

SN reads paired: 1598554 # paired-end technology bit set

SN reads duplicated: 0 # PCR or optical duplicate bit set

SN reads MQ0: 11117 # mapped and MQ=0

SN reads QC failed: 0

SN non-primary alignments: 169511

SN total length: 119199769 # ignores clipping

SN bases mapped: 119199769 # ignores clipping

SN bases mapped (cigar): 119199769 # more accurate

SN bases trimmed: 0

SN bases duplicated: 0

SN mismatches: 0 # from NM fields

SN error rate: 0.000000e+00 # mismatches / bases mapped (cigar)

SN average length: 74

SN maximum length: 76

SN average quality: 34.8

SN insert size average: 301.9

SN insert size standard deviation: 402.3

SN inward oriented pairs: 798517

SN outward oriented pairs: 0

SN pairs with other orientation: 0

SN pairs on different chromosomes: 0

DChIPRep • 1.2k views

ADD COMMENT • link updated 7.4 years ago by Bernd Klaus ▴ 610 • written 7.6 years ago by azim.munivar ▴ 10

score 0 · Answer 1 · 2016-11-30

Hi Azim,

first of all, I am very very sorry for overlooking your question initially!

So, according to your output, 64195 read pairs are retained for processing,
however, none seems to be matching the given feature type.

The script simply uses

HTSeq.GFF_Reader(gff,end_included = True)

(in set_up_IO)

from HTSeq to import the gff and then checks whether the feature type of every feature
is the feature type given by the command line argument,

see the code of the function write_count_table.

So you could try to run

HTSeq.GFF_Reader(gff,end_included = True)

directly to import your gff file into HTSeq and see whether you find features
that correspond to your specified feature type.

Note that only 64195 pairs are retained and 1.5 Million have multiple alignments,
so they are filtered out by the script. So it might be that the reads without multiple alignments
simply do not cover annotated transcripts.

So you either might look at those very small fraction of reads with no multiple
alignments or filter your sam file in such a way that you only keep only one alignment
for every aligned pair.

Alternatively, you can use the function importData_soGGi in DChIPRep to try
to import your bam files directly.

Bernd