No counts generated when running python counting script.
1
1
Entering edit mode
azim.munivar ▴ 10
@azimmunivar-11408
Last seen 4.8 years ago

Hi there,

I'm trying to use your python script to count overlaps with the Transcript feature but I'm getting no counts at the end. Also, I'm not sure exactly what are all of the options I can use for the feature type. As far as I can tell, the python script constrains based on what exists in the GFF file, but when I try 'transcript' or 'exon' it fails, even though these appear to be in lowercase in my GFF file.

Here is an example of a call that I used (though R)

pyCount <- function(bamfile){
        gff <- normalizePath("./gencode.v19.annotation.gff3")
        chrsize <- normalizePath("./chrmsize_ready.txt")
        output <- paste(bamfile,"_count.table.txt",sep="")
        system2("python",paste(pycounterfile,
                " -i ",bamfile,
                " -a ",gff,
                " -g ",chrsize,
                " -o ",output,
                " -v ",
                " -q ",0,
                " -l ",0,
                " -L ",5000,
                " --duplicate_filter ",1000,
                " -f","Transcript",sep="")
        )
}

I used these parameters as I tried to make it as least stringent as possible. I split my bamfile into just chr19, and then converted to SAM and ran the script with the above parameters.

The output I get is:

Genome file loaded

500000 reads processed

1000000 reads processed

1500000 reads processed

Alignment file processed

1645041 pairs of reads processed

1545944 pairs with multiple alignments

0 pairs with low quality alignments

34902 pairs with aberrations

0 pairs with an insert size out of the specifiec boundaries

0 estimated PCR duplicates

64195 pairs used to generate the final count table

Counts written to file chr19/FND1_004_TGACCA_L005Aligned.sortedByCoord.out.bam_subset_chr19.sam_count.table.txt

2615566 annotated features processed in the annotation file provided

0 annotated features retained to write the count table

 

Any additional information would be very helpful.

 

Here is some output from samtools:

>samtools flagstat FND1_004_TGACCA_L005Aligned.sortedByCoord.out.bam_subset_chr19.sam

1768065 + 0 in total (QC-passed reads + QC-failed reads)

169511 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

1768065 + 0 mapped (100.00% : N/A)

1598554 + 0 paired in sequencing

799330 + 0 read1

799224 + 0 read2

1598310 + 0 properly paired (99.98% : N/A)

1598310 + 0 with itself and mate mapped

244 + 0 singletons (0.02% : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

 

# This file was produced by samtools stats (1.3+htslib-1.3) and can be plotted using plot-bamstats

# This file contains statistics for all reads.

# The command line was:  stats FND1_004_TGACCA_L005Aligned.sortedByCoord.out.bam_subset_chr19.sam

# CHK, Checksum    [2]Read Names    [3]Sequences    [4]Qualities

# CHK, CRC32 of reads which passed filtering followed by addition (32bit overflow)

CHK    4170e87a    d7d4eb0a    c708789a

# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.

SN    raw total sequences:    1598554

SN    filtered sequences:    0

SN    sequences:    1598554

SN    is sorted:    1

SN    1st fragments:    799330

SN    last fragments:    799224

SN    reads mapped:    1598554

SN    reads mapped and paired:    1598310    # paired-end technology bit set + both mates mapped

SN    reads unmapped:    0

SN    reads properly paired:    1598310    # proper-pair bit set

SN    reads paired:    1598554    # paired-end technology bit set

SN    reads duplicated:    0    # PCR or optical duplicate bit set

SN    reads MQ0:    11117    # mapped and MQ=0

SN    reads QC failed:    0

SN    non-primary alignments:    169511

SN    total length:    119199769    # ignores clipping

SN    bases mapped:    119199769    # ignores clipping

SN    bases mapped (cigar):    119199769    # more accurate

SN    bases trimmed:    0

SN    bases duplicated:    0

SN    mismatches:    0    # from NM fields

SN    error rate:    0.000000e+00    # mismatches / bases mapped (cigar)

SN    average length:    74

SN    maximum length:    76

SN    average quality:    34.8

SN    insert size average:    301.9

SN    insert size standard deviation:    402.3

SN    inward oriented pairs:    798517

SN    outward oriented pairs:    0

SN    pairs with other orientation:    0

SN    pairs on different chromosomes:    0

 

DChIPRep • 610 views
ADD COMMENT
0
Entering edit mode
Bernd Klaus ▴ 590
@bernd-klaus-6281
Last seen 2.7 years ago
Germany

Hi Azim,

first of all, I am very very sorry for overlooking your question initially!

So, according to your output, 64195 read pairs are retained for processing,
however, none seems to be matching the given feature type.

The script simply uses

HTSeq.GFF_Reader(gff,end_included = True)

(in set_up_IO)

from HTSeq to import the gff and then checks whether the feature type of every feature
is the feature type  given by the command line argument,

see the code of the function write_count_table.

So you could try to run

HTSeq.GFF_Reader(gff,end_included = True)

directly to import your gff file into HTSeq and see whether you find features
that correspond to your specified feature type.

Note that only 64195 pairs are retained and 1.5 Million have multiple alignments,
so they are filtered out by the script. So it might be that the reads without multiple alignments
simply do not cover annotated transcripts.

So you either might look at those very small fraction of reads with no multiple
alignments or filter your sam file in such a way that you only keep only one alignment
for every aligned pair.

Alternatively, you can use the function importData_soGGi in DChIPRep to try
to import your bam files directly.

Bernd

 

ADD COMMENT

Login before adding your answer.

Traffic: 366 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6