featureCounts with NCBI T2T not capturing all genes
1
0
Entering edit mode
Hilary ▴ 40
@hilary-8596
Last seen 20 months ago
United States

Hello,

My team would greatly appreciate assistance with running featureCounts using the human NCBI T2T assembly (assembly (T2T-CHM13v2.0) as a reference; when we run it we end up with nearly 14,000 fewer genes than what the annotation supposedly contains.What (if any) modifications can be made to run Subread or RSubread featureCounts on the new NCBI release and obtain counts data for all ~57 thousand genes supposedly in the assembly?

Typically my lab has used the Ensembl/Gencode GRCh38 for mapping total RNA-Seq projects with STAR, and then counted reads with the featureCounts command from Subread or Rsubread. We are attempting to set up new pipelines with the latest release 43 (GRCh38.p13) at gencode, https://www.gencodegenes.org/human/. In parallel we have attempted to implement a pipeline to run in with the new NCBI T2T assembly from https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4), in case that newer assembly may improve gene-level biomarker discovery. This assembly supposedly has 57,771 genes, per awk code that a staff scientist at NCBI provided my colleague:

Number of unique genes:

zcat GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz | grep -v "^#" | awk 'BEGIN {FS="\t"; OFS= "\t"} $3=="gene" {print $9}' | cut -f 3 -d ";" | sort | uniq | wc -l)

57771

Number of gene placements in the GTF: zcat GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz | grep -v "^#" | cut -f 3 | sort | uniq -c  

699164 CDS

1017051 exon 

57816 gene 

65594 start_codon 

65411 stop_codon

112905 transcript

However, with code such as the text below using, we end up with a much smaller number of genes, closer to 42-43,000 thousand. We have even tried implementing this specifying gene id to export Entrez IDs by adding the flag <-g db_xref> in case the default gene_id identifier was causing an issue. The NCBI staff scientist suggested that the difference of our tabulated genes vs the GTF assembly annotation of over 57 thousand, reflects unprocessed pseudogenes. We suspect this may have to do with how the NCBI GTF is set up, and are wondering if there is some way to modify the call of how the GTF is used by Subread/Rsubread to obtain an output that is more similar to the output of Gencode and reflects all 57,771 genes. We tried searching for other posts on the topic and found on one biostars at https://www.biostars.org/p/9547097/#google_vignette , leading us to attempt to use the T2T SAF at https://bioinf.wehi.edu.au/Rsubread/annot/ but that led to 0% alignments because the SAF format was chr1, chr2 etc and the NCBI designations differ (NC_060930.1).

What (if any) modifications can be made to run Subread or RSubread featureCounts on the new NCBI release and obtain counts data for all ~57 thousand genes? Many thanks in advance for your consideration.

Run of featureCounts

Creates a directory for the output of Subread and then changes directories to /run/media/userT/Padlock_DT/paired/results/T2T/STAR

mkdir -p /run/media/userT/Padlock_DT/paired/results/T2T/counts/

cd /run/media/userT/Padlock_DT/paired/results/T2T/STAR/

Runs featureCounts in paired-end mode, reverse strand, with 32 threads.

featureCounts -T 32 -s 2 -p -C FALSE -d 50 --verbose --countReadPairs \

-a $gtf_T2T \

-o /run/media/userT/Padlock_DT/paired/results/T2T/counts/T2T_featurecounts.txt \

The above was from Subread and NOT an R session so we do not have that, but the version of Subread used was 2.0.3; we plan to upgrade to 2.04.

Yet if it helps, I believe the Rsubread translation may be akin to code I ran in R a while ago, updating that to reflect the newer GTF would be roughly equivalent to: MyOutput=featureCounts(files=RBam, isPairedEnd=TRUE, annot.ext="GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf", countChimericFragments=FALSE, minFragLength=50, isGTFAnnotationFile=TRUE, strandSpecific=2, nthreads=32, countReadPairs=TRUE, verbose=TRUE)

sessionInfo( ) R version 4.0.5 (2021-03-31) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux 8.4 (Ootpa)

Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.12.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] Rsubread_2.4.2

loaded via a namespace (and not attached): [1] compiler_4.0.5 Matrix_1.3-2 grid_4.0.5 lattice_0.20-41

```

CHM13v2.0 featureCounts T2T Rsubread • 2.2k views
ADD COMMENT
0
Entering edit mode

featureCounts only counts genes that have an exon feature by default. I suspect this is the reason why not all the genes were counted.

ADD REPLY
0
Entering edit mode

Where in the documentation can I find this? In the user guide I see exon as default feature.

ADD REPLY
0
Entering edit mode

Sorry I meant to say exon. I have corrected this.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 37 minutes ago
WEHI, Melbourne, Australia

By default, featureCounts counts reads that overlap exons. It is easy to very that GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz contains 57816 genes but only about 42600 of these contain any exons.

You can tell featureCounts to count all reads overlapping gene bodies, in which case you will get results for all 57816 genes, but I suggest that counting reads for genes with no exons may not be very meaningful.

The SAF file chm13v2.0_RefSeq_27Sep2022.saf available from https://bioinf.wehi.edu.au/Rsubread/annot/ contains 42604 genes, which includes all exon-containing genes from GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz plus mitochondrial genes not in the NCBI annotation and minus pseudoautosomal region exons on chromosome Y (to eliminate mapping ambiguity for these regions between the X and Y chromosomes). The SAF file should be used in conjunction with the T2T genome build chm13v2.0_maskedY.fa.gz available from https://github.com/marbl/CHM13. All chromosome names match up and are in UCSC Browser style.

The pseudoautosomal region exons are masked for reasons explained by Heng Li: https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

If I might add a comment, both the Gencode and NCBI annotation contain very large numbers of pseudo-genes that overlap with protein-coding genes and hence cause multi-overlapping reads. I prefer to count only the RefSeq curated genes (chm13v2.0_RefSeqStrict_27Sep2022.saf.gz) so to reduce ambiguity and to get more reliable counts.

ADD COMMENT
0
Entering edit mode

Thank you Dr. Wei Shi and Dr. Gordon Smyth; this is very helpful. We ran stranded total RNASeq and are interested in both coding and non-coding elements for a biomarker discovery project, so we would like to try exporting results for all 57,816 genes. We are planning to attempt this on a pilot sample set, and based on your advise of using the RefSeq curated genes, also run the pilot set with the RefSeq (exon) annotation as you suggest and see if that yields better results.

Would the featureCounts command below be the way to 'tell featureCounts to count all reads overlapping gene bodies'? We tried this and it yielded 57,816 genes, but we want to ensure this is the appropriate way to code it. We think all that is required is the command -t gene (instead of the default -t exon command in Subread, or GTF.featureType = "gene" in Rsubread). But we were not sure if we should be modifying something else in addition or in place of -t (like the Subread -f or Rsubread useMetaFeatures parameter).

featureCounts -T 32 -s 2 -p -C -d 50 --verbose --countReadPairs -g gene_id -t gene -a GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf -o T2T_gene_gene_id_featurecounts.txt CA-00-50-RN-01-220826_Aligned.sortedByCoord.out.bam

If you have time, a less key question than verifying the code above but of interest for using NCBI GTFs, is whether you have any advise for obtaining the Entrez IDs. We noticed the default with some GTF files is to export gene symbols such as IL1B, and tried to set the -g flag (or in Rsubread, GTF.attrType) from the default -g gene_id to be instead -g db_xref, but this sometimes yields Entrez IDs such as 3553 and sometimes yields MIM hits such as MIM 147720, and we are not sure how to force it to just report an Entrez ID -- if that is even advisable?

ADD REPLY
1
Entering edit mode

Your command sounds good to me. Note that you can provide multiple features to the -t parameter (they should be separated by comma). So an alternative approach is to do something like -t exon,CDS,UTR to try to catch all the genes. This approach can avoid counting those reads mapping to introns.

ADD REPLY
0
Entering edit mode

Thank you very much Dr. Wei Shi; we will investigate the GTF and are testing variations of the -t parameter for Subread (or for RSubread, GTF.featureType parameter). Thank you for reviewing our code and suggesting how to modify the -t parameter to catch all genes.

ADD REPLY
0
Entering edit mode

We ran stranded total RNASeq and are interested in both coding and non-coding elements for a biomarker discovery project, so we would like to try exporting results for all 57,816 genes.

I don't follow your logic here. All genes, whether coding or not, should contain exons. Strandedness and coding vs non-coding are irrelevant to the annotation issues with the NCBI GTF files that we have been discussing. I provided you with curated SAF files that contain stranded annotation for both coding and non-coding genes.

Would the featureCounts command below be the way to 'tell featureCounts to count all reads overlapping gene bodies'?

Looks ok but I am more familiar with the Rsubread package syntax.

of interest for using NCBI GTFs, is whether you have any advise for obtaining the Entrez IDs.

I agree that the NCBI GTF files have formatting inconsistencies but we don't have any automatic fix. A software tool like featureCounts can only read what is in the file. My solution was to provide you with curated SAF files that provide you with consistent Entrez IDs. If you don't want to use what we provide, then you will need to find your own way to deal with the inconsistencies.

ADD REPLY
0
Entering edit mode

Thank you; based on your feedback Dr. Smyth and that of Dr. Wei Shi, we are now more confident in our code. We are testing variations of the parameter to call gene bodies with or without introns (the -t Subread or GTF.attrType parameter). Knowing I have benefited from reviewing others' threads and postings, in case anyone else has a similar question and may benefit from this thread, I will attempt to put versions of code for both Subread and RSubread below (at least for the -t gene option). We will tailor for our dataset based on some tests we are running now; again many thanks to both of you.

Subread:

featureCounts -T 32 -s 2 -p -C -d 50 --verbose --countReadPairs -g gene_id -t gene -a GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf -o MyOutput.txt MyInput.bam

Rsubread:

MyOutput=featureCounts(files=RBam, isPairedEnd=TRUE, annot.ext="GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf", countChimericFragments=FALSE, minFragLength=50, isGTFAnnotationFile=TRUE, strandSpecific=2, nthreads=32, countReadPairs=TRUE, GTF.attrType="gene_id", GTF.featureType = "gene", verbose=TRUE)

ADD REPLY

Login before adding your answer.

Traffic: 604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6