Hello,
DISCLAIMER I'm new to BioConductor, so please feel free to comment on the structure of the post as well as the content.
DESCRIPTION I am using single nuclei RNA sequencing to study protein variants. I am trying to use summarizeOverlaps( mode = IntersectionNotEmpty) to generate count matrices for my snRNA sequencing data generated by Smart-Seq2. In addition to "standard" sx analysis, I want to identify proteins variants, that is a given protein that is present in my data in two or more forms.
I use the same GTF file I used to generate BAM files to create my ensembl object. I subset this ensemble object using ensembldb::transcriptsBy("gene") (all TXID of a gene are collapsed to give gene counts) and ensembldb::transcripts() (maintains TXID level counts). Using these I generate a count matrices with summarizeOverlaps(), where I add metadata like GENEID, symbol and TXID (for transcripts()), and remove any row for which the rowSums is 0. I check if there are GENEID duplicates in either of the of the count matrices and only find duplicates in transcript_count_mat as expected.
The problem is that my matrix containing gene level information contain 6350 geneIDs that I cannot find in my matrix with TXID information. The reason for this is most likely that all the reads for the 6350 genes map to the shorter transcripts for that gene. Due to IntersectionNotEmpty's behaviour any read that cannot be assigned without doubt is not counted (i.e. the read maps to a region where two or more reference transcripts overlap perfectly).
ISSUE My issue goes as follows; how can I figure out which reference transcripts my reads map to? As my aim is to identify protein variants, ignoring what is potentially 6350 of them, does not sit well with me.
I have tried giving a minimal example of my code below for illustration;
#Libraries for generating and working with GAlignment or GRange objects
library(GenomicAlignments)
library(GenomicFeatures)
library(ensembldb) # Use to generate a ensDB object to convert to GRange object
library(AnnotationFilter)
# Retrieve BAM files and GTF file
bam_files <- list.files(pattern = "fq\.gz_Aligned\.out\.bam$", full.names = TRUE)
# Generate GAlignment Object, reads
read_list <- lapply(bam_list, readGAlignmentsList)
# Generate GRanges Object, reference
ensDB_gtf <- ensDbFromGtf(
gtf = "Mus_musculus.GRCm39.110.gtf", #file containing gtf information
level = c("genes", "transcripts"), #information to obtain
ignoreTxVersion = TRUE,
.checkAgainstTxDb = FALSE
)
# Convert the SQlite database file to a ensDB object
ensDB <- EnsDb(ensDB_gtf)
# Generate 'genes' and transcript objects seperately for use in summarizeOverlaps()
TGenes <- ensembldb::transcriptsBy(ensDB, by = "gene") # Used to obtain gene level information
transcripts <- ensembldb::transcripts(ensDB) # Used to obtain TXID information
# SummarizeOverlaps()
"
summariseOverlaps() is applied for each file seperately and the results then
combined into a single count_matrix.
"
# Transcript level counting
transcript_counts <- lapply(read_list, function(reads){
summarizeOverlaps(features = transcripts,
reads = reads,
mode = IntersectionNotEmpty)
})
# Gene level counting
TGenes_counts <- lapply(read_list, function(reads){
summarizeOverlaps(features = TGenes,
reads = reads,
mode = IntersectionNotEmpty)
})
# Extract counts and combine into a matrix
transcript_count_mat <- do.call(cbind, lapply(transcript_counts, function(se) {
assays(se)$counts
}))
TGenes_count_mat <- do.call(cbind, lapply(TGenes_counts, function(se) {
assays(se)$counts
}))
Any help is appreciated.
Thank you, I will try this and see.