Issue with assigning transcript IDs to reads when using summarizeOverlaps(mode = IntersectionNotEmpty)
1
0
Entering edit mode
SofieK • 0
@9b195990
Last seen 26 days ago
Denmark

Hello,

DISCLAIMER I'm new to BioConductor, so please feel free to comment on the structure of the post as well as the content.

DESCRIPTION I am using single nuclei RNA sequencing to study protein variants. I am trying to use summarizeOverlaps( mode = IntersectionNotEmpty) to generate count matrices for my snRNA sequencing data generated by Smart-Seq2. In addition to "standard" sx analysis, I want to identify proteins variants, that is a given protein that is present in my data in two or more forms.

I use the same GTF file I used to generate BAM files to create my ensembl object. I subset this ensemble object using ensembldb::transcriptsBy("gene") (all TXID of a gene are collapsed to give gene counts) and ensembldb::transcripts() (maintains TXID level counts). Using these I generate a count matrices with summarizeOverlaps(), where I add metadata like GENEID, symbol and TXID (for transcripts()), and remove any row for which the rowSums is 0. I check if there are GENEID duplicates in either of the of the count matrices and only find duplicates in transcript_count_mat as expected.

The problem is that my matrix containing gene level information contain 6350 geneIDs that I cannot find in my matrix with TXID information. The reason for this is most likely that all the reads for the 6350 genes map to the shorter transcripts for that gene. Due to IntersectionNotEmpty's behaviour any read that cannot be assigned without doubt is not counted (i.e. the read maps to a region where two or more reference transcripts overlap perfectly).

ISSUE My issue goes as follows; how can I figure out which reference transcripts my reads map to? As my aim is to identify protein variants, ignoring what is potentially 6350 of them, does not sit well with me.

I have tried giving a minimal example of my code below for illustration;


#Libraries for generating and working with GAlignment or GRange objects
library(GenomicAlignments)
library(GenomicFeatures)

library(ensembldb) # Use to generate a ensDB object to convert to GRange object
library(AnnotationFilter)

# Retrieve BAM files and GTF file

bam_files <- list.files(pattern = "fq\.gz_Aligned\.out\.bam$", full.names = TRUE)

# Generate GAlignment Object, reads

read_list <- lapply(bam_list, readGAlignmentsList)

# Generate GRanges Object, reference

ensDB_gtf <- ensDbFromGtf(
gtf = "Mus_musculus.GRCm39.110.gtf", #file containing gtf information
level = c("genes", "transcripts"), #information to obtain
ignoreTxVersion = TRUE,
.checkAgainstTxDb = FALSE
)

# Convert the SQlite database file to a ensDB object

ensDB <- EnsDb(ensDB_gtf)

# Generate 'genes' and transcript objects seperately for use in summarizeOverlaps()

TGenes <- ensembldb::transcriptsBy(ensDB, by = "gene") # Used to obtain gene level information

transcripts <- ensembldb::transcripts(ensDB) # Used to obtain TXID information

# SummarizeOverlaps()

"
summariseOverlaps() is applied for each file seperately and the results then
combined into a single count_matrix.
"

# Transcript level counting

transcript_counts <- lapply(read_list, function(reads){
summarizeOverlaps(features = transcripts,
reads = reads,
mode = IntersectionNotEmpty)
})

# Gene level counting

TGenes_counts <- lapply(read_list, function(reads){
summarizeOverlaps(features = TGenes,
reads = reads,
mode = IntersectionNotEmpty)
})

# Extract counts and combine into a matrix

transcript_count_mat <- do.call(cbind, lapply(transcript_counts, function(se) {
assays(se)$counts
}))

TGenes_count_mat <- do.call(cbind, lapply(TGenes_counts, function(se) {
assays(se)$counts
}))

Any help is appreciated.

GenomicAlignments summarizeOverlaps( Smart-seq2 proteinvariantcounting • 790 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

If I understand correctly, you want to assess differential transcript abundances. If so, there are a couple of ways you can do that. One way is to generate both gene level and exon level counts and then identify exons that have different logFC between groups than the genes do (see diffspliceDGE in the edgeR package). This is more of a bulk RNA-Seq method, so you would probably have to identify cells and then combine the counts at the cell level first. The other alternative is to use the alevin aligner to generate inferential replicates and then use swish from the fishpond package.

0
Entering edit mode

Thank you, I will try this and see.

ADD REPLY

Login before adding your answer.

Traffic: 710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6