My question is about reads that don't align to the genome yet are long and have very good Phred scores. Currently, my workflow is FastQC > Cutadapt > Trimmomatic > RNA-STAR > HTSeq-count > edgeR (RUVSeq) I use gencode genomes with Ensembl IDs and even with the cleanest isolation of cells and excellent library production I still get about 80% alignment to the genome. I use the entire genocode genome and gtf files for the alignment and I collect the unaligned reads and sometimes there are a large number of long reads with good Phred scores and I am thinking that in a perfect reference genome that they would align. The reference genome is not perfect by any means and with a certainty there are some cell type differences and strain differences between the reference genome and the source of the total RNA. Is there a way to construct and extract contigs from unaligned reads and then blast them to see what they have homology to? or see if they are genetic rearrangements or even if they are simply un-annotated ORFs. Does anyone have experience with this? What software would you recommend? Any response is greatly appreciated. TIA.

