UCSF-Simons Seminar Series
Meisam Razaviyayn (Stanford)
Calvin Lab Room 116
CONVEX: De Novo Transcriptome Error Correction by Convexification
The complexity of higher eukaryotic genomes imposes significant limitations on the assembly of transcript and splicing discrimination. In particular, it is known that in the presence of certain repeat structures, the RNA de novo assembly and splice product discrimination from short reads are impossible even when all constituent elements are identified. These limitations promote the use of long read isoform sequencing (Iso-Seq) technology to discover novel splicing. In this talk, we consider the de novo Iso-Seq problem using full-length long reads. Despite combinatorial nature of the problem, we propose an iterative convex reformulation of the problem using a greedy procedure with order optimal sample complexity. Based on the proposed greedy procedure, we develop an algorithm, dubbed CONVEX, which has linear computational complexity in the number of reads and can be implemented on parallel multi-core machines. We compare the performance of the algorithm with the state-of-the-art tool, ICE, on various datasets such as ERCC2.0 and heart/liver tissue PacBio datasets. Our numerical experiments show that CONVEX results in up to 20% improvement in the number of denoised reads in addition to being multiple times faster than ICE on moderate and large size datasets.