UCSF-Simons Seminar Series
Anil Raj (Stanford University)
Calvin Lab Room 116
What Determines Protein Diversity and Abundance in Human Cells?
High-throughput RNA sequencing measurements have revealed that a substantial fraction of the genome is transcribed and processed into RNA molecules. While this has led to an appreciation of transcriptional diversity in humans, it is currently unclear what fraction of these transcripts are translated into proteins and what factors regulate the diversity and abundance of these various protein forms. In this talk, I will describe a statistical model that integrates ribosome footprint profiling data along with gene expression and RNA sequence information to accurately infer the precise sequences within RNA molecules that are translated in a given cell type. We applied our model to measurements in human lymphoblastoid cell lines and identified 7,273 previously unannotated coding sequences, including 448 translated pseudogenes and 2,442 translated upstream open reading frames. We observed an enrichment of harringtonine-treated ribosome footprints at the inferred initiation sites, validating many of the novel coding sequences. The novel sequences exhibit significant signatures of selective constraint in the reading frames of the inferred proteins, suggesting that many of these are functional. Among transcripts in which two distinct reading frames are translated, nearly 40% showed significant negative correlation in the levels of translation of their two coding sequences, suggesting a key regulatory role for these novel translated sequences. This work significantly expands the set of known coding regions in humans and highlights that translation is far more diverse and widespread than previously recognized.