Published on Wed Oct 31 2018

DEEPGONET: Multi-label Prediction of GO Annotation for Protein from Sequence Using Cascaded Convolutional and Recurrent Network

Sheikh Muhammad Saiful Islam, Md Mahedi Hasan

The present gap between the amount of available protein sequence due to the development of next generation sequencing technology (NGS) and slow and expensive experimental extraction of useful information is ever widening. This gap can be reduced by employing automatic function prediction (AFP) approaches. In this paper, we present DEEPGONET, a novel cascaded convolutional and

0
0
0
Abstract

The present gap between the amount of available protein sequence due to the development of next generation sequencing technology (NGS) and slow and expensive experimental extraction of useful information like annotation of protein sequence in different functional aspects, is ever widening, which can be reduced by employing automatic function prediction (AFP) approaches. Gene Ontology (GO), comprising of more than 40, 000 classes, defines three aspects of protein function names Biological Process (BP), Cellular Component (CC), Molecular Function (MF). Multiple functions of a single protein, has made automatic function prediction a large-scale, multi-class, multi-label task. In this paper, we present DEEPGONET, a novel cascaded convolutional and recurrent neural network, to predict the top-level hierarchy of GO ontology. The network takes the primary sequence of protein as input which makes it more useful than other prevailing state-of-the-art deep learning based methods with multi-modal input, making them less applicable for proteins where only primary sequence is available. All predictions of different protein functions of our network are performed by the same architecture, a proof of better generalization as demonstrated by promising performance on a variety of organisms while trained on Homo sapiens only, which is made possible by efficient exploration of vast output space by leveraging hierarchical relationship among GO classes. The promising performance of our model makes it a potential avenue for directing experimental protein functions exploration efficiently by vastly eliminating possible routes which is done by the exploring only the suggested routes from our model. Our proposed model is also very simple and efficient in terms of computational time and space compared to other architectures in literature.