NLP Based Protein Sequence Classification Through Convolutional Neural Network
Main Article Content
Abstract
Redesigning and modifying proteins is a leading objective in the pharmaceutical industry today. Modern technology has made it possible to efficiently redesign proteins by simulating mutation, natural selection, and amplification in the lab. There are an infinite number of possible mutations for each protein. It would be impossible to synthesise every sequence or even examine every version that could be beneficial. Recently, there has been an increase in the use of machine learning to aid in protein redesign, as prediction models can be used to virtually evaluate a large number of different sequences. Modern machine learning models, notably deep learning models, are poorly understood. In addition, few descriptors of protein sequences have been considered. This paper presents a novel classification method for protein sequences that is propelled by artificial intelligence. Two distinct single-amino-acid descriptors and one structure-based, three-dimensional descriptor are used to create prediction models, and their effectiveness is compared. Several various evaluation metrics were applied to a variety of public and private data sets to determine the accuracy of the predictions. The study's findings indicate that the convolution neural network models constructed using amino acid property descriptors are the most pertinent to protein redesign problems encountered in the pharmaceutical industry.