Bio-pharma organizations can now leverage the groundbreaking protein folding system, AlphaFold, with E2E Cloud
In this article we are going to learn about the basics of AlphaFold which is used by over a million researchers for predicting the structures of protein in their research. Also pretty much every pharma company in the world has been using AlphaFold in their drug discovery programs.
In this post we are going to learn about:
- What is a Protein?
- Importance and Structure Prediction of a Protein
- Overview of AlphaFold
- Preprocessing ,Evoformer, Structure Module and Training
- Running AlphaFold on E2E Cloud
Let's Dive into the depth of these concepts:.
Proteins are the long chains of amino acids which are essential to life and understanding of their structure can facilitate mechanistic understanding of their function.
“Protein digest your food, contarct your muscles, fire your neuron and power your immune system. Everything happens in biology, almost, happen due to proteins”,David Baker David Baker (biochemist) - Wikipedia. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences. Deep learning methods revolutionized protein research and AlphaFold is an integral part of it.
A interesting TED Talk in which David Baker presented 5 challenges we could solve by designing new proteins | David Baker you can watch to get some motivation.
Overview of AlphaFold:- AlphaFold greatly improves the accuracy of structure prediction by
incorporating novel neural network architectures and training procedures based on the evolutionary, physical and geometric constraints of protein structures.
An input sequence is fed to Genetic Database Search to find the relevant sequences based on evolutionary constraints along with a Pair representation that is a 2 Dimensional projection of a 3 Dimensional protein structure (it's a 2 Dimensional array of distance between each residue pair).
MSA and Pair Representation are fed to Evoformer (made of 48 blocks each different so it's not a RNN) in which various update operations are performant and fed to Structure Module (made of 8 blocks, all with same weights so it's a RNN) which predicts the exact 3 Dimensional structure of protein. This cycle is repeated thrice in a clever way as shown in image above.
Preprocessing:-
The input sequence is used to query the genetic databases and MSA (Multiple Sequence Alignment) is formed. Initial templates from the structure database that have similar data sequences are used to initialize the Pair representation. The preprocessing pipeline is repeated several times and an average of values is taken for MSA and Pair representation.
Evoformer:-
Evoformer consists of 48 blocks (different blocks) based on attention mechanisms. Each of 48 blocks takes MSA and Pair representation as an input and produces an updated value of the same as output. There is exchange of information between MSA and Pair representation to facilitate the direct reasoning of evolutionary and spatial structure of a protein. MSA stack uses actual attention where alternative layers use row-wise and column-wise attention. Row-wise attention is used for the model to reason about aligned sequence separately while column attention allows the model to reason about the relationship between the aligned residuals.
Structure Module:-
It takes the abstract information from MSA and Pair representation and forms a 3 Dimensional structure. Each row of MSA and Pair representation passes through 8 neural network layers with the same weights. Structure Module works on a 3 Dimensional backbone which is represented as an independent set of rotation and translation for each residue with respect to frame. The initial value of rotation and translation are zero and with training steps it produces the 3 Dimensional structure.
Running AlphaFold on E2E Cloud:-
1.Launch NVIDIA A100-40GB from https://myaccount.e2enetworks.com. Follow steps in this manual for a smooth process: Introduction — E2E Networks documentation.
2. Since the database requirement is high i.e. around 3T.B. Create a volume of 4TB and attach it to node and mount the volume to file structure. Please follow these document to do that:-
Introduction — E2E Networks documentation and Make your Block Storage volume available for use on Linux — E2E Networks documentation.
3.Requirement :-
- Docker & NGC Container Toolkit (Preinstalled with a new A100 machine launched from E2E Cloud).
- Setup running Docker as a non-root user.
4.Run the following self explanatory commands:-
git clone https://github.com/deepmind/alphafold.git
cd ./alphafold
sudo apt install aria2
scripts/download_all_data.sh <DOWNLOAD_DIR> > download.log 2> download_all.log &
The download directory <DOWNLOAD_DIR> should not be a subdirectory in the AlphaFold repository directory.
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
You may optionally wish to create a Python Virtual Environment to prevent conflicts with your system's Python environment.
python3 -m venv directory_where_you_want_to_run_it
pip3 install -r docker/requirements.txt
python3 docker/run_docker.py \
--fasta_paths=your_protein.fasta \
--max_template_date=2022-01-01 \
--data_dir=$DOWNLOAD_DIR \
--output_dir=/home/user/absolute_path_to_the_output_dir
Once the run is over, the output directory shall contain predicted structures of the targeted protein.