Biophysics - Breadcrumbs¶

TODO:

What is Breadcrumbs?¶

Education stands as a pivotal force in socioeconomic mobility. Yet, generational barriers outside of an individual's influence hinder a person's access to educational opportunities. Breadcrumbs represents OASCI's initiative to fundamentally transform access to educational materials, embodying the spirit of discovery and the sharing of knowledge fragments that have traditionally been accessible only through the goodwill of others. Breadcrumb websites are freely accessible resources, and we encourage contributions, improvements, and adaptations so long as they abide by the terms outlined in the CC BY-NC-SA 4.0 license.

Deploying¶

We use bump-my-version to release a new version. This will create a git tag that is used by poetry-dynamic-version to generate version strings.

However, we are using Calendar Versioning, meaning we need to specify new versions manually. For example, to bump the version to November 8, 2024, you would run the following command after activating the relevant conda environment.

bump-my-version bump --new-version 2024.11.8

After releasing a new version, you must push and include all tags.

git push --follow-tags

License¶

Code contained in this project is released under the MIT License as specified in LICENSE_CODE. This license grants you the freedom to use, modify, and distribute it as long as you include the original copyright notice contained in LICENSE_CODE and the following notice.

Portions of this code were incorporated with permission from CADD Breadcrumbs by OASCI licensed under the MIT license.

All other data, information, documentation, and associated content provided within this project are released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) as specified in LICENSE_INFO.

Some content was incorporated with permission from CADD Breadcrumbs by OASCI licensed under the CC BY-NC-SA 4.0 license

Web analytics¶

Why would we want to track website traffic?

An instructor can gain insights into how students engage with online teaching materials by analyzing web analytics. This information is instrumental in assessing the effectiveness of the materials. Web analytics reveal the popularity of specific topics or sections among students, empowering instructors to tailor future lectures or discussions. Analytics also provides valuable data for curriculum development, helping instructors identify trends, strengths, and weaknesses in course materials. Additionally, instructors may leverage web analytics as evidence of their commitment to continuous improvement in teaching methods, which is helpful in discussions related to professional development, promotions, or tenure.

We track website traffic using plausible, which is privacy-friendly, uses no cookies, and is compliant with GDPR, CCPA, and PECR. We also share this website's analytics with you for additional transparency.

Structural ↵

Structural¶

TODO:

Principles ↵

Principles¶

TODO:

Protein¶

TODO:

RNA¶

TODO:

Ended: Principles

Determination ↵

Determination¶

TODO:

Cryo-electron microscopy¶

TODO:

Small-angle x-ray scattering¶

TODO:

X-ray crystallography¶

TODO:

Ended: Determination

Prediction ↵

Prediction¶

TODO:

Protein ↵

Protein¶

TODO:

Ab initio¶

TODO:

Critical Assessment of Structure Prediction¶

TODO:

https://predictioncenter.org/index.cgi ↩

Coevolution¶

TODO:

Background on Protein Structure and Evolution¶

A. Overview of protein folding and the sequence-structure relationship
   1. Protein structure hierarchy: primary, secondary, tertiary, and quaternary structure
   2. Thermodynamic and kinetic aspects of protein folding
   3. Experimental methods to study protein folding: spectroscopy, calorimetry, and single-molecule techniques
B. Sequence determinants of protein structure
   1. Role of amino acid properties: size, charge, hydrophobicity, and polarity
   2. Secondary structure propensities of amino acids
   3. Local and global interactions stabilizing protein structure
C. Evolutionary constraints on protein sequences
   1. Selective pressures for maintaining structure and function
   2. Mutation, selection, and drift in protein evolution
   3. Sequence conservation and variability across homologs
D. Structural and functional diversity of protein families
   1. Evolutionary divergence and convergence of protein structures
   2. Relationship between sequence, structure, and function within protein families
   3. Examples of protein families with conserved structures and diverse functions

Coevolution and Correlated Mutations¶

A. Concept of coevolution in proteins
   1. Definition and theoretical basis
   2. Types of coevolution: pairwise, higher-order, and intramolecular
B. Mechanisms of coevolution
   3. Structural constraints: maintaining contacts and interactions
   4. Functional constraints: preserving active sites and specificity
   5. Thermodynamic and kinetic constraints: ensuring stability and folding
C. Correlated mutations as a signature of coevolution
   6. Definition and early observations
   7. Relationship to structural and functional constraints
   8. Examples of correlated mutations in protein families
D. Challenges in detecting coevolution
   9. Distinguishing direct and indirect couplings
   10. Accounting for phylogenetic relationships and background conservation
   11. Dealing with statistical noise and limited data

Inference of Coevolutionary Couplings¶

A. Inputs for coevolutionary inference
   1. Multiple sequence alignments: construction and quality control
   2. Sequence weighting and resampling strategies
   3. Incorporating structural and functional information
B. Covariation measures and scoring functions
   1. Local methods: mutual information, statistical coupling, and perturbation-based scores
   2. Global methods: maximum entropy, Bayesian networks, and machine learning
   3. Advantages and limitations of different approaches
C. Direct Coupling Analysis (DCA) and related methods
   1. Theoretical foundations and mathematical formulation
   2. Mean-field approximation and pseudolikelihood maximization
   3. Gauge fixing and regularization techniques
   4. Sparse variants and computational optimizations
D. Other key coevolution prediction methods
   1. Protein Sparse Inverse COVariance (PSICOV)
   2. Generative REgularized ModeLs of proteINs (GREMLIN)
   3. Coevolution-based Contact Prediction (CCMPred)
   4. Sparse inverse covariance estimation (SLICE) and graphical LASSO
E. Performance evaluation and benchmarking
   1. Gold standard datasets: manually curated and structurally derived contacts
   2. Metrics: precision, recall, AUC, and rank-based scores
   3. Cross-validation and statistical significance assessment

From Coevolution to Contacts¶

A. Interpreting coevolutionary scores
   1. Relationship between coevolution and spatial proximity
   2. Distinguishing structural and functional couplings
   3. Identifying long-range and interchain contacts
B. Contact definition and distance thresholds
   1. Atom-based and residue-based definitions
   2. Distance cutoffs for defining contacts: 8Å, 10Å, and others
   3. Considerations for different protein classes and folds
C. Sequence separation and contact probability
   1. Local contacts: within 6-12 residues in sequence
   2. Medium-range contacts: between secondary structure elements
   3. Long-range contacts: between distant regions in sequence
D. Typical performance of contact prediction methods
   1. Precision and coverage of top-ranked contacts
   2. Comparison across different protein families and alignment depths
   3. Challenges and opportunities for improvement

VII. Limitations and Frontiers A. Dependence on multiple sequence alignment quality 1. Impact of alignment depth, diversity, and errors 2. Strategies for improving alignment construction and robustness B. Challenges with specific protein classes 3. Sparse alignments and orphan proteins 4. Multidomain proteins and domain boundaries 5. Protein-protein and protein-ligand complexes C. Disentangling evolutionary signals 6. Phylogenetic corrections and background models 7. Distinguishing structural and functional constraints 8. Higher-order and epistatic interactions D. Integration with complementary data and methods 9. Combining coevolution with experimental contacts and restraints 10. Integrating with physicochemical knowledge and energy terms 3. Hybrid approaches with machine learning and deep learning E. Improving interpretability and explainability 4. Identifying functional sites and specificity determinants 5. Inferring evolutionary histories and selective pressures 6. Generating testable hypotheses and guiding experiments F. Future directions and opportunities 7. Expanding to new protein families and evolutionary distances 8. Modeling conformational dynamics and allostery 9. Designing novel proteins with optimized properties

Göbel, U., Sander, C., Schneider, R., & Valencia, A. (1994). Correlated mutations and residue contacts in proteins. Proteins: Structure, Function, and Bioinformatics, 18(4), 309-317. ↩
Shindyalov, I. N., Kolchanov, N. A., & Sander, C. (1994). Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Engineering, Design and Selection, 7(3), 349-358. DOI: 10.1093/protein/7.3.349 ↩
Kamisetty, H., Ovchinnikov, S., & Baker, D. (2013). Assessing the utility of coevolution-based residue–residue contact predictions in a sequence-and structure-rich era. Proceedings of the National Academy of Sciences, 110(39), 15674-15679. DOI: 10.1073/pnas.1314045110 ↩
Sułkowska, J. I., Morcos, F., Weigt, M., Hwa, T., & Onuchic, J. N. (2012). Genomics-aided structure prediction. Proceedings of the National Academy of Sciences, 109(26), 10340-10345. DOI: 10.1073/pnas.1207864109 ↩
Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D. S., Sander, C., ... & Weigt, M. (2011). Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108(49), E1293-E1301. DOI: 10.1073/pnas.1111471108 ↩
Burger, L., & Van Nimwegen, E. (2008). Accurate prediction of protein–protein interactions from sequence alignments using a Bayesian method. Molecular systems biology, 4(1), 165. DOI: 10.1038/msb4100203 ↩
Burger, L., & Van Nimwegen, E. (2010). Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS computational biology, 6(1), e1000633. DOI: 10.1371/journal.pcbi.1000633 ↩
Hockenberry, A. J., & Wilke, C. O. (2019). Evolutionary couplings detect side-chain interactions. PeerJ, 7, e7280. DOI: 10.7717/peerj.7280 ↩

Deep learning ↵

Deep learning¶

In recent years, deep learning has emerged as a powerful tool for protein structure prediction. Deep learning algorithms, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), can learn complex patterns and relationships from large datasets of protein sequences and structures. These algorithms have shown promising results in predicting protein secondary structures, contact maps, and even 3D coordinates.

AlphaFold ↵

AlphaFold¶

TODO:

AlphaFold3¶

TODO:

Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., ... & Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 1-3. DOI: 10.1038/s41586-024-07487-w ↩

AlphaFold2 ↵

AlphaFold2¶

TODO:

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589. DOI: 10.1038/s41586-021-03819-2 ↩
Varadi, M., Anyango, S., Deshpande, M., Nair, S., Natassia, C., Yordanova, G., ... & Velankar, S. (2022). AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1), D439-D444. DOI: 10.1093/nar/gkab1061 ↩
Skolnick, J., Gao, M., Zhou, H., & Singh, S. (2021). AlphaFold 2: why it works and its implications for understanding the relationships of protein sequence, structure, and function. Journal of chemical information and modeling, 61(10), 4827-4831. DOI: 10.1021/acs.jcim.1c01114 ↩
Rubiera, C. (2021). AlphaFold 2 is here: What’s behind the structure prediction miracle. Oxford Protein Informatics Group. www.blopig.com/blog/2021/07/alphafold-2-is-here-whats-behind-the-structure-prediction-miracle/ ↩

AlphaFold-Multimer¶

TODO:

Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., ... & Hassabis, D. (2021). Protein complex prediction with AlphaFold-Multimer. biorxiv, 2021-10. DOI: 10.1101/2021.10.04.463034 ↩

Ended: AlphaFold2

AlphaFold¶

AlphaFold¹ is a deep learning-based protein structure prediction system that leverages information from multiple sequence alignments (MSAs) and a novel neural network architecture to accurately predict protein structures. This version was released in 2020 and its pipeline consists of four main stages: feature extraction, neural network inference, potential construction, and structure realization.

AlphaFold extracts features from the target protei0n sequence and its MSA, which captures covariation information from homologous sequences.
The input features include one-hot encoded amino acid sequences, position-specific substitution matrices (PSSMs), and covariation features derived from the MSA.

Neural network architecture

The core of AlphaFold is a deep residual convolutional neural network that predicts a probability distribution over the distances between pairs of residues (distogram).
The network also makes auxiliary predictions, such as secondary structure, solvent accessibility, and torsion angles, which help guide the structure prediction process.

Potential construction

AlphaFold constructs a smooth potential by combining the predicted distance distributions with a reference state and a steric repulsion term.
The potential also incorporates information from the predicted torsion angles, ensuring that the generated structures are physically plausible.

Structure realization

The final stage involves generating protein structures that minimize the constructed potential.
AlphaFold initializes the backbone torsion angles by sampling from the predicted distributions and then optimizes the structure using gradient descent.
The optimization process is repeated with "noisy" restarts to explore the conformational space and identify low-energy structures.

The following sections will delve deeper into each stage of the AlphaFold pipeline, providing a detailed understanding of how the system achieves state-of-the-art performance in protein structure prediction.

Feature Extraction¶

Multiple Sequence Alignments (MSAs)¶

Importance of MSAs in capturing covariation information a. Evolutionary relationships between homologous sequences b. Covariation: Correlated mutations across multiple sequences c. Inferring structural information from covariation patterns [Figure: Schematic illustration of an MSA, highlighting covarying positions]
Generating MSAs using HHblits and UniClust30 database a. HHblits: Iterative homology detection using hidden Markov models b. UniClust30: Clustered protein sequence database for efficient search c. Iterative search process to gather diverse homologous sequences [Figure: Flowchart of the MSA generation process using HHblits and UniClust30]
Data augmentation techniques a. MSA subsampling: Randomly selecting subsets of sequences b. Importance of subsampling for robustness and generalization c. Generating multiple MSA samples for each target protein [Figure: Illustration of MSA subsampling, showing different subsets of sequences]

Input features¶

Sequence features a. One-hot encoding of amino acid sequences i. Representing each amino acid as a binary vector ii. Capturing residue identity information [Figure: Example of one-hot encoding for a short amino acid sequence]

b. Position-specific substitution matrices (PSSMs) from PSI-BLAST i. Capturing evolutionary conservation and substitution patterns ii. Incorporating information from distant homologs [Figure: Heatmap visualization of a PSSM, showing amino acid substitution scores]

c. Hidden Markov Model (HMM) profiles from HHblits i. Representing sequence profiles based on homologous sequences ii. Capturing position-specific amino acid frequencies and insertion/deletion probabilities [Figure: Schematic representation of an HMM profile, showing match, insert, and delete states]

Covariation features a. Potts model parameters i. Pairwise statistical coupling between residue positions ii. Inferring direct couplings from observed covariation patterns [Figure: Heatmap visualization of Potts model parameters, showing pairwise residue couplings]

b. Frobenius norm of the Potts model parameters i. Summarizing the overall strength of covariation signal ii. Providing a single feature value for each residue pair [Figure: Equation for calculating the Frobenius norm of Potts model parameters]
Structural templates (not used in the fully de novo AlphaFold model) a. Identifying structural templates from the Protein Data Bank (PDB) b. Extracting template-based distance restraints and torsion angles c. Incorporating template information in the input features (when available) [Figure: Schematic illustration of template-based distance restraints and torsion angles]

Neural network architecture¶

Deep residual convolutional network¶

Advantages of using a deep network a. Capturing complex sequence-structure relationships b. Learning hierarchical features at different scales c. Enabling end-to-end learning from sequence to structure [Figure: Schematic overview of the deep residual convolutional network architecture]
Residual connections and their importance a. Allowing gradients to flow directly through the network b. Mitigating the vanishing gradient problem in deep networks c. Facilitating the training of very deep architectures [Figure: Illustration of a residual block, highlighting the skip connection]
Dilated convolutions for capturing long-range interactions a. Expanding the receptive field without losing resolution b. Capturing dependencies between distant residues c. Efficiently processing large input feature maps [Figure: Visualization of dilated convolutions with different dilation rates]

Distogram prediction¶

Predicting discrete probability distributions over distances between residue pairs a. Representing distances as a probability distribution b. Capturing uncertainty and multimodality in distance predictions c. Enabling the modeling of complex spatial relationships [Figure: Example of a predicted distogram, showing probability distributions for each residue pair]
Bin resolution and distance range a. Using a fine-grained 64-bin output to represent distance distributions b. Covering distances from 2 to 22 Angstroms c. Balancing resolution and computational efficiency [Figure: Illustration of the 64-bin distance range and corresponding Angstrom values]
Handling sparse distance maps using a cropping strategy a. Cropping the distance map into fixed-size regions b. Enabling efficient processing of long sequences c. Preserving local context while reducing memory requirements [Figure: Schematic representation of the cropping strategy applied to a distance map]

Auxiliary predictions¶

Secondary structure prediction a. Predicting 8-class secondary structure labels (3-state + 8-state) b. Incorporating secondary structure information in the distance prediction c. Improving the overall accuracy of the structure prediction [Figure: Example of predicted secondary structure probabilities for a protein sequence]
Solvent accessibility prediction a. Predicting relative solvent accessibility for each residue b. Providing information about the exposure of residues to the solvent c. Aiding in the modeling of surface residues and interactions [Figure: Visualization of predicted solvent accessibility scores mapped onto a protein structure]
Torsion angle prediction a. Predicting φ (phi) and ψ (psi) torsion angles for each residue b. Using a von Mises distribution to model angular distributions c. Guiding the backbone conformation during structure realization [Figure: Ramachandran plot showing predicted torsion angle distributions for a protein]

Potential Construction¶

Distance-based potential¶

Deriving a smooth potential using spline interpolation a. Converting discrete distance probabilities to a continuous potential b. Using cubic spline interpolation for smoothness c. Enabling gradient-based optimization during structure realization [Figure: Illustration of the spline interpolation process, showing discrete probabilities and the resulting smooth potential]
Incorporating a reference state to account for background distributions a. Modeling the background distribution of distances based on sequence length and amino acid composition b. Subtracting the reference state potential from the predicted potential c. Enhancing the specificity of the distance-based potential [Figure: Comparison of the predicted and reference state distance potentials]

Torsion angle potential¶

Using a von Mises distribution to model torsion angle preferences a. Representing torsion angles as a continuous probability distribution b. Capturing the preferred φ (phi) and ψ (psi) angles for each residue c. Incorporating backbone conformational preferences in the potential [Figure: Example of a von Mises distribution fitted to predicted torsion angle probabilities]
Combining φ and ψ angle potentials for each residue a. Treating φ and ψ angles as independent variables b. Summing the negative log-likelihoods of the von Mises distributions c. Creating a joint torsion angle potential for each residue [Figure: Equation for combining φ and ψ angle potentials]

Steric repulsion potential¶

Incorporating Rosetta's VdW term (V_score2_smooth) to prevent clashes a. Using a simplified Lennard-Jones potential to model atomic repulsion b. Preventing unrealistic overlaps between non-bonded atoms c. Ensuring physically plausible structures during optimization [Figure: Illustration of the steric repulsion potential, showing the energy as a function of atomic distance]
Efficient computation of the steric repulsion potential a. Using a coarse-grained representation of the protein (e.g., Cβ atoms) b. Employing distance-dependent cutoffs for computational efficiency c. Balancing accuracy and speed in the potential evaluation [Figure: Schematic representation of the coarse-grained protein model used for steric repulsion calculations]

Combining potentials and weighting scheme¶

Weighted sum of distance, torsion, and steric potentials a. Assigning weights to each potential term b. Balancing the contributions of different energy components c. Creating a unified objective function for structure optimization [Figure: Equation showing the weighted sum of potential terms]
Determining optimal weights through cross-validation a. Testing different weight combinations on a validation set b. Selecting the weights that yield the best structure prediction performance c. Ensuring robustness and generalization of the potential [Figure: Heatmap showing the cross-validation performance for different weight combinations]

Structure Realization¶

Initialization of backbone torsion angles¶

Sampling from predicted torsion angle distributions a. Generating diverse initial conformations b. Ensuring compatibility with the predicted torsion angle preferences c. Exploring the conformational space effectively [Figure: Illustration of the sampling process, showing multiple initial conformations]
Assigning initial torsion angles to the protein backbone a. Using the predicted φ (phi) and ψ (psi) angles for each residue b. Constructing a coarse-grained backbone representation c. Preparing the structure for optimization [Figure: Schematic representation of the initial backbone conformation with assigned torsion angles]

Optimization using gradient descent¶

L-BFGS algorithm for minimizing the total potential a. Employing a quasi-Newton optimization method b. Approximating the inverse Hessian matrix for efficient gradient descent c. Handling large-scale optimization problems with limited memory requirements [Figure: Flowchart of the L-BFGS optimization process]
Efficient computation of gradients using the differentiable geometric unit a. Representing the protein geometry using a differentiable framework b. Enabling gradient flow from the potential to the torsion angles c. Allowing for end-to-end optimization of the structure [Figure: Schematic representation of the differentiable geometric unit, showing the relationship between torsion angles and Cartesian coordinates]
Iterative refinement using "noisy" restarts a. Perturbing the optimized structure with random torsion angle noise b. Restarting the optimization process from the perturbed conformation c. Escaping local minima and exploring alternative structures [Figure: Illustration of the noisy restart process, showing multiple optimization trajectories]

Model selection and ranking¶

Generating a pool of low-energy conformations a. Collecting structures from multiple optimization runs b. Filtering out high-energy and physically implausible conformations c. Retaining a diverse set of candidate structures [Figure: Energy landscape visualization, showing multiple low-energy conformations]
Ranking models based on the total potential energy a. Evaluating the final potential energy for each candidate structure b. Sorting the structures based on their energy values c. Selecting the lowest-energy conformation as the predicted structure [Figure: Bar plot showing the ranked candidate structures based on their potential energy]

Adding side-chain atoms to the predicted backbone a. Using Rosetta's side-chain packing algorithms b. Optimizing the side-chain conformations based on Rosetta's energy function c. Incorporating physical and statistical potentials for accurate side-chain placement [Figure: Illustration of the side-chain addition process, showing the transition from a coarse-grained to a full-atom model]
Refining the full-atom model using Rosetta's energy minimization a. Employing Rosetta's all-atom energy function (e.g., Talaris2014) b. Performing gradient-based minimization to optimize the structure c. Fine-tuning the predicted structure to improve its physical realism [Figure: Comparison of the structure before and after the full-atom refinement step]

Conclusion and Future Directions¶

Recap of AlphaFold's key innovations¶

Deep learning architecture for accurate prediction of protein structures a. Highlighting the success of the deep residual convolutional network in capturing complex sequence-structure relationships b. Emphasizing the importance of the distogram prediction and auxiliary tasks in guiding the structure prediction process [Figure: Schematic overview of AlphaFold's key architectural components and their contributions to the overall performance]
Incorporation of coevolutionary information from multiple sequence alignments a. Reiterating the significance of MSAs in providing valuable evolutionary constraints b. Discussing the effectiveness of AlphaFold's approach in leveraging coevolutionary information for structure prediction [Figure: Illustration of how coevolutionary information from MSAs is integrated into AlphaFold's prediction pipeline]
Construction of a differentiable potential for gradient-based optimization a. Summarizing the novelty of AlphaFold's potential construction approach, combining distance, torsion, and steric terms b. Highlighting the advantages of a differentiable potential for efficient gradient-based optimization [Figure: Schematic representation of AlphaFold's differentiable potential and its role in the structure realization process]

Potential applications of improved protein structure prediction¶

Accelerating protein structure determination and understanding protein function a. Discussing how AlphaFold's accurate predictions can complement and guide experimental structure determination efforts b. Highlighting the potential of predicted structures in providing insights into protein function and mechanisms [Figure: Examples of how AlphaFold's predictions can be used to understand protein function and guide experimental studies]
Facilitating rational drug design and targeted therapeutics a. Explaining the importance of protein structures in the drug discovery process b. Illustrating how AlphaFold's predictions can aid in the identification of drug targets and the design of targeted therapeutics [Figure: Case studies showcasing the application of AlphaFold's predictions in rational drug design and the development of targeted therapies]
Enabling large-scale structural characterization of proteomes a. Discussing the potential of AlphaFold in predicting structures for entire proteomes, including proteins that are difficult to study experimentally b. Highlighting the implications of large-scale structural characterization for understanding cellular processes and disease mechanisms [Figure: Visualization of a predicted structural proteome, emphasizing the scale and diversity of structures that can be characterized using AlphaFold]

Limitations and areas for further improvement¶

Addressing the challenges of predicting structures for large multi-domain proteins and protein complexes a. Discussing the current limitations of AlphaFold in handling large multi-domain proteins and protein-protein interactions b. Identifying potential strategies for extending AlphaFold's capabilities to accurately predict structures of these challenging targets [Figure: Examples of large multi-domain proteins and protein complexes that pose challenges for current structure prediction methods]
Improving the interpretability and biological plausibility of predicted structures a. Highlighting the need for further validation and refinement of AlphaFold's predictions to ensure their biological relevance b. Discussing potential approaches for incorporating additional experimental constraints and biological knowledge into the prediction process [Figure: Schematic representation of the integration of experimental data and biological knowledge to improve the interpretability and plausibility of predicted structures]
Incorporating information about protein dynamics and conformational flexibility a. Acknowledging the limitations of current structure prediction methods in capturing protein dynamics and conformational flexibility b. Discussing potential avenues for extending AlphaFold to predict multiple conformational states and model protein dynamics [Figure: Illustration of protein conformational flexibility and the potential for AlphaFold to predict multiple conformational states]

Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., ... & Hassabis, D. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706-710. DOI: 10.1038/s41586-019-1923-7 ↩

Ended: AlphaFold

RoseTTAFold¶

TODO:

Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., ... & Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), 871-876. ↩
Krishna, R., Wang, J., Ahern, W., Sturmfels, P., Venkatesh, P., Kalvet, I., ... & Baker, D. (2024). Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science, eadl2528. ↩

ESMfold¶

TODO:

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130. DOI: 10.1126/science.ade2574 ↩

Chroma¶

TODO:

Ingraham, J. B., Baranov, M., Costello, Z., Barber, K. W., Wang, W., Ismail, A., ... & Grigoryan, G. (2023). Illuminating protein space with a programmable generative model. Nature, 623(7989), 1070-1078. DOI: 10.1038/s41586-023-06728-8 ↩

EMBER2¶

TODO:

Weissenow, K., Heinzinger, M., & Rost, B. (2022). Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure, 30(8), 1169-1177. DOI: 10.1016/j.str.2022.05.001 ↩

FrameDiff¶

TODO:

Yim, J., Trippe, B. L., De Bortoli, V., Mathieu, E., Doucet, A., Barzilay, R., & Jaakkola, T. (2023). SE(3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277. DOI: 10.48550/arXiv.2302.02277 ↩

ProteinBERT¶

TODO:

Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., & Linial, M. (2022). ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8), 2102-2110. DOI: 10.1093/bioinformatics/btac020 ↩

RGN2¶

TODO:

Chowdhury, R., Bouatta, N., Biswas, S., Floristean, C., Kharkar, A., Roy, K., ... & AlQuraishi, M. (2022). Single-sequence protein structure prediction using a language model and deep learning. Nature Biotechnology, 40(11), 1617-1623. ↩

OpenFold¶

TODO:

Ahdritz, G., Bouatta, N., Floristean, C., Kadyan, S., Xia, Q., Gerecke, W., ... & AlQuraishi, M. (2024). OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nature Methods, 1-11. DOI: 10.1038/s41592-024-02272-z ↩

Ended: Deep learning

Homology¶

TODO:

Threading¶

TODO:

Ended: Protein

RNA¶

TODO:

Ended: Prediction

Ended: Structural

Bioenergetics¶

TODO:

Imaging¶

TODO:

Molecular¶

TODO:

Molecular ↵

Molecular¶

TODO:

Dls ↵

Dynamic light scattering¶

Dynamic Light Scattering (DLS) is a powerful analytical technique used to determine the size distribution of small particles in suspension or polymers in solution. It measures the time-dependent fluctuations in the intensity of scattered light due to the Brownian motion of particles. These fluctuations provide information about the particle size through the calculation of the diffusion coefficient.

Kaszuba, M., McKnight, D., Connah, M. T., McNeil-Watson, F. K., & Nobbmann, U. (2008). Measuring sub nanometre sizes using dynamic light scattering. Journal of nanoparticle research, 10, 823-829. DOI: 10.1007/s11051-007-9317-4 ↩
Pecora, R. (2000). Dynamic light scattering measurement of nanometer particles in liquids. Journal of nanoparticle research, 2, 123-131. DOI: 10.1023/A:1010067107182 ↩
Stetefeld, J., McKenna, S. A., & Patel, T. R. (2016). Dynamic light scattering: a practical guide and applications in biomedical sciences. Biophysical reviews, 8, 409-427. DOI: 10.1007/s12551-016-0218-6 ↩

Theoretical background ↵

Theoretical Background¶

TODO:

Brownian motion¶

Brownian Motion is the random movement of particles suspended in a fluid (liquid or gas), resulting from collisions with the molecules of the fluid. This phenomenon was first observed by botanist Robert Brown in 1827 while studying pollen grains in water. The motion of these particles, though seemingly random, can provide valuable insights into the properties of the particles and the fluid in which they are suspended.

Brownian motion has two key aspects:

its random nature and
its microscopic origin.

The motion is erratic and unpredictable because it results from countless collisions with molecules that are themselves moving randomly. This randomness is a fundamental characteristic of thermal motion at the molecular level. On a microscopic scale, the constant, thermal-driven impacts of fluid molecules on suspended particles cause the particles to exhibit Brownian motion.

Zero net movement¶

\[ \left\langle x^2 \right\rangle = \frac{\sum x_i^2}{n} \]

Diffusion Coefficient¶

Diffusion Coefficient (D) quantifies the rate at which particles move through a fluid. It is a crucial parameter in understanding how particles disperse over time.

The diffusion coefficient measures how fast particles spread out in a fluid. It's like observing how quickly a drop of ink disperses in a glass of water.

Key Points:

Rate of Spread: The diffusion coefficient indicates how quickly particles move through the fluid. Higher diffusion coefficients mean faster spreading.
Dependence on Temperature and Viscosity: The diffusion coefficient is directly proportional to temperature and inversely proportional to the viscosity of the fluid.

Mathematical Explanation¶

The Stokes-Einstein equation relates the diffusion coefficient to the hydrodynamic diameter (\(R_H\)) of the particles:

\[ D = \frac{k_B T}{6 \pi \eta R_H} \]

Where:

\(k_B\) is the Boltzmann constant (\(1.380 \times 10^{-23} \, \text{J/K}\)),
\(T\) is the absolute temperature (in Kelvin),
\(\eta\) is the viscosity of the fluid (in Pa·s),
\(R_H\) is the hydrodynamic radius of the particle (in meters).

Worked Example¶

A suspension of nanoparticles in water is maintained at \(298 \, \text{K}\) (25°C). The viscosity of water is \(0.89 \times 10^{-3} \, \text{Pa} \cdot \text{s}\).

The diffusion coefficient \(D\) is measured to be \(1.43 \times 10^{-12} \, \text{m}^2/\text{s}\).

Using the Stokes-Einstein equation

\[ R_H = \frac{k_B T}{6 \pi \eta D} = \frac{1.38 \times 10^{-23} \, \text{J/K} \times 298 \, \text{K}}{6 \pi \times 0.89 \times 10^{-3} \, \text{Pa} \cdot \text{s} \times 1.43 \times 10^{-12} \, \text{m}^2/\text{s}} \approx 6.1 \times 10^{-9} \, \text{m} = 61 \, \text{nm} \]

Self-Assessment Questions¶

How does the viscosity of a fluid affect the diffusion coefficient of particles suspended in it?

Answer

The viscosity of the fluid affects how easily the particles can move; higher viscosity means more resistance to motion, resulting in a lower diffusion coefficient.
Why is the diffusion coefficient temperature-dependent?

Answer

The diffusion coefficient is temperature-dependent because higher temperatures increase the kinetic energy of the fluid molecules, leading to more vigorous collisions and faster particle movement.
Given a fluid with viscosity \(\eta = 1.0 \times 10^{-3} \, \text{Pa} \cdot \text{s}\) and temperature \(T = 300 \, \text{K}\), calculate the diffusion coefficient for particles with a hydrodynamic radius of \(50 \, \text{nm}\).

Answer

Using the Stokes-Einstein equation

\[ D = \frac{k_B T}{6 \pi \eta R_H} = \frac{1.38 \times 10^{-23} \, \text{J/K} \times 300 \, \text{K}}{6 \pi \times 1.0 \times 10^{-3} \, \text{Pa} \cdot \text{s} \times 50 \times 10^{-9} \, \text{m}} \approx 1.1 \times 10^{-12} \, \text{m}^2/\text{s} \]
If the diffusion coefficient is \(1.33 \times 10^{-12} \, \text{m}^2/\text{s}\), what is the hydrodynamic radius of the particles?

Answer

Rearranging the Stokes-Einstein equation to solve for \(R_H\)

\[ R_H = \frac{k_B T}{6 \pi \eta D} = \frac{1.38 \times 10^{-23} \, \text{J/K} \times 300 \, \text{K}}{6 \pi \times 1.0 \times 10^{-3} \, \text{Pa} \cdot \text{s} \times 1.33 \times 10^{-12} \, \text{m}^2/\text{s}} \approx 5.5 \times 10^{-9} \, \text{m} = 55 \, \text{nm} \]

Light Scattering Theory¶

Dynamic light scattering (DLS) is a powerful technique used to determine the size distribution of small particles in suspension or polymers in solution. At the heart of DLS lies the phenomenon of Rayleigh scattering, which occurs when light interacts with particles much smaller than its wavelength.

Rayleigh scattering basics¶

Rayleigh scattering is an elastic scattering process, meaning that the scattered light has the same frequency (and thus, the same energy) as the incident light. In DLS, Rayleigh scattering occurs when the laser light interacts with particles that are typically less than one-tenth of the light's wavelength, such as nanoparticles or macromolecules in a liquid. When the laser light hits these small particles, the oscillating electric field of the light induces an oscillating dipole moment in the particles. The oscillating dipole moment causes the particles to radiate scattered light in all directions, acting as point-like scatterers.

The intensity of the scattered light depends on several factors, including the size of the particles, the wavelength of the light, and the scattering angle. In DLS, the intensity of the scattered light (\(I\)) due to Rayleigh scattering is proportional to the sixth power of the particle diameter (\(d^6\)) and inversely proportional to the fourth power of the wavelength (\(\lambda^4\)):

\[ I \propto \frac{d^6}{\lambda^4} \]

This relationship has important implications for DLS measurements:

Smaller particles scatter light less intensely than larger ones. As the particle size decreases, the scattering intensity decreases dramatically, making it more challenging to detect very small particles.
The wavelength of the laser light affects the scattering intensity. Using shorter wavelengths (e.g., blue light) can enhance the scattering intensity, improving the sensitivity of DLS measurements.

Intensity fluctuations and brownian motion¶

In DLS, the scattered light intensity fluctuates over time due to the Brownian motion of the particles in the liquid. The rate of these intensity fluctuations depends on the particle size and the viscosity of the solvent. Smaller particles, with their higher diffusion coefficients, move more rapidly and cause faster fluctuations in the scattered light intensity. Larger particles, with their lower diffusion coefficients, move more slowly and produce slower intensity fluctuations. By analyzing the rate of these intensity fluctuations using an autocorrelation function, DLS can determine the diffusion coefficients and, consequently, the particle sizes.

The intensity and angle of the scattered light depend on the particle size and shape. In DLS, the intensity fluctuations of the scattered light are measured over time, providing information about the diffusion coefficient and hence the size of the particles.

The wave vector \(q\) in light scattering is given by:

\[ q = \frac{4 \pi n}{\lambda} \sin\left(\frac{\theta}{2}\right) \]

where:

\(\lambda\) is the wavelength of the incident light in vacuum,
\(n\) is the refractive index of the medium,
\(\theta\) is the scattering angle.

Autocorrelation Function¶

The autocorrelation function describes the correlation between the intensity of scattered light at different times. It decays as the time difference increases, with the rate of decay providing information about the particle's diffusion coefficient.

For monodisperse particles, the autocorrelation function \(g_2(\tau)\) can be represented as:

\[ g_2(\tau) = 1 + \beta e^{-2\Gamma\tau} \]

where:

\(\tau\) is the delay time,
\(\beta\) is an instrument-dependent coherence factor,
\(\Gamma\) is the decay constant related to the diffusion coefficient \(D\) by \(\Gamma = D q^2\).

For polydisperse particles, the autocorrelation function is a weighted sum of exponential decays:

\[ g_2(\tau) = 1 + \beta \int_0^\infty G(\Gamma) e^{-2\Gamma\tau} d\Gamma \]

where \(G(\Gamma)\) is the distribution of decay rates.

Stokes-Einstein Equation¶

The Stokes-Einstein equation relates the diffusion coefficient (D) obtained from DLS measurements to the hydrodynamic radius (R_h) of the particles:

\[ D = \frac{k_B T}{6 \pi \eta R_h} \]

Where:

\(k_B\) is the Boltzmann constant
\(T\) is the absolute temperature
\(\eta\) is the viscosity of the solvent

By measuring the diffusion coefficient using DLS and applying the Stokes-Einstein equation, researchers can determine the size distribution of the particles in the sample.

Ended: Theoretical background

Instrumentation¶

Basic Components: - Laser Source: A coherent light source, typically a He-Ne laser or a diode laser, with a specific wavelength (e.g., 633 nm). - Sample Holder: Usually a cuvette made of quartz or glass, which holds the sample solution. - Detector: A photomultiplier tube (PMT) or an avalanche photodiode (APD) detects the scattered light. - Autocorrelator: A digital device that computes the time correlation of the scattered light intensity.

Photon Correlation Spectroscopy (PCS): - PCS measures the time fluctuations of scattered light intensity. The autocorrelation function is analyzed to obtain the diffusion coefficient and particle size distribution. - PCS is sensitive to particles ranging from a few nanometers to several micrometers.

Depolarized Fabry-Perot Interferometry (FPI): - FPI measures the frequency broadening of scattered light due to rotational diffusion of anisotropic particles. It provides information about particle shape and dimensions.

Measurement Configurations: - Backscatter Detection (173°): Enhances sensitivity for small particles by increasing the detected count rate. It is particularly useful for measuring particles at low concentrations.

Figure: A detailed diagram showing the laser source, sample holder, detector, and autocorrelator setup. Include labels and brief descriptions of each component.

Experimental procedures¶

Sample Preparation: - Use high-purity solvents and buffers to minimize contamination. - Filter samples using 0.22 µm or smaller filters to remove dust and aggregates. - Degas samples to remove air bubbles that can interfere with light scattering measurements.

Instrument Calibration: - Calibrate the DLS instrument using standard particles of known size (e.g., polystyrene latex beads) to ensure accurate measurements. - Perform calibration regularly to account for any drifts or changes in instrument performance.

Measurement Protocols: - Maintain a consistent temperature using a temperature-controlled sample holder or water bath. - Measure samples at different concentrations to ensure reliable data, avoiding concentrations too high to prevent multiple scattering effects. - Perform multiple repetitions to obtain average values and assess measurement repeatability.

Troubleshooting Common Issues: - Addressing problems like multiple scattering (reduce sample concentration), dust contamination (filter samples), and sample instability (ensure proper dispersion and no aggregation).

Data acquisition and analysis¶

Collecting Scattered Light Intensity Data: - The intensity of scattered light is recorded over time, capturing fluctuations due to particle Brownian motion. The data is typically collected for several seconds to minutes, depending on the sample.

Calculating the Autocorrelation Function: - The time correlation function of the scattered light intensity is computed, which decays over time. The rate of decay is related to the particle diffusion coefficient.

Graph illustrating the autocorrelation function decay for monodisperse and polydisperse samples. Show the differences in decay rates and how they relate to particle size distribution.

Determining the Diffusion Coefficient: - The autocorrelation function is analyzed to extract the diffusion coefficient (D). For monodisperse samples, this involves fitting a single exponential decay. For polydisperse samples, multiple exponential decays are fitted.

Converting Diffusion Coefficient to Particle Size: - Using the Stokes-Einstein equation, the diffusion coefficient is converted to the hydrodynamic diameter (D_H). The viscosity of the solvent and the temperature must be known accurately.

Data Interpretation: - Monodisperse samples show a single peak in the size distribution, while polydisperse samples show multiple peaks. - The polydispersity index (PDI) provides a measure of the sample's size distribution width. A low PDI (<0.1) indicates a monodisperse sample, while a high PDI (>0.1) indicates polydispersity.

Software Tools: - Software tools such as Malvern Zetasizer, Brookhaven Instruments, and Horiba DLS software are used for data analysis. These tools provide graphical representations of size distributions and various statistical parameters.

Case Studies and Applications¶

Measuring Protein Sizes: DLS is used to determine the size distribution and aggregation state of proteins in solution, important for understanding their function and stability.

Characterizing Nanoparticles: DLS provides information on the size and distribution of nanoparticles, crucial for applications in nanotechnology and materials science.

Quality Control in Pharmaceuticals: Ensuring consistent particle size distribution in drug formulations.

Research Applications: DLS is widely used in research to study colloids, polymers, and other macromolecules in various solvents.

Advanced Topics¶

Measuring Anisotropic Particles: Techniques for analyzing particles that are not spherical, providing information about their shape and orientation.

Combining DLS with Other Techniques: Using DLS in conjunction with X-ray scattering, electron microscopy, and other methods to obtain comprehensive data about particle size and structure.

Developments in DLS Technology: Recent advancements in instrumentation and software that improve the sensitivity, accuracy, and range of DLS measurements.

Biophysics - Breadcrumbs¶

What is Breadcrumbs?¶

Deploying¶

License¶

Web analytics¶

Structural ↵

Structural¶

Principles ↵

Principles¶

Protein¶

RNA¶

Ended: Principles

Determination ↵

Determination¶

Cryo-electron microscopy¶

Small-angle x-ray scattering¶

X-ray crystallography¶

Ended: Determination

Prediction ↵

Prediction¶

Protein ↵

Protein¶

Ab initio¶

Critical Assessment of Structure Prediction¶

Coevolution¶

Background on Protein Structure and Evolution¶

Coevolution and Correlated Mutations¶

Inference of Coevolutionary Couplings¶

From Coevolution to Contacts¶

Deep learning ↵

Deep learning¶

AlphaFold ↵

AlphaFold¶

AlphaFold3¶

AlphaFold2 ↵

AlphaFold2¶

AlphaFold-Multimer¶

Ended: AlphaFold2

AlphaFold¶

Feature Extraction¶

Multiple Sequence Alignments (MSAs)¶

Input features¶

Neural network architecture¶

Deep residual convolutional network¶

Distogram prediction¶

Auxiliary predictions¶

Potential Construction¶

Distance-based potential¶

Torsion angle potential¶

Steric repulsion potential¶

Combining potentials and weighting scheme¶

Structure Realization¶

Initialization of backbone torsion angles¶

Optimization using gradient descent¶

Model selection and ranking¶

Full-atom refinement using Rosetta (optional)¶

Conclusion and Future Directions¶

Recap of AlphaFold's key innovations¶

Potential applications of improved protein structure prediction¶

Limitations and areas for further improvement¶

Ended: AlphaFold

RoseTTAFold¶

ESMfold¶

Chroma¶

EMBER2¶

FrameDiff¶

ProteinBERT¶

RGN2¶

OpenFold¶

Ended: Deep learning

Homology¶

Threading¶

Ended: Protein

RNA¶

Ended: Prediction

Ended: Structural

Bioenergetics¶

Imaging¶

Molecular¶

Molecular ↵