Lecture 7 Video 14

Protein structure

🧬 Protein Structure Refinement & Validation β€” Full Summary

This lecture explains the final stages of X-ray crystallographic structure determination β€” how we improve, validate, and judge the quality of a protein model before publication.

Think of this stage as:

🧩 You already built a rough protein model β†’ now you polish, test, and verify if it truly matches the experimental data.


πŸ”§ Structure Refinement β€” Improving the Model

After building an initial model, refinement aims to:

βœ… Minimize the difference between:

  • Observed structure factors (experimental diffraction data)
  • Calculated structure factors (from the model)

This is the central refinement goal.


πŸ“ Using Chemical Knowledge in Refinement

Refinement is not blind fitting β€” we use known stereochemistry constraints:

  • Bond lengths (C–C, C–O etc.)
  • Bond angles
  • Torsion angles
  • Planarity of peptide bonds
  • Amino-acid chirality
  • van der Waals radii

These help guide the model toward physically realistic conformations.


⛰️ Local vs Global Minimum Problem

Initial models often get stuck in local minima.

To escape this:

1️⃣ Least-squares optimization

  • Adjust parameters gradually
  • Move toward lower residual error

2️⃣ Simulated annealing πŸ”₯❄️

(Molecular-dynamics style refinement)

  • β€œHeat” atoms β†’ increase mobility
  • β€œCool” system β†’ settle into better minimum
  • Helps escape incorrect conformations

Goal β†’ reach global minimum = best model.


βš™οΈ Constraints vs Restraints (VERY exam-important)

These control model complexity vs data amount.


πŸ”’ Constraints

Reduce number of parameters.

Example:

  • Instead of one B-factor per atom
  • Use one B-factor per residue (group B-factor)

Why?

πŸ‘‰ Low-resolution data β†’ fewer reflections πŸ‘‰ Too many parameters β†’ overfitting

Constraints prevent over-parameterization.


🧷 Restraints

Allow flexibility but within allowed ranges:

  • Bond length intervals
  • Angle intervals

Model can move β€” but not unrealistically.


🌫️ B-factor (Atomic Displacement)

Describes atomic mobility / disorder.

  • Low B β†’ rigid atoms β†’ sharp diffraction β†’ high resolution
  • High B β†’ flexible atoms β†’ blurred diffraction β†’ low resolution

High B-factors cause:

➑ Faster fall-off of scattering ➑ Poor high-resolution density visibility

Especially important for:

  • Flexible proteins
  • Loop regions
  • Ligands with partial occupancy

πŸ“Š R-factor β€” Core Refinement Statistic

Measures mismatch between data and model.

R = rac{sum |F_ - F_|}{sum F_}

  • Perfect model β†’ R = 0 (never achieved)
  • Good protein model β†’ R < ~20%

Refinement aims to reduce R continuously.


πŸ§ͺ R-free β€” Validation Against Overfitting

Super important concept ⭐

Procedure:

  • Randomly remove ~5% reflections
  • Do NOT use them in refinement
  • Calculate R-free using them

Interpretation:

SituationMeaning
Rwork ↓ and Rfree ↓Model improving
Rwork ↓ but Rfree ↑❗ Overfitting noise
Rfree β‰ˆ 68%Random model

Difference between Rwork and Rfree β‰ˆ 5% is typical.


πŸ“ Ramachandran Plot β€” Geometry Validation

Plots Ο† (phi) vs ψ (psi) torsion angles.

Regions:

πŸ”΄ Allowed 🟑 Additional allowed 🟨 Generously allowed βšͺ Disallowed

Good model:

  • Majority residues in allowed regions
  • Very few in disallowed

Exception:

πŸ‘‰ Catalytic residues may appear strained but real β€” always check electron density.


πŸ” Real Space Correlation Coefficient (RSCC)

Measures how well model density matches observed density.

Good value:

RSCC > 0.9

Low RSCC + High B-factor β†’ poorly defined region Typical example: flexible loops or incorrectly modeled ligands.


πŸ’Š Ligand Modeling Issues

Ligands often:

  • Have higher B-factors
  • Lower occupancy
  • Weak density

Reasons:

  • Not all binding sites occupied
  • Conformational disorder
  • Incorrect placement by crystallographer

Contour level matters:

  • ~1Οƒ = standard map interpretation
  • <0.8Οƒ = risky β†’ may see noise instead of real density

πŸ“ˆ Data Collection Statistics (Tables in Papers)

Typical parameters:


πŸ”’ Measured vs Unique Reflections

  • More reflections β†’ higher resolution β†’ more model parameters allowed

πŸ” Redundancy (Multiplicity)

ext{Redundancy} = rac{ ext{Measured reflections}}{ ext{Unique reflections}}

Higher redundancy β†’ better precision.


🧩 Completeness

How much of reciprocal space was measured.

  • Closer to 100% β†’ better dataset
  • Must also be high in highest resolution shell

Otherwise resolution claim is unreliable.


πŸ“‰ Rsym

Agreement between symmetry-related reflections.

  • Lower = better
  • Higher tolerated in highest shell (weak data)

πŸ”Š Signal-to-Noise (I/ΟƒI)

Rule of thumb:

  • Good cutoff β‰ˆ 2
  • Modern practice accepts values near 1
  • CCΒ½ increasingly used instead.

🌍 Wilson B-factor (Overall Dataset Disorder)

Average B-factor for crystal.

  • High Wilson B β†’ low resolution
  • Membrane proteins often high (~100 Γ…Β²)

Again shows disorder limits resolution.


πŸ“ RMSD Bond Length & Angle

Quality indicator of geometry.

Typical targets:

  • Bond length RMSD < 0.02 Γ…
  • Angle RMSD < 4Β°

At low resolution β†’ strong restraints β†’ artificially small RMSD At high resolution β†’ restraints can be loosened.


πŸ’§ Modeling Water Molecules

  • Visible only at high resolution
  • Often absent at low resolution

Structural waters may still appear even at lower resolution.


🧠 Big Conceptual Takeaway

Protein crystallography workflow ends with:

1️⃣ Build model 2️⃣ Refine model (fit data + chemistry) 3️⃣ Validate model (statistics + geometry + density)

Only after passing all checks β†’ structure is considered reliable.

This lecture essentially teaches:

🧬 A protein structure is not just β€œsolved” β€” it must be statistically and chemically proven correct.

Quiz

Score: 0/30 (0%)