Lecture 7 Video 14

Protein structure

🧬 Protein Structure Refinement & Validation — Full Summary

This lecture explains the final stages of X-ray crystallographic structure determination — how we improve, validate, and judge the quality of a protein model before publication.

Think of this stage as:

🧩 You already built a rough protein model → now you polish, test, and verify if it truly matches the experimental data.

🔧 Structure Refinement — Improving the Model

After building an initial model, refinement aims to:

✅ Minimize the difference between:

Observed structure factors (experimental diffraction data)
Calculated structure factors (from the model)

This is the central refinement goal.

📏 Using Chemical Knowledge in Refinement

Refinement is not blind fitting — we use known stereochemistry constraints:

Bond lengths (C–C, C–O etc.)
Bond angles
Torsion angles
Planarity of peptide bonds
Amino-acid chirality
van der Waals radii

These help guide the model toward physically realistic conformations.

⛰️ Local vs Global Minimum Problem

Initial models often get stuck in local minima.

To escape this:

1️⃣ Least-squares optimization

Adjust parameters gradually
Move toward lower residual error

2️⃣ Simulated annealing 🔥❄️

(Molecular-dynamics style refinement)

“Heat” atoms → increase mobility
“Cool” system → settle into better minimum
Helps escape incorrect conformations

Goal → reach global minimum = best model.

⚙️ Constraints vs Restraints (VERY exam-important)

These control model complexity vs data amount.

🔒 Constraints

Reduce number of parameters.

Example:

Instead of one B-factor per atom
Use one B-factor per residue (group B-factor)

Why?

👉 Low-resolution data → fewer reflections 👉 Too many parameters → overfitting

Constraints prevent over-parameterization.

🧷 Restraints

Allow flexibility but within allowed ranges:

Bond length intervals
Angle intervals

Model can move — but not unrealistically.

🌫️ B-factor (Atomic Displacement)

Describes atomic mobility / disorder.

Low B → rigid atoms → sharp diffraction → high resolution
High B → flexible atoms → blurred diffraction → low resolution

High B-factors cause:

➡ Faster fall-off of scattering ➡ Poor high-resolution density visibility

Especially important for:

Flexible proteins
Loop regions
Ligands with partial occupancy

📊 R-factor — Core Refinement Statistic

Measures mismatch between data and model.

R = rac{sum |F_ - F_|}{sum F_}

Perfect model → R = 0 (never achieved)
Good protein model → R < ~20%

Refinement aims to reduce R continuously.

🧪 R-free — Validation Against Overfitting

Super important concept ⭐

Procedure:

Randomly remove ~5% reflections
Do NOT use them in refinement
Calculate R-free using them

Interpretation:

Situation	Meaning
Rwork ↓ and Rfree ↓	Model improving
Rwork ↓ but Rfree ↑	❗ Overfitting noise
Rfree ≈ 68%	Random model

Difference between Rwork and Rfree ≈ 5% is typical.

📐 Ramachandran Plot — Geometry Validation

Plots φ (phi) vs ψ (psi) torsion angles.

Regions:

🔴 Allowed 🟡 Additional allowed 🟨 Generously allowed ⚪ Disallowed

Good model:

Majority residues in allowed regions
Very few in disallowed

Exception:

👉 Catalytic residues may appear strained but real — always check electron density.

🔍 Real Space Correlation Coefficient (RSCC)

Measures how well model density matches observed density.

Good value:

RSCC > 0.9

Low RSCC + High B-factor → poorly defined region Typical example: flexible loops or incorrectly modeled ligands.

💊 Ligand Modeling Issues

Ligands often:

Have higher B-factors
Lower occupancy
Weak density

Reasons:

Not all binding sites occupied
Conformational disorder
Incorrect placement by crystallographer

Contour level matters:

~1σ = standard map interpretation
<0.8σ = risky → may see noise instead of real density

📈 Data Collection Statistics (Tables in Papers)

Typical parameters:

🔢 Measured vs Unique Reflections

More reflections → higher resolution → more model parameters allowed

🔁 Redundancy (Multiplicity)

ext{Redundancy} = rac{ ext{Measured reflections}}{ ext{Unique reflections}}

Higher redundancy → better precision.

🧩 Completeness

How much of reciprocal space was measured.

Closer to 100% → better dataset
Must also be high in highest resolution shell

Otherwise resolution claim is unreliable.

📉 Rsym

Agreement between symmetry-related reflections.

Lower = better
Higher tolerated in highest shell (weak data)

🔊 Signal-to-Noise (I/σI)

Rule of thumb:

Good cutoff ≈ 2
Modern practice accepts values near 1
CC½ increasingly used instead.

🌍 Wilson B-factor (Overall Dataset Disorder)

Average B-factor for crystal.

High Wilson B → low resolution
Membrane proteins often high (~100 Å²)

Again shows disorder limits resolution.

📐 RMSD Bond Length & Angle

Quality indicator of geometry.

Typical targets:

Bond length RMSD < 0.02 Å
Angle RMSD < 4°

At low resolution → strong restraints → artificially small RMSD At high resolution → restraints can be loosened.

💧 Modeling Water Molecules

Visible only at high resolution
Often absent at low resolution

Structural waters may still appear even at lower resolution.

🧠 Big Conceptual Takeaway

Protein crystallography workflow ends with:

1️⃣ Build model 2️⃣ Refine model (fit data + chemistry) 3️⃣ Validate model (statistics + geometry + density)

Only after passing all checks → structure is considered reliable.

This lecture essentially teaches:

🧬 A protein structure is not just “solved” — it must be statistically and chemically proven correct.

Quiz

Score: 0/30 (0%)

Q0. What is the main goal of structure refinement in protein crystallography?

Maximize diffraction intensity

Minimize difference between observed and calculated structure factors

Increase crystal size

Reduce solvent content

Q1. Which method can help escape local minima during refinement?

Fourier transform

Simulated annealing

Gel filtration

Mass spectrometry

Q2. What is the main reason to apply constraints in refinement?

Increase resolution

Reduce number of reflections

Avoid over-parameterization

Improve crystal growth

Q3. What does a high B-factor typically indicate?

High atomic rigidity

Low diffraction quality

High atomic mobility or disorder

Perfect model fit

Q4. Which parameter is often grouped per residue instead of per atom to reduce model complexity?

Occupancy

Charge

B-factor

Resolution

Q5. What approximate R-factor value indicates a good refined protein structure?

~70%

~50%

Below ~20%

Exactly 0%

Q6. Why are reflections excluded when calculating R-free?

To improve crystal packing

To validate the model against overfitting

To increase completeness

To determine symmetry

Q7. What does a large gap between R-work and R-free suggest?

High resolution data

Model overfitting

Low redundancy

Improved stereochemistry

Q8. Which plot is used to validate backbone torsion angles?

Wilson plot

Ramachandran plot

Patterson map

Kratky plot

Q9. What RSCC value typically indicates a well-modeled region?

Below 0.2

Around 0.5

Above 0.9

Exactly 1.5

Q10. Why might ligands have higher B-factors than proteins?

They contain heavier atoms

They are always fully occupied

They may be partially occupied or disordered

They diffract more strongly

Q11. What does completeness measure in crystallographic data?

Accuracy of phase determination

Fraction of reciprocal space sampled

Number of modeled residues

Strength of hydrogen bonds

Q12. How is redundancy calculated?

Unique reflections divided by measured reflections

Measured reflections divided by unique reflections

Resolution divided by completeness

B-factor divided by occupancy

Q13. What is the typical rule-of-thumb cutoff for signal-to-noise ratio (I/σI)?

0.1

Q14. Why are water molecules easier to model at high resolution?

They move faster

They have stronger scattering

Their density becomes more clearly defined

They increase redundancy

Q15. Refinement seeks to reduce the difference between observed and calculated structure factors.

True

False

Q16. Simulated annealing cools the system first and then heats it.

True

False

Q17. Constraints fix parameters completely, whereas restraints allow limited variation.

True

False

Q18. Low B-factors are associated with sharper diffraction and higher resolution.

True

False

Q19. An R-free value around 68% indicates a highly accurate model.

True

False

Q20. The difference between R-work and R-free is typically about 5%.

True

False

Q21. Most amino acids should fall in disallowed regions of the Ramachandran plot.

True

False

Q22. Loop regions often show higher B-factors and lower RSCC values.

True

False

Q23. Lower contour levels in electron density maps can reveal weaker features but increase risk of noise interpretation.

True

False

Q24. Completeness close to 100% generally indicates a more reliable dataset.

True

False

Q25. Higher redundancy means each reflection was measured fewer times.

True

False

Q26. Wilson B-factor represents an average disorder measure for the dataset.

True

False

Q27. Membrane proteins often exhibit higher overall B-factors.

True

False

Q28. At low resolution, restraints are often loosened to allow more geometric variability.

True

False

Q29. Water molecules are never modeled in low-resolution structures.

True

False