Lecture 6 Video 2

Protein structure

🧬 Lecture 6 – Protein Structure Bioinformatics & Classification

(Fun, detailed, and beginner-friendly walkthrough)

This lecture moves from sequence bioinformatics into the world of structure bioinformatics — and that’s where things get very interesting. Instead of comparing strings of amino acids, we compare 3D shapes of proteins.

I’ll go through everything step-by-step and explain the logic behind it clearly.

🧩 1. Structure Alignment vs Structure Superposition

You may hear these terms used interchangeably — but they are not the same.

🔹 Structure Alignment

Used for different proteins
They may:
- Have different sequences
- Be homologs from different organisms
- Be mutants
- Have different residue numbering
Goal: 👉 Find the best overlap in 3D space

Importantly:

Not all parts of proteins necessarily overlap
You must identify which regions should be compared
Alignment is based on atomic coordinates, not sequence similarity

🔹 Structure Superposition

Used for identical proteins
Example: same protein in two conformations
Goal: Compare structural differences

So:

Alignment = comparing different proteins
Superposition = comparing the same protein in different states

🧪 2. How Alignment Works in Practice (PyMOL Example)

In PyMOL, two commands are commonly used:

align
super

Example from the lecture: Two proteins (CPZ and CC8) both had:

4-stranded β-sheet
2 α-helices

After alignment:

RMSD = 2.2 Å
That is considered quite good

📏 What is RMSD?

RMSD = Root Mean Square Deviation It measures the average distance between aligned atoms.

Low RMSD → structures are very similar
High RMSD → poor alignment

Important:

PyMOL did NOT align sequences — it aligned atomic coordinates

This is structural comparison, not sequence comparison.

🧬 3. Sequence → Structure → Function (Direction Matters!)

The lecture emphasizes something crucial:

🧭 Direction of Determination

DNA sequence → determines protein sequence
Protein sequence → determines protein structure

BUT NOT THE OTHER WAY AROUND.

Why?

Genetic code is redundant
Many DNA sequences → same protein sequence
Many protein sequences → same structure
Fewer possible structures than sequences

🔽 Variability decreases downward:

DNA variability > Protein sequence variability > Protein structure variability

This is extremely important for understanding why structure classification works.

🧠 4. Structure vs Function — Not Always Simple

Two important evolutionary insights:

🔹 Similar structure, different function

Rare but possible

🔹 Same function, different structures

Example: serine proteases Different folds, same catalytic activity Likely convergent evolution

This tells us: Structure and function are often linked — but not guaranteed.

🗂 5. Why Classify Protein Structures?

Because:

There are fewer possible folds than sequences
We want to organize structural space
Helps predict function
Helps understand evolution

Two major databases attempt this:

🏛 6. CATH Database

CATH stands for:

Class
Architecture
Topology
Homologous superfamily

🔹 Level 1: Class

Four main classes:

Mainly α
Mainly β
α/β
Few secondary structures (often unstructured proteins)

🔹 Level 2: Architecture

Describes overall arrangement of secondary structures.

Examples in α/β class:

Super roll
β barrel
Two-layer sandwich
Three-layer sandwich
αβα three-layer
ββα three-layer

Architecture = overall 3D arrangement (Not sequence order yet)

🔹 Level 3: Topology

Topology = 👉 The path the polypeptide chain takes through the structure

It depends on:

Order of secondary structures
Number of elements

Two proteins can:

Have same architecture
But different topology

Because the order in sequence differs

Topology ≠ spatial arrangement only Topology = connectivity pattern

🔹 Level 4: Homologous Superfamily

Groups proteins with:

Structural similarity
Evolutionary relationship

🏛 7. SCOP Database

SCOP = Structural Classification of Proteins

More complex hierarchy than CATH.

Levels include:

Class
Fold
Superfamily
Family

Example: In class α and β:

147 different folds

Difference between α/β and α+β is not always obvious

🧩 8. Domain Classification — The Complicated Part

Important: Both CATH and SCOP classify domains, not entire proteins

What is a domain?

A structural and functional unit within a protein.

Example: Pyruvate phosphate dikinase

SCOP identified 3 domains
CATH identified 6 domains

Even more confusing:

Some domains overlap
Some domains consist of non-contiguous sequence regions

This makes domain definition:

Difficult
Sometimes manual
Not fully reproducible

CATH:

Automated + manual inspection

SCOP:

Manual classification

That introduces operator bias.

🔎 9. Finding Similar Structures – The DALI Server

Suppose you:

Solved a new structure
Or built a homology model

How do you know if similar structures exist?

👉 Use the DALI server

Process:

Upload PDB file
DALI compares your structure to all known structures
Returns similar hits

Example: Uploaded small copper-binding protein (COPSET)

Returned hits:

Copper transporting proteins
Mercury transporting proteins
Heavy metal binding proteins

Conclusion: Proteins with similar structure often share similar function

🧠 Big Conceptual Takeaways

🧬 1. Structure is more conserved than sequence

You can lose sequence similarity and still retain fold.

🧱 2. Alignment is geometric, not sequence-based

Atomic coordinates are compared.

🏛 3. CATH and SCOP organize structural space differently

CATH: hierarchical & semi-automated
SCOP: more manual & detailed

🧩 4. Domain definition is not trivial

It is partly subjective and complex.

🔍 5. DALI is your structural BLAST

It finds structural neighbors.

📌 Conceptual Flow of the Lecture

Structural alignment basics
RMSD interpretation
Sequence → structure → function direction
Evolutionary implications
Structural classification systems
Domain complications
Structural similarity search

Quiz

Score: 0/30 (0%)

Q0. What is the main difference between structure alignment and structure superposition?

Alignment compares identical proteins, superposition compares different proteins

Alignment compares different proteins, superposition compares identical proteins in different conformations

Alignment is sequence-based, superposition is structure-based

There is no difference

Q1. When performing structural alignment in PyMOL, what is being aligned?

Amino acid sequences

Secondary structure labels

Atomic coordinates

Protein names

Q2. An RMSD value of 2.2 Å after alignment generally indicates:

Poor structural similarity

Moderate structural similarity

Good structural similarity

Identical structures

Q3. Which relationship is correct regarding biological information flow?

Protein structure determines DNA sequence

Protein sequence determines DNA sequence

DNA sequence determines protein sequence

Protein structure determines protein sequence

Q4. Which statement about sequence and structure variability is correct?

Protein structures are more variable than DNA sequences

Protein sequences are less variable than protein structures

DNA sequences are more variable than protein sequences

All levels have equal variability

Q5. Proteins with more than ~20% sequence homology often:

Have completely different structures

Have the same structure

Have no evolutionary relationship

Cannot be aligned structurally

Q6. Proteins with similar structures but different functions represent:

Divergent evolution

Convergent evolution

Rare structural coincidence

Sequence duplication

Q7. Serine proteases with very different folds but same catalytic function are an example of:

Structural conservation

Convergent evolution

Sequence homology

Domain overlap

Q8. What does CATH stand for?

Class, Architecture, Topology, Homology

Class, Architecture, Topology, Homologous superfamily

Classification, Arrangement, Topology, Hierarchy

Catalysis, Architecture, Topology, Homology

Q9. In CATH classification, architecture refers to:

The amino acid sequence

The evolutionary origin

The overall 3D arrangement of secondary structures

The exact residue numbering

Q10. Topology in CATH refers to:

The chemical composition of residues

The spatial size of the protein

The order and connectivity of secondary structures along the sequence

The protein's biological function

Q11. Which of the following is NOT a main CATH class?

Mainly alpha

Mainly beta

Alpha/beta

Membrane proteins only

Q12. SCOP classification differs from CATH primarily because:

SCOP is fully automated

SCOP does not classify domains

SCOP uses a more manual and elaborate hierarchy

SCOP ignores topology

Q13. Both CATH and SCOP classify:

Whole organisms

Entire genomes

Domains rather than full proteins

Only enzyme active sites

Q14. The DALI server is primarily used to:

Predict DNA mutations

Compare protein sequences

Find structurally similar proteins

Assign protein domains manually

Q15. True or False: Structural alignment in PyMOL is based on amino acid sequence identity.

True

False

Q16. True or False: A single protein sequence can adopt many completely unrelated stable structures under the same conditions.

True

False

Q17. True or False: Multiple different DNA sequences can encode the same protein sequence.

True

False

Q18. True or False: Protein structure variability is greater than protein sequence variability.

True

False

Q19. True or False: Two proteins with identical architecture must have identical topology.

True

False

Q20. True or False: RMSD measures average distance between aligned atoms.

True

False

Q21. True or False: SCOP classification is entirely automated.

True

False

Q22. True or False: CATH uses automated domain definitions with manual inspection.

True

False

Q23. True or False: Domains can consist of non-contiguous amino acid stretches.

True

False

Q24. True or False: Domain definitions are always fully reproducible and objective.

True

False

Q25. True or False: Similar structure often implies similar function.

True

False

Q26. True or False: Similar function always requires similar structure.

True

False

Q27. True or False: Topology depends on the sequence order of secondary structure elements.

True

False

Q28. True or False: The DALI server compares protein sequences using BLAST.

True

False

Q29. True or False: Proteins classified under 'few secondary structures' in CATH are typically highly ordered globular proteins.

True

False