A structure-aware, machine learning–driven bioinformatics pipeline for predicting antibody-accessible epitopes on the Porcine Circovirus Type 2 (PCV2) capsid protein (ORF2).
🔗 https://pcv2.epitope.aiconceptlimited.com.ng/
This project implements a research-grade computational framework integrating:
- Evolutionary sequence analysis
- Structural biology (PDB-based features)
- Physicochemical characterization
- Machine learning (XGBoost)
to identify potential B-cell epitopes on the PCV2 capsid protein.
- Epitope discovery
- Vaccine target identification
- Viral antigen characterization
- Immunoinformatics research
Epitope prediction is treated as a multi-modal biological inference problem:
Sequence → Evolution → Structure → Features → ML → Prediction → Validation
| Component | Source |
|---|---|
| Protein sequences | NCBI (Entrez API) |
| Reference sequence | UniProt |
| Protein structures | PDB (3R0R, 6EZG) |
| Epitope validation | IEDB |
NCBI Retrieval
↓
Sequence Cleaning (capsid-only filtering)
↓
Multiple Sequence Alignment (MAFFT)
↓
Feature Engineering
- Conservation
- Entropy
- SASA
- Residue Depth
- Electrostatics
↓
Feature Matrix Construction
↓
Epitope Labeling (IEDB)
↓
Machine Learning (XGBoost)
↓
Prediction
↓
3D + Sequence Visualization (Streamlit)
- Conservation score (frequency-based)
- Shannon entropy (sequence variability)
- Solvent Accessible Surface Area (SASA)
- Residue depth
- Secondary structure (loop/helix/sheet)
- Electrostatics
- Hydrophobicity
- Charge distribution
- Sliding window (±2 residues)
- Spatial neighborhood aggregation
- Model: XGBoost Classifier
- Input: Residue-level feature matrix
- Output: Probability of epitope per residue
- Imbalanced dataset handling
- Threshold tuning (default: 0.25)
- Feature importance extraction
| Metric | Value |
|---|---|
| Total residues | ~162–245 |
| Predicted epitopes | ~24 |
| Validated (IEDB overlap) | ~4 |
| ROC-AUC | ~0.70–0.75 |
- Predictions compared with IEDB experimental epitopes
- Overlap analysis performed at residue level
- ✅ Overlapping residues → validated epitopes
- 🔬 Non-overlapping → novel candidate epitopes
Predicted epitopes are enriched in:
- Surface-exposed regions (high SASA)
- Loop/coil structures
- High-entropy (variable) regions
👉 This aligns with known principles of antibody binding.
pcv2_epitope_project/
│
├── data/ # Metadata, mappings, IEDB data
├── sequences/ # FASTA + alignments
├── structures/ # PDB files (3R0R, 6EZG)
├── features/ # Engineered features
├── results/ # Predictions + evaluation
├── models/ # Trained ML model
├── scripts/ # Feature + analysis scripts
├── pipeline/ # Automation scripts
│
├── dashboard.py # Streamlit interface
└── run_smart_pipeline.py # Full pipeline runner
git clone https://github.com/YOUR_USERNAME/pcv2-epitope-platform.git
cd pcv2-epitope-platform
python -m venv pcv2_env
source pcv2_env/bin/activate
pip install -r requirements.txtpython run_smart_pipeline.pystreamlit run dashboard.py- 📈 Epitope probability plots
- 🧬 Sequence visualization (UniProt-aligned)
- 🧊 3D structure mapping (Py3Dmol)
- 🧪 IEDB validation overlay
- 📦 Epitope clustering
- Limited experimentally validated epitopes (class imbalance)
- Predictions are computational (require lab validation)
- Sequence–structure mapping introduces approximation
- Graph Neural Networks (GNN)
- Transformer-based protein models
- Improved structural alignment
- REST API deployment
- Continuous data updates (automated pipeline)
Open to collaborations in:
- Bioinformatics
- Immunoinformatics
- Vaccine design
- Structural biology
This system provides computational predictions and should not replace experimental validation.
Abubakar Bioinformatics & Computational Biology
- NCBI (sequence data)
- RCSB PDB (structural data)
- IEDB (epitope data)
- Biopython, XGBoost, Streamlit communities