From fa32134c7f054a0729cbad7873da8f4ed244622a Mon Sep 17 00:00:00 2001 From: mckayadd Date: Sun, 4 May 2025 10:43:38 +0200 Subject: [PATCH 1/5] Add .gitignore --- .gitignore | 104 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 104 insertions(+) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..a99281c --- /dev/null +++ b/.gitignore @@ -0,0 +1,104 @@ +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[cod] +*$py.class + +# C extensions +*.so + +# Distribution / packaging +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +*.egg-info/ +.installed.cfg +*.egg + +# PyInstaller +# Usually contains the main executable and other build artifacts +dist/ +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +.hypothesis/ + +# Pytest +.pytest_cache/ + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pylint +pylint-report.txt +pylint-global.txt + +# Pyre +.pyre/ + +# Environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Jupyter Notebook checkpoints +.ipynb_checkpoints + +# pyenv +.python-version + +# pipenv +Pipfile.lock + +# poetry +poetry.lock + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site + +# Sphinx documentation +docs/_build/ + +# PyBuilder +target/ + +# VS Code +.vscode/ + +# macOS Finder metadata +.DS_Store + From f713f0ac368bbab3c0ca4d7a633e221b2fce6594 Mon Sep 17 00:00:00 2001 From: mckayadd Date: Sun, 4 May 2025 10:46:10 +0200 Subject: [PATCH 2/5] Add bio nomenclature --- BIO_NOMENCLEATURE.md | 179 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 179 insertions(+) create mode 100644 BIO_NOMENCLEATURE.md diff --git a/BIO_NOMENCLEATURE.md b/BIO_NOMENCLEATURE.md new file mode 100644 index 0000000..f553b16 --- /dev/null +++ b/BIO_NOMENCLEATURE.md @@ -0,0 +1,179 @@ +# BIO_NOMENCLATURE.md + +A detailed guide to the core bioinformatics and computational biology concepts behind the ProtFeat project. This file is intended to help new users—especially those coming from outside biology—understand the terminology and logic behind the protein feature extraction techniques used in ProtFeat. + +--- + +## 1. Central Dogma Refresher + +The central dogma of molecular biology describes the flow of genetic information within biological systems: + +**DNA → RNA → Protein** + +### Key Steps: +- **Transcription**: DNA is transcribed into messenger RNA (mRNA). +- **Translation**: mRNA is read in triplets (codons) to assemble a chain of amino acids, forming a polypeptide. +- **Folding**: The polypeptide folds into a 3D structure to become a functional protein. + +### Why it matters for ProtFeat: +ProtFeat begins with the protein—the final product of this process—and uses its amino acid sequence as the raw input for numerical feature extraction. + +--- + +## 2. Protein Primary Structure & FASTA Format + +### Protein Primary Structure + +The **primary structure** of a protein is its linear sequence of amino acids, typically represented using one-letter codes: + +**Example:** + +``` MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQ... ``` + + +Each letter corresponds to one of the 20 standard amino acids (e.g., M = Methionine, A = Alanine, etc.). + +### FASTA Format + +FASTA is a plain-text format commonly used for storing protein and nucleotide sequences. + +**Example:** + +``` sp|Q9H9K5|SYT1_HUMAN Synaptotagmin-1 OS=Homo sapiens +MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQ... ``` + + +- The line starting with `>` is the **header**, containing metadata such as UniProt ID and description. +- The second line contains the actual sequence. + +### Why it matters for ProtFeat: +- ProtFeat reads protein sequences from FASTA files. +- The `--ppid` parameter allows you to select which word from the header line is used as the unique protein ID. + +--- + +## 3. Amino Acid Physicochemical Properties + +Amino acids differ by their chemical and physical characteristics, which influence protein structure and function. + +| Property | Example Residues | Description | +|-----------------|------------------|----------------------------------------| +| Hydrophobicity | L, I, V, F | Avoid water | +| Charge (+) | K, R | Positively charged | +| Charge (–) | D, E | Negatively charged | +| Polarity | S, T, N, Q | Ability to form hydrogen bonds | +| Size | G, A vs. W, Y | Small vs. bulky side chains | + +### Why it matters for ProtFeat: +Descriptors like CTD or PseAAC use these groupings to extract biologically meaningful statistics about the sequence. + +--- + +## 4. Sequence Alignment Basics + +Sequence alignment compares protein sequences to find regions of similarity. + +| Concept | Description | +|---------------------|------------------------------------------------------------| +| Global Alignment | Aligns full-length sequences (Needleman–Wunsch) | +| Local Alignment | Aligns best-matching subsequences (Smith–Waterman) | +| Substitution Matrix | Scores residue substitutions (e.g., BLOSUM62, PAM) | + +### Why it matters for ProtFeat: +PSSM-based descriptors depend on multiple sequence alignments, typically generated using **PSI-BLAST**, which produces conservation scores used in feature extraction. + +--- + +## 5. Position-Specific Scoring Matrix (PSSM) + +A **PSSM** is a matrix of size *L × 20*, where *L* is the sequence length and 20 columns represent the 20 standard amino acids. + +Each cell contains a **log-odds score** indicating the likelihood of a particular amino acid occurring at a specific position based on aligned homologous sequences. + +**Simplified example:** + +| Position | A | R | N | D | ... | Y | +|----------|----|----|----|----|-----|----| +| 1 | 4 | -1 | 0 | 0 | ... | -2 | +| 2 | -2 | 6 | 0 | -1 | ... | 0 | + +### Why it matters for ProtFeat: +ProtFeat includes multiple **PSSM-based descriptors** (via the POSSUM toolkit) that derive numeric features from evolutionary conservation profiles. + +--- + +## 6. Protein Feature Descriptors + +Descriptors are numerical summaries of protein sequences. ProtFeat supports both sequence-based and PSSM-based descriptors. + +| Code | Full Name | Description | Typical Vector Length | +|-----------|----------------------------------|--------------------------------------------------|------------------------| +| AAC | Amino Acid Composition | Frequency of each amino acid | 20 | +| DPC | Dipeptide Composition | Frequency of each 2-letter amino acid pair | 400 | +| CTD | Composition/Transition/Dist. | Group-wise counts based on properties | 147 | +| PseAAC | Pseudo Amino Acid Composition | Sequence + physicochemical + order info | Varies | +| PSSM-AC | PSSM Autocorrelation | Correlation between positions in the PSSM | Typically 60 | + +### Why it matters for ProtFeat: +These descriptors convert variable-length protein sequences into fixed-length feature vectors for input to machine learning models. + +--- + +## 7. Databases & Identifiers + +ProtFeat relies on standard biological identifiers used in well-known databases. + +| Database | Example ID | Description | +|------------|----------------|----------------------------------------------| +| UniProtKB | Q9H9K5 | Curated protein knowledgebase | +| NCBI nr | NP_000537.3 | Non-redundant protein sequence database | +| Pfam | PF00096 | Protein domain/family annotations | + +### Why it matters for ProtFeat: +- FASTA files often come from UniProt or NCBI. +- These IDs are used to track protein features and link sequences to known biological metadata. + +--- + +## 8. Feature Engineering in Proteomics + +**Feature engineering** transforms protein sequences into numerical representations suitable for predictive modeling. + +### Common ML applications: +- Predicting enzyme/non-enzyme +- Identifying subcellular localization +- Predicting protein–protein interactions +- Classifying disease-causing variants + +### Typical workflow: + +Protein FASTA file +↓ +(Optional) PSI-BLAST → PSSM +↓ +ProtFeat feature extraction +↓ +Tabular feature matrix (.tsv) +↓ +Machine learning model + +### Why it matters for ProtFeat: +ProtFeat provides the core feature extraction step in this pipeline—converting biological sequences into structured data. + +--- + +## Further Reading + +**Books:** +- Baxevanis & Ouellette, *Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins* +- Jones & Pevzner, *An Introduction to Bioinformatics Algorithms* + +**Online Resources:** +- [UniProt](https://www.uniprot.org) +- [NCBI BLAST](https://blast.ncbi.nlm.nih.gov) +- [Pfam](http://pfam.xfam.org) + +--- + +*Save this file as `BIO_NOMENCLATURE.md` in your repository to help readers understand the bioinformatics background of ProtFeat.* + From 1688200aabe507d07345313a219c8ef398e9c677 Mon Sep 17 00:00:00 2001 From: mckayadd Date: Sun, 4 May 2025 10:47:05 +0200 Subject: [PATCH 3/5] Update BIO_NOMENCLEATURE.md --- BIO_NOMENCLEATURE.md | 1 - 1 file changed, 1 deletion(-) diff --git a/BIO_NOMENCLEATURE.md b/BIO_NOMENCLEATURE.md index f553b16..0ae194a 100644 --- a/BIO_NOMENCLEATURE.md +++ b/BIO_NOMENCLEATURE.md @@ -1,4 +1,3 @@ -# BIO_NOMENCLATURE.md A detailed guide to the core bioinformatics and computational biology concepts behind the ProtFeat project. This file is intended to help new users—especially those coming from outside biology—understand the terminology and logic behind the protein feature extraction techniques used in ProtFeat. From 604697f7f080193713af0e2b00acf448656419f0 Mon Sep 17 00:00:00 2001 From: mckayadd Date: Sun, 4 May 2025 10:55:52 +0200 Subject: [PATCH 4/5] Update BIO_NOMENCLEATURE.md --- BIO_NOMENCLEATURE.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/BIO_NOMENCLEATURE.md b/BIO_NOMENCLEATURE.md index 0ae194a..1c0149e 100644 --- a/BIO_NOMENCLEATURE.md +++ b/BIO_NOMENCLEATURE.md @@ -38,8 +38,8 @@ FASTA is a plain-text format commonly used for storing protein and nucleotide se **Example:** -``` sp|Q9H9K5|SYT1_HUMAN Synaptotagmin-1 OS=Homo sapiens -MSEQNNTEMTFQIQRIYTKDISFEAPNAPHVFQ... ``` +``` >sp|Q9H9K5|SYT1_HUMAN Synaptotagmin-1 OS=Homo sapiens... ``` +```MSEQNNTEMTFQIQRIY... ``` - The line starting with `>` is the **header**, containing metadata such as UniProt ID and description. @@ -174,5 +174,3 @@ ProtFeat provides the core feature extraction step in this pipeline—converting --- -*Save this file as `BIO_NOMENCLATURE.md` in your repository to help readers understand the bioinformatics background of ProtFeat.* - From c2b4e82a74993590a4eb068842084da8b7ec338a Mon Sep 17 00:00:00 2001 From: mckayadd Date: Sun, 4 May 2025 11:49:38 +0200 Subject: [PATCH 5/5] Update BIO_NOMENCLEATURE.md --- BIO_NOMENCLEATURE.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/BIO_NOMENCLEATURE.md b/BIO_NOMENCLEATURE.md index 1c0149e..be5fcf3 100644 --- a/BIO_NOMENCLEATURE.md +++ b/BIO_NOMENCLEATURE.md @@ -146,6 +146,7 @@ ProtFeat relies on standard biological identifiers used in well-known databases. ### Typical workflow: +``` Protein FASTA file ↓ (Optional) PSI-BLAST → PSSM @@ -155,7 +156,7 @@ ProtFeat feature extraction Tabular feature matrix (.tsv) ↓ Machine learning model - +``` ### Why it matters for ProtFeat: ProtFeat provides the core feature extraction step in this pipeline—converting biological sequences into structured data.