FUNC-E is a Python package for functional enrichment analysis of gene lists. It follows a similar approach to that of DAVID in that it performs enrichment analysis using a Fisher's test but then clusters enriched annotations using Kappa Statistics. FUNC-E provides the following benefits:
- FUNC-E provides a command-line tool for inclusion in workflows.
- FUNC-E provides an Application Programmers Interface (API) that can be used to incorporate functional enrichment into any Python script or application.
- You can provide any list of vocabularies for which you have annotations (e.g., GO, KEGG, InterPro, Pfam, etc.)
- FUNC-E is species agnostic. You provide the gene names and the functional annotations.
- FUNC-E provides both a command-line tool and API functions for creating term lists for the Gene Ontology, KEGG and InterPro.
You can install the most recent version of FUNC-E using pip:
pip install func-e
Before using KINC you must prepare your files. You will need to prepare four files:
- A genomic background file containing the full list of genes.
- A query list of genes that will be analyzed for enrichment of function.
- One or more files containing a list of functional terms used for enrichment.
- One or more files that associates functional terms with genes.
FUNC-E requires the complete list of genes. This serves as the genomic "background". This file should have a single column with each gene listed on a separate line. For example to first 10 lines of a TAIR10 background file would be:
AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01046
AT1G01050
AT1G01060
AT1G01070
AT1G01073
AT1G01080
The query list contains the gene list that will undergo functional enrichment. This file allows you to specify multiple groups (i.e. modules) of genes for enrichment. The file should be tab-delimited. The first column should contain the list of genes and the second column the group (or module) name name. The group name allows for multiple groups of genes to be listed in the same file but enrichment performed separately for each. The second column, however, may be left blank and only a single column of gene names can be provided.
AT1G01010 Module1
AT1G01020 Module1
AT1G01030 Module1
AT1G01040 Module1
AT1G01046 Module2
AT1G01050 Module2
AT1G01060 Module2
AT1G01070 Module2
AT1G01073 Module2
AT1G01080 Module2
A term list is a file that contains the terms used for enrichment. This file should be a tab delimited file with five columns: ID_Space, Vocabulary, Term, Name and Definition. The Term must be unique (e.g. the unique accession). All columns must have values except the Definition column. The term list can be contained in one large file with all terms from multiple vocabularies combined, or each vocabulary can be in separate files. The following example combines terms from multiple vocabularies into a single list. In the example the Definition (last column) has an empty string.
ID_Space Vocabulary Term Name Definition
GO biological_process GO:0000001 mitochondrion inheritance
GO biological_process GO:0000002 mitochondrial genome maintenance
IPR IPR IPR000002 Cdc20/Fizzy
IPR IPR IPR000003 Retinoid X receptor
IPR IPR IPR000005 Helix-turn-helix, AraC type
A term mapping list maps the genes in the genomic background to terms in the term list. This file should be tab delimited and consist of two columns: gene name, term ID. The gene name must be present in the genomics background file and the term name must be present in in the terms list file(s). The following is example lines from a mapping list for rice genes:
LOC_Os01g01010 GO:0005097
LOC_Os01g01010 GO:0005622
LOC_Os01g01010 GO:0032313
LOC_Os01g01030 GO:0005507
Bioinformatics tools such as InterProScan, Blast2GO and EnTAP (to name a few) provide the mapping of genes to controlled vocabulary terms, but creating the list of all terms in a vocabulary is still needed prior to enrichment. FUNC-E makes it easy to generate these for common vocabularies such as the Gene Ontology (GO), KEGG (KEGG) and InterPro (IPR).
To generate a file of vocabulary terms from GO, KEGG and IPR use the following command:
FUNC-E-terms --vocab GO KEGG IPR
This will create a file named terms.tsv ready for the format required by the --terms argument of FUNC-E
Alternatively, you can create separate files for each vocabulary:
FUNC-E-terms --outprefix KEGG --vocab KEGG
FUNC-E-terms --outprefix GO --vocab GO
FUNC-E-terms --outprefix IPR --vocab IPRFUNC-E provides the following usage instructions:
FUNC-E [-h] --background BACKGROUND
--query_list QUERY_LIST
--ecut ECUT
--terms TERMS [TERMS ...]
--terms2features TERMS2FEATURES [TERMS2FEATURES ...]
[--outprefix OUTPREFIX]
[--module MODULE]
[--vocab VOCAB [VOCAB ...]]
[--similarity_threshold SIMILARITY_THRESHOLD]
[--similarity_term_overlap SIMILARITY_TERM_OVERLAP]
[--percent_similarity PERCENT_SIMILARITY]
[--initial_group_membership INITIAL_GROUP_MEMBERSHIP]
[--multiple_linkage_threshold MULTIPLE_LINKAGE_THRESHOLD]
[--final_group_membership FINAL_GROUP_MEMBERSHIP]
[--verbose VERBOSE]
For more detailed information about each argument please run the FUNC-E -h command.
The following example performs functional enrichment of a list of arabidopsis genes from the TAIR10 genome assembly and annotation. It requires that four types of files using the arguments: --background, --query_list, --terms, and --terms2features. Note that the --terms and --terms2features arguments can be provided as many times as there are files. In this example, lists of terms (provided with the --terms option) from AraCyc, GO (Gene Ontology), IPR (InterPro), Pfam and PO (Plant Ontology) have been prepared, as well as the mappings of these term lists to the genes. Genes to term mappings are provided using the --terms2features option.
FUNC-E \
--background arabidopsis_thaliana.TAIR10.genes.txt \
--query_list modules.txt \
--outprefix modules-enrichment \
--terms AraCyc.terms.txt GO.terms.txt IPR.terms.txt \
Pfam.terms.txt PO.terms.txt \
--terms2features arabidopsis_thaliana.TAIR10.genes2AraCyc.txt \
arabidopsis_thaliana.TAIR10.genes2GO.txt \
arabidopsis_thaliana.TAIR10.genes2IPR.txt \
arabidopsis_thaliana.TAIR10.genes2Pfam.txt \
arabidopsis_thaliana.TAIR10.genes2PO.txt \
--ecut 0.01 Additionally, the --ecut option provides a p-value cutoff for enrichment, and the --outprefix provides the a prefix which is added to every output file created by this script.
When you run FUNC-E from the command-line, it generates five tab-delimited output files. If you provide an --outprefix argument (e.g., --outprefix modules-enrichment), that prefix will be prepended to each filename. Otherwise, the files will be named with just the FUNC-E. prefix.
Contains all terms that passed the enrichment p-value cutoff (specified by --ecut). Each row represents one enriched term in one module.
Columns:
Module: The module name from the query listID_Space: The namespace of the term (e.g., GO, IPR, KEGG)Vocabulary: The specific vocabulary within the ID space (e.g., biological_process, molecular_function)Term: The unique term identifier (e.g., GO:0008150)Name: Human-readable term nameModule Size: Total number of genes in the moduleCount In Module: Number of genes in the module annotated with this termCount In Background: Total number of genes in the background annotated with this termFishers p-value: Raw p-value from Fisher's exact testBonferroni: Bonferroni-corrected p-value for multiple testingBenjamini: Benjamini-Hochberg (FDR) corrected p-value for multiple testing
Summarizes the clusters of enriched terms identified through kappa statistics. Each row represents one cluster.
Columns:
Module: The module nameCluster Index: Numeric identifier for the cluster (1, 2, 3, ...)Geometric Mean: Geometric mean of p-values for all terms in the clusterEASE Score: -log10 of the geometric mean (higher scores indicate stronger enrichment)Features: List of genes/features that share the enriched terms in this clusterEnriched Terms: List of term IDs included in this cluster
A filtered version of the enriched terms file, containing only terms that were assigned to clusters. This combines information from both the enrichment analysis and clustering steps.
Columns:
- Same as the enriched terms file, plus:
Cluster Index: The cluster to which this term belongs
This file is useful for focusing on the most biologically coherent groups of enriched terms, as clustering removes isolated or weakly-related terms.
Contains pairwise kappa similarity scores between genes that share enriched terms. These scores are used to perform the clustering.
Columns:
Feature1: First gene/feature in the pairFeature2: Second gene/feature in the pairModule: The module containing both featuresScore: Cohen's kappa score (ranges from -1 to 1; values closer to 1 indicate higher similarity)Overlap: Number of enriched terms shared between the two features
Only gene pairs with kappa scores above the --similarity_threshold (default 0.35) are included.
Lists all genes/features that have at least one enriched term, along with the terms enriched for that gene.
Columns:
Feature: The gene/feature nameModule: The module containing this featureTerm: A list of all enriched term IDs associated with this feature
This file is useful for quickly identifying which genes contributed to the enrichment signal and what terms they are associated with.
FUNC-E performs two distinct analyses: enrichment and clustering. Understanding the difference between these steps is important for interpreting your results.
Enrichment analysis identifies individual functional terms that are statistically over-represented in your gene list compared to the genomic background. This is done using Fisher's exact test for each term independently.
What it tells you:
- Which specific functional annotations appear more frequently in your gene list than expected by chance
- The statistical significance of each term (p-value)
- How many genes in your list are annotated with each term
Output: The *.enriched_terms.tsv file contains all terms that pass your p-value cutoff (--ecut).
Limitations: Enrichment analysis often produces long lists of related or redundant terms. For example, if your genes are involved in "DNA replication," you might see dozens of enriched terms like "DNA replication," "DNA-dependent DNA replication," "nuclear DNA replication," "DNA replication initiation," etc. While all are statistically significant, they represent overlapping biological concepts.
Clustering groups related enriched terms together based on the similarity of the genes annotated with those terms. FUNC-E uses Cohen's kappa statistics to measure how similarly two genes are annotated, then clusters genes (and their associated terms) that share similar annotation patterns. This clustering approach is adapted from the methodology developed by the DAVID (Database for Annotation, Visualization and Integrated Discovery) tool.
What it tells you:
- Which groups of enriched terms represent coherent biological themes
- Which genes share similar functional annotation patterns
- The overall strength of each functional theme (EASE score)
Output: The *.clusters.tsv and *.cluster_terms.tsv files contain the clustered results.
Benefits: Clustering reduces redundancy and helps identify the major biological themes in your data. Instead of reviewing hundreds of individual terms, you can focus on a smaller number of term clusters, each representing a distinct biological process or function.
- Enrichment first: FUNC-E identifies all significantly enriched terms (those with p-value ≤
--ecut) - Kappa calculation: For genes with enriched terms, FUNC-E calculates pairwise similarity scores based on shared term annotations
- Clustering: Genes with similar annotation patterns are grouped, and their associated enriched terms form clusters
- Filtering: Only clusters meeting size requirements (controlled by
--final_group_membership) are retained
-
Use enriched terms (
*.enriched_terms.tsv) when:- You want a complete list of all significant functional annotations
- You need to identify every specific term that is over-represented
- You plan to perform your own downstream filtering or analysis
-
Use clustered terms (
*.cluster_terms.tsvand*.clusters.tsv) when:- You want a high-level summary of biological themes
- Your enrichment results contain many redundant or related terms
- You need to prioritize the most important functional categories
- You're preparing results for publication or presentation
Note: Not all enriched terms will appear in clusters. Terms that don't cluster well with others (isolated terms with unique annotation patterns) are excluded from the clustering results but remain in the enrichment results. This is by design—clustering focuses on identifying coherent functional themes rather than individual signals.
Bioinformatics tools such as InterProScan, Blast2GO and EnTAP (to name a few) provide the mapping of genes to controlled vocabulary terms, but creating the list of all terms in a vocabulary is still needed prior to enrichment. FUNC-E makes it easy to generate these for common vocabularies such as the Gene Ontology (GO), KEGG (KEGG) and InterPro (IPR).
To use the FUNC-E API to build a list of vocabularies, you must first import the package into your code:
import func_e.vocabs.all as vocabsTo generate a Pandas DataFrame of vocabulary terms from GO, KEGG and IPR use the following function call:
terms = vocabs.getTerms(['GO', 'KEGG', 'IPR'])If you are using terms from vocabularies other than GO, KEGG or InterPro, then you must provide the list of terms in a Pandas Dataframe. The columns of this dataframe must match the terms file described above. The column headers must include ID_Space, Vocabulary, Term, Name, Definition.
Next, you must create a Pandas DataFrame that assigns the feature (e.g., gene, transcript or protein) to terms. The terms must be present in the terms DataFrame created in the previous step. The column headers must be Feature and Term.
To perform functional enrichment using the FUNC-E API start by importing the FUNC_E class module:
from func_e.FUNC_E import FUNC_ENext, instantiating a new FUNC_E object.
fe = FUNC_E()Next, you need to set the p-value cutoff for enrichment testing:
fe.setEnrichmentSettings({
'ecut': 0.01
})If you desire, you can change the clustering default settings as well:
fe.setClusteringSettings({
'similarity_term_overlap': 3,
'percent_similarity': 0.50,
'initial_group_membership': 3,
'multiple_linkage_threshold': 0.50,
'final_group_membership': 3,
'similarity_threshold': 0.5
})The settings have the following meaning:
similarity_threshold: This value is used to threshold the kappa scores. Pair-wise kappa scores are calculated for all genes. Kappa scores range between -1 to 1 and provide a measurement as to the similarity of annotations between two genes. Kappa scores greater than this value are considered meaningful and only those gene pairs with scores greater than this threshold are clustered. The default value if not specified is 0.35.similarity_term_overlap: Before kappa statistics are calculated two genes must share a specified number of terms. This parameter sets that minimum value. The default is 4.percent_similarity: Before clustering, seed groups are created, and when creating seed groups we want high quality groups. Therefore, the members of the seed groups must themselves share similarity with all other genes in the group greater or equal than the value specified by this parameter. The default is 0.50 (50 percent)initial_group_membership: When clustering, initial seed groups are created by grouping a gene with all other genes with which it has a significant (> similarity_threshold) kappa score. This parameter sets the minimum number of genes that must exist for a group to be considered a seed group. The default value is 4.multiple_linkage_threshold: After initial seed groups are formed an iterative process attempts to merge seed groups that have a specified percentage of genes in common. This parameter sets this percentage. The default is 0.50 (or seed groups must share 50 percent of genes to be merged).final_group_membership: This parameter sets the minimum number of terms in a cluster after all clustering. If the cluster has fewer terms it is thrown out. The default value is 4.
Next, FUNC_E can import the files needed for enrichment analysis. These are the same as the example files used in the command-line example above.
fe.importFiles({
'background': 'arabidopsis_thaliana.TAIR10.genes.txt',
'query': 'modules.txt',
'terms2features': ['arabidopsis_thaliana.TAIR10.genes2AraCyc.txt',
'arabidopsis_thaliana.TAIR10.genes2GO.txt',
'arabidopsis_thaliana.TAIR10.genes2IPR.txt',
'arabidopsis_thaliana.TAIR10.genes2Pfam.txt',
'arabidopsis_thaliana.TAIR10.genes2PO.txt']
'terms': ['IPR.terms.tsv', 'GO.terms.tsv', 'KEGG.terms.tsv']
})Alternatively, you may have created the terms DataFrame using the vocabs.getTerms() function described above. If so, you can leave out the terms argument in the importFiles() function call above and set the terms manually:
fe.setTerms(terms)Now that FUNC-E has all of the necessary files and settings, you can perform functional enrichment:
fe.run()If you only wish to perform enrichment analysis and not clustering you can provide the cluster=False argument:
fe.run(cluster=False)If you want to limit enrichment to only a subset of modules and/or vocabularies you can provide the modules and vocabs arguments:
fe.run(modules=['module1', 'module2'], vocabs=['GO'])Once completed you can access results using the following attributes of the FUNC_E object:
fe.enrichment: a Pandas DataFrame containing the results of the enrichment test, including the p-value and bonferroni and bejamini corrected p-valuesfe.clusters: a Pandas DataFrame listing the the clusters that were identified contiaing the EASE score and geometric mean of p-values.fe.cluster_terms: a copy of the enrichment report, but with only clustered terms.
Finally, below are example commands to save results to a file:
fe.enrichment.sort_values(['Module', 'Fishers p-value']).to_csv('FUNC-E.enriched_terms.tsv', sep="\t", index=None)
fe.clusters.sort_values(['Module','Cluster Index', 'EASE Score']).to_csv('FUNC-E.clusters.tsv', sep="\t", index=None)
fe.cluster_terms.sort_values(['Module','Cluster Index', 'Fishers p-value']).to_csv('FUNC-E.cluster_terms.tsv', sep="\t", index=None)