FUNC-E

FUNC-E is a Python package for functional enrichment analysis of gene lists. It follows a similar approach to that of DAVID in that it performs enrichment analysis using a Fisher's test but then clusters enriched annotations using Kappa Statistics. FUNC-E provides the following benefits:

FUNC-E provides a command-line tool for inclusion in workflows.
FUNC-E provides an Application Programmers Interface (API) that can be used to incorporate functional enrichment into any Python script or application.
You can provide any list of vocabularies for which you have annotations (e.g., GO, KEGG, InterPro, Pfam, etc.)
FUNC-E is species agnostic. You provide the gene names and the functional annotations.
FUNC-E provides both a command-line tool and API functions for creating term lists for the Gene Ontology, KEGG and InterPro.

Installation

You can install the most recent version of FUNC-E using pip:

pip install func-e

Preparing Input Files

Before using KINC you must prepare your files. You will need to prepare four files:

A genomic background file containing the full list of genes.
A query list of genes that will be analyzed for enrichment of function.
One or more files containing a list of functional terms used for enrichment.
One or more files that associates functional terms with genes.

Genomic Background (--background option)

FUNC-E requires the complete list of genes. This serves as the genomic "background". This file should have a single column with each gene listed on a separate line. For example to first 10 lines of a TAIR10 background file would be:

AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01046
AT1G01050
AT1G01060
AT1G01070
AT1G01073
AT1G01080

Query List (--query_list option)

The query list contains the gene list that will undergo functional enrichment. This file allows you to specify multiple groups (i.e. modules) of genes for enrichment. The file should be tab-delimited. The first column should contain the list of genes and the second column the group (or module) name name. The group name allows for multiple groups of genes to be listed in the same file but enrichment performed separately for each. The second column, however, may be left blank and only a single column of gene names can be provided.

AT1G01010 Module1
AT1G01020 Module1
AT1G01030 Module1
AT1G01040 Module1
AT1G01046 Module2
AT1G01050 Module2
AT1G01060 Module2
AT1G01070 Module2
AT1G01073 Module2
AT1G01080 Module2

Term List (--terms options)

A term list is a file that contains the terms used for enrichment. This file should be a tab delimited file with five columns: ID_Space, Vocabulary, Term, Name and Definition. The Term must be unique (e.g. the unique accession). All columns must have values except the Definition column. The term list can be contained in one large file with all terms from multiple vocabularies combined, or each vocabulary can be in separate files. The following example combines terms from multiple vocabularies into a single list. In the example the Definition (last column) has an empty string.

ID_Space        Vocabulary      Term    Name    Definition
GO      biological_process      GO:0000001      mitochondrion inheritance       
GO      biological_process      GO:0000002      mitochondrial genome maintenance      
IPR      IPR     IPR000002       Cdc20/Fizzy      
IPR      IPR     IPR000003       Retinoid X receptor      
IPR      IPR     IPR000005       Helix-turn-helix, AraC type

Term Mapping List (--terms2features option)

A term mapping list maps the genes in the genomic background to terms in the term list. This file should be tab delimited and consist of two columns: gene name, term ID. The gene name must be present in the genomics background file and the term name must be present in in the terms list file(s). The following is example lines from a mapping list for rice genes:

LOC_Os01g01010  GO:0005097  
LOC_Os01g01010  GO:0005622  
LOC_Os01g01010  GO:0032313   
LOC_Os01g01030  GO:0005507

Usage

Command-line

Generate Terms files

Bioinformatics tools such as InterProScan, Blast2GO and EnTAP (to name a few) provide the mapping of genes to controlled vocabulary terms, but creating the list of all terms in a vocabulary is still needed prior to enrichment. FUNC-E makes it easy to generate these for common vocabularies such as the Gene Ontology (GO), KEGG (KEGG) and InterPro (IPR).

To generate a file of vocabulary terms from GO, KEGG and IPR use the following command:

FUNC-E-terms  --vocab GO KEGG IPR

This will create a file named terms.tsv ready for the format required by the --terms argument of FUNC-E

Alternatively, you can create separate files for each vocabulary:

FUNC-E-terms --outprefix KEGG --vocab KEGG
FUNC-E-terms --outprefix GO --vocab GO
FUNC-E-terms --outprefix IPR --vocab IPR

Perform Functional Enrichment Analysis

FUNC-E provides the following usage instructions:

 FUNC-E [-h] --background BACKGROUND
     --query_list QUERY_LIST
     --ecut ECUT
     --terms TERMS [TERMS ...]
     --terms2features TERMS2FEATURES [TERMS2FEATURES ...]
     [--outprefix OUTPREFIX]
     [--module MODULE]
     [--vocab VOCAB [VOCAB ...]]
     [--similarity_threshold SIMILARITY_THRESHOLD]
     [--similarity_term_overlap SIMILARITY_TERM_OVERLAP]
     [--percent_similarity PERCENT_SIMILARITY]
     [--initial_group_membership INITIAL_GROUP_MEMBERSHIP]
     [--multiple_linkage_threshold MULTIPLE_LINKAGE_THRESHOLD]
     [--final_group_membership FINAL_GROUP_MEMBERSHIP]
     [--verbose VERBOSE]

For more detailed information about each argument please run the FUNC-E -h command.

Example

The following example performs functional enrichment of a list of arabidopsis genes from the TAIR10 genome assembly and annotation. It requires that four types of files using the arguments: --background, --query_list, --terms, and --terms2features. Note that the --terms and --terms2features arguments can be provided as many times as there are files. In this example, lists of terms (provided with the --terms option) from AraCyc, GO (Gene Ontology), IPR (InterPro), Pfam and PO (Plant Ontology) have been prepared, as well as the mappings of these term lists to the genes. Genes to term mappings are provided using the --terms2features option.

FUNC-E \
  --background arabidopsis_thaliana.TAIR10.genes.txt \
  --query_list modules.txt \
  --outprefix modules-enrichment \
  --terms AraCyc.terms.txt GO.terms.txt IPR.terms.txt \
          Pfam.terms.txt PO.terms.txt \
  --terms2features arabidopsis_thaliana.TAIR10.genes2AraCyc.txt \
                   arabidopsis_thaliana.TAIR10.genes2GO.txt \
                   arabidopsis_thaliana.TAIR10.genes2IPR.txt \
                   arabidopsis_thaliana.TAIR10.genes2Pfam.txt \
                   arabidopsis_thaliana.TAIR10.genes2PO.txt  \
  --ecut 0.01

Additionally, the --ecut option provides a p-value cutoff for enrichment, and the --outprefix provides the a prefix which is added to every output file created by this script.

Output Files

When you run FUNC-E from the command-line, it generates five tab-delimited output files. If you provide an --outprefix argument (e.g., --outprefix modules-enrichment), that prefix will be prepended to each filename. Otherwise, the files will be named with just the FUNC-E. prefix.

1. Enriched Terms File (`*.FUNC-E.enriched_terms.tsv`)

Contains all terms that passed the enrichment p-value cutoff (specified by --ecut). Each row represents one enriched term in one module.

Columns:

Module: The module name from the query list
ID_Space: The namespace of the term (e.g., GO, IPR, KEGG)
Vocabulary: The specific vocabulary within the ID space (e.g., biological_process, molecular_function)
Term: The unique term identifier (e.g., GO:0008150)
Name: Human-readable term name
Module Size: Total number of genes in the module
Count In Module: Number of genes in the module annotated with this term
Count In Background: Total number of genes in the background annotated with this term
Fishers p-value: Raw p-value from Fisher's exact test
Bonferroni: Bonferroni-corrected p-value for multiple testing
Benjamini: Benjamini-Hochberg (FDR) corrected p-value for multiple testing

2. Clusters File (`*.FUNC-E.clusters.tsv`)

Summarizes the clusters of enriched terms identified through kappa statistics. Each row represents one cluster.

Columns:

Module: The module name
Cluster Index: Numeric identifier for the cluster (1, 2, 3, ...)
Geometric Mean: Geometric mean of p-values for all terms in the cluster
EASE Score: -log10 of the geometric mean (higher scores indicate stronger enrichment)
Features: List of genes/features that share the enriched terms in this cluster
Enriched Terms: List of term IDs included in this cluster

3. Cluster Terms File (`*.FUNC-E.cluster_terms.tsv`)

A filtered version of the enriched terms file, containing only terms that were assigned to clusters. This combines information from both the enrichment analysis and clustering steps.

Columns:

Same as the enriched terms file, plus:
Cluster Index: The cluster to which this term belongs

This file is useful for focusing on the most biologically coherent groups of enriched terms, as clustering removes isolated or weakly-related terms.

4. Kappa Scores File (`*.FUNC-E.kappa.tsv`)

Contains pairwise kappa similarity scores between genes that share enriched terms. These scores are used to perform the clustering.

Columns:

Feature1: First gene/feature in the pair
Feature2: Second gene/feature in the pair
Module: The module containing both features
Score: Cohen's kappa score (ranges from -1 to 1; values closer to 1 indicate higher similarity)
Overlap: Number of enriched terms shared between the two features

Only gene pairs with kappa scores above the --similarity_threshold (default 0.35) are included.

5. Enriched Features File (`*.FUNC-E.efeatures.tsv`)

Lists all genes/features that have at least one enriched term, along with the terms enriched for that gene.

Columns:

Feature: The gene/feature name
Module: The module containing this feature
Term: A list of all enriched term IDs associated with this feature

This file is useful for quickly identifying which genes contributed to the enrichment signal and what terms they are associated with.

Understanding Enrichment vs. Clustering

FUNC-E performs two distinct analyses: enrichment and clustering. Understanding the difference between these steps is important for interpreting your results.

Enrichment Analysis

Enrichment analysis identifies individual functional terms that are statistically over-represented in your gene list compared to the genomic background. This is done using Fisher's exact test for each term independently.

What it tells you:

Which specific functional annotations appear more frequently in your gene list than expected by chance
The statistical significance of each term (p-value)
How many genes in your list are annotated with each term

Output: The *.enriched_terms.tsv file contains all terms that pass your p-value cutoff (--ecut).

Limitations: Enrichment analysis often produces long lists of related or redundant terms. For example, if your genes are involved in "DNA replication," you might see dozens of enriched terms like "DNA replication," "DNA-dependent DNA replication," "nuclear DNA replication," "DNA replication initiation," etc. While all are statistically significant, they represent overlapping biological concepts.

Clustering Analysis

Clustering groups related enriched terms together based on the similarity of the genes annotated with those terms. FUNC-E uses Cohen's kappa statistics to measure how similarly two genes are annotated, then clusters genes (and their associated terms) that share similar annotation patterns. This clustering approach is adapted from the methodology developed by the DAVID (Database for Annotation, Visualization and Integrated Discovery) tool.

What it tells you:

Which groups of enriched terms represent coherent biological themes
Which genes share similar functional annotation patterns
The overall strength of each functional theme (EASE score)

Output: The *.clusters.tsv and *.cluster_terms.tsv files contain the clustered results.

Benefits: Clustering reduces redundancy and helps identify the major biological themes in your data. Instead of reviewing hundreds of individual terms, you can focus on a smaller number of term clusters, each representing a distinct biological process or function.

How They Work Together

Enrichment first: FUNC-E identifies all significantly enriched terms (those with p-value ≤ --ecut)
Kappa calculation: For genes with enriched terms, FUNC-E calculates pairwise similarity scores based on shared term annotations
Clustering: Genes with similar annotation patterns are grouped, and their associated enriched terms form clusters
Filtering: Only clusters meeting size requirements (controlled by --final_group_membership) are retained

Which Results Should You Use?

Use enriched terms (*.enriched_terms.tsv) when:
- You want a complete list of all significant functional annotations
- You need to identify every specific term that is over-represented
- You plan to perform your own downstream filtering or analysis
Use clustered terms (*.cluster_terms.tsv and *.clusters.tsv) when:
- You want a high-level summary of biological themes
- Your enrichment results contain many redundant or related terms
- You need to prioritize the most important functional categories
- You're preparing results for publication or presentation

Note: Not all enriched terms will appear in clusters. Terms that don't cluster well with others (isolated terms with unique annotation patterns) are excluded from the clustering results but remain in the enrichment results. This is by design—clustering focuses on identifying coherent functional themes rather than individual signals.

Using the API

Generate Terms files

Automate it

Bioinformatics tools such as InterProScan, Blast2GO and EnTAP (to name a few) provide the mapping of genes to controlled vocabulary terms, but creating the list of all terms in a vocabulary is still needed prior to enrichment. FUNC-E makes it easy to generate these for common vocabularies such as the Gene Ontology (GO), KEGG (KEGG) and InterPro (IPR).

To use the FUNC-E API to build a list of vocabularies, you must first import the package into your code:

import func_e.vocabs.all as vocabs

To generate a Pandas DataFrame of vocabulary terms from GO, KEGG and IPR use the following function call:

terms = vocabs.getTerms(['GO', 'KEGG', 'IPR'])

Manually Create It

If you are using terms from vocabularies other than GO, KEGG or InterPro, then you must provide the list of terms in a Pandas Dataframe. The columns of this dataframe must match the terms file described above. The column headers must include ID_Space, Vocabulary, Term, Name, Definition.

Provide Features to Term mapping

Next, you must create a Pandas DataFrame that assigns the feature (e.g., gene, transcript or protein) to terms. The terms must be present in the terms DataFrame created in the previous step. The column headers must be Feature and Term.

Perform Functional Enrichment Analysis

To perform functional enrichment using the FUNC-E API start by importing the FUNC_E class module:

from func_e.FUNC_E import FUNC_E

Next, instantiating a new FUNC_E object.

fe = FUNC_E()

Next, you need to set the p-value cutoff for enrichment testing:

fe.setEnrichmentSettings({
    'ecut': 0.01
})

If you desire, you can change the clustering default settings as well:

fe.setClusteringSettings({
    'similarity_term_overlap': 3,
    'percent_similarity': 0.50,
    'initial_group_membership': 3,
    'multiple_linkage_threshold': 0.50,
    'final_group_membership':  3,
    'similarity_threshold': 0.5
})

The settings have the following meaning:

similarity_threshold: This value is used to threshold the kappa scores. Pair-wise kappa scores are calculated for all genes. Kappa scores range between -1 to 1 and provide a measurement as to the similarity of annotations between two genes. Kappa scores greater than this value are considered meaningful and only those gene pairs with scores greater than this threshold are clustered. The default value if not specified is 0.35.
similarity_term_overlap: Before kappa statistics are calculated two genes must share a specified number of terms. This parameter sets that minimum value. The default is 4.
percent_similarity: Before clustering, seed groups are created, and when creating seed groups we want high quality groups. Therefore, the members of the seed groups must themselves share similarity with all other genes in the group greater or equal than the value specified by this parameter. The default is 0.50 (50 percent)
initial_group_membership: When clustering, initial seed groups are created by grouping a gene with all other genes with which it has a significant (> similarity_threshold) kappa score. This parameter sets the minimum number of genes that must exist for a group to be considered a seed group. The default value is 4.
multiple_linkage_threshold: After initial seed groups are formed an iterative process attempts to merge seed groups that have a specified percentage of genes in common. This parameter sets this percentage. The default is 0.50 (or seed groups must share 50 percent of genes to be merged).
final_group_membership: This parameter sets the minimum number of terms in a cluster after all clustering. If the cluster has fewer terms it is thrown out. The default value is 4.

Next, FUNC_E can import the files needed for enrichment analysis. These are the same as the example files used in the command-line example above.

fe.importFiles({
    'background': 'arabidopsis_thaliana.TAIR10.genes.txt',
    'query': 'modules.txt',
    'terms2features': ['arabidopsis_thaliana.TAIR10.genes2AraCyc.txt',
                       'arabidopsis_thaliana.TAIR10.genes2GO.txt',
                       'arabidopsis_thaliana.TAIR10.genes2IPR.txt',
                       'arabidopsis_thaliana.TAIR10.genes2Pfam.txt',
                       'arabidopsis_thaliana.TAIR10.genes2PO.txt']
    'terms': ['IPR.terms.tsv', 'GO.terms.tsv', 'KEGG.terms.tsv']
})

Alternatively, you may have created the terms DataFrame using the vocabs.getTerms() function described above. If so, you can leave out the terms argument in the importFiles() function call above and set the terms manually:

fe.setTerms(terms)

Now that FUNC-E has all of the necessary files and settings, you can perform functional enrichment:

fe.run()

If you only wish to perform enrichment analysis and not clustering you can provide the cluster=False argument:

fe.run(cluster=False)

If you want to limit enrichment to only a subset of modules and/or vocabularies you can provide the modules and vocabs arguments:

fe.run(modules=['module1', 'module2'], vocabs=['GO'])

Once completed you can access results using the following attributes of the FUNC_E object:

fe.enrichment: a Pandas DataFrame containing the results of the enrichment test, including the p-value and bonferroni and bejamini corrected p-values
fe.clusters: a Pandas DataFrame listing the the clusters that were identified contiaing the EASE score and geometric mean of p-values.
fe.cluster_terms: a copy of the enrichment report, but with only clustered terms.

Finally, below are example commands to save results to a file:

fe.enrichment.sort_values(['Module', 'Fishers p-value']).to_csv('FUNC-E.enriched_terms.tsv', sep="\t", index=None)

fe.clusters.sort_values(['Module','Cluster Index', 'EASE Score']).to_csv('FUNC-E.clusters.tsv', sep="\t", index=None)

fe.cluster_terms.sort_values(['Module','Cluster Index', 'Fishers p-value']).to_csv('FUNC-E.cluster_terms.tsv', sep="\t", index=None)

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
func_e		func_e
kube		kube
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

License

SystemsGenetics/FUNC-E

Folders and files

Latest commit

History

Repository files navigation

FUNC-E

Installation

Preparing Input Files

Genomic Background (--background option)

Query List (--query_list option)

Term List (--terms options)

Term Mapping List (--terms2features option)

Usage

Command-line

Generate Terms files

Perform Functional Enrichment Analysis

Example

Output Files

1. Enriched Terms File (*.FUNC-E.enriched_terms.tsv)

2. Clusters File (*.FUNC-E.clusters.tsv)

3. Cluster Terms File (*.FUNC-E.cluster_terms.tsv)

4. Kappa Scores File (*.FUNC-E.kappa.tsv)

5. Enriched Features File (*.FUNC-E.efeatures.tsv)

Understanding Enrichment vs. Clustering

Enrichment Analysis

Clustering Analysis

How They Work Together

Which Results Should You Use?

Using the API

Generate Terms files

Automate it

Manually Create It

Provide Features to Term mapping

Perform Functional Enrichment Analysis

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

1. Enriched Terms File (`*.FUNC-E.enriched_terms.tsv`)

2. Clusters File (`*.FUNC-E.clusters.tsv`)

3. Cluster Terms File (`*.FUNC-E.cluster_terms.tsv`)

4. Kappa Scores File (`*.FUNC-E.kappa.tsv`)

5. Enriched Features File (`*.FUNC-E.efeatures.tsv`)

Packages