From d06b55e690a112930c5324ab4f7745b8e7ee14f6 Mon Sep 17 00:00:00 2001 From: Yong Jung Date: Mon, 1 Jul 2019 10:06:57 -0400 Subject: [PATCH] Add files via upload --- docs/doc.rst | 4 ++-- docs/graph.rst | 12 ++++++------ docs/install.rst | 13 +++++++------ docs/intro.rst | 7 ++++--- docs/pssm.rst | 10 +++++----- docs/viz.rst | 10 +++++----- docs/workflow.rst | 19 +++++++++++-------- 7 files changed, 40 insertions(+), 35 deletions(-) diff --git a/docs/doc.rst b/docs/doc.rst index f199aa7..291773d 100644 --- a/docs/doc.rst +++ b/docs/doc.rst @@ -1,7 +1,7 @@ Documentation ************** -All the prototypes of the class/methods are here specified +All prototypes of the class/methods used here are specified. .. toctree:: :maxdepth: 4 @@ -30,4 +30,4 @@ SVM .. automodule:: iScore.rank :members: - :undoc-members: \ No newline at end of file + :undoc-members: diff --git a/docs/graph.rst b/docs/graph.rst index 14d5844..221bce1 100644 --- a/docs/graph.rst +++ b/docs/graph.rst @@ -6,9 +6,9 @@ Generating the Graphs : ---------------------------- -The first step in iSCore is to generate the connections graph of the itnerface. In this graph each node is represented by the PSSM of a residue. The nodes are connected if they form a contact pair between the two proteins. +The first step in iScore is to generate the bipartite graph of the interface of a decoy model. In the generated graph, each node is encoded the PSSM profile of a residue. The nodes are connected if they form a contact pair between two proteins of the decoy model. -To create the graph one needs the PDB file of the interface and the two PSSM files (one for each chain) created by the PSSMGen tool. To generate the graph simply use : +To create the graph, a PDB file for the interface and two separate PSSM files (one for each protein chain) created by the PSSMGen tool are required. To generate the graph, simply use : >>> from iScore.graph import GenGraph, Graph >>> @@ -19,14 +19,14 @@ To create the graph one needs the PDB file of the interface and the two PSSM fil >>> g.construct_graph() >>> g.export_graph('name.pkl') -This simple example will construct the connection graph and export it in a pickle file. A working example can be found in ``example/graph/create_graph.py`` +This simple example constructs the bipartite graph and export it into a pickle file. A working example can be found in ``example/graph/create_graph.py`` -The function ``iscore_graph()`` facilitate the generation of a large number of conformations. By default this function will create the graphs of all the conformations stored in the subfolder ``./pdb/`` using the pssm files stored in the subfolder ``./pssm/``. The resulting graphs will be stored in the subfolder ``./graph/``. +The function ``iscore_graph()`` facilitates generation of a large number of conformations. By default, this function creates the graphs of all conformations stored in the subfolder ``./pdb/`` using the PSSM files stored in the subfolder ``./pssm/``. The resulting graphs will be stored in the subfolder ``./graph/``. Generating the Graph Kernels : ------------------------------------- -Once we have calculated the graphs of multiple conformation we can simply compute the kernel of the different pairs using iScore. An example can be found at ``example/kernel/create_kernel.py`` +Once we obtain the graphs of conformations, we can simply compute the kernel of the different pairs using iScore. An example can be found at ``example/kernel/create_kernel.py`` >>> from iScore.graph import Graph, iscore_graph >>> from iScore.kernel import Kernel @@ -41,4 +41,4 @@ Once we have calculated the graphs of multiple conformation we can simply comput >>> # run the calculations >>> ker.run(lamb=1.0,walk=4,check=checkfile) -The kernel between the two graphs computed above is calculated with the class `Kernel()`. By default the method `Kernel.import_from_mat()` will read all the graphs stored in the subfolder `graph/`. To compute all the pairwise kernels of the graphs loaded above we can simply use the method `Kernel.run()`. We can here specify the value of lambda and the length of the walk. \ No newline at end of file +The kernel between the two graphs is computed by the `Kernel()` class. By default, the method `Kernel.import_from_mat()` imports all the graphs stored in the subfolder `graph/`. To compute all pairwise kernels of the graphs loaded, we can simply use the method `Kernel.run()` (Yong: which one is correct, ker.run or Kernel.run?) . Users can set a lambda value and a walking length as parameters. diff --git a/docs/install.rst b/docs/install.rst index e45cd77..5c60081 100644 --- a/docs/install.rst +++ b/docs/install.rst @@ -17,19 +17,20 @@ Test the installation To test the module go to the test folder ``cd ./test`` and execute the following test : ``pytest`` These tests are automatically run on Travis CI at each new push. -So if the build button display passing they should work ! +So if the build button display passing they should work ! (Yong: I am not sure what this sentence means) -Requiried Dependencies + +Required packages for dependencies ------------------------ -The code is written in Python3. Several packages are required to run the code but most are pretty standard. Here is an non-exhaustive list of dependencies +The code is written in Python3. Several packages are required to run the code. Here is a list of their dependencies. * Numpy * Biopython - * libsvm - * mpi4py - * pdb2sql + * libsvm (https://github.com/cjlin1/libsvm/tree/master/python) + + * pdb2sql (https://github.com/DeepRank/pdb2sql) diff --git a/docs/intro.rst b/docs/intro.rst index 89f6136..ff5b7e0 100644 --- a/docs/intro.rst +++ b/docs/intro.rst @@ -3,16 +3,17 @@ Introduction ============================= -**Support Vector Machine on Graph Kernels for Protein-Protein Docking Scoring** +**iScore: a MPI supported software for ranking protein-protein docking models based on a random walk graph kernel and +support vector machines** The software supports the publication of the following articles: C. Geng *et al.*, *iScore: A novel graph kernel-based function for scoring protein-protein docking models*, bioRxiv 2018, https://doi.org/10.1101/498584 -iScore uses a support vector machine (SVM) approach to rank protein-protein interfaces. Each interface is represented by a connection graph in which each node represents a contact residue and each edge the connection between two contact residues of different proterin chain. As feature, the node contains the Position Specific Similarity Matrix (PSSM) of the corresponding residue. +iScore uses a support vector machine (SVM) approach to rank protein-protein docking models using their interface information. Each interface is represented as a bipartite graph, in which each node represents a contact residue and each edge denotes the two nodes are close to each other in 3D space (the current cutoff is 6 A). Currently, edges are not labelled, and each node is labeled with 20 by 1 vector from the Position Specific Scoring Matrix (PSSM) of the corresponding residue. -To measure the similarity between two graphs, iScore use a random walk graph kernel (RWGK) approach. These RWGKs are then used as input of the SVM model to either train the model on a training set or use a pretrained model to rank new protein-protein interface. +To measure the similarity between two graphs, iScore use a random walk graph kernel (RWGK) approach. The graph kernel matrix for all graph pairs is then used as input of the SVM model to either train the model on a training set or use a pretrained model to rank new protein-protein docking models. .. image :: comp.png diff --git a/docs/pssm.rst b/docs/pssm.rst index 7ba52cb..ac49774 100644 --- a/docs/pssm.rst +++ b/docs/pssm.rst @@ -1,15 +1,15 @@ Computing PSSM files ============================= -As a prepocessign step one must compute the PSSM files corespondng to the PDB files in the training/testing dataset. Thiscan be acheived with the PisBLast library (https://ncbiinsights.ncbi.nlm.nih.gov/2017/10/27/blast-2-7-1-now-available/). The library BioPython allows ane asy use of these libraries. +As a preprocessing step, users must compute the PSSM files correspondng to the PDB files in the training/testing dataset. This can be acheived with the PSI-Blast library (https://ncbiinsights.ncbi.nlm.nih.gov/2017/10/27/blast-2-7-1-now-available/). The BioPython package allows an easy use of the library. -iScore contains wrapper that allows to compute the PSSM data, map them to the PDB files and format them for further processing. The only input needed is the PDB file of the decoy. To compute the PSSM file one can simply use : +iScore contains a wrapper that allows to compute the PSSM data, map them to the PDB files and format them for further processing. The only input needed is the PDB file of the decoy. To compute the PSSM file one can simply use : ->>> from iscore.pssm.pssm import PSSM +>>> from iScore.pssm.pssm import PSSM >>> ->>> gen = PSSM('1AK4') +>>> gen = PSSM(caseID = '1AK4', pdb_dir ='1AK4/pdb') >>> >>> # generates the FASTA query >>> gen.get_fasta() @@ -21,4 +21,4 @@ iScore contains wrapper that allows to compute the PSSM data, map them to the PD >>> gen.get_pssm() >>> >>> # map the pssm to the pdb ->>> gen.map_pssm() \ No newline at end of file +>>> gen.map_pssm() diff --git a/docs/viz.rst b/docs/viz.rst index 1642c98..64411a1 100644 --- a/docs/viz.rst +++ b/docs/viz.rst @@ -1,7 +1,7 @@ Visualizing the connection graphs ====================================== -iSore allows to easily visualize the connection graphs using the HDF5 browser provided with the software and pymol. First the connections graphs must be stored in a HDF5 file. To do that simply generate the graphs as following: +iScore allows to easily visualize the bipartite graphs using the HDF5 browser provided by the software and pymol. First, the bipartite graphs must be stored in the format of a HDF5 file. To do so, the graphs can be processed to fit in HDF5 file format as follows: >>> from iScore.graphrank.graph import iscore_graph @@ -9,12 +9,12 @@ iSore allows to easily visualize the connection graphs using the HDF5 browser pr >>> pssm_path=, >>> export_hdf5=True) -where you have to specify the folder containing the PDB files abd PSSM files in pdb_path and pssm_path. By default this are simply ``./pdb/`` and ``./pssm/``. The script above will create a HDF5 file containing the graph. +where you have to specify the folder containing the PDB files and PSSM files in ``pdb_path`` and ``pssm_path``. By default, these are set as ``./pdb/`` and ``./pssm/``. The script above creates a HDF5 file containing the graphs. -This HDF5 cile can be explored using the the dedicated HDF5 browser. Go to the ``./h5x/`` folder and type: +The generated HDF5 file can be opened using the HDF5 browser. To open the HDF5 file in the HDF5 browser, please go to the ``./h5x/`` folder and type: ``./h5x.py`` -This will open the hdf5 browser. You can open a hdf5 file by clicking on the file icon in the bottom left of the browser. Once opened, you will see the content of the file in the browser. Right-click on the name of a conformation and choose ``3D Plot``. This will open PyMol and allow you to visualize the connecton graph +You can open a HDF5 file by clicking on the file icon in the bottom left of the browser. Once it is opened, you can see the content of the file in the browser. Right-click on the name of a conformation and choose ``3D Plot``. This will open PyMol and allow you to visualize the bipartite graph -.. image :: h5x_iscore.png \ No newline at end of file +.. image :: h5x_iscore.png diff --git a/docs/workflow.rst b/docs/workflow.rst index 2416e4e..4c4f646 100644 --- a/docs/workflow.rst +++ b/docs/workflow.rst @@ -1,23 +1,26 @@ iScore Workflow ======================== -One of the mainfeature of the software are the serial and MPI binaries that fully automatize the workflow and that can be used directly from the command line. To illustrate the use of these binaries go to the folder ``iScore/example/training_set/``. This folder contains the subfolders ``pdb/`` and ``pssm/`` that contain the PDB and PSSM files of our training set. The binary class corresponding to these PDBs are specified in the file 'caseID.lst'. +One of the main features in the iScore software are the serial and MPI binaries that fully automatize the workflow and that can be used directly from the command line. To illustrate the use of these binaries, you can go to the folder ``iScore/example/training_set/``. This folder contains the subfolders ``pdb/`` and ``pssm/`` that have the PDB and PSSM files for our training set (xue: this folder contains also a folder of `test` and caseID.lst). The binary class corresponding to these PDBs are specified in the file 'caseID.lst'. -Training a model using iScore can be done in a single line using MPI binaries with the command : +=== train === +Training a model using iScore can be done in a single line using MPI binaries with the command : +``$ cd iScore/example/training_set/train `` (xue: I added this line.) ``$ mpiexec -n 2 iScore.train.mpi`` +This command will first generate the graphs of the conformations stored in ``pdb/`` using the corresponding PSSMs contained in ``pssm/`` as features. These graphs will be stored as a pickle file in ``graph/``. The command will then compute the pairwise kernels of these graphs and store the kernel files in ``kernel/``. Finally, an SVM model will be trained using the kernel files and the ``caseID.lst`` file that contains its binary class of the conformation. -This command will first generate the graphs of the conformations stored in ``pdb/`` using the PSSM contained in ``pssm/`` as features. These graphs will be stored as pickle file in ``graph/``. The command will then compute the pairwise kernels of these graphs and store the kernel files in ``kernel/``. Finally it will train a SVM model using the kernel files and the ``caseID.lst`` file that contains the binary class of the model. +The calculated graphs and the svm model are stored in a single tar file called here ``training_set.tar.gz``. This file contains all the information needed to predict binary classes of decoy models in a test set using the trained model. -The calculated graphs and the svm model are stored in a single tar file called here ``training_set.tar.gz``. This file contains all the information needed to predict binary classes of a test set using the trained model. +=== test === -To predict binary classes (and decision values) of new conformations go to the subfoler ``test/``. Here 5 conformations are specified by the PDB and PSSM files stored in ``pdb/`` and ``pssm/`` that we want to use as a test set. Ranking these conformations can be done in a single command using : +To predict binary classes (and decision values) of new conformations go to the subfolder ``test/``. Here 5 conformations are specified by the PDB and PSSM files stored in ``pdb/`` and ``pssm/`` that we want to use as a test set. Ranking these conformations can be done in a single command using : -``$ mpiexec -n 2 iScore.predict.mpi --archive ../training_set.tar.gz`` +``$ mpiexec -n 2 iScore.predict.mpi --archive ../train/training_set.tar.gz`` -This command will use first compute the graph of the comformation in the test set and store them in `graph/`. The binary will then compute the pair wise kernels of each graph in the test set with all the graph contained in the training set that are stored in the tar file. These kernels will be stored in ``kernel/``. Finally the binary will use the trained SVM model contained in the tar file to predict the binary class and decision value of the conformations in the test set. The results are then stored in a text file and a pickle file ``iScorePredict.pkl`` and ``iScorePredict.txt``. Opening the text file you will see : +This command will first compute the graphs of the comformations in the test set and store them in `graph/`. The binary will then compute the pairwise kernels for each graph in the test set and all the graphs contained in the training set that are stored in the tar file. These kernels will be stored in ``kernel/``. Finally the binary will use the trained SVM model contained in the tar file to predict the binary classes and decision values of the conformations in the test set. The results are then stored in a text file and a pickle file ``iScorePredict.pkl`` and ``iScorePredict.txt``. Opening the text file you will see : +--------+--------+---------+-------------------+ |Name | label| pred| decision_value| @@ -40,7 +43,7 @@ The ground truth label are here all None because they were not provided in the t Serial Binaries ------------------------ -Serial binaries are also provided and can be used in a similar way than the MPI binaries : ``iscore.train`` and ``iscore.predict`` +Serial binaries are also provided and can be used in a similar way than the MPI binaries (Yong: it needs to be rewrite. A bit unclear what it means) : ``iscore.train`` and ``iscore.predict``