zinc

Zincidentifier

 

This webpage provides datasets and scripts used to apply the trained models to predict zinc-binding sites in proteins in support of the following paper:

Zheng C, Wang M, Takemoto K, Akutsu T, Zhang Z, Song J. An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins, submitted for publication

 


Introduction

Zinc ions usually have catalytic, regulatory or structural roles that are critical for the function of the zinc-binding proteins. Due to the abundance and importance of zinc-binding proteins, accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. In this study, We have developed a new approach that combines multiple useful sequence and structural properties as well as the graph-theoretic network features, followed by an efficient feature selection procedure to improve the prediction of zinc-binding sites in proteins.

 

 


Datasets

In this study, we only focused on the functional zinc ions (i.e. Zn3, Zn4 and Co-catalytic Zn). In summary, CHEDs binding to these zinc ions are positive samples.We randomly divided these positive samples into six parts, one of which was used as the independent test set while the remaining five parts were used as the positive training set. The negative samples were randomly selected with the negative to positive ratio of 6:1. The curated benchmark dataset used for 5-fold cross-validation tests , the independent test dataset and the apo dataset can be download below:

benchmark_dataset

independent_test_dataset

apo_dataset

 


Zincidentifier sourcecodes

This software is an improved tool based on the random forest (RF) algorithm for identifying zinc-binding sites in proteins by focusing on four types of residues Cys, His, Glu and Asp (CHED) of the target proteins.The sourcecode of zincidentifier can be downloaded at this link.

The interested users should refer to the following procedures to use the sourceodes:
1. Download and install Linux operating system, Perl and R programs. You can download Perl and R here.
2. Install R package "randomForest" which can be installed by running the command "install.packages(randomForest)";
3. Download the files benchmark_dataset in Datasets section and files in Sourcecodes section zincidentifier.pl zinc_classifier.R input. Put them in the same folder.
4. "benchmark_dataset.txt" is a train file and "input.txt" is a test file. The data in the test file will be normalized firstly by "zincidentifier.pl" and generate the file "input.txt.norm". Then the Random Forest-based classifier will be used to make the prediction and output a prediction score for each residue each time. The RF classifiers will be used 100 times to generate 100 ouptput scores. The zincidentifier.pl script calculates the average of these 100 predictions and generates a final score between "-1" to "1", where "-1" denotes non-zinc binding residue and "1" denotes zinc-binding residue.

For example, run zincidentifier using the following command:

perl zincidentifier.pl input.txt

The result file will be generated in the current directory. If you have any problem in getting it working, refer to the readme file for help or send us an Email (The Email address is in the readme file).


zincidentifier.pl

zinc_classifier.R

input

 


Copyright @ 2012 College of Biological Sciences, China Agricultural University and Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences