Neural network structure
We used fully-connected, three-layered feed-forward networks. In the hidden layer, a hyperbolic activation function and in the output layer a logistic activation function have was used. All networks were trained using the standard Back Propagation algorithm. Finally, three hidden neurons were used, as this structure showed at the same time good performance, low performance variance and good generalizing abilities due to the low number of neurons.
In total, 175 amino acid sequences were collected, the positive data set (sequences with an N-terminal mitochondrial transit peptide) consisting of 40 sequences and the negative data set comprising 135 sequences.
Sequences were included into the set of positive examples, when they fulfilled at least one of the following criteria.
A total of 40 positive examples were collected this way.
- They are homologous to proteins exclusively or usually found in mitochondria of other organisms (e.g. proteins of the citric acid cycle, the electron transport chain, ubiquinone biosynthesis)
- In constructing a phylogenetic tree, they branch with proteins known to be mitochondrial
- If in alignments to bacterial proteins, they appeared to have N-terminal extensions
- Proteins that have been experimentally shown to localize in the mitochondrion (only dihydroorotate dehydrogenase falls into this category).
Negative examples, analogously, were compiled by homology modelling using proteins usually found in other compartments of eukaryotic organisms than the mitochondria. 135 negative sequences were incorporated into the negative data set this way.
Mitochondrial transit peptides vary greatly in length, therefore in all cases a fixed number of N-terminal amino acids was used for further analysis. Their length in other eukaryotes, taken a sample of several hundert mitochondrial transit peptides from SwissProt, was determined to have a median length of 31 amino acids, with the first quartile at 24 and the third quartile at 42 amino acids. All data sets were cut to these three different lengths. We found that all the sequences, apart from one cytosolic peptide, did not have more than 50% sequence identity (using JalView ). Therefore, no further redundancy reduction seemed to be necessary and all sequences were used for further analysis.
For each sequence, a 20-dimensional composition vector was computed containing the relative residue frequencies among the first 24 N-terminal amino acids. These vectors were used for neural network training. In addition, 19-dimensional physicochemical descriptors have been calculated, again based on the amino acid frequencies based on the 24 N-terminal residues. These vectors were used to train another set of three-layered feed-forward networks with three neurons in the hidden layer.
An ANN-based prediction system was developed to classify protein sequences from P. falciparum, based on relative amino acid frequencies of the first 24, 31 and 42 N-terminal residues. A three-layered ANN containing 1 to 50 hidden units was trained in a 20-fold cross validation with the complete P. falciparum data set (40 positive, 135 negative examples; 89 sequences in training and 43 in each select and test set).
A network using three hidden neurons was chosen as the best network, because it achived a high Mathews coefficient (cc=0.74), together with low variance (σ2 =0.10) and a low number of hidden neurons.
Genetic selection of variables was performed to reduce the set of input descriptors. The parameters chosen in Statistica are 100 Iterations, 100 children per iteration, mutation rate 1 and a crossover probability of 0.1. Seven amino acids - cysteine, histidine, glutamine, serine, threonine, tryptophane and tyrosine - were found not to improve classification results and their frequencies have from now on been omitted. Thus, after selection a 13-dimensional input vector of relative amino acid frequencies was used.
A 20-fold cross validation employing the 13-dimensional input vectors was performed, using an improvement in the classification of the select set as a criterion to end learning. On average, training was ended after 436 epochs. A 20-fold cross validation with randomly chosen data sets and fixed end of training after 436 epochs gave a Mathews coefficient of cc=0.74, with on average 90% correct prediction. Sensitivity was 0.94 with a selectivity of 0.68.
Using all 175 data sets for training, a Mathews coefficient of cc=0.92 was achieved, with a sensitivity of 0.98 and a selectivity of 0.91. Of the 175 sequences, only five were not correctly classified. The four proteins M1 Family Aminopeptidas, Clathrin Coat Assembly Protein, Vacuolar Proton Pumping Pyrophosphatase-2 and the Knob-associated histidine rich proteine (KAHRP) were overpredicted, whereas Fumarase Class 1 was underpredicted. Of the overpredicted sequences, Vacuolar Proton Pumping Pyrophosphatase-2 contains a large N-terminal extension, compared to Pyrophosphatase-1. This extension was classified as a mTP and was therefore misclassified. KAHRP, another false positive sequence, contains an internal transit peptide. Due to our focussing on the N-terminal part of the amino acid sequence we cannot handle internal signal peptides, so sequences like KAHRP cannot be classified correctly. The reason for the misclassification of the remaining two false-positive and the one false-negative sequence is unknown.
To reduce the number of false-positive results, the relative penalty for false positives to false negatives was set to 3 and the net was retrained in a 10-fold cross-validation study. A Mathews correlation coefficient of cc = 0.51 was obtained, only one sequence was still overpredicted, while 26 sequences were underpredicted. The overpredicted sequence was again Vacuolar Proton Pumping Pyrophosphatase 2.