Structural genomics programs have developed and applied structure-determination pipelines to a wide range of protein targets, facilitating the visualization of macromolecular interactions and the understanding of their molecular and biochemical functions. taxonomic superkingdoms are distinct. The use of knowledge-based target selection is shown to substantially increase the ability to produce X-ray structures. It is demonstrated that the human proteome has one of the highest attainable coverage values among eukaryotes, and GPCR membrane proteins suitable for X-ray structure determination were determined. (Smialowski (Overton & Barton, 2006 ?), (Chen (Slabinski, Jaroszewski, Rodrigues (Overton (Mizianty & Kurgan, 2009 ?), (Price (Kandaswamy (Mizianty & Kurgan, 2011 ?), (Overton (Mizianty & Kurgan, 2012 ?), (Charoen-kwan (Jahandideh noncrystallizable) in the training and test sets is below 25%; this is in line with the LIMK2 evaluation protocols performed in prior related studies (Mizianty & buy 1215868-94-2 Kurgan, 2011 ?; Overton algorithm (Edgar, 2010 ?). could not process sequences longer than 10?000 amino acids and thus they were removed from the analysis. We utilized several thresholds of protein identity, including 50%, which we used to group functionally similar chains (Addou generates clusters of proteins that are similar above the predefined threshold to a reference (seed) protein in a given cluster and that it defines similarity as the number of identical residues in the alignment divided by the length of the shorter sequence. This means that proteins within a given cluster are likely to have a pairwise sequence similarity above the threshold, although this is not guaranteed. We use this clustering method since it provides a good buy 1215868-94-2 trade-off between the quality of clustering and low computational cost (Edgar, 2010 ?), which is necessary given the large size of our UniProt data set. Using the clustering at 30% sequence identity, each protein sequence in a given cluster is considered structurally covered at a given cutoff of the crystallization propensity score if there is at least one sequence in this cluster with a score higher or equal to the cutoff. The structures of the remaining sequences in that cluster could buy 1215868-94-2 be obtained through homology modeling. buy 1215868-94-2 These clusters are referred to as modeling families. Moreover, the percentage values of coverage are computed with respect to the total number of modeling families in a given analysis, the number of structurally solved modeling families divided by the total number of modeling families. To estimate the current coverage by X-ray structures, we used the algorithm to map proteins from the UniProt data set to the PDB data set. More specifically, we found all proteins from the UniProt data set which have at least one target in the PDB data set which covers no less than 90% of their sequence with no less than 90% sequence identity. As above, we assume that a given cluster (modeling buy 1215868-94-2 family) can be solved by homology modeling if at least one of its members has such a PDB target, if a template structure for homology modeling is already available in PDB. Supplementary Table S2 summarizes the scope of our study, including the number of considered complete proteomes, protein sequences and modeling families across three superkingdoms and viruses. 2.3. Measures to evaluate predictive quality ? To estimate a correlation of inputs (features) used by our predictors with the binary prediction outcome (crystallizable noncrystallizable), we use the point biserial correlation coefficient, where is the standard deviation of the values of a given feature on the entire data set of proteins (both crystallizable and noncrystallizable), is the total number of proteins. The predictive quality of the crystallization propensity predictors was evaluated utilizing several commonly used measures including accuracy, sensitivity, specificity and the Matthews correlation coefficient (MCC) (Overton & Barton, 2006 ?; Overton ? We used a machine-learning approach and annotated (with the prediction outcomes) data from the training data set to build the prediction model. Our model predicts whether a given input protein chain would.