Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 11, 522

Structuring Heterogeneous Biological Information Using Fuzzy Clustering of K-Partite Graphs

Affiliations

Structuring Heterogeneous Biological Information Using Fuzzy Clustering of K-Partite Graphs

Mara L Hartsperger et al. BMC Bioinformatics.

Abstract

Background: Extensive and automated data integration in bioinformatics facilitates the construction of large, complex biological networks. However, the challenge lies in the interpretation of these networks. While most research focuses on the unipartite or bipartite case, we address the more general but common situation of k-partite graphs. These graphs contain k different node types and links are only allowed between nodes of different types. In order to reveal their structural organization and describe the contained information in a more coarse-grained fashion, we ask how to detect clusters within each node type.

Results: Since entities in biological networks regularly have more than one function and hence participate in more than one cluster, we developed a k-partite graph partitioning algorithm that allows for overlapping (fuzzy) clusters. It determines for each node a degree of membership to each cluster. Moreover, the algorithm estimates a weighted k-partite graph that connects the extracted clusters. Our method is fast and efficient, mimicking the multiplicative update rules commonly employed in algorithms for non-negative matrix factorization. It facilitates the decomposition of networks on a chosen scale and therefore allows for analysis and interpretation of structures on various resolution levels. Applying our algorithm to a tripartite disease-gene-protein complex network, we were able to structure this graph on a large scale into clusters that are functionally correlated and biologically meaningful. Locally, smaller clusters enabled reclassification or annotation of the clusters' elements. We exemplified this for the transcription factor MECP2.

Conclusions: In order to cope with the overwhelming amount of information available from biomedical literature, we need to tackle the challenge of finding structures in large networks with nodes of multiple types. To this end, we presented a novel fuzzy k-partite graph partitioning algorithm that allows the decomposition of these objects in a comprehensive fashion. We validated our approach both on artificial and real-world data. It is readily applicable to any further problem.

Figures

Figure 1
Figure 1
Illustration of the fuzzy clustering approach. We want to approximate the tripartite example graph G in (a) by a smaller tripartite cluster network H, the so-called backbone graph (b). The decomposition into fuzzy clusters connected by this backbone must explain the original connectivity as good as possible. The edges of G are collected in adjacency matrices A(ij) connecting the elements of the partitions i and j. The approximation of G by the backbone graph is encoded in the adjacency matrices B(ij) connecting the fuzzy node clusters C(i). These matrices C(i) collect the degrees of membership of each node of partition Vi to each cluster of this type. Its (k, l)-th element Ckl(i) specifies how much node k belongs to the backbone node l.
Figure 2
Figure 2
Fuzzy clustering algorithm. Summarization of the final fuzzy k-partite clustering algorithm.
Figure 3
Figure 3
Illustration of the cluster decomposition of a bipartite toy example. (a) We demonstrate the graph decomposition with our algorithm on a small bipartite graph with overlapping cluster structure. The original graph consists of partitions V1 = {1 ... 4} (red filled nodes) and V2 = {5 ... 10} (blue filled nodes) connected by edges A(12) colored in black. We decomposed it into two clusters for partition V1 and three clusters for partition V2. The resulting fuzzy clustering is illustrated as a weighted graph connecting original nodes to cluster nodes (framed red and blue). The cluster assignments C(1) and C(2) are indicated by dashed lines, where the coloring corresponds to the degree of cluster membership. The interconnections of the clusters form the backbone graph, encoded in the adjacency matrix B(12) which we denote by continous lines where color indicates the edge weight. Another way of illustrating the graph decomposition is shown in (b). It is clearer especially for larger graphs. First, we plot hierarchical clusterings of the nodes' degrees of membership in partitions V1 and V2 (encoded by C(1) and C(2)). This facilitates the identification of overlapping clusters (e.g. nodes 1 and 10 are assigned to more than one cluster) or hard cluster assignments (e.g. node 5). The backbone graph B(12) is shown bottom right. This backbone graph is densely connected in our example.
Figure 4
Figure 4
Performance on toy models. We validated our algorithm on graphs with predefined cluster structure. To this end, we compared it with the hard clustering method by Long et al. on four different random toy models, see Table 1. The plot shows the mean relative deviation between the two algorithms relative to the results of the hard clustering. Error bars denote standard deviations over 1000 runs. We see that the fuzzy cluster assignments of our method require much more runtime, but both cost function and data estimation error (see Methods) are significantly smaller. The large standard deviations show the dependency of the decomposition on the random initial conditions. Therefore, by default we perform multiple restarts with different initializations.
Figure 5
Figure 5
Decomposition of a gene-disease-protein complex network. We integrated the gene-disease network from [3] with human protein complexes from the CORUM database [16]. This resulted in a layered tripartite graph, which is schematically drawn in (a). We performed a 10-fold approximation of this graph to estimate appropriate numbers of clusters. The boxplot curve (b) shows how the cost function f(H, C) from equation (1) depends on the number of gene clusters mg. The true minima of the cost function are decreasing with mg, and this is also visible in the approximated minima using our proposed algorithm. Therefore, we are able to identify structures on various resolution levels. The details represent the cost function course for large-scale clustering (i) and a decomposition on small scale (ii), respectively. For our detailed analyses, we used the decompositions showing steep drops in the cost function marked by the red and green boxes.
Figure 6
Figure 6
Illustration of large-scale cluster structures in the gene-disease-protein complex network. The large-scale decomposition of the gene-disease-protein complex network is illustrated as described in Figure 2b. The hierarchical clustering of the nodes' degrees of membership of the (a) complex, (c) gene and the (d) disease partition show that the majority of elements was assigned to single clusters. However, a considerable amount of cluster overlaps exists, e.g. for the disease clusters 3 and 4. The backbones for gene-complex (b) and for gene-disease (e) are sparsely connected, but show that locally overlapping clusters tend to interconnect with the same clusters of the other partition; e.g. disease cluster 3 and 4 are both connected to gene cluster 9 with large weights.
Figure 7
Figure 7
Evaluation of the backbone of the gene-disease-protein complex network. To evaluate the large-scale clustering we additionally included functional annotations. (a) and (b) compare the gene-complex backbone graph with the functional correlations of the extracted clusters according to FunCat annotation. Similarly, (d) and (e) show the gene-disease backbone and the clusters' disorder class correlations (see Methods). We see that interconnected clusters also seem to correlate in their annotations. To test this hypothesis rigorously, we calculated difference scores as defined in Methods in order to quantify the correlation of the backbones and their annotations, respectively. Vertical lines in (c) and (f) correspond to these difference scores for the fuzzy (black) and the hard (red) clustering. Comparing these values to the difference scores for 105 randomized cluster assignments we obtain significant p-values, both < 10-5. The correlations between annotations of connected clusters of the backbone is higher when applying the fuzzy approach.
Figure 8
Figure 8
The small-scale clustering in the neighborhood of MECP2. We draw the results - the backbone network and the nodes' degrees of membership to clusters, thresholded by μ > 0.2 - of the small-scale clustering in the neighborhood of MECP2 using the fuzzy (a) and the hard clustering (b). Nodes are colored according to their disorder class annotations. Blue edges indicate backbone interconnectivity, grey edges cluster assignment. Edge thickness indicates the degree of membership. MECP2 is connected to three gene clusters mainly covering neurological (red) and psychiatric (purple) genes. The seven interconnected disease clusters also represent mainly psychiatric and neurological disorders. Also unclassified disorders are present such as e.g. Alcohol dependence (white), which is classified as a mental and behavioral disorder. In a broader sense, however, it can be considered as psychiatric disorder. Applying the hard clustering (b), MECP2 is assigned to gene cluster 209 which is connected to two disease clusters only. Although all associated disorders are identified correctly, in contrast to the fuzzy clustering no further information can be obtained from the decomposition.

Similar articles

See all similar articles

Cited by 5 articles

References

    1. Klamt S, Haus UU, Theis F. Hypergraphs and cellular networks. PLoS Comput Biol. 2009;5(5):e1000385. doi: 10.1371/journal.pcbi.1000385. - DOI - PMC - PubMed
    1. Montanez R, Medina MA, Solé RV, Rodríguez-Caso C. When metabolism meets topology: Reconciling metabolite and reaction networks. Bioessays. 2010;32(3):246–256. doi: 10.1002/bies.200900145. - DOI - PubMed
    1. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL. The human disease network. Proc Natl Acad Sci USA. 2007;104(21):8685–8690. doi: 10.1073/pnas.0701361104. - DOI - PMC - PubMed
    1. Barber M. Modularity and community detectionin bipartite networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2007;76(6 Pt 2):066102. - PubMed
    1. Karypis G, Aggarwal R, Kumar V, Shekhar S. Proc. DAC '97. ACM Press; 1997. Multilevel hypergraph partitioning: application in VLSI domain; pp. 526–529. full_text.

Substances

LinkOut - more resources

Feedback