Home | << 1 >> |
![]() |
Alejo, L., Atkinson, J., & Lackner, S. (2020). Looking deeper – exploring hidden patterns in reactor data of N-removal systems through clustering analysis. Water Sci. Technol., 81(8), 1569–1577.
Abstract: In this work, clustering analysis of two partial nitritation-anammox (PN-A) moving bed biofilm reactors (MBBR) containing different types of carrier material was explored for the identification of patterns and operational conditions that may benefit process performance. The systems ran for two years under fluctuations of temperature and organic matter. Ex situ batch activity tests were performed every other week during the operation of these reactors. These datasets and the parameters, which were monitored online and in the laboratory, were combined and analyzed applying clustering analysis to identify non-obvious information regarding the performance of the systems. The initial results were consistent with the literature and from an operational perspective, which allowed the parameters to be explored further. The new information revealed that the oxidation reduction potential (ORP) and the anaerobic ammonium oxidizing bacteria (AnAOB) activity correlated well. ORP also dropped when the reactors were exposed to real wastewater (presence of organic matter). Moreover, operating conditions during nitrite accumulation were identified through clustering, and also revealed inhibition of anammox bacteria already at low nitrite concentrations.
Keywords: clustering; feature selection; k-means; partial nitritation-anammox
|
Allende, C., Sohn, E., & Little, C. (2015). Treelink: data integration, clustering and visualization of phylogenetic trees. BMC Bioinformatics, 16, 6 pp.
Abstract: Background: Phylogenetic trees are central to a wide range of biological studies. In many of these studies, tree nodes need to be associated with a variety of attributes. For example, in studies concerned with viral relationships, tree nodes are associated with epidemiological information, such as location, age and subtype. Gene trees used in comparative genomics are usually linked with taxonomic information, such as functional annotations and events. A wide variety of tree visualization and annotation tools have been developed in the past, however none of them are intended for an integrative and comparative analysis. Results: Treelink is a platform-independent software for linking datasets and sequence files to phylogenetic trees. The application allows an automated integration of datasets to trees for operations such as classifying a tree based on a field or showing the distribution of selected data attributes in branches and leafs. Genomic and proteonomic sequences can also be linked to the tree and extracted from internal and external nodes. A novel clustering algorithm to simplify trees and display the most divergent clades was also developed, where validation can be achieved using the data integration and classification function. Integrated geographical information allows ancestral character reconstruction for phylogeographic plotting based on parsimony and likelihood algorithms. Conclusion: Our software can successfully integrate phylogenetic trees with different data sources, and perform operations to differentiate and visualize those differences within a tree. File support includes the most popular formats such as newick and csv. Exporting visualizations as images, cluster outputs and genomic sequences is supported. Treelink is available as a web and desktop application at http://www. treelinkapp. com.
Keywords: Phylogenetic tree; Data integration; Clustering; Visualization
|
Araya-Diaz, P., Ruz, G. A., & Palomino, H. M. (2013). Discovering Craniofacial Patterns Using Multivariate Cephalometric Data for Treatment Decision Making in Orthodontics. Int. J. Morphol., 31(3), 1109–1115.
Abstract: The aim was to find craniofacial morphology patterns in a multivariate cephalometric database using a clustering technique. Cephalometric analysis was performed in a sample of 100 teleradiographs collected from Chilean orthodontic patients. Thirty cephalometric measurements were taken from commonly used analysis. The computed variables were used to perform a clustering analysis with the k-means algorithm to identify patterns of craniofacial morphology. The J48 decision tree was used to analyze each cluster, and the ANOVA test to determine the statistical differences between the clusters. Four clusters were found that had significant differences (P<0.001) in 24 of the 30 variables studied, suggesting that they represent different patterns of craniofacial form. Using the decision tree, 8 of the 30 variables appeared to be relevant for describing the clusters. The clustering analysis is effective in identifying different craniofacial patterns based on a multivariate database. The distinct clusters appear to be caused by differences in the compensation process of the facial structure responding to a genetically determined cranial and mandible form. The proposed method can be applied to several databases, creating specific classifications for each one of them.
|
Baler, R. V., Wijnhoven, I. B., del Valle, V. I., Giovanetti, C. M., & Vivanco, J. F. (2019). Microporosity Clustering Assessment in Calcium Phosphate Bioceramic Particles. Front. Bioeng. Biotechnol., 7(281), 7 pp.
Abstract: There has been an increase in the application of different biomaterials to repair hard tissues. Within these biomaterials, calcium phosphate (CaP) bioceramics are suitable candidates, since they can be biocompatible, biodegradable, osteoinductive, and osteoconductive. Moreover, during sintering, bioceramic materials are prone to form micropores and undergo changes in their surface topographical features, which influence cellular physiology and bone ingrowth. In this study, five geometrical properties from the surface of CaP bioceramic particles and their micropores were analyzed by data mining techniques, driven by the research question: what are the geometrical properties of individual micropores in a CaP bioceramic, and how do they relate to each other? The analysis not only shows that it is feasible to determine the existence of micropore clusters, but also to quantify their geometrical properties. As a result, these CaP bioceramic particles present three groups of micropore clusters distinctive by their geometrical properties. Consequently, this new methodological clustering assessment can be applied to advance the knowledge about CaP bioceramics and their role in bone tissue engineering.
|
Canessa, E., Chaigneau, S. E., Moreno, S., & Lagos, R. (2020). Informational content of cosine and other similarities calculated from high-dimensional Conceptual Property Norm data. Cogn. Process., 21, 601–614.
Abstract: To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, inter-concept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 x 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lower-dimensional data fare better than those calculated from higher-dimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.
|
Fierro, R., Leiva, V., & Moller, J. (2015). The Hawkes Process With Different Exciting Functions And Its Asymptotic Behavior. J. Appl. Probab., 52(1), 37–54.
Abstract: The standard Hawkes process is constructed from a homogeneous Poisson process and uses the same exciting function for different generations of offspring. We propose an extension of this process by considering different exciting functions. This consideration may be important in a number of fields; e.g. in seismology, where main shocks produce aftershocks with possibly different intensities. The main results are devoted to the asymptotic behavior of this extension of the Hawkes process. Indeed, a law of large numbers and a central limit theorem are stated. These results allow us to analyze the asymptotic behavior of the process when unpredictable marks are considered.
|
Garreton, M., & Sanchez, R. (2016). Identifying an optimal analysis level in multiscalar regionalization: A study case of social distress in Greater Santiago. Comput. Environ. Urban Syst., 56, 14–24.
Abstract: Assembling spatial units into meaningful clusters is a challenging task, as it must cope with a consequential computational complexity while controlling for the modifiable areal unit problem (MAUP), spatial autocorrelation and attribute multicolinearity. Nevertheless, these effects can reveal significant interactions among diverse spatial phenomena, such as segregation and economic specialization. Various regionalization methods have been developed in order to address these questions, but key fundamental properties of the aggregation of spatial entities are still poorly understood. In particular, due to the lack of an objective stopping rule, the question of determining an optimal number of clusters is yet unresolved. Therefore, we develop a clustering algorithm which is sensitive to scalar variations of multivariate spatial correlations, recalculating PCA scores at several aggregation steps in order to account for differences in the span of autocorrelation effects for diverse variables. With these settings, the scalar evolution of correlation, compactness and isolation measures is compared between empirical and 120 random datasets, using two dissimilarity measures. Remarkably, adjusting several indicators with real and simulated data allows for a clear definition of a stopping rule for spatial hierarchical clustering. Indeed, increasing correlations with scale in random datasets are spurious MAUP effects, so they can be discounted from real data results in order to identify an optimal clustering level, as defined by the maximum of authentic spatial self-organization. This allows singling out the most socially distressed areas in Greater Santiago, thus providing relevant socio-spatial insights from their cartographic and statistical analysis. In sum, we develop a useful methodology to improve the fundamental comprehension of spatial interdependence and multiscalar self-organizing phenomena, while linking these questions to relevant real world issues. (c) 2015 Elsevier Ltd. All rights reserved.
Keywords: Spatial clustering; MAUP; Autocorrelation; Multicolinearity; Stopping rule; Data mining
|
Heredia, C., Moreno, S., & Yushimito, W. F. (2022). Characterization of Mobility Patterns with a Hierarchical Clustering of Origin-Destination GPS Taxi Data. IEEE Trans. Intell. Transp. Syst., 23(8), 12700–12710.
Abstract: Clustering taxi data is commonly used to understand spatial patterns of urban mobility. In this paper, we propose a new clustering model called Origin-Destination-means (OD-means). OD-means is a hierarchical adaptive k-means
algorithm based on origin-destination pairs. In the first layer of the hierarchy, the clusters are separated automatically based on the variation of the within-cluster distance of each cluster until convergence. The second layer of the hierarchy corresponds to the sub clustering process of small clusters based on the distance between the origin and destination of each cluster. The algorithm is tested on a large data set of taxi GPS data from Santiago, Chile, and compared to other clustering algorithms. In contrast to them, our proposed model is capable of detecting general and local travel patterns in the city thanks to its hierarchical structure. |
Moreno, S., Pereira, J., & Yushimito, W. (2020). A hybrid K-means and integer programming method for commercial territory design: a case study in meat distribution. Ann. Oper. Res., 286(1-2), 87–117.
Abstract: The objective of territorial design for a distribution company is the definition of geographic areas that group customers. These geographic areas, usually called districts or territories, should comply with operational rules while maximizing potential sales and minimizing incurred costs. Consequently, territorial design can be seen as a clustering problem in which clients are geographically grouped according to certain criteria which usually vary according to specific objectives and requirements (e.g. costs, delivery times, workload, number of clients, etc.). In this work, we provide a novel hybrid approach for territorial design by means of combining a K-means-based approach for clustering construction with an optimization framework. The K-means approach incorporates the novelty of using tour length approximation techniques to satisfy the conditions of a pork and poultry distributor based in the region of Valparaiso in Chile. The resulting method proves to be robust in the experiments performed, and the Valparaiso case study shows significant savings when compared to the original solution used by the company.
Keywords: Territorial design; Clustering; K-means; Integer programming; Case study
|
Pham, D. T., & Ruz, G. A. (2009). Unsupervised training of Bayesian networks for data clustering. Proc. R. Soc. A-Math. Phys. Eng. Sci., 465(2109), 2927–2948.
Abstract: This paper presents a new approach to the unsupervised training of Bayesian network classifiers. Three models have been analysed: the Chow and Liu (CL) multinets; the tree-augmented naive Bayes; and a new model called the simple Bayesian network classifier, which is more robust in its structure learning. To perform the unsupervised training of these models, the classification maximum likelihood criterion is used. The maximization of this criterion is derived for each model under the classification expectation-maximization ( EM) algorithm framework. To test the proposed unsupervised training approach, 10 well-known benchmark datasets have been used to measure their clustering performance. Also, for comparison, the results for the k-means and the EM algorithm, as well as those obtained when the three Bayesian network classifiers are trained in a supervised way, are analysed. A real-world image processing application is also presented, dealing with clustering of wood board images described by 165 attributes. Results show that the proposed learning method, in general, outperforms traditional clustering algorithms and, in the wood board image application, the CL multinets obtained a 12 per cent increase, on average, in clustering accuracy when compared with the k-means method and a 7 per cent increase, on average, when compared with the EM algorithm.
|
Ruz, G. A. (2016). Improving the performance of inductive learning classifiers through the presentation order of the training patterns. Expert Syst. Appl., 58, 1–9.
Abstract: Although the development of new supervised learning algorithms for machine learning techniques are mostly oriented to improve the predictive power or classification accuracy, the capacity to understand how the classification process is carried out is of great interest for many applications in business and industry. Inductive learning algorithms, like the Rules family, induce semantically interpretable classification rules in the form of if-then rules. Although the effectiveness of the Rules family has been studied thoroughly and new and improved versions are constantly been developed, one important drawback is the effect of the presentation order of the training patterns which has not been studied in depth previously. In this paper this issue is addressed, first by studying empirically the effect of random presentation orders in the number of rules and the generalization power of the resulting classifier. Then a presentation order method for the training examples is proposed which combines a clustering stage with a new density measure developed specifically for this problem. The results using benchmark datasets and a real application of wood defect classification show the effectiveness of the proposed method. Also, since the presentation order method is employed as a preprocessing stage, the simplicity of the Rules family is not affected but instead it enables the generation of fewer and more accurate rules, which can have a direct impact in the performance and usefulness of the Rules family in an expert system context. (C) 2016 Elsevier Ltd. All rights reserved.
Keywords: Inductive learning; Rules family; Clustering; Classification
|
Ruz, G. A., & Pham, D. T. (2012). NBSOM: The naive Bayes self-organizing map. Neural Comput. Appl., 21(6), 1319–1330.
Abstract: The naive Bayes model has proven to be a simple yet effective model, which is very popular for pattern recognition applications such as data classification and clustering. This paper explores the possibility of using this model for multidimensional data visualization. To achieve this, a new learning algorithm called naive Bayes self-organizing map (NBSOM) is proposed to enable the naive Bayes model to perform topographic mappings. The training is carried out by means of an online expectation maximization algorithm with a self-organizing principle. The proposed method is compared with principal component analysis, self-organizing maps, and generative topographic mapping on two benchmark data sets and a real-world image processing application. Overall, the results show the effectiveness of NBSOM for multidimensional data visualization.
|
Ruz, G. A., Varas, S., & Villena, M. (2013). Policy making for broadband adoption and usage in Chile through machine learning. Expert Syst. Appl., 40(17), 6728–6734.
Abstract: For developing countries, such as Chile, we study the influential factors for adoption and usage of broadband services. In particular, subsidies on the broadband price are analyzed to see if this initiative has a significant effect in the broadband penetration. To carry out this study, machine learning techniques are used to identify different household profiles using the data obtained from a survey on access, use, and users of broadband Internet from Chile. Different policies are proposed for each group found, which were then evaluated empirically through Bayesian networks. Results show that an unconditional subsidy for the Internet price does not seem to be very appropriate for everyone since it is only significant for some households groups. The evaluation using Bayesian networks showed that other polices should be considered as well such as the incorporation of computers, Internet applications development, and digital literacy training. (C) 2013 Elsevier Ltd. All rights reserved.
|
Valle, M. A., & Ruz, G. A. (2021). Finding Hierarchical Structures of Disordered Systems: An Application for Market Basket Analysis. IEEE Access, 9, 1626–1641.
Abstract: Complex systems can be characterized by their level of order or disorder. An ordered system is related to the presence of system properties that are correlated with each other. For example, it has been found in crisis periods that the financial systems tend to be synchronized, and symmetry appears in financial assets' behavior. In retail, the collective purchasing behavior tends to be highly disorderly, with a diversity of correlation patterns appearing between the available market supply. In those cases, it is essential to understand the hierarchical structures underlying these systems. For the latter, community detection techniques have been developed to find similar behavior clusters according to some similarity measure. However, these techniques do not consider the inherent interactions between the multitude of system elements. This paper proposes and tests an approach that incorporates a hierarchical grouping process capable of dealing with complete weighted networks. Experiments show that the proposal is superior in terms of the ability to find minimal energy clusters. These minimum energy clusters are equivalent to system states (market baskets) with a higher probability of occurrence; therefore, they are interesting for marketing and promotion activities in retail environments.
Keywords: Boltzmann machine; clustering; disordered systems; greedy; hierarchical; market basket
|