|
Alejo, L., Atkinson, J., Guzman-Fierro, V., & Roeckel, M. (2018). Effluent composition prediction of a two-stage anaerobic digestion process: machine learning and stoichiometry techniques. Environ. Sci. Pollut. Res., 25(21), 21149–21163.
Abstract: Computational self-adapting methods (Support Vector Machines, SVM) are compared with an analytical method in effluent composition prediction of a two-stage anaerobic digestion (AD) process. Experimental data for the AD of poultry manure were used. The analytical method considers the protein as the only source of ammonia production in AD after degradation. Total ammonia nitrogen (TAN), total solids (TS), chemical oxygen demand (COD), and total volatile solids (TVS) were measured in the influent and effluent of the process. The TAN concentration in the effluent was predicted, this being the most inhibiting and polluting compound in AD. Despite the limited data available, the SVM-based model outperformed the analytical method for the TAN prediction, achieving a relative average error of 15.2% against 43% for the analytical method. Moreover, SVM showed higher prediction accuracy in comparison with Artificial Neural Networks. This result reveals the future promise of SVM for prediction in non-linear and dynamic AD processes.
|
|
|
Arevalo-Ramirez, T., Villacres, J., Fuentes, A., Reszka, P., & Cheein, F. A. A. (2020). Moisture content estimation of Pinus radiata and Eucalyptus globulus from reconstructed leaf reflectance in the SWIR region. Biosyst. Eng., 193, 187–205.
Abstract: Valparaiso, a central-southern region in Chile, has one of the highest rates of wildfire occurrence in the country. The constant threat of fires is mainly due to its highly flammable forest plantation, composed of 97.5% Pinus radiata and Eucalyptus globulus. Fuel moisture content is one of the most relevant parameters for studying fire spreading and risk, and can be estimated from the reflectance of leaves in the short wave infra-red (SWIR) range, not easily available in most vision-based sensors. Therefore, this work addresses the problem of estimating the water content of leaves from the two previously mentioned species, without any knowledge of their spectrum in the SWIR band. To this end, and for validation purposes, the reflectance of 90 leaves per species, at five dehydration stages, were taken between 350 nm and 2500 nm (full spectrum). Then, two machine-learning regressors were trained with 70% of the data set to determine the unknown reflectance, in the range 1000 nm-2500 nm. Results were validated with the remaining 30% of the data, achieving a root mean square error less than 9% in the spectrum estimation, and an error of 10% in spectral indices related to water content estimation. (C) 2020 IAgrE. Published by Elsevier Ltd. All rights reserved.
|
|
|
Arevalo-Ramirez, T. A., Castillo, A. H. F., Cabello, P. S. R., & Cheein, F. A. A. (2021). Single bands leaf reflectance prediction based on fuel moisture content for forestry applications. Biosyst. Eng., 202, 79–95.
Abstract: Vegetation indices can be used to perform quantitative and qualitative assessment of vegetation cover. These indices exploit the reflectance features of leaves to predict their biophysical properties. In general, there are different vegetation indices capable of describing the same biophysical parameter. For instance, vegetation water content can be inferred from at least sixteen vegetation indices, where each one uses the reflectance of leaves in different spectral bands. Therefore, if the leaf moisture content, a vegetation index and the reflectance at the wavelengths to compute the vegetation index are known, then the reflectance in other spectral bands can be computed with a bounded error. The current work proposes a method to predict, by a machine learning regressor, the leaf reflectance (spectral signature) at specific spectral bands using the information of leaf moisture content and a single vegetation index of two tree species (Pinus radiata, and Eucalyptus globulus), which constitute 97.5% of the Valparai ' so forests in Chile. Results suggest that the most suitable vegetation index to predict the spectral signature is the Leaf Water Index, which using a Kernel Ridge Regressor achieved the best prediction results, with an RMSE lower than 0.022, and an average R2 greater than 0.95 for Pinus radiata and 0.81 for Eucalyptus globulus, respectively. (c) 2020 IAgrE. Published by Elsevier Ltd. All rights reserved.
|
|
|
Bertossi, L., & Geerts, F. (2020). Data Quality and Explainable AI. ACM J. Data Inf. Qual., 12(2), 11.
Abstract: In this work, we provide some insights and develop some ideas, with few technical details, about the role of explanations in Data Quality in the context of data-based machine learning models (ML). In this direction, there are, as expected, roles for causality, and explainable artificial intelligence. The latter area not only sheds light on the models, but also on the data that support model construction. There is also room for defining, identifying, and explaining errors in data, in particular, in ML, and also for suggesting repair actions. More generally, explanations can be used as a basis for defining dirty data in the context of ML, and measuring or quantifying them. We think dirtiness as relative to the ML task at hand, e.g., classification.
|
|
|
Blanco, K., Salcidua, S., Orellana, P., Sauma, T., Leon, T., Lopez-Steinmetz, L. C., et al. (2023). Systematic review: fluid biomarkers and machine learning methods to improve the diagnosis from mild cognitive impairment to Alzheimers disease. Alzheimer's Res. Ther., Early Access.
Abstract: Mild cognitive impairment ( AQ1 MCI) is often considered an early stage of dementia, with estimated rates of progression to dementia up to 80�90% after approximately 6 years from the initial diagnosis. Diagnosis of cognitive impairment in dementia is typically based on clinical evaluation, neuropsychological assessments, cerebrospinal fluid (CSF) biomarkers, and neuroimaging. The main goal of diagnosing MCI is to determine its cause, particularly whether it is due to Alzheimer�s disease (AD). However, only a limited percentage of the population has access to etiological confirmation, which has led to the emergence of peripheral fluid biomarkers as a diagnostic tool for dementias, including MCI due to AD. Recent advances in biofluid assays have enabled the use of sophisticated statistical models and multimodal machine learning (ML) algorithms for
the diagnosis of MCI based on fluid biomarkers from CSF, peripheral blood, and saliva, among others. This approach has shown promise for identifying specific causes of MCI, including AD. After a PRISMA analysis, 29 articles revealed a trend
towards using multimodal algorithms that incorporate additional biomarkers such as neuroimaging, neuropsychological tests, and genetic information. Particularly, neuroimaging is commonly used in conjunction with fluid biomarkers for both crosssectional and longitudinal studies. Our systematic review suggests that cost-effective longitudinal multimodal monitoring data, representative of diverse cultural populations and utilizing white-box ML algorithms, could be a valuable contribution to the development of diagnostic models for AD due to MCI. Clinical assessment and biomarkers, together with ML techniques, could prove pivotal in improving diagnostic tools for MCI due to AD.
|
|
|
Decker, L., Leite, D., Minarini, F., Tisbeni, S. R., & Bonacorsi, D. (2022). Unsupervised Learning and Online Anomaly Detection: An On-Condition Log-Based Maintenance System. Int. J. Embed. Real-Time Commun. Syst., 13(1).
Abstract: The large hadron collider (LHC) demands a huge amount of computing resources to deal with petabytes of data generated from high energy physics (HEP) experiments and user logs, which report user activity within the supporting worldwide LHC computing grid (WLCG). An outburst of data and information is expected due to the scheduled LHC upgrad, that is, the workload of the WLCG should increase by 10 times in the near future. Autonomous system maintenance by means of log mining and machine learning algorithms is of utmost importance to keep the computing grid functional. The aim is to detect software faults, bugs, threats, and infrastructural problems. This paper describes a general-purpose solution to anomaly detection in computer grids using unstructured, textual, and unsupervised data. The solution consists in recognizing periods of anomalous activity based on content and information extracted from user log events. This study has particularly compared one-class SVM, isolation forest (IF), and local outlier factor (LOF). IF provides the best fault detection accuracy, 69.5%.
|
|
|
Guevara, E., Babonneau, F., Homem-de-Mello, T., & Moret, S. (2020). A machine learning and distributionally robust optimization framework for strategic energy planning under uncertainty. Appl. Energy, 271, 18 pp.
Abstract: This paper investigates how the choice of stochastic approaches and distribution assumptions impacts strategic investment decisions in energy planning problems. We formulate a two-stage stochastic programming model assuming different distributions for the input parameters and show that there is significant discrepancy among the associated stochastic solutions and other robust solutions published in the literature. To remedy this sensitivity issue, we propose a combined machine learning and distributionally robust optimization (DRO) approach which produces more robust and stable strategic investment decisions with respect to uncertainty assumptions. DRO is applied to deal with ambiguous probability distributions and Machine Learning is used to restrict the DRO model to a subset of important uncertain parameters ensuring computational tractability. Finally, we perform an out-of-sample simulation process to evaluate solutions performances. The Swiss energy system is used as a case study all along the paper to validate the approach.
|
|
|
Gutierrez-Portela, F., Arteaga-Arteaga, B. H., Almenares-Mendoza, F., Calderon-Benavente, L., Acosta-Mesa, H. G., & Tabares-Soto. R. (2023). Enhancing Intrusion Detection in IoT Communications Through ML Model Generalization With a New Dataset (IDSAI). IEEE Access, 11, 70542–70559.
Abstract: One of the fields where Artificial Intelligence (AI) must continue to innovate is computer security. The integration of Wireless Sensor Networks (WSN) with the Internet of Things (IoT) creates ecosystems of attractive surfaces for security intrusions, being vulnerable to multiple and simultaneous attacks. This research evaluates the performance of supervised ML techniques for detecting intrusions based on network traffic captures. This work presents a new balanced dataset (IDSAI) with intrusions generated in attack environments in a real scenario. This new dataset has been provided in order to contrast model generalization from different datasets. The results show that for the detection of intruders, the best supervised algorithms are XGBoost, Gradient Boosting, Decision Tree, Random Forest, and Extra Trees, which can generate predictions when trained and predicted with ten specific intrusions (such as ARP spoofing, ICMP echo request Flood, TCP Null, and others), both of binary form (intrusion and non-intrusion) with up to 94% of accuracy, as multiclass form (ten different intrusions and non-intrusion) with up to 92% of accuracy. In contrast, up to 90% of accuracy is achieved for prediction on the Bot-IoT dataset using models trained with the IDSAI dataset.
|
|
|
Heredia, C., Moreno, S., & Yushimito, W. F. (2022). Characterization of Mobility Patterns with a Hierarchical Clustering of Origin-Destination GPS Taxi Data. IEEE Trans. Intell. Transp. Syst., 23(8), 12700–12710.
Abstract: Clustering taxi data is commonly used to understand spatial patterns of urban mobility. In this paper, we propose a new clustering model called Origin-Destination-means (OD-means). OD-means is a hierarchical adaptive k-means
algorithm based on origin-destination pairs. In the first layer of the hierarchy, the clusters are separated automatically based on the variation of the within-cluster distance of each cluster until convergence. The second layer of the hierarchy corresponds to the sub clustering process of small clusters based on the
distance between the origin and destination of each cluster. The algorithm is tested on a large data set of taxi GPS data from Santiago, Chile, and compared to other clustering algorithms.
In contrast to them, our proposed model is capable of detecting general and local travel patterns in the city thanks to its hierarchical structure.
|
|
|
Hughes, S., Moreno, S., Yushimito, W. F., & Huerta-Canepa, G. (2019). Evaluation of machine learning methodologies to predict stop delivery times from GPS data. Transp. Res. Pt. C-Emerg. Technol., 109, 289–304.
Abstract: In last mile distribution, logistics companies typically arrange and plan their routes based on broad estimates of stop delivery times (i.e., the time spent at each stop to deliver goods to final receivers). If these estimates are not accurate, the level of service is degraded, as the promised time window may not be satisfied. The purpose of this work is to assess the feasibility of machine learning techniques to predict stop delivery times. This is done by testing a wide range of machine learning techniques (including different types of ensembles) to (1) predict the stop delivery time and (2) to determine whether the total stop delivery time will exceed a predefined time threshold (classification approach). For the assessment, all models are trained using information generated from GPS data collected in Medellin, Colombia and compared to hazard duration models. The results are threefold. First, the assessment shows that regression-based machine learning approaches are not better than conventional hazard duration models concerning absolute errors of the prediction of the stop delivery times. Second, when the problem is addressed by a classification scheme in which the prediction is aimed to guide whether a stop time will exceed a predefined time, a basic K-nearest-neighbor model outperforms hazard duration models and other machine learning techniques both in accuracy and F-1 score (harmonic mean between precision and recall). Third, the prediction of the exact duration can be improved by combining the classifiers and prediction models or hazard duration models in a two level scheme (first classification then prediction). However, the improvement depends largely on the correct classification (first level).
|
|
|
Lagos, F., & Pereira, J. (2023). Multi-arme d bandit-base d hyper-heuristics for combinatorial optimization problems. Eur. J. Oper. Res., 312(1), 70–91.
Abstract: There are significant research opportunities in the integration of Machine Learning (ML) methods and Combinatorial Optimization Problems (COPs). In this work, we focus on metaheuristics to solve COPs that have an important learning component. These algorithms must explore a solution space and learn from the information they obtain in order to find high-quality solutions. Among the metaheuristics, we study Hyper-Heuristics (HHs), algorithms that, given a number of low-level heuristics, iteratively select and apply heuristics to a solution. The HH we consider has a Markov model to produce sequences of low-level heuristics, which we combine with a Multi-Armed Bandit Problem (MAB)-based method to learn its parameters. This work proposes several improvements to the HH metaheuristic that yields a better learning for solving problem instances. Specifically, this is the first work in HHs to present Exponential Weights for Exploration and Exploitation (EXP3) as a learning method, an algorithm that is able to deal with adversarial settings. We also present a case study for the Vehicle Routing Problem with Time Windows (VRPTW), for which we include a list of low-level heuristics that have been proposed in the literature. We show that our algorithms can handle a large and diverse list of heuristics, illustrating that they can be easily configured to solve COPs of different nature. The computational results indicate that our algorithms are competitive methods for the VRPTW (2.16% gap on average with respect to the best known solutions), demonstrating the potential of these algorithms to solve COPs. Finally, we show how algorithms can even detect low-level heuristics that do not contribute to finding better solutions to the problem.& COPY
|
|
|
Leite, D., Skrjanc, I., Blazic, S., Zdesar, A., & Gomide, F. (2023). Interval incremental learning of interval data streams and application to vehicle tracking. Inf. Sci., 630, 1–22.
Abstract: This paper presents a method called Interval Incremental Learning (IIL) to capture spatial and temporal patterns in uncertain data streams. The patterns are represented by information granules and a granular rule base with the purpose of developing explainable human-centered computational models of virtual and physical systems. Fundamentally, interval data are either included into wider and more meaningful information granules recursively, or used for structural adaptation of the rule base. An Uncertainty-Weighted Recursive-Least-Squares (UW-RLS) method is proposed to update affine local functions associated with the rules. Online recursive procedures that build interval-based models from scratch and guarantee balanced information granularity are described. The procedures assure stable and understandable rule-based modeling. In general, the model can play the role of a predictor, a controller, or a classifier, with online sample-per-sample structural adaptation and parameter estimation done concurrently. The IIL method is aligned with issues and needs of the Internet of Things, Big Data processing, and eXplainable Artificial Intelligence. An application example concerning real-time land-vehicle localization and tracking in an uncertain environment illustrates the usefulness of the method. We also provide the Driving Through Manhattan interval dataset to foster future investigation.
|
|
|
Opazo, D., Moreno, S., Alvarez-Miranda, E., & Pereira, J. (2021). Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities. Mathematics, 20(9), 2599.
Abstract: Student dropout, defined as the abandonment of a high education program before obtaining the degree without reincorporation, is a problem that affects every higher education institution in the world. This study uses machine learning models over two Chilean universities to predict first-year engineering student dropout over enrolled students, and to analyze the variables that affect the probability of dropout. The results show that instead of combining the datasets into a single dataset, it is better to apply a model per university. Moreover, among the eight machine learning models tested over the datasets, gradient-boosting decision trees reports the best model. Further analyses of the interpretative models show that a higher score in almost any entrance university test decreases the probability of dropout, the most important variable being the mathematical test. One exception is the language test, where a higher score increases the probability of dropout.
|
|
|
Otsuki, A., & Jang, H. (2022). Prediction of Particle Size Distribution of Mill Products Using Artificial Neural Networks. Chemengineering, 6(6), 92.
Abstract: High energy consumption in size reduction operations is one of the most significant issues concerning the sustainability of raw material beneficiation. Thus, process optimization should be done to reduce energy consumption. This study aimed to investigate the applicability of artificial neural networks (ANNs) to predict the particle size distributions (PSDs) of mill products. PSD is one of the key sources of information after milling since it significantly affects the subsequent beneficiation processes. Thus, precise PSD prediction can contribute to process optimization and energy consumption reduction by avoiding over-grinding. In this study, coal particles (-2 mm) were ground with a rod mill under different conditions, and their PSDs were measured. The variables studied included volume% (vol.%) of feed (coal particle), vol.% rod load, and grinding time. Our supervised ANN models were developed to predict PSDs and trained by experimental data sets. The trained models were verified with the other experimental data sets. The results showed that the PSDs predicted by ANN fitted very well with the experimental data after the training. Root mean squared error (RMSE) was calculated for each milling condition, with results between 0.165 and 0.965. Also, the developed ANN models can predict the PSDs of ground products under different milling conditions (i.e., vol.% feed, vol.% rod load, and grinding time). The results confirmed the applicability of ANNs to predict PSD and, thus the potential contribution to reducing energy consumption by optimizing the grinding conditions.
|
|
|
Pham, D. T., & Ruz, G. A. (2009). Unsupervised training of Bayesian networks for data clustering. Proc. R. Soc. A-Math. Phys. Eng. Sci., 465(2109), 2927–2948.
Abstract: This paper presents a new approach to the unsupervised training of Bayesian network classifiers. Three models have been analysed: the Chow and Liu (CL) multinets; the tree-augmented naive Bayes; and a new model called the simple Bayesian network classifier, which is more robust in its structure learning. To perform the unsupervised training of these models, the classification maximum likelihood criterion is used. The maximization of this criterion is derived for each model under the classification expectation-maximization ( EM) algorithm framework. To test the proposed unsupervised training approach, 10 well-known benchmark datasets have been used to measure their clustering performance. Also, for comparison, the results for the k-means and the EM algorithm, as well as those obtained when the three Bayesian network classifiers are trained in a supervised way, are analysed. A real-world image processing application is also presented, dealing with clustering of wood board images described by 165 attributes. Results show that the proposed learning method, in general, outperforms traditional clustering algorithms and, in the wood board image application, the CL multinets obtained a 12 per cent increase, on average, in clustering accuracy when compared with the k-means method and a 7 per cent increase, on average, when compared with the EM algorithm.
|
|
|
Ramos, D., Moreno, S., Canessa, E., Chaigneau, S. E., & Marchant, N. (2023). AC-PLT: An algorithm for computer-assisted coding of semantic property listing data. Behav. Res. Methods, Early Access.
Abstract: In this paper, we present a novel algorithm that uses machine learning and natural language processing techniques to facilitate the coding of feature listing data. Feature listing is a method in which participants are asked to provide a list of features that are typically true of a given concept or word. This method is commonly used in research studies to gain insights into people's understanding of various concepts. The standard procedure for extracting meaning from feature listings is to manually code the data, which can be time-consuming and prone to errors, leading to reliability concerns. Our algorithm aims at addressing these challenges by automatically assigning human-created codes to feature listing data that achieve a quantitatively good agreement with human coders. Our preliminary results suggest that our algorithm has the potential to improve the efficiency and accuracy of content analysis of feature listing data. Additionally, this tool is an important step toward developing a fully automated coding algorithm, which we are currently preliminarily devising.
|
|
|
Rozas Andaur, J. M., Ruz, G. A., & Goycoolea, M. (2021). Predicting Out-of-Stock Using Machine Learning: An Application in a Retail Packaged Foods Manufacturing Company. Electronics, 10(22), 2787.
Abstract: For decades, Out-of-Stock (OOS) events have been a problem for retailers and manufacturers. In grocery retailing, an OOS event is used to characterize the condition in which customers do not find a certain commodity while attempting to buy it. This paper focuses on addressing this problem from a manufacturer’s perspective, conducting a case study in a retail packaged foods manufacturing company located in Latin America. We developed two machine learning based systems to detect OOS events automatically. The first is based on a single Random Forest classifier with balanced data, and the second is an ensemble of six different classification algorithms. We used transactional data from the manufacturer information system and physical audits. The novelty of this work is our use of new predictor variables of OOS events. The system was successfully implemented and tested in a retail packaged foods manufacturer company. By incorporating the new predictive variables in our Random Forest and Ensemble classifier, we were able to improve their system’s predictive power. In particular, the Random Forest classifier presented the best performance in a real-world setting, achieving a detection precision of 72% and identifying 68% of the total OOS events. Finally, the incorporation of our new predictor variables allowed us to improve the performance of the Random Forest by 0.24 points in the F-measure.
|
|
|
Wolff, P., Rios, S., Clavijo, D., Grana, M., & Carrasco, M. (2020). Methodologically grounded semantic analysis of large volume of chilean medical literature data applied to the analysis of medical research funding efficiency in Chile. J. Biomed. Semant., 11(1), 10 pp.
Abstract: Background Medical knowledge is accumulated in scientific research papers along time. In order to exploit this knowledge by automated systems, there is a growing interest in developing text mining methodologies to extract, structure, and analyze in the shortest time possible the knowledge encoded in the large volume of medical literature. In this paper, we use the Latent Dirichlet Allocation approach to analyze the correlation between funding efforts and actually published research results in order to provide the policy makers with a systematic and rigorous tool to assess the efficiency of funding programs in the medical area. Results We have tested our methodology in the Revista Medica de Chile, years 2012-2015. 50 relevant semantic topics were identified within 643 medical scientific research papers. Relationships between the identified semantic topics were uncovered using visualization methods. We have also been able to analyze the funding patterns of scientific research underlying these publications. We found that only 29% of the publications declare funding sources, and we identified five topic clusters that concentrate 86% of the declared funds. Conclusions Our methodology allows analyzing and interpreting the current state of medical research at a national level. The funding source analysis may be useful at the policy making level in order to assess the impact of actual funding policies, and to design new policies.
|
|