Home | << 1 2 >> |
Alejo, L., Atkinson, J., Guzman-Fierro, V., & Roeckel, M. (2018). Effluent composition prediction of a two-stage anaerobic digestion process: machine learning and stoichiometry techniques. Environ. Sci. Pollut. Res., 25(21), 21149–21163.
Abstract: Computational self-adapting methods (Support Vector Machines, SVM) are compared with an analytical method in effluent composition prediction of a two-stage anaerobic digestion (AD) process. Experimental data for the AD of poultry manure were used. The analytical method considers the protein as the only source of ammonia production in AD after degradation. Total ammonia nitrogen (TAN), total solids (TS), chemical oxygen demand (COD), and total volatile solids (TVS) were measured in the influent and effluent of the process. The TAN concentration in the effluent was predicted, this being the most inhibiting and polluting compound in AD. Despite the limited data available, the SVM-based model outperformed the analytical method for the TAN prediction, achieving a relative average error of 15.2% against 43% for the analytical method. Moreover, SVM showed higher prediction accuracy in comparison with Artificial Neural Networks. This result reveals the future promise of SVM for prediction in non-linear and dynamic AD processes.
|
Arevalo-Ramirez, T., Villacres, J., Fuentes, A., Reszka, P., & Cheein, F. A. A. (2020). Moisture content estimation of Pinus radiata and Eucalyptus globulus from reconstructed leaf reflectance in the SWIR region. Biosyst. Eng., 193, 187–205.
Abstract: Valparaiso, a central-southern region in Chile, has one of the highest rates of wildfire occurrence in the country. The constant threat of fires is mainly due to its highly flammable forest plantation, composed of 97.5% Pinus radiata and Eucalyptus globulus. Fuel moisture content is one of the most relevant parameters for studying fire spreading and risk, and can be estimated from the reflectance of leaves in the short wave infra-red (SWIR) range, not easily available in most vision-based sensors. Therefore, this work addresses the problem of estimating the water content of leaves from the two previously mentioned species, without any knowledge of their spectrum in the SWIR band. To this end, and for validation purposes, the reflectance of 90 leaves per species, at five dehydration stages, were taken between 350 nm and 2500 nm (full spectrum). Then, two machine-learning regressors were trained with 70% of the data set to determine the unknown reflectance, in the range 1000 nm-2500 nm. Results were validated with the remaining 30% of the data, achieving a root mean square error less than 9% in the spectrum estimation, and an error of 10% in spectral indices related to water content estimation. (C) 2020 IAgrE. Published by Elsevier Ltd. All rights reserved.
|
Arevalo-Ramirez, T. A., Castillo, A. H. F., Cabello, P. S. R., & Cheein, F. A. A. (2021). Single bands leaf reflectance prediction based on fuel moisture content for forestry applications. Biosyst. Eng., 202, 79–95.
Abstract: Vegetation indices can be used to perform quantitative and qualitative assessment of vegetation cover. These indices exploit the reflectance features of leaves to predict their biophysical properties. In general, there are different vegetation indices capable of describing the same biophysical parameter. For instance, vegetation water content can be inferred from at least sixteen vegetation indices, where each one uses the reflectance of leaves in different spectral bands. Therefore, if the leaf moisture content, a vegetation index and the reflectance at the wavelengths to compute the vegetation index are known, then the reflectance in other spectral bands can be computed with a bounded error. The current work proposes a method to predict, by a machine learning regressor, the leaf reflectance (spectral signature) at specific spectral bands using the information of leaf moisture content and a single vegetation index of two tree species (Pinus radiata, and Eucalyptus globulus), which constitute 97.5% of the Valparai ' so forests in Chile. Results suggest that the most suitable vegetation index to predict the spectral signature is the Leaf Water Index, which using a Kernel Ridge Regressor achieved the best prediction results, with an RMSE lower than 0.022, and an average R2 greater than 0.95 for Pinus radiata and 0.81 for Eucalyptus globulus, respectively. (c) 2020 IAgrE. Published by Elsevier Ltd. All rights reserved.
Keywords: Leaf water index; Machine learning; Remote sensing; Wildfire; Wildland fuels
|
Bertossi, L., & Geerts, F. (2020). Data Quality and Explainable AI. ACM J. Data Inf. Qual., 12(2), 11.
Abstract: In this work, we provide some insights and develop some ideas, with few technical details, about the role of explanations in Data Quality in the context of data-based machine learning models (ML). In this direction, there are, as expected, roles for causality, and explainable artificial intelligence. The latter area not only sheds light on the models, but also on the data that support model construction. There is also room for defining, identifying, and explaining errors in data, in particular, in ML, and also for suggesting repair actions. More generally, explanations can be used as a basis for defining dirty data in the context of ML, and measuring or quantifying them. We think dirtiness as relative to the ML task at hand, e.g., classification.
Keywords: Machine learning; causes; fairness; bias
|
Blanco, K., Salcidua, S., Orellana, P., Sauma, T., Leon, T., Lopez-Steinmetz, L. C., et al. (2023). Systematic review: fluid biomarkers and machine learning methods to improve the diagnosis from mild cognitive impairment to Alzheimers disease. Alzheimer's Res. Ther., Early Access.
Abstract: Mild cognitive impairment ( AQ1 MCI) is often considered an early stage of dementia, with estimated rates of progression to dementia up to 80�90% after approximately 6 years from the initial diagnosis. Diagnosis of cognitive impairment in dementia is typically based on clinical evaluation, neuropsychological assessments, cerebrospinal fluid (CSF) biomarkers, and neuroimaging. The main goal of diagnosing MCI is to determine its cause, particularly whether it is due to Alzheimer�s disease (AD). However, only a limited percentage of the population has access to etiological confirmation, which has led to the emergence of peripheral fluid biomarkers as a diagnostic tool for dementias, including MCI due to AD. Recent advances in biofluid assays have enabled the use of sophisticated statistical models and multimodal machine learning (ML) algorithms for
the diagnosis of MCI based on fluid biomarkers from CSF, peripheral blood, and saliva, among others. This approach has shown promise for identifying specific causes of MCI, including AD. After a PRISMA analysis, 29 articles revealed a trend towards using multimodal algorithms that incorporate additional biomarkers such as neuroimaging, neuropsychological tests, and genetic information. Particularly, neuroimaging is commonly used in conjunction with fluid biomarkers for both crosssectional and longitudinal studies. Our systematic review suggests that cost-effective longitudinal multimodal monitoring data, representative of diverse cultural populations and utilizing white-box ML algorithms, could be a valuable contribution to the development of diagnostic models for AD due to MCI. Clinical assessment and biomarkers, together with ML techniques, could prove pivotal in improving diagnostic tools for MCI due to AD. |
Decker, L., Leite, D., Minarini, F., Tisbeni, S. R., & Bonacorsi, D. (2022). Unsupervised Learning and Online Anomaly Detection: An On-Condition Log-Based Maintenance System. Int. J. Embed. Real-Time Commun. Syst., 13(1).
Abstract: The large hadron collider (LHC) demands a huge amount of computing resources to deal with petabytes of data generated from high energy physics (HEP) experiments and user logs, which report user activity within the supporting worldwide LHC computing grid (WLCG). An outburst of data and information is expected due to the scheduled LHC upgrad, that is, the workload of the WLCG should increase by 10 times in the near future. Autonomous system maintenance by means of log mining and machine learning algorithms is of utmost importance to keep the computing grid functional. The aim is to detect software faults, bugs, threats, and infrastructural problems. This paper describes a general-purpose solution to anomaly detection in computer grids using unstructured, textual, and unsupervised data. The solution consists in recognizing periods of anomalous activity based on content and information extracted from user log events. This study has particularly compared one-class SVM, isolation forest (IF), and local outlier factor (LOF). IF provides the best fault detection accuracy, 69.5%.
|
Guevara, E., Babonneau, F., Homem-de-Mello, T., & Moret, S. (2020). A machine learning and distributionally robust optimization framework for strategic energy planning under uncertainty. Appl. Energy, 271, 18 pp.
Abstract: This paper investigates how the choice of stochastic approaches and distribution assumptions impacts strategic investment decisions in energy planning problems. We formulate a two-stage stochastic programming model assuming different distributions for the input parameters and show that there is significant discrepancy among the associated stochastic solutions and other robust solutions published in the literature. To remedy this sensitivity issue, we propose a combined machine learning and distributionally robust optimization (DRO) approach which produces more robust and stable strategic investment decisions with respect to uncertainty assumptions. DRO is applied to deal with ambiguous probability distributions and Machine Learning is used to restrict the DRO model to a subset of important uncertain parameters ensuring computational tractability. Finally, we perform an out-of-sample simulation process to evaluate solutions performances. The Swiss energy system is used as a case study all along the paper to validate the approach.
|
Gutierrez-Portela, F., Arteaga-Arteaga, B. H., Almenares-Mendoza, F., Calderon-Benavente, L., Acosta-Mesa, H. G., & Tabares-Soto. R. (2023). Enhancing Intrusion Detection in IoT Communications Through ML Model Generalization With a New Dataset (IDSAI). IEEE Access, 11, 70542–70559.
Abstract: One of the fields where Artificial Intelligence (AI) must continue to innovate is computer security. The integration of Wireless Sensor Networks (WSN) with the Internet of Things (IoT) creates ecosystems of attractive surfaces for security intrusions, being vulnerable to multiple and simultaneous attacks. This research evaluates the performance of supervised ML techniques for detecting intrusions based on network traffic captures. This work presents a new balanced dataset (IDSAI) with intrusions generated in attack environments in a real scenario. This new dataset has been provided in order to contrast model generalization from different datasets. The results show that for the detection of intruders, the best supervised algorithms are XGBoost, Gradient Boosting, Decision Tree, Random Forest, and Extra Trees, which can generate predictions when trained and predicted with ten specific intrusions (such as ARP spoofing, ICMP echo request Flood, TCP Null, and others), both of binary form (intrusion and non-intrusion) with up to 94% of accuracy, as multiclass form (ten different intrusions and non-intrusion) with up to 92% of accuracy. In contrast, up to 90% of accuracy is achieved for prediction on the Bot-IoT dataset using models trained with the IDSAI dataset.
|
Heredia, C., Moreno, S., & Yushimito, W. (2024). ODMeans: An R package for global and local cluster detection for Origin–Destination GPS data. SoftwareX, 26, 101732.
Abstract: The ODMeans R package implements the OD-Means model, a two-layer hierarchical clustering algorithm designed for extracting both global and local travel patterns from Origin–Destination Pairs (OD-Pairs). In contrast to existing models, OD-Means automates cluster determination and offers advantages such as smaller Within-Cluster Distance (WCD) and dual hierarchies. The package includes functions for applying the model and visualizing the results on maps. Using real taxi data from Santiago, Chile, we demonstrate the package’s capabilities, showcasing its flexibility and impact on understanding urban mobility patterns.
Keywords: Machine learning; k-means; Odmeans; R
|
Heredia, C., Moreno, S., & Yushimito, W. F. (2022). Characterization of Mobility Patterns with a Hierarchical Clustering of Origin-Destination GPS Taxi Data. IEEE Trans. Intell. Transp. Syst., 23(8), 12700–12710.
Abstract: Clustering taxi data is commonly used to understand spatial patterns of urban mobility. In this paper, we propose a new clustering model called Origin-Destination-means (OD-means). OD-means is a hierarchical adaptive k-means
algorithm based on origin-destination pairs. In the first layer of the hierarchy, the clusters are separated automatically based on the variation of the within-cluster distance of each cluster until convergence. The second layer of the hierarchy corresponds to the sub clustering process of small clusters based on the distance between the origin and destination of each cluster. The algorithm is tested on a large data set of taxi GPS data from Santiago, Chile, and compared to other clustering algorithms. In contrast to them, our proposed model is capable of detecting general and local travel patterns in the city thanks to its hierarchical structure. |
Holguin-Garcia, S. A., Guevara-Navarro, E., Daza-Chica, A. E., Patiño-Claro, M. A., Arteaga-Arteaga, H. B., Ruz, G. A., et al. (2024). A comparative study of CNN-capsule-net, CNN-transformer encoder, and Traditional machine learning algorithms to classify epileptic seizure. BMC Med. Inform. Decis. Mak., 24(1), 60.
Abstract: IntroductionEpilepsy is a disease characterized by an excessive discharge in neurons generally provoked without any external stimulus, known as convulsions. About 2 million people are diagnosed each year in the world. This process is carried out by a neurological doctor using an electroencephalogram (EEG), which is lengthy.MethodTo optimize these processes and make them more efficient, we have resorted to innovative artificial intelligence methods essential in classifying EEG signals. For this, comparing traditional models, such as machine learning or deep learning, with cutting-edge models, in this case, using Capsule-Net architectures and Transformer Encoder, has a crucial role in finding the most accurate model and helping the doctor to have a faster diagnosis.ResultIn this paper, a comparison was made between different models for binary and multiclass classification of the epileptic seizure detection database, achieving a binary accuracy of 99.92% with the Capsule-Net model and a multiclass accuracy with the Transformer Encoder model of 87.30%.Conclusion Artificial intelligence is essential in diagnosing pathology. The comparison between models is helpful as it helps to discard those that are not efficient. State-of-the-art models overshadow conventional models, but data processing also plays an essential role in evaluating the higher accuracy of the models.
|
Hughes, S., Moreno, S., Yushimito, W. F., & Huerta-Canepa, G. (2019). Evaluation of machine learning methodologies to predict stop delivery times from GPS data. Transp. Res. Pt. C-Emerg. Technol., 109, 289–304.
Abstract: In last mile distribution, logistics companies typically arrange and plan their routes based on broad estimates of stop delivery times (i.e., the time spent at each stop to deliver goods to final receivers). If these estimates are not accurate, the level of service is degraded, as the promised time window may not be satisfied. The purpose of this work is to assess the feasibility of machine learning techniques to predict stop delivery times. This is done by testing a wide range of machine learning techniques (including different types of ensembles) to (1) predict the stop delivery time and (2) to determine whether the total stop delivery time will exceed a predefined time threshold (classification approach). For the assessment, all models are trained using information generated from GPS data collected in Medellin, Colombia and compared to hazard duration models. The results are threefold. First, the assessment shows that regression-based machine learning approaches are not better than conventional hazard duration models concerning absolute errors of the prediction of the stop delivery times. Second, when the problem is addressed by a classification scheme in which the prediction is aimed to guide whether a stop time will exceed a predefined time, a basic K-nearest-neighbor model outperforms hazard duration models and other machine learning techniques both in accuracy and F-1 score (harmonic mean between precision and recall). Third, the prediction of the exact duration can be improved by combining the classifiers and prediction models or hazard duration models in a two level scheme (first classification then prediction). However, the improvement depends largely on the correct classification (first level).
Keywords: Machine learning; Stop delivery time; Classification; Regression; Hazard duration; GPS
|
Lagos, F., & Pereira, J. (2023). Multi-arme d bandit-base d hyper-heuristics for combinatorial optimization problems. Eur. J. Oper. Res., 312(1), 70–91.
Abstract: There are significant research opportunities in the integration of Machine Learning (ML) methods and Combinatorial Optimization Problems (COPs). In this work, we focus on metaheuristics to solve COPs that have an important learning component. These algorithms must explore a solution space and learn from the information they obtain in order to find high-quality solutions. Among the metaheuristics, we study Hyper-Heuristics (HHs), algorithms that, given a number of low-level heuristics, iteratively select and apply heuristics to a solution. The HH we consider has a Markov model to produce sequences of low-level heuristics, which we combine with a Multi-Armed Bandit Problem (MAB)-based method to learn its parameters. This work proposes several improvements to the HH metaheuristic that yields a better learning for solving problem instances. Specifically, this is the first work in HHs to present Exponential Weights for Exploration and Exploitation (EXP3) as a learning method, an algorithm that is able to deal with adversarial settings. We also present a case study for the Vehicle Routing Problem with Time Windows (VRPTW), for which we include a list of low-level heuristics that have been proposed in the literature. We show that our algorithms can handle a large and diverse list of heuristics, illustrating that they can be easily configured to solve COPs of different nature. The computational results indicate that our algorithms are competitive methods for the VRPTW (2.16% gap on average with respect to the best known solutions), demonstrating the potential of these algorithms to solve COPs. Finally, we show how algorithms can even detect low-level heuristics that do not contribute to finding better solutions to the problem.& COPY
|
Lagos, F., Moreno, S., Yushimito, W. F., & Brstilo, T. (2024). Urban Origin–Destination Travel Time Estimation Using K-Nearest-Neighbor-Based Methods. Mathematics, 12(8), 1255.
Abstract: Improving the estimation of origin�destination (O-D) travel times poses a formidable challenge due to the intricate nature of transportation dynamics. Current deep learning models often require an overwhelming amount of data, both in terms of data points and variables, thereby limiting their applicability. Furthermore, there is a scarcity of models capable of predicting travel times with basic trip information such as origin, destination, and starting time. This paper introduces novel models rooted in the k-nearest neighbor (KNN) algorithm to tackle O-D travel time estimation with limited data. These models represent innovative adaptations of weighted KNN techniques, integrating the haversine distance of neighboring trips and incorporating correction factors to mitigate prediction biases, thereby enhancing the accuracy of travel time estimations for a given trip. Moreover, our models incorporate an adaptive heuristic to partition the time of day, identifying time blocks characterized by similar travel-time observations. These time blocks facilitate a more nuanced understanding of traffic patterns, enabling more precise predictions. To validate the effectiveness of our proposed models, extensive testing was conducted utilizing a comprehensive taxi trip dataset sourced from Santiago, Chile. The results demonstrate substantial improvements over existing state-of-the-art models (e.g., MAPE between 35 to 37% compared to 49 to 60% in other methods), underscoring the efficacy of our approach. Additionally, our models unveil previously unrecognized patterns in city traffic across various time blocks, shedding light on the underlying dynamics of urban mobility.
|
Leite, D., Skrjanc, I., Blazic, S., Zdesar, A., & Gomide, F. (2023). Interval incremental learning of interval data streams and application to vehicle tracking. Inf. Sci., 630, 1–22.
Abstract: This paper presents a method called Interval Incremental Learning (IIL) to capture spatial and temporal patterns in uncertain data streams. The patterns are represented by information granules and a granular rule base with the purpose of developing explainable human-centered computational models of virtual and physical systems. Fundamentally, interval data are either included into wider and more meaningful information granules recursively, or used for structural adaptation of the rule base. An Uncertainty-Weighted Recursive-Least-Squares (UW-RLS) method is proposed to update affine local functions associated with the rules. Online recursive procedures that build interval-based models from scratch and guarantee balanced information granularity are described. The procedures assure stable and understandable rule-based modeling. In general, the model can play the role of a predictor, a controller, or a classifier, with online sample-per-sample structural adaptation and parameter estimation done concurrently. The IIL method is aligned with issues and needs of the Internet of Things, Big Data processing, and eXplainable Artificial Intelligence. An application example concerning real-time land-vehicle localization and tracking in an uncertain environment illustrates the usefulness of the method. We also provide the Driving Through Manhattan interval dataset to foster future investigation.
|
Opazo, D., Moreno, S., Alvarez-Miranda, E., & Pereira, J. (2021). Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities. Mathematics, 20(9), 2599.
Abstract: Student dropout, defined as the abandonment of a high education program before obtaining the degree without reincorporation, is a problem that affects every higher education institution in the world. This study uses machine learning models over two Chilean universities to predict first-year engineering student dropout over enrolled students, and to analyze the variables that affect the probability of dropout. The results show that instead of combining the datasets into a single dataset, it is better to apply a model per university. Moreover, among the eight machine learning models tested over the datasets, gradient-boosting decision trees reports the best model. Further analyses of the interpretative models show that a higher score in almost any entrance university test decreases the probability of dropout, the most important variable being the mathematical test. One exception is the language test, where a higher score increases the probability of dropout.
Keywords: machine learning; first-year student dropout; universities
|
Otsuki, A., & Jang, H. (2022). Prediction of Particle Size Distribution of Mill Products Using Artificial Neural Networks. Chemengineering, 6(6), 92.
Abstract: High energy consumption in size reduction operations is one of the most significant issues concerning the sustainability of raw material beneficiation. Thus, process optimization should be done to reduce energy consumption. This study aimed to investigate the applicability of artificial neural networks (ANNs) to predict the particle size distributions (PSDs) of mill products. PSD is one of the key sources of information after milling since it significantly affects the subsequent beneficiation processes. Thus, precise PSD prediction can contribute to process optimization and energy consumption reduction by avoiding over-grinding. In this study, coal particles (-2 mm) were ground with a rod mill under different conditions, and their PSDs were measured. The variables studied included volume% (vol.%) of feed (coal particle), vol.% rod load, and grinding time. Our supervised ANN models were developed to predict PSDs and trained by experimental data sets. The trained models were verified with the other experimental data sets. The results showed that the PSDs predicted by ANN fitted very well with the experimental data after the training. Root mean squared error (RMSE) was calculated for each milling condition, with results between 0.165 and 0.965. Also, the developed ANN models can predict the PSDs of ground products under different milling conditions (i.e., vol.% feed, vol.% rod load, and grinding time). The results confirmed the applicability of ANNs to predict PSD and, thus the potential contribution to reducing energy consumption by optimizing the grinding conditions.
|
Pham, D. T., & Ruz, G. A. (2009). Unsupervised training of Bayesian networks for data clustering. Proc. R. Soc. A-Math. Phys. Eng. Sci., 465(2109), 2927–2948.
Abstract: This paper presents a new approach to the unsupervised training of Bayesian network classifiers. Three models have been analysed: the Chow and Liu (CL) multinets; the tree-augmented naive Bayes; and a new model called the simple Bayesian network classifier, which is more robust in its structure learning. To perform the unsupervised training of these models, the classification maximum likelihood criterion is used. The maximization of this criterion is derived for each model under the classification expectation-maximization ( EM) algorithm framework. To test the proposed unsupervised training approach, 10 well-known benchmark datasets have been used to measure their clustering performance. Also, for comparison, the results for the k-means and the EM algorithm, as well as those obtained when the three Bayesian network classifiers are trained in a supervised way, are analysed. A real-world image processing application is also presented, dealing with clustering of wood board images described by 165 attributes. Results show that the proposed learning method, in general, outperforms traditional clustering algorithms and, in the wood board image application, the CL multinets obtained a 12 per cent increase, on average, in clustering accuracy when compared with the k-means method and a 7 per cent increase, on average, when compared with the EM algorithm.
|
Ramos, D., Moreno, S., Canessa, E., Chaigneau, S. E., & Marchant, N. (2023). AC-PLT: An algorithm for computer-assisted coding of semantic property listing data. Behav. Res. Methods, Early Access.
Abstract: In this paper, we present a novel algorithm that uses machine learning and natural language processing techniques to facilitate the coding of feature listing data. Feature listing is a method in which participants are asked to provide a list of features that are typically true of a given concept or word. This method is commonly used in research studies to gain insights into people's understanding of various concepts. The standard procedure for extracting meaning from feature listings is to manually code the data, which can be time-consuming and prone to errors, leading to reliability concerns. Our algorithm aims at addressing these challenges by automatically assigning human-created codes to feature listing data that achieve a quantitatively good agreement with human coders. Our preliminary results suggest that our algorithm has the potential to improve the efficiency and accuracy of content analysis of feature listing data. Additionally, this tool is an important step toward developing a fully automated coding algorithm, which we are currently preliminarily devising.
|
Rozas Andaur, J. M., Ruz, G. A., & Goycoolea, M. (2021). Predicting Out-of-Stock Using Machine Learning: An Application in a Retail Packaged Foods Manufacturing Company. Electronics, 10(22), 2787.
Abstract: For decades, Out-of-Stock (OOS) events have been a problem for retailers and manufacturers. In grocery retailing, an OOS event is used to characterize the condition in which customers do not find a certain commodity while attempting to buy it. This paper focuses on addressing this problem from a manufacturer’s perspective, conducting a case study in a retail packaged foods manufacturing company located in Latin America. We developed two machine learning based systems to detect OOS events automatically. The first is based on a single Random Forest classifier with balanced data, and the second is an ensemble of six different classification algorithms. We used transactional data from the manufacturer information system and physical audits. The novelty of this work is our use of new predictor variables of OOS events. The system was successfully implemented and tested in a retail packaged foods manufacturer company. By incorporating the new predictive variables in our Random Forest and Ensemble classifier, we were able to improve their system’s predictive power. In particular, the Random Forest classifier presented the best performance in a real-world setting, achieving a detection precision of 72% and identifying 68% of the total OOS events. Finally, the incorporation of our new predictor variables allowed us to improve the performance of the Random Forest by 0.24 points in the F-measure.
|