|
Canessa, E., Chaigneau, S. E., & Moreno, S. (2021). Language processing differences between blind and sighted individuals and the abstract versus concrete concept difference. Cogn. Sci., 45(10), e13044.
Abstract: In the Property Listing Task (PLT), participants are asked to list properties for a concept (e.g., for the concept dog, “barks” and “is a pet” may be produced). In Conceptual Property Norming studies (CPNs), participants are asked to list properties for large sets of concepts. Here, we use a mathematical model of the property listing process to explore two longstanding issues: characterizing the difference between concrete and abstract concepts, and characterizing semantic knowledge in the blind versus sighted population. When we apply our mathematical model to a large CPN reporting properties listed by sighted and blind participants, the model uncovers significant differences between concrete and abstract concepts. Though we also find that blind individuals show many of the same processing differences between abstract and concrete concepts found in sighted individuals, our model shows that those differences are noticeably less pronounced than in sighted individuals. We discuss our results vis a vis theories attempting to
characterize abstract concepts.
|
|
|
Canessa, E., Chaigneau, S. E., & Moreno, S. (2022). Using agreement probability to study differences in types of concepts and conceptualizers. Behav. Res. Methods, Early Access.
Abstract: Agreement probability p(a) is a homogeneity measure of lists of properties produced by participants in a Property Listing Task (PLT) for a concept. Agreement probability's mathematical properties allow a rich analysis of property-based descriptions. To illustrate, we use p(a) to delve into the differences between concrete and abstract concepts in sighted and blind populations. Results show that concrete concepts are more homogeneous within sighted and blind groups than abstract ones (i.e., exhibit a higher p(a) than abstract ones) and that concrete concepts in the blind group are less homogeneous than in the sighted sample. This supports the idea that listed properties for concrete concepts should be more similar across subjects due to the influence of visual/perceptual information on the learning process. In contrast, abstract concepts are learned based mainly on social and linguistic information, which exhibit more variability among people, thus, making the listed properties more dissimilar across subjects. Relative to abstract concepts, the difference in p(a) between sighted and blind is not statistically significant. Though this is a null result, and should be considered with care, it is expected because abstract concepts should be learned by paying attention to the same social and linguistic input in both, blind and sighted, and thus, there is no reason to expect that the respective lists of properties should differ. Finally, we used p(a) to classify concrete and abstract concepts with a good level of certainty. All these analyses suggest that p(a) can be fruitfully used to study data obtained in a PLT.
|
|
|
Canessa, E., Chaigneau, S. E., Moreno, S., & Lagos, R. (2020). Informational content of cosine and other similarities calculated from high-dimensional Conceptual Property Norm data. Cogn. Process., 21, 601–614.
Abstract: To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, inter-concept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 x 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lower-dimensional data fare better than those calculated from higher-dimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.
|
|
|
Canessa, E., Chaigneau, S. E., Moreno, S., & Lagos, R. (2023). CPNCoverageAnalysis: An R package for parameter estimation in conceptual properties norming studies. Behav. Res. Methods, 55, 554–569.
Abstract: In conceptual properties norming studies (CPNs), participants list properties that describe a set of concepts. From CPNs, many different parameters are calculated, such as semantic richness. A generally overlooked issue is that those values are
only point estimates of the true unknown population parameters. In the present work, we present an R package that allows us to treat those values as population parameter estimates. Relatedly, a general practice in CPNs is using an equal number of participants who list properties for each concept (i.e., standardizing sample size). As we illustrate through examples, this procedure has negative effects on data�s statistical analyses. Here, we argue that a better method is to standardize coverage (i.e., the proportion of sampled properties to the total number of properties that describe a concept), such that a similar coverage is achieved across concepts. When standardizing coverage rather than sample size, it is more likely that the set of concepts in a CPN all exhibit a similar representativeness. Moreover, by computing coverage the researcher can decide whether the
CPN reached a sufficiently high coverage, so that its results might be generalizable to other studies. The R package we make available in the current work allows one to compute coverage and to estimate the necessary number of participants to reach a target coverage. We show this sampling procedure by using the R package on real and simulated CPN data.
|
|
|
Canessa, E., Chaigneau, S.E, Moreno, S. (2023). Describing and understanding the time course of the Property Listing Task. Cogn. Process., Early Access.
Abstract: To study linguistically coded concepts, researchers often resort to the Property Listing Task (PLT). In a PLT, participants are asked to list properties that describe a concept (e.g., for DOG, subjects may list �is a pet�, �has four legs�, etc.). When PLT data is collected for many concepts, researchers obtain Conceptual Properties Norms (CPNs), which are used to study semantic content and as a source of control variables. Though the PLT and CPNs are widely used across psychology, only recently a model that describes the listing course of a PLT has been developed and validated. That original model describes the listing course using order of production of properties. Here we go a step beyond and validate the model using response times (RT), i.e., the time from cue onset to property listing. Our results show that RT data exhibits the same regularities observed in the previous model, but now we can also analyze the time course, i.e., dynamics of the PLT. As such, the RT validated model may be applied to study several similar memory retrieval tasks, such as the Free Listing Task, Verbal Fluidity Task, and to examine related cognitive processes. To illustrate those kinds of analyses, we present a brief example of the difference in PLT�s dynamics between listing properties for abstract versus concrete concepts, which shows that the model may be fruitfully applied to study concepts.
|
|
|
Heredia, C., Moreno, S., & Yushimito, W. (2024). ODMeans: An R package for global and local cluster detection for Origin–Destination GPS data. SoftwareX, 26, 101732.
Abstract: The ODMeans R package implements the OD-Means model, a two-layer hierarchical clustering algorithm designed for extracting both global and local travel patterns from Origin–Destination Pairs (OD-Pairs). In contrast to existing models, OD-Means automates cluster determination and offers advantages such as smaller Within-Cluster Distance (WCD) and dual hierarchies. The package includes functions for applying the model and visualizing the results on maps. Using real taxi data from Santiago, Chile, we demonstrate the package’s capabilities, showcasing its flexibility and impact on understanding urban mobility patterns.
|
|
|
Heredia, C., Moreno, S., & Yushimito, W. F. (2022). Characterization of Mobility Patterns with a Hierarchical Clustering of Origin-Destination GPS Taxi Data. IEEE Trans. Intell. Transp. Syst., 23(8), 12700–12710.
Abstract: Clustering taxi data is commonly used to understand spatial patterns of urban mobility. In this paper, we propose a new clustering model called Origin-Destination-means (OD-means). OD-means is a hierarchical adaptive k-means
algorithm based on origin-destination pairs. In the first layer of the hierarchy, the clusters are separated automatically based on the variation of the within-cluster distance of each cluster until convergence. The second layer of the hierarchy corresponds to the sub clustering process of small clusters based on the
distance between the origin and destination of each cluster. The algorithm is tested on a large data set of taxi GPS data from Santiago, Chile, and compared to other clustering algorithms.
In contrast to them, our proposed model is capable of detecting general and local travel patterns in the city thanks to its hierarchical structure.
|
|
|
Hughes, S., Moreno, S., Yushimito, W. F., & Huerta-Canepa, G. (2019). Evaluation of machine learning methodologies to predict stop delivery times from GPS data. Transp. Res. Pt. C-Emerg. Technol., 109, 289–304.
Abstract: In last mile distribution, logistics companies typically arrange and plan their routes based on broad estimates of stop delivery times (i.e., the time spent at each stop to deliver goods to final receivers). If these estimates are not accurate, the level of service is degraded, as the promised time window may not be satisfied. The purpose of this work is to assess the feasibility of machine learning techniques to predict stop delivery times. This is done by testing a wide range of machine learning techniques (including different types of ensembles) to (1) predict the stop delivery time and (2) to determine whether the total stop delivery time will exceed a predefined time threshold (classification approach). For the assessment, all models are trained using information generated from GPS data collected in Medellin, Colombia and compared to hazard duration models. The results are threefold. First, the assessment shows that regression-based machine learning approaches are not better than conventional hazard duration models concerning absolute errors of the prediction of the stop delivery times. Second, when the problem is addressed by a classification scheme in which the prediction is aimed to guide whether a stop time will exceed a predefined time, a basic K-nearest-neighbor model outperforms hazard duration models and other machine learning techniques both in accuracy and F-1 score (harmonic mean between precision and recall). Third, the prediction of the exact duration can be improved by combining the classifiers and prediction models or hazard duration models in a two level scheme (first classification then prediction). However, the improvement depends largely on the correct classification (first level).
|
|
|
Lagos, F., Moreno, S., Yushimito, W. F., & Brstilo, T. (2024). Urban Origin–Destination Travel Time Estimation Using K-Nearest-Neighbor-Based Methods. Mathematics, 12(8), 1255.
Abstract: Improving the estimation of origin�destination (O-D) travel times poses a formidable challenge due to the intricate nature of transportation dynamics. Current deep learning models often require an overwhelming amount of data, both in terms of data points and variables, thereby limiting their applicability. Furthermore, there is a scarcity of models capable of predicting travel times with basic trip information such as origin, destination, and starting time. This paper introduces novel models rooted in the k-nearest neighbor (KNN) algorithm to tackle O-D travel time estimation with limited data. These models represent innovative adaptations of weighted KNN techniques, integrating the haversine distance of neighboring trips and incorporating correction factors to mitigate prediction biases, thereby enhancing the accuracy of travel time estimations for a given trip. Moreover, our models incorporate an adaptive heuristic to partition the time of day, identifying time blocks characterized by similar travel-time observations. These time blocks facilitate a more nuanced understanding of traffic patterns, enabling more precise predictions. To validate the effectiveness of our proposed models, extensive testing was conducted utilizing a comprehensive taxi trip dataset sourced from Santiago, Chile. The results demonstrate substantial improvements over existing state-of-the-art models (e.g., MAPE between 35 to 37% compared to 49 to 60% in other methods), underscoring the efficacy of our approach. Additionally, our models unveil previously unrecognized patterns in city traffic across various time blocks, shedding light on the underlying dynamics of urban mobility.
|
|
|
Moreno, S., Bórquez-Paredes, D., & Martínez, V. (2023). Analysis of the Characteristics and Speed of Spread of the 'FUNA' on Twitter. Mathematics, 11(7), 1749.
Abstract: The funa is a prevalent concept in Chile that aims to expose a persons bad behavior, punish the aggressor publicly, and warn the community about it. Despite its massive use on the social networks of Chilean society, the real dissemination of funas among communities is unknown. In this paper, we extract, generate, analyze, and compare the Twitter social networks spread of three tweets related to �funas� against three other trending topics, through the analysis of global network characteristics over time (degree distribution, clustering coefficient, hop plot, and betweenness centrality). As observed, funas have a specific behavior, and they disseminate as quickly as a common tweet or more quickly; however, they spread thanks to several network users, generating a cohesive group.
|
|
|
Moreno, S., Neville, J., & Kirshner, S. (2018). Tied Kronecker Product Graph Models to Capture Variance in Network Populations. ACM Trans. Knowl. Discov. Data, 12(3), 40 pp.
Abstract: Much of the past work on mining and modeling networks has focused on understanding the observed propel ties of single example graphs. However, in many real-life applications it is important to characterize the structure of populations of graphs. In this work, we analyze the distributional properties of probabilistic generative graph models (PGGMs) for network populations. PGGMs are statistical methods that model the network distribution and match common characteristics of real-world networks. Specifically, we show that most PGGMs cannot relied the natural variability in graph properties observed across multiple networks because their edge generation process assumes independence among edges. Then, we propose the mixed Kronecker Product Graph Model (mKPGM) a scalable generalization of KPGMs that uses tied parameters to increase the variability of the sampled networks, while preserving the edge probabilities in expectation. We compare mKPGM to several other graph models. The results show that learned mKPGMs accurately represent the characteristics of real-world networks, while also effectively capturing the natural variability in network structure.
|
|
|
Moreno, S., Pereira, J., & Yushimito, W. (2020). A hybrid K-means and integer programming method for commercial territory design: a case study in meat distribution. Ann. Oper. Res., 286(1-2), 87–117.
Abstract: The objective of territorial design for a distribution company is the definition of geographic areas that group customers. These geographic areas, usually called districts or territories, should comply with operational rules while maximizing potential sales and minimizing incurred costs. Consequently, territorial design can be seen as a clustering problem in which clients are geographically grouped according to certain criteria which usually vary according to specific objectives and requirements (e.g. costs, delivery times, workload, number of clients, etc.). In this work, we provide a novel hybrid approach for territorial design by means of combining a K-means-based approach for clustering construction with an optimization framework. The K-means approach incorporates the novelty of using tour length approximation techniques to satisfy the conditions of a pork and poultry distributor based in the region of Valparaiso in Chile. The resulting method proves to be robust in the experiments performed, and the Valparaiso case study shows significant savings when compared to the original solution used by the company.
|
|
|
Moreno, S., Pfeiffer, J. J., & Neville, J. (2018). Scalable and exact sampling method for probabilistic generative graph models. Data Min. Knowl. Discov., 32(6), 1561–1596.
Abstract: Interest in modeling complex networks has fueled the development of multiple probabilistic generative graph models (PGGMs). PGGMs are statistical methods that model the network distribution and match common characteristics of real world networks. Recently, scalable sampling algorithms for well known PGGMs, made the analysis of large-scale, sparse networks feasible for the first time. However, it has been demonstrated that these scalable sampling algorithms do not sample from the original underlying distribution, and sometimes produce very unlikely graphs. To address this, we extend the algorithm proposed in Moreno et al.(in: IEEE 14th international conference on data mining, pp 440-449, 2014) for a single model and develop a general solution for a broad class of PGGMs. Our approach exploits the fact that PGGMs are typically parameterized by a small set of unique probability valuesthis enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove bias due to conditional sampling and probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Our new algorithm reduces time complexity by avoiding the expensive rejection sampling step previously necessary, and we demonstrate its generality, by outlining implementations for six different PGGMs. We conduct theoretical analysis and empirical evaluation to demonstrate the strengths of our algorithms. We conclude by sampling a network with over a billion edges in 95s on a single processor.
|
|
|
Opazo, D., Moreno, S., Alvarez-Miranda, E., & Pereira, J. (2021). Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities. Mathematics, 20(9), 2599.
Abstract: Student dropout, defined as the abandonment of a high education program before obtaining the degree without reincorporation, is a problem that affects every higher education institution in the world. This study uses machine learning models over two Chilean universities to predict first-year engineering student dropout over enrolled students, and to analyze the variables that affect the probability of dropout. The results show that instead of combining the datasets into a single dataset, it is better to apply a model per university. Moreover, among the eight machine learning models tested over the datasets, gradient-boosting decision trees reports the best model. Further analyses of the interpretative models show that a higher score in almost any entrance university test decreases the probability of dropout, the most important variable being the mathematical test. One exception is the language test, where a higher score increases the probability of dropout.
|
|
|
Poirrier, M., Moreno, S., Huerta-Canepa, G. (2021). Robust h-index. Scientometrics, 126, 1969–1981.
Abstract: The h-index is the most used measurement of impact for researchers.
Sites such as Web of Science, Google Scholar, Microsoft Academic, and Scopus
leverage it to show and compare the impact of authors. The h-index can be
described in simple terms: it is the highest h for which an authors has h papers
with the number of cites more or equal than h.
Unfortunately, some researchers, in order to increase their productivity
articially, manipulate their h-index using dierent techniques such as selfcitation.
Even though it is relatively simple to discard self-citations, every day
appears more sophisticated methods to articially increase this index. One of
these methods is collaborative citations, in which a researcher A cites indiscriminately
another researcher B, with whom it has a previous collaboration,
increasing her/his h-index.
This work presents a new robust generalization of the h-index called rh-
index that minimizes the impact of new collaborative citations, maintaining
the importance of their citations previous to their collaborative work.
To demonstrate the usefulness of the proposed index, we analyze its eect
over 600 Chilean researchers. Our results show that, while some of the most
cited researchers were barely aected, demonstrating their robustness, another group of authors show a substantial reduction in comparison to their original
h-index.
|
|
|
Ramos, D., Moreno, S., Canessa, E., Chaigneau, S. E., & Marchant, N. (2023). AC-PLT: An algorithm for computer-assisted coding of semantic property listing data. Behav. Res. Methods, Early Access.
Abstract: In this paper, we present a novel algorithm that uses machine learning and natural language processing techniques to facilitate the coding of feature listing data. Feature listing is a method in which participants are asked to provide a list of features that are typically true of a given concept or word. This method is commonly used in research studies to gain insights into people's understanding of various concepts. The standard procedure for extracting meaning from feature listings is to manually code the data, which can be time-consuming and prone to errors, leading to reliability concerns. Our algorithm aims at addressing these challenges by automatically assigning human-created codes to feature listing data that achieve a quantitatively good agreement with human coders. Our preliminary results suggest that our algorithm has the potential to improve the efficiency and accuracy of content analysis of feature listing data. Additionally, this tool is an important step toward developing a fully automated coding algorithm, which we are currently preliminarily devising.
|
|
|
Salas, R., Allende, H., Moreno, S., & Saavedra, C. (2005). Flexible Architecture of Self Organizing Maps for changing environments. Lect. Notes Comput. Sc., 3773, 642–653.
Abstract: Catastrophic Interference is a well known problem of Artificial Neural Networks (ANN) learning algorithms where the ANN forget useful knowledge while learning from new data. Furthermore the structure of most neural models must be chosen in advance. In this paper we introduce a hybrid algorithm called Flexible Architecture of Self Organizing Maps (FASOM) that overcomes the Catastrophic Interference and preserves the topology of Clustered data in changing environments. The model consists in K receptive fields of self organizing maps. Each Receptive Field projects high-dimensional data of the input space onto a neuron position in a low-dimensional output space grid by dynamically adapting its structure to a specific region of the input space. Furthermore the FASOM model automatically finds the number of maps and prototypes needed to successfully adapt to the data. The model has the capability of both growing its structure when novel clusters appears and gradually forgets when the data volume is reduced in its receptive fields. Finally we show the capabilities of our model with experimental results using synthetic sequential data sets and real world data.
|
|
|
Wiener, M., Moreno, S., Jafvert, C., & Nies, L. (2020). Time Series Analysis of Water Use and Indirect Reuse within a HUC-4 Basin (Wabash) over a Nine Year Period. Sci. Total Environ., 738, 140221.
|
|
|
Yushimito, W. F., Moreno, S., & Miranda, D. (2023). The Potential of Battery Electric Taxis in Santiago de Chile. Sustainability, 15(11), 8689.
Abstract: Given the semi-private nature of the mode, the conversion of taxi vehicles to electric requires a feasibility analysis, as it can impact their operations and revenues. In this research, we assess the feasibility of taxi companies in Santiago de Chile operating with battery electric vehicles (BEVs), considering the current electric mobility infrastructure of the city. We used a large database of GPS pulses provided by a taxi app to obtain a complete picture of typical taxi trips and operations in the city. Then, we performed an assessment of the feasibility of the fleet conversion by considering battery capacity, driving range, proximity to recharging stations, and charging power. The results are promising, as the number of completed trips ranges from 87.35% to 94.34%, depending on the BEV driving range. The analysis shows the importance of installing fast charging points in the locations or BEV driving ranges.
|
|