
Canessa, E., Chaigneau, S. E., & Moreno, S. (2021). Language processing differences between blind and sighted individuals and the abstract versus concrete concept difference. Cogn. Sci., 45(10), e13044.
Abstract: In the Property Listing Task (PLT), participants are asked to list properties for a concept (e.g., for the concept dog, “barks” and “is a pet” may be produced). In Conceptual Property Norming studies (CPNs), participants are asked to list properties for large sets of concepts. Here, we use a mathematical model of the property listing process to explore two longstanding issues: characterizing the difference between concrete and abstract concepts, and characterizing semantic knowledge in the blind versus sighted population. When we apply our mathematical model to a large CPN reporting properties listed by sighted and blind participants, the model uncovers significant differences between concrete and abstract concepts. Though we also find that blind individuals show many of the same processing differences between abstract and concrete concepts found in sighted individuals, our model shows that those differences are noticeably less pronounced than in sighted individuals. We discuss our results vis a vis theories attempting to
characterize abstract concepts.



Canessa, E., Chaigneau, S. E., Moreno, S., & Lagos, R. (2020). Informational content of cosine and other similarities calculated from highdimensional Conceptual Property Norm data. Cogn. Process., 21, 601–614.
Abstract: To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, interconcept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 x 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lowerdimensional data fare better than those calculated from higherdimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.



Canessa, E., Chaigneau, S. E., Moreno, S., & Lagos, R. (2022). CPNCoverageAnalysis: An R package for parameter estimation in conceptual properties norming studies. Behav. Res. Methods, Early Access.
Abstract: In conceptual properties norming studies (CPNs), participants list properties that describe a set of concepts. From CPNs, many different parameters are calculated, such as semantic richness. A generally overlooked issue is that those values are
only point estimates of the true unknown population parameters. In the present work, we present an R package that allows us to treat those values as population parameter estimates. Relatedly, a general practice in CPNs is using an equal number of participants who list properties for each concept (i.e., standardizing sample size). As we illustrate through examples, this procedure has negative effects on data�s statistical analyses. Here, we argue that a better method is to standardize coverage (i.e., the proportion of sampled properties to the total number of properties that describe a concept), such that a similar coverage is achieved across concepts. When standardizing coverage rather than sample size, it is more likely that the set of concepts in a CPN all exhibit a similar representativeness. Moreover, by computing coverage the researcher can decide whether the
CPN reached a sufficiently high coverage, so that its results might be generalizable to other studies. The R package we make available in the current work allows one to compute coverage and to estimate the necessary number of participants to reach a target coverage. We show this sampling procedure by using the R package on real and simulated CPN data.



Heredia, C., Moreno, S., & Yushimito, W. F. (2022). Characterization of Mobility Patterns with a Hierarchical Clustering of OriginDestination GPS Taxi Data. IEEE Trans. Intell. Transp. Syst., Early Access.
Abstract: Clustering taxi data is commonly used to understand spatial patterns of urban mobility. In this paper, we propose a new clustering model called OriginDestinationmeans (ODmeans). ODmeans is a hierarchical adaptive kmeans
algorithm based on origindestination pairs. In the first layer of the hierarchy, the clusters are separated automatically based on the variation of the withincluster distance of each cluster until convergence. The second layer of the hierarchy corresponds to the sub clustering process of small clusters based on the
distance between the origin and destination of each cluster. The algorithm is tested on a large data set of taxi GPS data from Santiago, Chile, and compared to other clustering algorithms.
In contrast to them, our proposed model is capable of detecting general and local travel patterns in the city thanks to its hierarchical structure.



Hughes, S., Moreno, S., Yushimito, W. F., & HuertaCanepa, G. (2019). Evaluation of machine learning methodologies to predict stop delivery times from GPS data. Transp. Res. Pt. CEmerg. Technol., 109, 289–304.
Abstract: In last mile distribution, logistics companies typically arrange and plan their routes based on broad estimates of stop delivery times (i.e., the time spent at each stop to deliver goods to final receivers). If these estimates are not accurate, the level of service is degraded, as the promised time window may not be satisfied. The purpose of this work is to assess the feasibility of machine learning techniques to predict stop delivery times. This is done by testing a wide range of machine learning techniques (including different types of ensembles) to (1) predict the stop delivery time and (2) to determine whether the total stop delivery time will exceed a predefined time threshold (classification approach). For the assessment, all models are trained using information generated from GPS data collected in Medellin, Colombia and compared to hazard duration models. The results are threefold. First, the assessment shows that regressionbased machine learning approaches are not better than conventional hazard duration models concerning absolute errors of the prediction of the stop delivery times. Second, when the problem is addressed by a classification scheme in which the prediction is aimed to guide whether a stop time will exceed a predefined time, a basic Knearestneighbor model outperforms hazard duration models and other machine learning techniques both in accuracy and F1 score (harmonic mean between precision and recall). Third, the prediction of the exact duration can be improved by combining the classifiers and prediction models or hazard duration models in a two level scheme (first classification then prediction). However, the improvement depends largely on the correct classification (first level).



Moreno, S., Neville, J., & Kirshner, S. (2018). Tied Kronecker Product Graph Models to Capture Variance in Network Populations. ACM Trans. Knowl. Discov. Data, 12(3), 40 pp.
Abstract: Much of the past work on mining and modeling networks has focused on understanding the observed propel ties of single example graphs. However, in many reallife applications it is important to characterize the structure of populations of graphs. In this work, we analyze the distributional properties of probabilistic generative graph models (PGGMs) for network populations. PGGMs are statistical methods that model the network distribution and match common characteristics of realworld networks. Specifically, we show that most PGGMs cannot relied the natural variability in graph properties observed across multiple networks because their edge generation process assumes independence among edges. Then, we propose the mixed Kronecker Product Graph Model (mKPGM) a scalable generalization of KPGMs that uses tied parameters to increase the variability of the sampled networks, while preserving the edge probabilities in expectation. We compare mKPGM to several other graph models. The results show that learned mKPGMs accurately represent the characteristics of realworld networks, while also effectively capturing the natural variability in network structure.



Moreno, S., Pereira, J., & Yushimito, W. (2020). A hybrid Kmeans and integer programming method for commercial territory design: a case study in meat distribution. Ann. Oper. Res., 286(12), 87–117.
Abstract: The objective of territorial design for a distribution company is the definition of geographic areas that group customers. These geographic areas, usually called districts or territories, should comply with operational rules while maximizing potential sales and minimizing incurred costs. Consequently, territorial design can be seen as a clustering problem in which clients are geographically grouped according to certain criteria which usually vary according to specific objectives and requirements (e.g. costs, delivery times, workload, number of clients, etc.). In this work, we provide a novel hybrid approach for territorial design by means of combining a Kmeansbased approach for clustering construction with an optimization framework. The Kmeans approach incorporates the novelty of using tour length approximation techniques to satisfy the conditions of a pork and poultry distributor based in the region of Valparaiso in Chile. The resulting method proves to be robust in the experiments performed, and the Valparaiso case study shows significant savings when compared to the original solution used by the company.



Moreno, S., Pfeiffer, J. J., & Neville, J. (2018). Scalable and exact sampling method for probabilistic generative graph models. Data Min. Knowl. Discov., 32(6), 1561–1596.
Abstract: Interest in modeling complex networks has fueled the development of multiple probabilistic generative graph models (PGGMs). PGGMs are statistical methods that model the network distribution and match common characteristics of real world networks. Recently, scalable sampling algorithms for well known PGGMs, made the analysis of largescale, sparse networks feasible for the first time. However, it has been demonstrated that these scalable sampling algorithms do not sample from the original underlying distribution, and sometimes produce very unlikely graphs. To address this, we extend the algorithm proposed in Moreno et al.(in: IEEE 14th international conference on data mining, pp 440449, 2014) for a single model and develop a general solution for a broad class of PGGMs. Our approach exploits the fact that PGGMs are typically parameterized by a small set of unique probability valuesthis enables fast generation via independent sampling of groups of edges with the same probability value. By sampling within groups, we remove bias due to conditional sampling and probability reallocation. We show that our grouped sampling methods are both provably correct and efficient. Our new algorithm reduces time complexity by avoiding the expensive rejection sampling step previously necessary, and we demonstrate its generality, by outlining implementations for six different PGGMs. We conduct theoretical analysis and empirical evaluation to demonstrate the strengths of our algorithms. We conclude by sampling a network with over a billion edges in 95s on a single processor.



Opazo, D., Moreno, S., AlvarezMiranda, E., & Pereira, J. (2021). Analysis of FirstYear University Student Dropout through Machine Learning Models: A Comparison between Universities. Mathematics, 20(9), 2599.
Abstract: Student dropout, defined as the abandonment of a high education program before obtaining the degree without reincorporation, is a problem that affects every higher education institution in the world. This study uses machine learning models over two Chilean universities to predict firstyear engineering student dropout over enrolled students, and to analyze the variables that affect the probability of dropout. The results show that instead of combining the datasets into a single dataset, it is better to apply a model per university. Moreover, among the eight machine learning models tested over the datasets, gradientboosting decision trees reports the best model. Further analyses of the interpretative models show that a higher score in almost any entrance university test decreases the probability of dropout, the most important variable being the mathematical test. One exception is the language test, where a higher score increases the probability of dropout.



Poirrier, M., Moreno, S., HuertaCanepa, G. (2021). Robust hindex. Scientometrics, 126, 1969–1981.
Abstract: The hindex is the most used measurement of impact for researchers.
Sites such as Web of Science, Google Scholar, Microsoft Academic, and Scopus
leverage it to show and compare the impact of authors. The hindex can be
described in simple terms: it is the highest h for which an authors has h papers
with the number of cites more or equal than h.
Unfortunately, some researchers, in order to increase their productivity
articially, manipulate their hindex using dierent techniques such as selfcitation.
Even though it is relatively simple to discard selfcitations, every day
appears more sophisticated methods to articially increase this index. One of
these methods is collaborative citations, in which a researcher A cites indiscriminately
another researcher B, with whom it has a previous collaboration,
increasing her/his hindex.
This work presents a new robust generalization of the hindex called rh
index that minimizes the impact of new collaborative citations, maintaining
the importance of their citations previous to their collaborative work.
To demonstrate the usefulness of the proposed index, we analyze its eect
over 600 Chilean researchers. Our results show that, while some of the most
cited researchers were barely aected, demonstrating their robustness, another group of authors show a substantial reduction in comparison to their original
hindex.



Salas, R., Allende, H., Moreno, S., & Saavedra, C. (2005). Flexible Architecture of Self Organizing Maps for changing environments. Lect. Notes Comput. Sc., 3773, 642–653.
Abstract: Catastrophic Interference is a well known problem of Artificial Neural Networks (ANN) learning algorithms where the ANN forget useful knowledge while learning from new data. Furthermore the structure of most neural models must be chosen in advance. In this paper we introduce a hybrid algorithm called Flexible Architecture of Self Organizing Maps (FASOM) that overcomes the Catastrophic Interference and preserves the topology of Clustered data in changing environments. The model consists in K receptive fields of self organizing maps. Each Receptive Field projects highdimensional data of the input space onto a neuron position in a lowdimensional output space grid by dynamically adapting its structure to a specific region of the input space. Furthermore the FASOM model automatically finds the number of maps and prototypes needed to successfully adapt to the data. The model has the capability of both growing its structure when novel clusters appears and gradually forgets when the data volume is reduced in its receptive fields. Finally we show the capabilities of our model with experimental results using synthetic sequential data sets and real world data.



Wiener, M., Moreno, S., Jafvert, C., & Nies, L. (2020). Time Series Analysis of Water Use and Indirect Reuse within a HUC4 Basin (Wabash) over a Nine Year Period. Sci. Total Environ., 738, 140221.

