Dr. Ayman Taha
Senior Researcher, Technological University Dublin, Ireland

Title of the speech: Feature Engineering for Big Data Applications: An Insurance case study


Dr. Ayman Taha is an Associate Professor at the Faculty of Computes and Artificial Intelligence at Cairo University Egypt. He served as a senior research Fellow at the Technological University Dublin (TUD), Ireland. He obtained his Ph.D. degree with honors in Computer Systems from Cairo University- Egypt and University of Minnesota (UMN)-USA through a joint scientific program in 2014. He received the best PhD Thesis Award at Cairo University in 2014.

Currently, Ayman’s research interests are Data Engineering, Machine learning, Spatial Data Computing, Fraud Detection, Categorical Data Analysis, Anomalous Events Identification, and Deep Learning.


Companies have an increasing access to very large datasets within their domain. Analyzing these datasets often requires the application of feature selection techniques to reduce the dimensionality of the data and prioritize features for downstream knowledge generation tasks. Effective feature selection is a key part of clustering, regression, and classification. It presents a myriad of opportunities to improve the machine learning pipeline: eliminating redundant and irrelevant features, reducing model over-fitting, faster model training times, and more explainable models. By contrast, and despite the widespread availability and use of categorical data in practice, feature selection for categorical and/or mixed data has received relatively little attention in comparison to numerical data. Furthermore, existing feature selection methods for mixed data are sensitive to number of objects by having nonlinear time complexities with respect to number of objects. In this talk, we discuss a generic multiple association measure for categorical and/or mixed datasets and a novel feature selection algorithm that uses multiple association across features. Our algorithms are based upon the belief that the most representative chosen set of features should be as diverse and minimally dependent on each other as possible. This algorithm formulates the problem of feature selection as an optimization problem, searching for the set of features that have minimum association amongst them. We present a generic multiple association measure and two associated feature selection algorithms: Naive and Greedy Feature Selection Algorithms called NFSA and GFSA, respectively.

Insurance is a data-rich sector, hosting large volumes of customer data that is analysed to evaluate risk. Machine learning techniques are increasingly used in the effective management of insurance risk. Insurance datasets by their nature, however, are often of poor quality with noisy subsets of data (or features). Choosing the right features of data is a significant pre-processing step in the creation of machine learning models. The inclusion of irrelevant and redundant features has been demonstrated to affect the performance of learning models. In our experiments, machine learning techniques based on a set of selected features suggested by feature selection algorithms outperformed the full feature set for a set of real insurance datasets. Specifically, 20% and 50% of features in real insurance datasets had improved downstream clustering and classification performance when compared to whole datasets. This indicates the potential for feature selection in the insurance sector to both improve model performance and highlight influential features for business insights.