Print

2014 - 2018


Several research aspects: Android Programming, Data Mining, Machine Learning, Data Caching.

Ph.D Thesis

The research program has been supported by the ICT department of the European Organization for Nuclear Research (CERN), Geneva, and the Compact Muon Solenoid (CMS) collaboration at the National Institute of Nuclear Research (INFN), Pisa. 

The thesis work has researched on how to apply machine learning and Big Data techniques on a large set of computing logs provided by the CMS experiment, in order to optimize CPU utilization and job wall time of the distributed computing infrastructure. It proposes a scalable pipeline of Spark components with the goal of collecting from different sites the dataset access logs, organizing them into weekly snapshots, and training, on these snapshots, predictive models able to forecast which datasets will become popular over time. The popularity of experiment-related dataset represents in fact an essential yet mission-critical information in the deployment of the CMS computing and storage resources. The accuracy of the trained models is refined in several steps ranging from feature engineering and refreshing techniques. The performance of global and site-level models are also compared in order to evaluate to which extent locality could exist in dataset accesses. The goal from this first step of the thesis work is to have the best performing classifier featuring high F1 measure, which indicates an accurate ability to correctly separate popular datasets from unpopular ones.

Next, the thesis leverages the dataset popularity predictions in the context of an innovative data caching policy, named Popularity Prediction Caching (PPC). The main idea consists of avoiding the eviction of cache elements if they become popular in the next time period. By evaluating the performance of PPC against popular caching policy baselines it is possible to demonstrate its effectiveness. The experiments conducted on large traces of real dataset accesses show that PPC outperforms current strategies and reduces significantly the number of cache misses in some sites. In particular, this result is outstanding with small cache size, which makes PPC very efficient in production sites featuring limited storage or bandwidth. 

Exams:

Talks: