Infusing knowledge into data-driven modelling of complex systems for improved quality and interpretability

IRIS

Specialised applications are deployed across various domains to collect and generate data that measure and model the behaviour of complex systems. While numerous data analytics tools have been developed to uncover trends, associations, and structural patterns within data, ranging from traditional pattern mining techniques to advanced ML architecture, challenges remain regarding model interpretability, data efficiency, robustness, and alignment with domain-specific knowledge. Beyond collected data, nearly all fields of expertise contain a reservoir of domain knowledge, either specific to the system studied or general to the domain, which often remains underutilised in data-driven approaches. This thesis addresses this gap by exploring methods for integrating available knowledge into data-driven approaches. Specifically, two types of knowledge are examined: relational knowledge, which captures relationships between entities within a system and is typically represented using graphs, and declarative knowledge, which expresses facts about the domain and is formalised through logical formulae. Relational knowledge integration targets domains where structural relationships between entities are available but frequently underutilised, such as distributed computing networks and biological systems. In distributed systems, the topology of a communication network is shown to predict the convergence rate of distributed averaging algorithms, informing the design of new methods that enable nodes to predict or improve algorithmic performance within specific network configurations. In biological systems, specifically metabolomics, the chemical structure of metabolites is found to be predictive of their relative abundance under perturbed states, providing insights into affected metabolic processes while addressing the limitations of traditional pathway-based methods. Declarative knowledge integration focuses on the clinical domain, where rule-based protocols are widely established, and ML approaches often struggle to meet clinical standards for both accuracy and explainability. A taxonomy of existing integration strategies is presented, evaluating these approaches for their ability to enhance model accuracy, interpretability, robustness, and coherence with established knowledge. These structured guidelines also inform the development of novel integration strategies that further improve model performance. Additionally, this thesis contributes to two further areas. First, it explores the potential of combining graph-based and rule-based approaches in unsupervised learning, particularly for disease subtyping. Second, it investigates how positional information can augment traditional data analysis, demonstrated through two crowdsourcing applications.

Infusing knowledge into data-driven modelling of complex systems for improved quality and interpretability

SIROCCHI, CHRISTEL

2025

Abstract

Specialised applications are deployed across various domains to collect and generate data that measure and model the behaviour of complex systems. While numerous data analytics tools have been developed to uncover trends, associations, and structural patterns within data, ranging from traditional pattern mining techniques to advanced ML architecture, challenges remain regarding model interpretability, data efficiency, robustness, and alignment with domain-specific knowledge. Beyond collected data, nearly all fields of expertise contain a reservoir of domain knowledge, either specific to the system studied or general to the domain, which often remains underutilised in data-driven approaches. This thesis addresses this gap by exploring methods for integrating available knowledge into data-driven approaches. Specifically, two types of knowledge are examined: relational knowledge, which captures relationships between entities within a system and is typically represented using graphs, and declarative knowledge, which expresses facts about the domain and is formalised through logical formulae. Relational knowledge integration targets domains where structural relationships between entities are available but frequently underutilised, such as distributed computing networks and biological systems. In distributed systems, the topology of a communication network is shown to predict the convergence rate of distributed averaging algorithms, informing the design of new methods that enable nodes to predict or improve algorithmic performance within specific network configurations. In biological systems, specifically metabolomics, the chemical structure of metabolites is found to be predictive of their relative abundance under perturbed states, providing insights into affected metabolic processes while addressing the limitations of traditional pathway-based methods. Declarative knowledge integration focuses on the clinical domain, where rule-based protocols are widely established, and ML approaches often struggle to meet clinical standards for both accuracy and explainability. A taxonomy of existing integration strategies is presented, evaluating these approaches for their ability to enhance model accuracy, interpretability, robustness, and coherence with established knowledge. These structured guidelines also inform the development of novel integration strategies that further improve model performance. Additionally, this thesis contributes to two further areas. First, it explores the potential of combining graph-based and rule-based approaches in unsupervised learning, particularly for disease subtyping. Second, it investigates how positional information can augment traditional data analysis, demonstrated through two crowdsourcing applications.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno di discussione

20-feb-2025

Appare nelle tipologie:

8.1 Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
Christel_Sirocchi_PhD_Thesis_Infusing_knowledge.pdf accesso aperto Descrizione: Infusing knowledge into data-driven modelling of complex systems for improved quality and interpretability Tipologia: DT Licenza: Creative commons Dimensione 18.35 MB Formato Adobe PDF Visualizza/Apri	18.35 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11576/2752311

Citazioni

ND

ND

ND

social impact