Specialised applications are deployed across various domains to collect and generate data that measure and model the behaviour of complex systems. While numerous data analytics tools have been developed to uncover trends, associations, and structural patterns within data, ranging from traditional pattern mining techniques to advanced ML architecture, challenges remain regarding model interpretability, data efficiency, robustness, and alignment with domain-specific knowledge. Beyond collected data, nearly all fields of expertise contain a reservoir of domain knowledge, either specific to the system studied or general to the domain, which often remains underutilised in data-driven approaches. This thesis addresses this gap by exploring methods for integrating available knowledge into data-driven approaches. Specifically, two types of knowledge are examined: relational knowledge, which captures relationships between entities within a system and is typically represented using graphs, and declarative knowledge, which expresses facts about the domain and is formalised through logical formulae. Relational knowledge integration targets domains where structural relationships between entities are available but frequently underutilised, such as distributed computing networks and biological systems. In distributed systems, the topology of a communication network is shown to predict the convergence rate of distributed averaging algorithms, informing the design of new methods that enable nodes to predict or improve algorithmic performance within specific network configurations. In biological systems, specifically metabolomics, the chemical structure of metabolites is found to be predictive of their relative abundance under perturbed states, providing insights into affected metabolic processes while addressing the limitations of traditional pathway-based methods. Declarative knowledge integration focuses on the clinical domain, where rule-based protocols are widely established, and ML approaches often struggle to meet clinical standards for both accuracy and explainability. A taxonomy of existing integration strategies is presented, evaluating these approaches for their ability to enhance model accuracy, interpretability, robustness, and coherence with established knowledge. These structured guidelines also inform the development of novel integration strategies that further improve model performance. Additionally, this thesis contributes to two further areas. First, it explores the potential of combining graph-based and rule-based approaches in unsupervised learning, particularly for disease subtyping. Second, it investigates how positional information can augment traditional data analysis, demonstrated through two crowdsourcing applications.
Infusing knowledge into data-driven modelling of complex systems for improved quality and interpretability
SIROCCHI, CHRISTEL
2025
Abstract
Specialised applications are deployed across various domains to collect and generate data that measure and model the behaviour of complex systems. While numerous data analytics tools have been developed to uncover trends, associations, and structural patterns within data, ranging from traditional pattern mining techniques to advanced ML architecture, challenges remain regarding model interpretability, data efficiency, robustness, and alignment with domain-specific knowledge. Beyond collected data, nearly all fields of expertise contain a reservoir of domain knowledge, either specific to the system studied or general to the domain, which often remains underutilised in data-driven approaches. This thesis addresses this gap by exploring methods for integrating available knowledge into data-driven approaches. Specifically, two types of knowledge are examined: relational knowledge, which captures relationships between entities within a system and is typically represented using graphs, and declarative knowledge, which expresses facts about the domain and is formalised through logical formulae. Relational knowledge integration targets domains where structural relationships between entities are available but frequently underutilised, such as distributed computing networks and biological systems. In distributed systems, the topology of a communication network is shown to predict the convergence rate of distributed averaging algorithms, informing the design of new methods that enable nodes to predict or improve algorithmic performance within specific network configurations. In biological systems, specifically metabolomics, the chemical structure of metabolites is found to be predictive of their relative abundance under perturbed states, providing insights into affected metabolic processes while addressing the limitations of traditional pathway-based methods. Declarative knowledge integration focuses on the clinical domain, where rule-based protocols are widely established, and ML approaches often struggle to meet clinical standards for both accuracy and explainability. A taxonomy of existing integration strategies is presented, evaluating these approaches for their ability to enhance model accuracy, interpretability, robustness, and coherence with established knowledge. These structured guidelines also inform the development of novel integration strategies that further improve model performance. Additionally, this thesis contributes to two further areas. First, it explores the potential of combining graph-based and rule-based approaches in unsupervised learning, particularly for disease subtyping. Second, it investigates how positional information can augment traditional data analysis, demonstrated through two crowdsourcing applications.File | Dimensione | Formato | |
---|---|---|---|
Christel_Sirocchi_PhD_Thesis_Infusing_knowledge.pdf
accesso aperto
Descrizione: Infusing knowledge into data-driven modelling of complex systems for improved quality and interpretability
Tipologia:
DT
Licenza:
Non pubblico
Dimensione
18.35 MB
Formato
Adobe PDF
|
18.35 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.