Artificial intelligence application for robust data cleaning in astronomy: A case study on distance estimation

Kania, N.; Busonero, D.; Contoli, C.; Freschi, V.; Lattanzi, E.

doi:10.1016/j.ascom.2026.101135

Estimation of fundamental stellar features, such as their distance, is of paramount importance for understanding stellar evolution and galactic dynamics. Modern astronomical surveys enable gathering a massive amount of data that can be used to train machine learning systems with the aim of automatically gauging the value of specific target variables. Standard deep learning methods usually assign to training samples the same importance, therefore disregarding potentially useful information available from astronomical surveys, such as the uncertainty measurement of various geometric, photometric, and spectroscopic parameters. Moreover, experimental data samples are often affected by missing or invalid values, hindering the training of complex machine learning models. In this study, we introduce a novel approach based on a twofold strategy to address these issues. First, we differentiate the contribution of samples according to their measurement error by designing a set of loss functions conveying this information, which can be used to effectively train deep learning models by handling noisy data, with improved accuracy and generalization capabilities. Second, we complement the proposed methodology with a novel approach to deal with input missing data by properly introducing a mask-based preprocessing layer in the neural network model, further widening the amount of data to be used for training and inference. These methods are implemented in a consistent pipeline that takes the most out of available data. The experimental results obtained on the Gaia mission benchmark data sets show improved precision and robustness of the proposed approach compared to the baseline methods.