Lung cancer is a leading cause of death worldwide, with non-small cell lung cancer (NSCLC) accounting for up to 85% of all cases and overall a 5-year survival rate of 17.8%. Both early detection and targeted patient-specific therapies of NSCLC are crucial to improve patient survival. As the amount of healthcare data is continuously growing, the diagnosis and treatment of lung cancer can be improved by identifying biomarkers which can be used to identify patients with an increased risk to develop the disease or which require a different type of therapy. Recently, an increasing number of machine learning approaches have been developed to facilitate the identification of such risk-factors.
However, data like proteomics and electronic health records are very challenging to work with as the data is high dimensional, noisy, and contain various sources of data bias.
Additionally, it requires a deep understanding of the clinical and biological domain to interpret results correctly.
In this project we aim to develop a ML pipeline to identify trustworthy medical risk factors and biomarkers in NSCLC. Given the nature of the data and the complexity within the domain, we propose a robust, explainable, and human-centred approach. On the one side, we want to reduce the impact of noise and known data bias, by making the used ML analysis more robust. On the other side, we want to simplify the evaluation of results for clinical experts by increasing the explainability of any given analysis and by providing them with tools to include their domain knowledge.
We believe that techniques developed within this project can easily be applied to other disease areas and will accelerate the development of personalised medicine.