Classification of Spanish medical patient discharge summaries

Detecting early signs of critical diseases from freetext notes
team size
2-3 people

Data Science

Deep Learning

Natural Language Processing

Maturity Level


As part of the “Artificial Intelligence for Europe” initiative we developed a system for the automatic detection and classification of medical diseases from patient discharge summaries in Spanish and English Language.

AI4EU is the European Union’s landmark Artificial Intelligence project, which seeks to develop a European AI ecosystem, bringing together knowledge, algorithms and tools to transform them into compelling solutions for users. AI4EU involves more than 70 partners, covering 21 countries and was funded with €20m by the Horizon 2020 framework program.

Our contribution to the AI4EU platform was published and can be accessed via the following link:


With 1.8 M new cases diagnosed in 2018 and 48% mortality rate, Colorectal cancer (CRC) is the 2nd leading cause of cancer-related deaths worldwide. CRC has a 5-year survival rate of 92% if diagnosed in Stage I and only 11% if diagnosed in Stage IV. Screening programs are the most powerful tool to diagnose CRC early.

Our partner has developed the product “ColoFast”, an innovative blood-based test for the early detection of colorectal cancer based on a combination of biomarkers for patients between 50 to 85 years old.

The long-term objective of our partner relies on the development of a diagnostic algorithm. Based on risk factors identified from patients’ discharge summaries the objective is to identify patients with high risk of developing cancer or premalignant lesions.

Our Natural Language Processing (NLP) algorithm automatically predicts ICD-10 codes from Spanish and English free-text medical notes. It thereby automates the coding task that usually involves manual labour from a professional medical coder.

The predicted ICD-10 codes provide useful categorical features for further classification tasks for recognition of different types of cancer. These categorical features combined with other features extracted from electronic health records and blood-based features are expected to increase the accuracy of medical screening.


By utilizing the rich data annotations that were offered in the challenge dataset, the constructed model is able to precisely predict ICD-10 codes, obtaining performances better than most alternative scores. Our approach allows us to offer predictions that can be regarded as SOTA in the field of Spanish medical texts.


Our solution is able to distinguish between 2.552 different ICD-10 classes. We processed documents from over 1.000 patients.

contact us
Have we sparked your interest?

Other projects

Back to Our Work