Evaluating the Alzheimer’s disease data landscape: issues with heterogeneity, bias and interoperability

On 16 December, Colin Birkenbihl, Prof. Martin Hofmann-Apitius and colleagues published an article in Alzheimer’s & Dementia: translational research and clinical interventions, in which they evaluate the data landscape across several major Alzheimer’s disease (AD) cohort studies.

Longitudinal cohort studies are defined as clinical research studies that follows a group of individuals over a defined period of time, obtaining clinical samples and data from the research participants at regular intervals. Studies such as these allow researchers to understand how disease develops and what factors might affect disease progression, among other facets. Longitudinal cohort studies are of particular value for research on progressive, neurodegenerative diseases like AD. However, it can be hard to compare or pool data across different cohort studies - something that is essential to really understand whether research results are reproducible. One of the reasons why these comparisons are challenging is the fact that studies are often designed differently, with varying inclusion/exclusion criteria, clinical protocols and data access conditions. In particular, there is substantial variability between studies in accessibility of patient-level data.

To assess the data landscape in AD, Colin Birkenbihl and colleagues analysed datasets from 9 cohort studies (A4, ADNI, ANMerge, AIBL, EMIF-1000, EPADv1500, JADNI, NACC and ROSMAP), summarising the data parameters and describing how they overlap between studies. The largest of these, NACC, includes over 40,000 research participants, although on average each cohort recruited between 1200-3600 participants. In general, they observed fairly large biases towards high levels of education and, in particular, a strong bias towards white/Caucasian ethnicity. Scoring different data parameters based on accessibility, they observed that while some straightforward modalities were accessible across all studies (e.g sex, age, education), other modalities were much more heterogeneous, such as imaging data, lifestyle parameters and fluid biomarker samples – indicating a lack of interoperability across datasets and cohorts. In addition, the extent of longitudinal follow-up and sampling varied extensively between studies and data modalities, with a particular paucity in MRI and CSF biomarker categories. All the analyses reported in the article have been made available through ADataViewer, an interactive web application developed by Fraunhofer SCAI that displays the findings in data availability maps.

Overall, these results highlight the challenges facing investigators wishing to validate their analyses across cohorts, or researchers aiming to model disease mechanisms using AI approaches. Many AD datasets are not interoperable, using different data models and providing variable access to raw data. Of particular concern, the lack of ethnoracial diversity in most cohorts also means that AI-based models developed or trained using these data could suffer from bias.

https://alz-journals.onlinelibrary.wiley.com/doi/full/10.1002/trc2.12102

ADataViewer can be accessed here: https://adata.scai.fraunhofer.de/

Science watch