Artificial intelligence is increasingly used in medical diagnoses, particularly for analyzing medical images like X-rays. However, studies have shown that these AI models don’t perform equally well across all demographics, often demonstrating lower accuracy for women and people of color.
Adding to these concerns, MIT researchers in 2022 discovered that AI models could predict a patient‘s race from chest X-rays with surprising accuracy, a feat impossible even for experienced radiologists. Now, the same team has uncovered a link between this ability and the observed bias. Their findings, published in Nature Medicine, suggest that models most accurate at predicting demographics also exhibit the largest “fairness gaps” – discrepancies in diagnostic accuracy across different races and genders.
“It’s well-established that high-capacity machine-learning models are good predictors of human demographics such as self-reported race or sex or age. This paper redemonstrates that capacity, and then links that capacity to the lack of performance across different groups, which has never been done,” explains Marzyeh Ghassemi, senior author of the study and an MIT associate professor of electrical engineering and computer science.
The team’s research indicates that these models may be using “demographic shortcuts” during diagnosis, leading to incorrect results for certain groups. While the researchers were able to retrain the models for improved fairness, this “debiasing” proved effective primarily when tested on patients similar to the training data. When applied to patients from different hospitals, the fairness gaps reemerged.
Haoran Zhang, an MIT graduate student and lead author of the paper, emphasizes two key takeaways: “First, you should thoroughly evaluate any external models on your own data because any fairness guarantees that model developers provide on their training data may not transfer to your population. Second, whenever sufficient data is available, you should train models on your own data.”
As of May 2024, the FDA has approved 882 AI-enabled medical devices, with a significant portion designed for radiology. The MIT study highlights the urgent need to address the bias inherent in these models.
Ghassemi points out, “Many popular machine learning models have superhuman demographic prediction capacity — radiologists cannot detect self-reported race from a chest X-ray. These are models that are good at predicting disease, but during training are learning to predict other things that may not be desirable.”
The researchers investigated this phenomenon using publicly available chest X-ray datasets, training models to diagnose conditions like fluid buildup in the lungs, collapsed lung, and heart enlargement. While the models performed well overall, most displayed fairness gaps, with accuracy discrepancies between men and women, and between white and Black patients.
The study revealed a strong correlation between a model’s accuracy in predicting demographics and the size of its fairness gap, suggesting the use of demographic categorization as a shortcut for disease prediction.
The team experimented with “subgroup robustness” training, rewarding models for better performance on their weakest subgroup, and “group adversarial” approaches, forcing models to disregard demographic information. Both methods showed promise, but only when tested on data similar to the training set.
“For in-distribution data, you can use existing state-of-the-art methods to reduce fairness gaps without making significant trade-offs in overall performance,” says Ghassemi. “Subgroup robustness methods force models to be sensitive to mispredicting a specific group, and group adversarial methods try to remove group information completely.”
Worryingly, when tested on datasets from different hospitals, the “debiased” models exhibited significant fairness gaps, highlighting the limitations of current debiasing techniques.
“If you debias the model in one set of patients, that fairness does not necessarily hold as you move to a new set of patients from a different hospital in a different location,” warns Zhang.
This finding raises concerns as hospitals often utilize models trained on data from other institutions. Ghassemi cautions, “We found that even state-of-the-art models which are optimally performant in data similar to their training sets are not optimal — that is, they do not make the best trade-off between overall and subgroup performance — in novel settings. Unfortunately, this is actually how a model is likely to be deployed. Most models are trained and validated with data from one hospital, or one source, and then deployed widely.”
The study emphasizes the need for hospitals to rigorously evaluate AI models on their own patient populations before implementation to ensure fair and accurate diagnoses for all patients. The researchers are continuing to explore and develop new methods for building fairer AI models that generalize better across diverse datasets.
This research was supported by various organizations, including a Google Research Scholar Award, the Robert Wood Johnson Foundation Harold Amos Medical Faculty Development Program, RSNA Health Disparities, the Lacuna Fund, the Gordon and Betty Moore Foundation, the National Institute of Biomedical Imaging and Bioengineering, and the National Heart, Lung, and Blood Institute.
Responses (0 )