Are Open Source Pathology Foundation Models ready for the clinic?

9 min readDec 4, 2024

Imagine that a pathologist reads a slide and diagnoses a patient’s cancer on a Monday. They return a few weeks later, read the same slide, and change their diagnosis without any new information. This is not ideal! But it’s a very human behavior. None of us are perfectly consistent over time.

But AI models don’t exhibit this behavior, right? If an AI product predicts one thing on Monday, it will predict the same thing on Tuesday? Surely with all of the progress that pathology foundation models have made over the last few years, such models are not susceptible to the same types of “mind changing” inconsistency that humans make, right? Well… unfortunately not so simple. Don’t believe us? Read on!

Foundation Models have made incredible progress

In the “old days” of computational pathology, machine learning engineers would train separate models for each task (classification, segmentation, detection, etc) or specialization of cancer (lung, breast, prostate etc). Pathology foundation models offer the exciting possibility of simply training one general feature extractor that can be fine-tuned or applied to many different applications and cancer areas. In just the last two years, we’ve seen an absolute flurry of activity from various organizations working on pathology foundation models:

Hibou-B (HistAI, 2024)
UNI (Mahmood Lab, 2024)
TITAN (Mahmood Lab, 2024)
Virchow2 (Paige.AI, 2024)
Virchow (Paige.AI, 2024)
Gigapath (Microsft, UW, Providence Health, 2024)
H-Optimus-0 (Bioptimus, 2024)
Phikon v2 (Owkin, 2024)
Phikon v1 (Owkin, 2023)

The good news about Foundation Model Progress

The good news is that many of these models do seem to be able to generalize to a number of different tasks that traditionally were tackled with totally separate models. For example, UNI [1] is a vision transformer-based, self-supervised model trained on 100,000+ pathology slides across 20 different major organ types. They train their patch embedder using DinoV2 [2], embed each patch from a slide, and aggregate patches from each slide using a standard MIL approach. As illustrated in Figure 1, the aggregated slide-level embeddings demonstrated excellent performance on a variety of downstream tasks in many different cancer areas.

Figure 1: Results from UNI [1], showcasing its performance by using the slide-level representations on a variety of weakly-supervised, slide-level classification tasks.

For another example, consider the Gigapath model. In this work, researchers from Microsoft also used a transformer-based architecture. But rather than use MIL to aggregate patch-level embeddings into a single slide-level representation, they use a LongNet based architecture to directly produce a slide-level representation. Using this approach, they also evaluate the quality of their slide embeddings on a variety of downstream tasks which span multiple cancer areas and task types. As Figure 2 illustrates, these results represent a big step forward in terms of potential clinical utility.

Figure 2: Results from Gigapath [3], showcasing the performance gains made by learning patient-level representations.

The bad news about foundation model progress

Unfortunately, despite the fantastic results, these open-source models have not actually been evaluated in real clinical settings and fail to deal with common real-world variations. In effect, they’ve been optimized for accuracy but not robustness to real-world variation.

Returning to the example given at the start of the article: if an AI model’s representation is sensitive to subtle variations generated as a routine part of the digital slide scanning process, then it is effectively certain that the model’s behavior will change from one scan to another.

In order to substantiate this claim, we’ll first have to explain how clinical labs operate and ensure consistent quality over time.

How Artera’s foundation models are solving clinical problems right now

Artera’s Prostate test is part of the standard of care for prostate cancer in the US (NCCN guidelines). Artera’s products represent one of the few examples of pathology foundation models actually being used in the field to improve clinical care for doctors and patients. Artera’s test workflow can be broken down into 5 steps:

A tissue sample is extracted and pathology slides prepared (standard biopsy process) in order to perform the initial diagnosis.
The pathology slides are sent to the Artera lab.
On receipt, the physical slides are loaded into a digital slide scanner and converted to digital images.
The digital slides are input to Artera’s AI model workflows and interpreted.
The model’s predictions are consumed by our report generator and the digital report is made available to the clinician.

Figure 3: Artera’s lab-based workflow used to aid decision making for cancer patients.

Quality control and sources of variation “in the wild”

Like all certified labs, Artera’s lab executes a quality control process to ensure that the lab is executing a consistent set of steps when scanning slides and running our AI models. As a part of these checks, a number of fixed slides are scanned each day to ensure we’re observing the same result from scanning session to session. Because our AI models are deterministic, if we observe a change in the AI models’ results, we know something has unintentionally changed in our process or reference slides and we can take the necessary steps to remedy the situation. What makes this challenging is the natural variability that occurs when deploying an AI model in the field.

Real World Source of variation #1: Slide Fading

If you’ve ever compared a slide from 30 years ago to a freshly stained pathology slide, you’ll notice a clear difference. The freshly stained slide looks fresh! Its colors are richer and more saturated. This phenomenon is somewhat well known in computational pathology. What is less commonly known is the rapidity with which slide fading can occur. Consider the following real example. On day 1, we scanned a newly stained pathology slide as a part of our quality control process. A crop of this slide is found in Figure 2 (left). The same slide was scanned every day for the next few weeks and the same crop on day 19 can be found in Figure 2 (right).

Figure 4: A real world example of a slide which exhibits drastic change in a matter of weeks. On the left is the original slide and on the right is the same slide several weeks later.

As you can see, the slide has undergone a dramatic amount of fading in just two and a half weeks! If your AI pipeline is overly sensitive to stain intensity, your AI product will “change its mind” over the course of several weeks!

Real World Source of variation #2: Dust and Scanner Settings

In a clinical setting, pathology slides are kept in temperature controlled settings. That being said, a technician still needs to remove a slide from this setting, walk over to a digital slide scanner, load it in, and scan the slide. While much effort is made to keep this process consistent, and the rooms clean, they are rarely performed in anything resembling a Class 10 ISO 4-level clean room. This means that it’s not unusual for dust or a human hair to make its way onto the slide or the sleeve that the slide is typically stored inside.

Figure 5: An example of real-world variation in which dust on the slide alters the slides digital appearance.

As shown in Figure 5, a small piece of dust has made its way onto the slide (middle right). While this speck is easier to see, a closer examination illustrates the degree to which many small dust particles can be found on the slide on the right producing a grainier image overall. Still more complicated is the fact that many slide scanners have built-in auto-white balancing features. This means that large dust particles or human hair may alter the scanner’s white balancing effect and ultimately affect the appearance of the digital whole slide image.

Given these examples, how might we evaluate the robustness of a model to sources of real-world variation?

Evaluating Foundation Model Sensitivity

At Artera, we’ve made a lot of progress training our own foundation models as well as evaluating publicly available ones. To do this, we sought to model the sources of variation in our clinical environment as much as possible. To this end, we developed an in-house approach that simulates:

Slide Fading
Slide Dust
Blurry Slides
H&E stain variation
Brightness, contrast, saturation changes that result from scanner pre-processing variations

To evaluate a model’s sensitivity, we present two versions of the same image patch to each foundation model: the original patch and a perturbed patch. We use cosine similarity to quantitatively measure the difference between the two. Cosine similarities that are closer to 1 indicate that the foundation model is insensitive to that source of variation (good!) and cosine similarities closer to 0 indicate that the foundation model is highly sensitive to that source of variation (bad!).

Results

We hope to publish a more comprehensive review of all of our results but we’ve shared a subset here which are consistently representative across the various evaluation criteria we use.

Figure 6: A plot of the degree to which models’ representations are sensitive to slide fading.

As shown in Figure 6, the public foundation model representations are highly sensitive to the slide fading which would otherwise break a quality control process unnecessarily. In contrast Artera’s Foundation Models are quite stable, even in the presence of severe degradation.

Figure 7: A plot of the degree to which models’ representations are sensitive to some amount of noise on a physical slide.

As Figure 7 illustrates, all of the models are sensitive to some level of noise, but the public foundation models decay at a far faster rate than Artera’s model. In contrast, Artera has trained our models to be insensitive to the types of real-world noise we typically see in our clinical lab. Noise like the ones we use in our evaluation is incredibly common and any would-be distributor of clinical-grade pathology models needs to take these into account.

Figure 8: A plot of the degree to which various models are sensitive to changes in slide contrast, which may occur silently as a result of slide (mis)calibration or the introduction of dust or other elements onto a slide.

Most open source models did reasonably well in terms of evaluation of contrast but there still remains room for improvement.

What should Foundation Model developers do to ensure real-world impact?

The most critical thing a machine learning engineer or researcher can do is to familiarize themself with the real-world setting in which they hope their model will actually be used. Most academic evaluations focus heavily on evaluating model performance on a clinical task. This leaves out an important aspect of real-world utility which is to take into account the types of real-world variation that a clinical lab typically deals with. In other words, machine learning engineers should start focusing on robustness to real-world variation.

Indeed, when deployed, the aspect being tested isn’t just what the AI model produces, it’s what the AI model produces within the broader process of slide preparation and scanning. With that in mind, pathology models that are able to take this into account are much more likely to realize the kind of real-world impact we’re hoping to make than not.

For a more in-depth overview of how Artera validates its models in the field, Gerrard et al [4] published the analytical validation protocol used to validate the repeatability and reproducibility of two of our models in the field. Given that Artera has been among the first companies to commercialize an AI test in the pathology field, there was effectively no previous published work on the subject and [4] represents the first attempt to share the specifics about how such AI pathology tests can be safely validated in the field.

Conclusions

The progress that our field has made towards pathology foundation models has been incredibly exciting and represented a concrete step forward towards deployed models that assist or solve real clinical problems. However, much of this progress has been made in a manner that is not sensitive to how these models are actually deployed and used in the field, which will largely limit their clinical utility. To avoid this, practitioners in both academia and industry need to start by:

Identifying the real use cases where they hope to have a clinical impact
Articulated a set of quantitative and comprehensive evaluation routines that ensure that the models are being used in a safe and efficacious manner
Work backwards to then train models that perform well on the aforementioned evaluation.

Artera’s published Analytical Validation routine [4] is still a subset of all of the validation that Artera does to ensure safety and efficacy in the field and a concerted effort by foundation model developers to take into account real-world variation will ensure that open-sourced models are as ready as possible for realizing clinical impact.

References

[1] Chen, Richard J., et al. “Towards a general-purpose foundation model for computational pathology.” Nature Medicine 30.3 (2024): 850–862.

[2] Oquab, Maxime, et al. “Dinov2: Learning robust visual features without supervision.” arXiv preprint arXiv:2304.07193 (2023).

[3] Xu, Hanwen, et al. “A whole-slide foundation model for digital pathology from real-world data.” Nature (2024): 1–8.

[4] Gerrard, Paul, et al. “Analytical Validation of a Clinical Grade Prognostic and Classification Artificial Intelligence Laboratory Test for Men with Prostate Cancer.” AI in Precision Oncology 1.2 (2024): 119–126.