Research Accomplishments in Pathology AI | Stony Brook Dept of Biomedical Informatics

Over the past decade, this research team has pioneered numerous advances in computational pathology, spanning tumor–immune microenvironment analysis, whole-slide image classification, generative modeling, attention-based representation learning, interpretable AI, and studies of pathologist visual attention. Below, we summarize key contributions in each area, highlighting methodological innovations and applications.

One major accomplishment was mapping tumor-infiltrating lymphocytes (TILs) in H&E-stained whole-slide images from The Cancer Genome Atlas (TCGA) to study the tumor–immune microenvironment. In a Cell Reports 2018 study, Saltz et al. processed 5,202 digitized slides across 13 cancer types to create TIL maps using a deep convolutional neural network that “computationally stained” regions containing lymphocytes. The network classified image patches to detect lymphocyte-rich areas, effectively highlighting immune cell infiltrates within tumors. These TIL maps were validated by showing strong agreement with pathologist annotations and with established molecular assays for T-cell density.

Crucially, the spatial patterns of TILs extracted from the slides were correlated with molecular and clinical data. Affinity propagation clustering revealed distinct local TIL spatial structures associated with patient overall survival. The study found that TIL density and spatial arrangement varied significantly across tumor types, immune subtypes, and tumor molecular subtypes. Certain TIL spatial patterns were enriched for specific T-cell subpopulations (from gene expression data) and linked to particular tumor genomic aberrations. In summary, this work demonstrated that routine pathology slides contain rich immunological information: by applying deep learning to map lymphocyte infiltration, one can connect tissue morphology with immune genomics and patient outcomes. This computational pathology approach to immune profiling provides a foundation for objective TIL quantification in diagnostics and for guiding immunotherapy decisions.

Another significant contribution is a patch-based CNN framework for classifying gigapixel whole-slide images (WSIs) into cancer subtypes. Hou et al. (CVPR 2016) recognized the impracticality of training a single CNN on entire WSIs due to their enormous size (often exceeding 10^9 pixels). Instead, they proposed breaking the WSI into smaller patches and training a patch-level classifier, then intelligently aggregating patch predictions to classify the whole slide. The core challenge was to combine patch outputs while accounting for the fact that not all patches are informative (e.g. many patches may be mostly background or benign tissue).

Their solution was a two-stage model: first a CNN was trained on image patches (e.g. 256×256 regions) to predict patch-level tumor subtype labels; second, a decision fusion model (e.g. logistic regression on the distribution of patch predictions) was trained to produce the slide-level diagnosis. Moreover, they introduced an Expectation-Maximization (EM) algorithm to iteratively identify and up-weight discriminative patches while down-weighting non-informative ones. This EM-based attention mechanism exploited spatial relationships between patches to refine which regions truly characterize the cancer subtype. They applied this approach to classify subtypes of glioma and non–small cell lung carcinoma, achieving accuracy on par with inter-pathologist agreement. Notably, in controlled experiments with smaller images, the patch-based CNN outperformed a conventional image-level CNN, confirming the benefit of preserving high-resolution detail.

In recent work (CVPR 2025), the group has pushed the frontier of generative modeling for pathology images, addressing the challenge of synthesizing realistic large-scale tissue images. Yellapragada et al. introduced ZoomLDM, a latent diffusion model capable of multi-scale histology image generation . Diffusion models had shown great success in image generation, but previously could not be directly applied to whole-slide images due to memory and data limitations – prior efforts could only generate small patches and missed global context . ZoomLDM overcomes this by introducing a magnification-aware conditioning mechanism that allows generation at different zoom levels (resolutions) of a WSI . In practice, the model is trained on image patches along with their magnification information and self-supervised context embeddings, so that it can conditionally generate a patch at a specified scale or “zoom” .

By modeling multiple scales, ZoomLDM acts like a digital “zoom lens”: it can generate a low-magnification thumbnail capturing global tissue architecture, and also high-magnification patches with fine cellular details, in a coherent way. This multi-scale approach achieved state-of-the-art image generation quality across various resolutions on pathology datasets . Notably, it excelled in the data-scarce task of generating entire slide thumbnails (e.g. 1.25× magnification images) where prior models struggled . The authors demonstrated globally consistent synthesis of large images up to 4096×4096 pixels (covering a substantial tissue region) by leveraging coarse-to-fine generation with MultiDiffusion sampling . Additionally, they showed that features extracted from the multi-scale diffusion model are highly effective for downstream WSI classification tasks in a multiple-instance learning setting – indicating that ZoomLDM not only produces images, but also learns useful representations of tissue structure. This work represents a substantial advance in generative AI for pathology, enabling realistic simulation of tissue images for data augmentation, education, and other applications that leverage the multi-scale nature of histology .

The team has also pioneered methods to improve representation learning for digital pathology by addressing attention biases in self-supervised models. In Medical Image Analysis 2024, Kapse et al. presented DiRL (Diversity-inducing Representation Learning), which combats the problem of attention sparsity in vision transformers trained on histopathology images . They observed that vanilla self-supervised models (e.g. using contrastive learning) tend to focus their attention on a few dominant regions of an image – essentially the network “locks onto” the most obvious structure (e.g. a large tumor gland) and ignores other context . While such focused attention might be acceptable in natural images (where it often corresponds to the main object), it is suboptimal in pathology because tissue images are not object-centric; a single slide may contain a mix of cell types and patterns, all of which are relevant . Insufficiently diverse attention can thus lead to losing important information about the tissue microenvironment .

To address this, they developed a novel pre-training strategy to de-sparsify the attention maps of a vision transformer, forcing the model to distribute attention more evenly across different tissue components . Specifically, they leverage prior knowledge in the form of cell segmentation: by identifying cell locations, they extract multiple patch representations per image (each centered on different cells or regions). They then introduce a dense matching pretext task in self-supervised learning that encourages the model to align corresponding regions between different augmented views of an image . In essence, instead of contrasting whole-image embeddings, the model must match multiple localized embeddings (features from different cells/regions), which prevents it from collapsing its focus onto one area. This prior-guided dense SSL approach led the network to “attend to various components more closely and evenly, thus inducing adequate diversification of attention” . Experiments on histopathology datasets showed that DiRL’s representations capture more globally distributed features and improve performance on multiple WSI classification and retrieval tasks . By introducing attention de-sparsification, this work provides a general strategy to boost representation diversity in computational pathology models, ultimately making them more robust to the complex heterogeneity of real-world tissue images .

Interpretability is crucial in pathology AI, and the group has made notable strides in developing self-interpreting deep models for whole-slide analysis. A highlight in this area is SI-MIL (Self-Interpretable Multiple Instance Learning) by Kapse et al. (CVPR 2024), which introduces interpretability into MIL-based WSI classifiers from the ground up . Traditional MIL methods for WSI (including the patch-based models above) often use an attention mechanism to identify important patches, which yields some interpretability – e.g. the top-attended patches can be visually highlighted for a pathologist. However, this only indicates where the model looked; it doesn’t explain why those regions were important or what features led to the slide’s classification . Thus, pathologists get limited insight into the decision rationale beyond a heatmap of “salient” regions.

SI-MIL addresses this by incorporating domain-specific knowledge (human-understandable features) into the model’s structure. It employs a standard deep MIL neural network for WSI classification, but couples it with an interpretable branch that is grounded in handcrafted pathological features . In practice, this means the model is constrained to make linear predictions based on a set of predefined features (such as cell morphology, density, nuclear atypia, etc.) which are computed for the identified salient regions . The MIL network selects the most relevant regions, and for those regions the interpretable branch evaluates feature values that mimic criteria a pathologist might consider. By design, the final slide-level prediction is decomposed into contributions of these intuitive features, providing a feature-level explanation for each decision . For example, instead of simply flagging a patch as “tumor,” the model might report that a region had a high mitotic count or an irregular glandular architecture which influenced the cancer grade prediction.

Despite the added constraints for interpretability, SI-MIL does not sacrifice accuracy. In fact, the study showed that with its linear prediction layer, SI-MIL achieved performance competitive with state-of-the-art black-box models across three cancer types, effectively debunking the notion of an inherent trade-off between interpretability and accuracy . The authors also rigorously evaluated interpretability through both quantitative metrics and a user study with pathologists, confirming that SI-MIL’s explanations were user-friendly and faithful to the model’s behavior . In addition to SI-MIL, the team has explored attention-based interpretability in other contexts (e.g. analyzing transformer attention maps for cell classification). Collectively, these efforts advance transparent AI in pathology, enabling models that not only predict well but also provide diagnostic reasoning that clinicians can trust and understand .

We capture the zoom and pan patterns of Pathologists and then train models to predict the Pathologist’s (human) attention trajectory and the heatmap depicting what they looked at and at what magnification. We started with CNNs, moved to transformers and are now using vision language models.
We have developed a Pathologist Attention Transformer (PAT) that can predict where an expert would look next on a slide . Such a model could power intelligent tutoring systems: for instance, guiding a trainee to examine regions that an expert would, thus teaching them expert-like search patterns. This line of research, at the intersection of AI and human cognition, not only reveals how expert pathologists visually interpret slides but also informs the design of assistive tools to improve medical training and decision support.

In addition to the above themes, the team and collaborators have produced a wide array of other contributions in pathology AI:

Unsupervised & Weakly-Supervised Learning for Pathology: Hou et al. (CVPR 2019) developed a framework for robust nuclei segmentation across diverse tissue types without manual labels . They synthesized training data by generating random polygonal nuclear masks and blending them with real tissue textures, then applied an importance sampling loss to preferentially train on realistic synthetic examples . This approach eliminated the need for per-tissue annotations and achieved segmentation performance on new cancer types that matched fully-supervised models . It demonstrated that carefully crafted simulation (both “to label” and “to synthesize”) can generalize a model far beyond its original training distribution.
Topological Data Analysis for Cell Patterns: Abousamra et al. (CVPR 2023) introduced a novel use of topological descriptors to model the spatial context of cells in tissue . They integrated tools from spatial statistics and topology (e.g. persistent homology to quantify cell clusters and voids) into a generative model that produces synthetic multi-cell layouts . By conditioning on these topology features and enforcing them via a differentiable loss, they managed to generate realistic arrangements of different cell types (lymphocytes, cancer cells, etc.) that reflect true tissue architecture . These topology-guided cell simulations were used as data augmentation, boosting downstream cell classification accuracy . This work bridged mathematical topology with deep learning, opening a new path to incorporate higher-order spatial constraints in pathology models.
Generative Models and Diffusion for Histology: Apart from ZoomLDM, the team explored other generative frameworks. Xu et al. (MICCAI 2023, Best Paper in the DGM4MICCAI workshop) proposed ViT-DAE, a Vision Transformer-driven diffusion autoencoder for histopathology image synthesis . This model combined the stability and diversity of diffusion models with a ViT backbone to better capture global tissue structure . ViT-DAE produced high-quality, diverse synthetic tissue patches and outperformed contemporary GANs in realism, showcasing the promise of diffusion+Transformer architectures in medical image generation.
Multi-Scale and Semi-Supervised Learning: The group has also tackled multi-scale feature learning beyond generation. For instance, CD-Net (Context-Detail Network) by Kapse, Das, Prasanna et al. learns joint representations from both low-resolution context and high-resolution detail patches in a WSI pyramid (using a dual-branch Transformer and a DINO self-supervised objective). In a similar vein, Howlader et al. (ECCV 2024) developed a semi-supervised segmentation method that uses a multi-scale patch-wise classifier to propagate labels, effectively leveraging unlabeled data to improve pixel-wise tumor segmentation. These efforts address the inherent multi-scale nature of pathology images, ensuring that AI models consider both the forest and the trees .
Uncertainty and Domain Adaptation: To make AI predictions more reliable, the team has investigated uncertainty estimation and domain generalization in pathology models. In a CVPR 2024 workshop paper, Yun et al. combined labeled and unlabeled data to estimate model uncertainty for tumor detection . Such techniques can flag low-confidence regions on a slide for human review, increasing trust in AI-assisted diagnoses. Additionally, many of the group’s works emphasize domain generalization – e.g. training models that perform robustly on unseen cancer types or scanner settings – which is vital for real-world deployment of pathology AI .
Scalable Data Infrastructure for Pathology: Aji et al. (PVLDB 2013) developed Hadoop-GIS, a high-performance spatial data warehousing system for managing massive pathology image datasets on a Hadoop cluster . This system enabled efficient querying of extremely large whole-slide images and other spatial data, and was honored with the 2024 VLDB Test of Time Award in recognition of its enduring impact .

Conclusion

Through these diverse but interconnected contributions, Joel Saltz, Dimitris Samaras, Prateek Prasanna, Chao Chen, Ken Shroyer, Tahsin Kurc, Raj Gupta, Gregory Zelinsky, Fusheng Wang, and collaborators have substantially advanced the state of Pathology AI. Their work ranges from foundational methods (like patch-based CNNs and MIL interpretability) to cutting-edge innovations (like multi-scale diffusion generation and topology-guided synthesis), all aimed at improving cancer diagnosis, prognostication, and the understanding of disease through computational analysis of pathology images.

J. Saltz et al., “Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images,” Cell Reports, vol. 23, no. 1, pp. 181–193.e7, Apr. 2018. DOI: 10.1016/j.celrep.2018.03.086
L. Hou et al., “Patch-based Convolutional Neural Network for Whole Slide Tissue Image Classification,” in Proc. CVPR, 2016, pp. 2424–2433. DOI: 10.1109/CVPR.2016.267
S. Yellapragada et al., “ZoomLDM: Latent Diffusion Model for Multi-Scale Image Generation,” in Proc. CVPR, 2025 (to appear). arXiv:2411.16969 [cs.CV] (Nov. 2024)
S. Kapse et al., “Attention De-sparsification Matters: Inducing Diversity in Digital Pathology Representation Learning,” Med. Image Anal., vol. 93, 103070, 2024. DOI: 10.1016/j.media.2023.103070
S. Kapse et al., “SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology,” in Proc. CVPR, 2024, pp. 11226–11237. DOI: 10.1109/CVPR.2024.01145
S. Chakraborty et al., “Decoding the Visual Attention of Pathologists to Reveal Their Level of Expertise,” in Proc. MICCAI, LNCS 13433, 2022, pp. 90–100. DOI: 10.1007/978-3-031-16440-8_9
L. Hou et al., “Robust Histopathology Image Analysis: To Label or to Synthesize?,” in Proc. CVPR, 2019, pp. 8525–8534. DOI: 10.1109/CVPR.2019.00873
S. Abousamra et al., “Topology-Guided Multi-Class Cell Context Generation for Digital Pathology,” in Proc. CVPR, 2023, pp. 3323–3333. DOI: 10.1109/CVPR46375.2023.00330
X. Xu et al., “ViT-DAE: Transformer-Driven Diffusion Autoencoder for Histopathology Image Analysis,” in Proc. MICCAI Workshops (DGM4MICCAI), LNCS 14533, 2023, pp. 66–76. DOI: 10.1007/978-3-031-53767-7_7
P. Howlader et al., “Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-Scale Patch-Based Multi-Label Classifier,” in Proc. ECCV, 2024 (to appear). [This work explores patch-level MIL to improve WSI segmentation].
J. Yun et al., “Uncertainty Estimation for Tumor Prediction with Unlabeled Data,” in Proc. CVPR Workshops, 2024, pp. 6946–6954. DOI: 10.1109/CVPRW.2024.00711
G. Zelinsky et al., “Predicting pathologist attention during cancer-image readings,” Journal of Vision, vol. 25, no. 9, p. 2736, 2025. DOI: 10.1167/jov.25.9.2736
A. Aji et al., “Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce,” Proc. VLDB Endow., vol. 6, no. 11, pp. 1009–1020, 2013. DOI: 10.14778/2536222.2536227