Dr. Joel Saltz's Digital Pathology Research
My research in Digital Pathology spans twenty five years and consists of closely coordinated efforts in data science, machine learning, software design, database design and high end computing. My group has developed tools and methods through years of funded projects supported by a wide range of institutes and agencies including NCI, NLM, NIBIB, NSF, DARPA, AFOSR, NASA, DOD and, DOE. This work laid much of the foundation for digital pathology as it is today.
I led the team at Johns Hopkins and the University of Maryland College Park that was the first to develop the “Virtual Microscope,” and pioneered developments in digital pathology whole slide image navigation, data management, caching strategies and computer aided classification.
Visualization, query, caching, data management, pipeline execution: My team’s initial efforts included development of the first whole slide image viewer followed by several years of effort to develop efficient methods to support whole slide image visualization, query, caching, data management along with methods for efficient systems software support for whole slide data analytic pipelines. This work targeted clusters and HPC systems. We went on to develop methods capable of analysis and visualization of 3-D Pathology images generated from serial sections and management of data obtained through multi-resolution whole slide image capture. In the past seven years since I came to Stony Brook, much focus has shifted to management, visualization and quality control of derived data products such instance semantic segmentation datasets generated by AI algorithms.
- A Containerized Software System for Generation, Management, and Exploration of Features from Whole Slide Tissue Images. Cancer research, 2017
- ImageMiner: a software system for comparative analysis of tissue microarrays using content-based image retrieval, high-performance computing, and grid technology, JAMIA 2011.
- A data model and database for high-resolution pathology analytical image informatics. J Pathol Inform 2011
- The Virtual Microscope. IEEE Transactions on Information Technology in Biomedicine. 2003
- Visualization of Large Datasets with the Active Data Repository. IEEE Computer Graphics and Applications. 2001.
- Digital Dynamic Telepathology-The Virtual Microscope. Proceedings of AMIA Fall Symposium, Orlando, FL. 1998
- The Virtual Microscope. Proceedings of AMIA Annual Fall Symposium. 1997: 449-53.
Machine learning, Computer Vision and AI Based Digital Pathology Methods: My group has developed a variety of machine learning methods to target analysis of whole slide images. Early work in this area included development of machine learning methods for Neuroblastoma and Lymphoma classification. More recently, my group has developed innovative multi-instance learning methods to classify whole slide H&E images using coarse grained case level training data. The 2016 CVPR paper describing this work has been highly influential and has been cited over 340 times (Google Scholar, January 2021). My group has also developed innovative methods that leverage generative adversarial networks to generalize nuclear segmentation across tissue types. This work was described in a 2019 CVPR paper, we then went on to use the method to create a dataset consisting of roughly 5 billion segmented nuclei; this dataset is public and documented in our 2020 Nature Scientific Data publication.
We have developed a rich set of pipelines to employ a variety of convolutional network algorithms to compute biologically significant Pathology features from H&E and multiplex IHC images, including spatial maps of tumor infiltrating lymphocytes (TILs). Development of one of our early CNN methods, designed to classify TILs in multiple tumor types, was motivated by our participation in the Pan Cancer Atlas project. The effort encompassed a comprehensive analysis of the relationship between spatial TIL patterns and molecular tumor characteristics. This work was published in a 2018 Cell Reports article which has been cited over 230 times (Google Scholar, January 2021). We have gone on to leverage and refine our TIL classification methods in many contexts, see for instance our 2020 American Journal of Pathology publication on TILs and breast cancer.
We have also developed AI and traditional machine learning based methods to support analysis tissue microarray and multiplex IHC studies and developed methods to support 3D digital tissue reconstruction.
- Dataset of segmented nuclei in hematoxylin and eosin stained histopathology images of ten cancer types. Nature Sci Data 2020
- Robust Histopathology Image Analysis: To Label or to Synthesize? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019
- Patch-based convolutional neural network for whole slide tissue image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016
- Deep learning-based image analysis methods for brightfield-acquired multiplex immunohistochemistry images. Diagnostic pathology, 2020.
- Utilizing automated breast cancer detection to identify spatial distributions of tumor infiltrating lymphocytes in invasive breast cancer. The American Journal of Pathology 2020
- Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Reports. 2018
- Development of a framework for large scale three-dimensional pathology and biomarker imaging and spatial analytics. AMIA Summits on Translational Science Proceedings, 2017
- A methodology for texture feature-based quality assessment in nucleus segmentation of histopathology images. Journal of pathology informatics 2017
- Integrated Morphologic Analysis for the Identification and Characterization of Disease Subtypes.” Journal of the American Medical Informatics Association (JAMIA). 2012
- Computer-aided Prognosis of Neuroblastoma on Whole-slide Images: Classification of Stromal Development. Pattern Recognit 2009
- Histopathological Image Analysis Using Model-Based Intermediate Representations and Color Texture: Follicular Lymphoma Grading. J Signal Process Sys 2009
- A caGrid-Enabled, Learning Based Image Segmentation Method for Histopathology Specimens. Proc IEEE Int Symp Biomed Imaging 2009
- An Imaging Workflow for Characterizing Phenotypical Change in Terabyte Sized Mouse Model Datasets. Journal of Bioinformatics, 2008.
Computer Science Research
Runtime Compilation: In the late 1980’s through the middle 1990’s I developed and refined the Inspector Executor paradigm. Irregular array accesses arise in many scientific applications including sparse matrix solvers, unstructured mesh partial differential equation (PDE) solvers and particle methods. Traditional compilation techniques required that indices to data arrays be symbolically analyzable at compile time. A common characteristic of irregular applications is the use of indirect indexing to represent relationships among array elements. This means that data arrays are indexed through values of other arrays, called indirection arrays. Inability to characterize array access patterns symbolically can prevent compilers from generating efficient code for irregular applications. The inspector/executor strategy involves using compilers to generate code to examine and analyze data references during program execution. The results of this execution- time analysis may be used 1) to determine which off-processor data needs to be fetched and where the data will be stored once it is received and 2) to reorder and coordinate execution of loop iterations in problems with irregular loop carried dependencies. The initial examination and analysis phase is called the inspector, while the phase that uses results of analysis to optimize program execution is called the executor. My later work in this area encompassed compiler transformations that used slicing and partial redundancy elimination to generate optimized inspector/executor code along with integration of inspector/executor methods into PGAS languages. This approach is frequently used to this day in a variety of contexts including GPU optimizations and my publications in this area are still frequently cited.
- An Interprocedural Framework for Placement of Asynchronous I/O Operations.
- Interprocedural Compilation of Irregular Applications for Distributed Memory Machines
- Runtime Support and Compilation Methods for User-specified Irregular Data Distributions
- Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines
- Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures. Journal of Parallel and Distributed Computing
- Slicing Analysis and Indirect Access to Distributed Arrays
- Software Support for Irregular and Loosely Synchronous Problems
- Run-Time Parallelization and Scheduling of Loops
- Run-Time Scheduling and Execution of Loops on Message Passing Machines
Data Science Middleware: In the mid-1990’s through roughly 2010, I developed a variety of methods to support what is now called edge computing. These methods include Active Disks, one of my several papers in this area has over 500 citations as of January 2021 and continues to be cited. Another research project along these lines was DataCutter which consisted of light weight portable processes designed to form distributed pipelines with processes that could be instantiated or moved close to data or to computational resources.
The Active Data Repository project was a spatial database project from the mid 1990’s through early 2000s that in many respects presaged Hadoop and Spark; this consisted of a system optimized to support computations that made use of generalized reduction operations. This effort incorporated spatial indexing and optimized cluster/de-clustering methods to optimize I/O in clusters and parallel machines. After the development of Hadoop and Spark, in the 2013-2017 timeframe, my team went on to develop analogous software systems that employed these tools. The VLDB article on this topic from 2013 has been cited over 600 times (as of January 2021) and the Hilbert space filling curve article in IEEE TKDE from 2001 has been cited over 850 times (as of January 2021).
- SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems 2017
- Efficient irregular wavefront propagation algorithms on hybrid CPU–GPU machines. Parallel Computing 2013
- Parameterized Speciﬁcation, Conﬁguration and Execution of Data-Intensive Scientiﬁc Workﬂows, Cluster Computing: the Journal of Networks, Software Tools and Applications , Special Issue on High Performance Distributed Computing, 2010
- Analysis and Semantic Querying in Large Biomedical Image Datasets. IEEE Computer. 2008
- I/O Conscious Algorithm Design and Systems Support for Data Analysis on Emerging Architectures. IPDPS 2006
- Processing Large-scale Multidimensional Data In Parallel and Distributed Environments. Parallel Computing, 2002
- Analysis of the Clustering Properties of the Hilbert Space-Filling Curve. IEEE Transactions on Knowledge and Data Engineering. 2001
- Evaluation of Active Disks for Decision Support Databases. Proceedings of the 6th International Symposium on High-Performance Computer Architecture. 2000
- Querying very large multi-dimensional datasets in ADR. Supercomputing ’99, 1999
- Titan: a high performance remote-sensing database. Proceedings of the 1997 International Conference on Data Engineering. 1997
- Sumatra: a language for resource-aware mobile programs. Mobile Object Systems-Towards the Programmable Internet. 1997
Artificial Intelligence, Machine Learning: My current AI projects focus on variety of efforts that target development of deep learning algorithms for semantic instance segmentation in large spatial datasets. Methods include super-resolution approaches able to make use of multi-scale label data, novel generative adversarial networks to generalize training data and multi-instance learning methods to able to make use very high level image level training data to classify gigapixel images. This work has been published in CVPR and ICLR computer vision conferences. My group’s 2016 CVPR multi-instance learning paper has over 350 citations (as of January 2021)
- Robust Histopathology Image Analysis: To Label or to Synthesize? CVPR 2019
- Label super-resolution networks. In International Conference on Learning Representations, 2018
- Patch-based convolutional neural network for whole slide tissue image classification. CVPR 2016
- Computer-aided Prognosis of Neuroblastoma on Whole-slide Images: Classification of Stromal Development, Pattern Recognition 2009
- A New Deformable Model for Boundary Tracking in Cardiac MRI and Its Application to the Detection of Intra-Ventricular Dysynchrony, CVPR 2006
- Using Distributed Query Result Caching to Evaluate Queries for Parallel Data Mining Algorithms. Proceedings of PDPTA’98 1998.
Driving Applications: My Computer Science related research efforts have targeted development of generalizable methods, algorithms and tools. When possible, my approach has been to identify multiple complementary applications to drive my research. In addition to digital Pathology and biomedical informatics, my computer science data science research has been driven by problems involving: 1) earth science related porous media analyses including oil reservoir management, pollution remediation and carbon sequestration, 2) computational aerospace, 3) changes in land cover analyzed using satellite images, 4) digital sky research, 5) computational chemistry, 6) analysis of neuroscience data and 7) 3D tissue reconstruction from microscopic sections to characterize organ phenotype changes associated with genomic alternations, 8) computational linear algebra.
- Dynamic Decision and Data-Driven Strategies for the Optimal Management of Subsurface Geo-Systems.
- Architectural Implications for Spatial Object Association Algorithms
- Rb is critical in a mammalian tissue stem cell population
- Towards Dynamic Data-driven Management of the Ruby Gulch Waste Repository
- The Design and Evaluation of a High-Performance Earth-Science Database
- Parallelizing Molecular Dynamics Programs for Distributed Memory Machines
- Implementation of a Parallel Unstructured Euler Solver on Shared-and Distributed-Memory Architectures.
- Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors. SIAM Journal on Scientific Statistics for Computing. 1990
- Supercomputers and Biological Sequence Comparison Algorithms
Clinical and Translational Bioinformatics: Descriptions of projects motivated by pragmatic informatics based healthcare and patient related research requirements. The requirements in some cases arise at the national level, examples being the N3C infrastructure, at other times requirements are institutional in nature. This is by nature a heterogeneous set of projects given that the motivation arises primarily from clinical and research needs.
- The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment
- A national strategy to develop pragmatic clinical trials infrastructure
- The Analytic Information Warehouse (AIW): a platform for analytics using electronic health record data
- Caveats for the Use of Operational Electronic Health Record Data in Comparative Effectiveness Research
- A framework for workflow-based clinical research billing disambiguation
- A Knowledge-Anchored Integrative Image Search and Retrieval System
- Information Warehouse as a Tool to Analyze Computerized Physician Order Entry Set Utilization: opportunities for improvement
- Design of an Integrated Clinical Data Warehouse
- Semantic Indexing for Complex Patient Grouping