Dr. Joel Saltz Research Page

Dr. Joel Saltz's Digital Pathology Research


My research in Digital Pathology spans twenty five years and consists of closely coordinated efforts in data science, machine learning, software design, database design and high end computing. My group has developed tools and methods  through years of funded projects supported by a wide range of institutes and agencies including NCI, NLM, NIBIB, NSF, DARPA, AFOSR, NASA, DOD and, DOE.  This work laid much of the foundation for digital pathology as it is today. 


I led the team at Johns Hopkins and the University of Maryland College Park that was the first to develop the “Virtual Microscope,” and pioneered developments in digital pathology whole slide image navigation, data management, caching strategies and computer aided classification.


Visualization, query, caching, data management, pipeline execution: My team’s initial efforts included development of the first whole slide image viewer followed by several years of effort to develop efficient methods to support whole slide image visualization, query, caching, data management along with methods for efficient systems software support for whole slide data analytic pipelines.  This work targeted clusters and HPC systems. We went on to develop methods capable of analysis and visualization of 3-D Pathology images generated from serial sections and management of data obtained through multi-resolution whole slide image capture.   In the past seven years since I came to Stony Brook, much focus has shifted to management, visualization and quality control of derived data products such instance semantic segmentation datasets generated by AI algorithms.  







Machine learning, Computer Vision  and AI Based Digital Pathology Methods: My group has developed a variety of machine learning methods to target analysis of whole slide images. Early work in this area included development of machine learning methods for Neuroblastoma and Lymphoma classification.  More recently, my group has developed innovative multi-instance learning methods to classify whole slide H&E images using coarse grained case level training data.  The 2016 CVPR paper describing this work has been highly influential and has been cited over 340 times (Google Scholar, January 2021).  My group has also developed innovative methods that leverage generative adversarial networks to generalize nuclear segmentation across tissue types.  This work was described in a 2019 CVPR paper, we then went on to use the method to create a dataset consisting of roughly 5 billion segmented nuclei; this dataset is public and documented in our 2020 Nature Scientific Data publication. 


We have developed a rich set of pipelines to employ a variety of convolutional network algorithms to compute biologically significant Pathology features from H&E and multiplex IHC images, including spatial maps of tumor infiltrating lymphocytes (TILs). Development of one of our early CNN methods, designed to classify TILs in multiple tumor types, was motivated by our participation in the Pan Cancer Atlas project.  The effort encompassed a comprehensive analysis of the relationship between spatial TIL patterns and molecular tumor characteristics.  This work was published in a 2018 Cell Reports article which has been cited over 230 times (Google Scholar, January 2021).  We have gone on to leverage and refine our TIL classification methods in many contexts, see for instance our 2020 American Journal of Pathology publication on TILs and breast cancer. 


We have also developed AI and traditional machine learning based methods to support  analysis tissue microarray and multiplex IHC studies and developed methods to support 3D digital tissue reconstruction.












Computer Science Research  


Runtime Compilation: In the late 1980’s through the middle 1990’s I developed and refined the Inspector Executor paradigm. Irregular array accesses arise in many scientific applications including sparse matrix solvers, unstructured mesh partial differential equation (PDE) solvers and particle methods. Traditional compilation techniques required that indices to data arrays be symbolically analyzable at compile time.  A common characteristic of irregular applications is the use of indirect indexing to represent relationships among array elements.  This means that data arrays are indexed through values of other arrays, called indirection arrays.   Inability to characterize array access patterns symbolically can prevent compilers from generating efficient code for irregular applications. The inspector/executor strategy involves using compilers to generate code to examine and analyze data references during program execution.  The results of this execution- time analysis may be used  1) to determine which off-processor data needs to be fetched and where the data will be stored once it is received and 2) to reorder and coordinate execution of loop iterations in problems with irregular loop carried dependencies.  The initial examination and analysis phase is called the inspector, while the phase that uses results of analysis to optimize program execution is called the executor. My later work in this area encompassed compiler transformations that used slicing and partial redundancy elimination to generate optimized inspector/executor code along with integration of inspector/executor methods into PGAS languages.  This approach is frequently used to this day in a variety of contexts including GPU optimizations and my publications in this area are still frequently cited.


Data Science Middleware:  In the mid-1990’s through roughly 2010,  I developed a variety of methods to support what is now called edge computing.  These methods include Active Disks, one of my several  papers in this area has over 500 citations as of January 2021 and continues to be cited.  Another research project along these lines was DataCutter which consisted of light weight portable processes designed to  form distributed pipelines with processes that could be instantiated or moved close to data or to computational resources.


The Active Data Repository project was a spatial database project from the mid 1990’s through early 2000s that in many respects presaged Hadoop and Spark; this consisted of a system optimized to support computations that made use of generalized reduction operations.  This effort incorporated spatial indexing and optimized cluster/de-clustering methods to optimize I/O in clusters and parallel machines.  After the development of Hadoop and Spark,  in the 2013-2017 timeframe, my team went on to develop analogous software systems that employed these tools.  The VLDB article on this topic from 2013 has been cited over 600 times (as of January 2021) and the  Hilbert space filling curve article in IEEE TKDE from 2001 has been cited over 850 times  (as of January 2021).











Artificial Intelligence, Machine Learning: My current AI projects focus on  variety of efforts that target development of deep learning algorithms for semantic instance segmentation in large spatial datasets. Methods include super-resolution approaches able to make use of multi-scale label data, novel generative adversarial networks to generalize training data and multi-instance learning methods to able to make use very high level  image level training data to classify gigapixel images.  This work has been published in CVPR and ICLR computer vision conferences.  My group’s 2016 CVPR multi-instance learning paper has over 350 citations (as of January 2021)


Driving Applications: My Computer Science related research efforts have targeted development of generalizable methods, algorithms and tools. When possible, my approach has been to identify multiple complementary applications to drive my research.  In addition to digital Pathology and  biomedical informatics, my computer science data science  research has been driven by problems involving: 1) earth science related porous media analyses including oil reservoir management, pollution remediation and carbon sequestration, 2) computational aerospace, 3) changes in land cover analyzed using satellite images,   4) digital sky research,  5) computational chemistry,  6) analysis of neuroscience data and 7) 3D tissue reconstruction from microscopic sections to characterize organ phenotype changes associated with genomic alternations, 8) computational linear algebra.


Clinical and Translational Bioinformatics: Descriptions of projects motivated by pragmatic informatics based healthcare and patient related research requirements. The requirements in some cases arise at the national level, examples being the N3C infrastructure, at other times requirements are institutional in nature.  This is by nature a heterogeneous set of projects given that the motivation arises primarily from  clinical and research needs.