Dr. Joel Saltz Research Page | Stony Brook Dept of Biomedical Informatics

Dr. Joel Saltz's Digital Pathology Research

My research in Digital Pathology spans twenty five years and consists of closely coordinated efforts in data science, machine learning, software design, database design and high end computing. My group has developed tools and methods through years of funded projects supported by a wide range of institutes and agencies including NCI, NLM, NIBIB, NSF, DARPA, AFOSR, NASA, DOD and, DOE. This work laid much of the foundation for digital pathology as it is today.

I led the team at Johns Hopkins and the University of Maryland College Park that was the first to develop the “Virtual Microscope,” and pioneered developments in digital pathology whole slide image navigation, data management, caching strategies and computer aided classification.

Visualization, query, caching, data management, pipeline execution: My team’s initial efforts included development of the first whole slide image viewer followed by several years of effort to develop efficient methods to support whole slide image visualization, query, caching, data management along with methods for efficient systems software support for whole slide data analytic pipelines. This work targeted clusters and HPC systems. We went on to develop methods capable of analysis and visualization of 3-D Pathology images generated from serial sections and management of data obtained through multi-resolution whole slide image capture. In the past seven years since I came to Stony Brook, much focus has shifted to management, visualization and quality control of derived data products such instance semantic segmentation datasets generated by AI algorithms.

A Containerized Software System for Generation, Management, and Exploration of Features from Whole Slide Tissue Images. Cancer research, 2017

ImageMiner: a software system for comparative analysis of tissue microarrays using content-based image retrieval, high-performance computing, and grid technology, JAMIA 2011.
A data model and database for high-resolution pathology analytical image informatics. J Pathol Inform 2011
The Virtual Microscope. IEEE Transactions on Information Technology in Biomedicine. 2003
Visualization of Large Datasets with the Active Data Repository. IEEE Computer Graphics and Applications. 2001.

Digital Dynamic Telepathology-The Virtual Microscope. Proceedings of AMIA Fall Symposium, Orlando, FL. 1998

The Virtual Microscope. Proceedings of AMIA Annual Fall Symposium. 1997: 449-53.

Machine learning, Computer Vision and AI Based Digital Pathology Methods: My group has developed a variety of machine learning methods to target analysis of whole slide images. Early work in this area included development of machine learning methods for Neuroblastoma and Lymphoma classification. More recently, my group has developed innovative multi-instance learning methods to classify whole slide H&E images using coarse grained case level training data. The 2016 CVPR paper describing this work has been highly influential and has been cited over 340 times (Google Scholar, January 2021). My group has also developed innovative methods that leverage generative adversarial networks to generalize nuclear segmentation across tissue types. This work was described in a 2019 CVPR paper, we then went on to use the method to create a dataset consisting of roughly 5 billion segmented nuclei; this dataset is public and documented in our 2020 Nature Scientific Data publication.

We have developed a rich set of pipelines to employ a variety of convolutional network algorithms to compute biologically significant Pathology features from H&E and multiplex IHC images, including spatial maps of tumor infiltrating lymphocytes (TILs). Development of one of our early CNN methods, designed to classify TILs in multiple tumor types, was motivated by our participation in the Pan Cancer Atlas project. The effort encompassed a comprehensive analysis of the relationship between spatial TIL patterns and molecular tumor characteristics. This work was published in a 2018 Cell Reports article which has been cited over 230 times (Google Scholar, January 2021). We have gone on to leverage and refine our TIL classification methods in many contexts, see for instance our 2020 American Journal of Pathology publication on TILs and breast cancer.

We have also developed AI and traditional machine learning based methods to support analysis tissue microarray and multiplex IHC studies and developed methods to support 3D digital tissue reconstruction.

Dataset of segmented nuclei in hematoxylin and eosin stained histopathology images of ten cancer types. Nature Sci Data 2020

Robust Histopathology Image Analysis: To Label or to Synthesize? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2019
Patch-based convolutional neural network for whole slide tissue image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016

Deep learning-based image analysis methods for brightfield-acquired multiplex immunohistochemistry images. Diagnostic pathology, 2020.

Utilizing automated breast cancer detection to identify spatial distributions of tumor infiltrating lymphocytes in invasive breast cancer. The American Journal of Pathology 2020

Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Reports. 2018

Development of a framework for large scale three-dimensional pathology and biomarker imaging and spatial analytics. AMIA Summits on Translational Science Proceedings, 2017
A methodology for texture feature-based quality assessment in nucleus segmentation of histopathology images. Journal of pathology informatics 2017

Integrated Morphologic Analysis for the Identification and Characterization of Disease Subtypes.” Journal of the American Medical Informatics Association (JAMIA). 2012

Computer-aided Prognosis of Neuroblastoma on Whole-slide Images: Classification of Stromal Development. Pattern Recognit 2009

Histopathological Image Analysis Using Model-Based Intermediate Representations and Color Texture: Follicular Lymphoma Grading. J Signal Process Sys 2009
A caGrid-Enabled, Learning Based Image Segmentation Method for Histopathology Specimens. Proc IEEE Int Symp Biomed Imaging 2009

An Imaging Workflow for Characterizing Phenotypical Change in Terabyte Sized Mouse Model Datasets. Journal of Bioinformatics, 2008.

Computer Science Research

Runtime Compilation: In the late 1980’s through the middle 1990’s I developed and refined the Inspector Executor paradigm. Irregular array accesses arise in many scientific applications including sparse matrix solvers, unstructured mesh partial differential equation (PDE) solvers and particle methods. Traditional compilation techniques required that indices to data arrays be symbolically analyzable at compile time. A common characteristic of irregular applications is the use of indirect indexing to represent relationships among array elements. This means that data arrays are indexed through values of other arrays, called indirection arrays. Inability to characterize array access patterns symbolically can prevent compilers from generating efficient code for irregular applications. The inspector/executor strategy involves using compilers to generate code to examine and analyze data references during program execution. The results of this execution- time analysis may be used 1) to determine which off-processor data needs to be fetched and where the data will be stored once it is received and 2) to reorder and coordinate execution of loop iterations in problems with irregular loop carried dependencies. The initial examination and analysis phase is called the inspector, while the phase that uses results of analysis to optimize program execution is called the executor. My later work in this area encompassed compiler transformations that used slicing and partial redundancy elimination to generate optimized inspector/executor code along with integration of inspector/executor methods into PGAS languages. This approach is frequently used to this day in a variety of contexts including GPU optimizations and my publications in this area are still frequently cited.

An Interprocedural Framework for Placement of Asynchronous I/O Operations. Supercomputing ’96, 1996
Interprocedural Compilation of Irregular Applications for Distributed Memory Machines. Supercomputing’95 1995
Runtime Support and Compilation Methods for User-specified Irregular Data Distributions. IEEE Transactions on Parallel and Distributed Systems. 1995
Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines. Software - Practice and Experience (SPE). 1995
Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures. Journal of Parallel and Distributed Computing. 1994
Slicing Analysis and Indirect Access to Distributed Arrays. Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing. 1993
Software Support for Irregular and Loosely Synchronous Problems, Computing Systems in Engineering. 1992
Run-Time Parallelization and Scheduling of Loops. IEEE Transactions on Computers. 1991
Run-Time Scheduling and Execution of Loops on Message Passing Machines. Journal of Parallel and Distributed Computing. 1990

Data Science Middleware: In the mid-1990’s through roughly 2010, I developed a variety of methods to support what is now called edge computing. These methods include Active Disks, one of my several papers in this area has over 500 citations as of January 2021 and continues to be cited. Another research project along these lines was DataCutter which consisted of light weight portable processes designed to form distributed pipelines with processes that could be instantiated or moved close to data or to computational resources.

The Active Data Repository project was a spatial database project from the mid 1990’s through early 2000s that in many respects presaged Hadoop and Spark; this consisted of a system optimized to support computations that made use of generalized reduction operations. This effort incorporated spatial indexing and optimized cluster/de-clustering methods to optimize I/O in clusters and parallel machines. After the development of Hadoop and Spark, in the 2013-2017 timeframe, my team went on to develop analogous software systems that employed these tools. The VLDB article on this topic from 2013 has been cited over 600 times (as of January 2021) and the Hilbert space filling curve article in IEEE TKDE from 2001 has been cited over 850 times (as of January 2021).

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems 2017
Efficient irregular wavefront propagation algorithms on hybrid CPU–GPU machines. Parallel Computing 2013

Hadoop GIS: a High Performance Spatial Data Warehousing System over MapReduce. VLDB 2013

Parameterized Speciﬁcation, Conﬁguration and Execution of Data-Intensive Scientiﬁc Workﬂows, Cluster Computing: the Journal of Networks, Software Tools and Applications , Special Issue on High Performance Distributed Computing, 2010

Analysis and Semantic Querying in Large Biomedical Image Datasets. IEEE Computer. 2008
I/O Conscious Algorithm Design and Systems Support for Data Analysis on Emerging Architectures. IPDPS 2006
Processing Large-scale Multidimensional Data In Parallel and Distributed Environments. Parallel Computing, 2002

Analysis of the Clustering Properties of the Hilbert Space-Filling Curve. IEEE Transactions on Knowledge and Data Engineering. 2001

Evaluation of Active Disks for Decision Support Databases. Proceedings of the 6th International Symposium on High-Performance Computer Architecture. 2000

Querying very large multi-dimensional datasets in ADR. Supercomputing ’99, 1999

Titan: a high performance remote-sensing database. Proceedings of the 1997 International Conference on Data Engineering. 1997

Sumatra: a language for resource-aware mobile programs. Mobile Object Systems-Towards the Programmable Internet. 1997

Artificial Intelligence, Machine Learning: My current AI projects focus on variety of efforts that target development of deep learning algorithms for semantic instance segmentation in large spatial datasets. Methods include super-resolution approaches able to make use of multi-scale label data, novel generative adversarial networks to generalize training data and multi-instance learning methods to able to make use very high level image level training data to classify gigapixel images. This work has been published in CVPR and ICLR computer vision conferences. My group’s 2016 CVPR multi-instance learning paper has over 350 citations (as of January 2021)

Robust Histopathology Image Analysis: To Label or to Synthesize? CVPR 2019
Label super-resolution networks. In International Conference on Learning Representations, 2018
Patch-based convolutional neural network for whole slide tissue image classification. CVPR 2016
Computer-aided Prognosis of Neuroblastoma on Whole-slide Images: Classification of Stromal Development, Pattern Recognition 2009
A New Deformable Model for Boundary Tracking in Cardiac MRI and Its Application to the Detection of Intra-Ventricular Dysynchrony, CVPR 2006
Using Distributed Query Result Caching to Evaluate Queries for Parallel Data Mining Algorithms. Proceedings of PDPTA’98 1998.

Driving Applications: My Computer Science related research efforts have targeted development of generalizable methods, algorithms and tools. When possible, my approach has been to identify multiple complementary applications to drive my research. In addition to digital Pathology and biomedical informatics, my computer science data science research has been driven by problems involving: 1) earth science related porous media analyses including oil reservoir management, pollution remediation and carbon sequestration, 2) computational aerospace, 3) changes in land cover analyzed using satellite images, 4) digital sky research, 5) computational chemistry, 6) analysis of neuroscience data and 7) 3D tissue reconstruction from microscopic sections to characterize organ phenotype changes associated with genomic alternations, 8) computational linear algebra.

Dynamic Decision and Data-Driven Strategies for the Optimal Management of Subsurface Geo-Systems. Journal of Algorithms & Computational Technology, 2011
Architectural Implications for Spatial Object Association Algorithms”, IPDPS 2009
Rb is critical in a mammalian tissue stem cell population. Genes Dev. 2007
Towards Dynamic Data-driven Management of the Ruby Gulch Waste Repository. Proceedings of the ICCS 2006 6th International Conference 2006
The Design and Evaluation of a High-Performance Earth-Science Database. Parallel Computing on Parallel Data Servers and Applications. 1998
Parallelizing Molecular Dynamics Programs for Distributed Memory Machines. IEEE Computational Science & Engineering. 1995
Implementation of a Parallel Unstructured Euler Solver on Shared-and Distributed-Memory Architectures. AIAA Journal 1994
Aggregation Methods for Solving Sparse Triangular Systems on Multiprocessors. SIAM Journal on Scientific Statistics for Computing. 1990
Supercomputers and Biological Sequence Comparison Algorithms. Computers in Biomedical Research. 1989

Clinical and Translational Bioinformatics: Descriptions of projects motivated by pragmatic informatics based healthcare and patient related research requirements. The requirements in some cases arise at the national level, examples being the N3C infrastructure, at other times requirements are institutional in nature. This is by nature a heterogeneous set of projects given that the motivation arises primarily from clinical and research needs.

The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. JAMIA 2020
A national strategy to develop pragmatic clinical trials infrastructure. Clin Transl Sci. 2014
The Analytic Information Warehouse (AIW): a platform for analytics using electronic health record data, J Biomed Inform 2013
Caveats for the Use of Operational Electronic Health Record Data in Comparative Effectiveness Research. Medical Care 2013
A framework for workflow-based clinical research billing disambiguation. AMIA Annu Symp Proc. 2007
A Knowledge-Anchored Integrative Image Search and Retrieval System. J Digit Imaging. 2007
Information Warehouse as a Tool to Analyze Computerized Physician Order Entry Set Utilization: opportunities for improvement. AMIA Annual Symposium Proceedings. 2003
Design of an Integrated Clinical Data Warehouse. Journal of the Association for Laboratory Automation. 2000
Semantic Indexing for Complex Patient Grouping. Proceedings of the 1997 AMIA Annual Fall Symposium. 1997.

Date

Mon, 05/17/2021 - 12:00