IDC Annotation Conversion

Python project for converting various pathology annotations into DICOM format for ingestion into the Imaging Data Commons.

The code in this repository is currently under development.

Installation

This repository is structured to be directly installable as a Python distribution named idc-annotation-conversion via pip. You should be able to run this command from the root of the cloned repository to install the packages along with all its dependencies (defined in pyproject.toml) in your current Python environment:

pip install .

Alternatively, you can install the package directly from remote with:

pip install https://github.com/ImagingDataCommons/idc-sm-annotations-conversion.git

Cloud Authentication

You need to authenticate to the relevant Google cloud buckets to run the code in this package. Specifically, access to the following resources is required:

Project idc-etl-processing
Bucket public-datasets-idc, the public bucket containing DICOM-format whole slide images.
Bucket idc-annotation-conversion-outputs, or any other bucket specified as the output bucket, if any.

Depending on the conversion process that you are running, you may also need access to:

Bucket tcia-nuclei-seg, which contains the original (CSV format) segmentations for the pan_cancer_nuclei_seg conversion process.
Project idc-external-031 and bucket rms_annotation_test_oct_2023, which contains the original (XML format) annotations for the rms conversion process.

If you are using an IDC cloud VM, this should be handled automatically for you. Otherwise, you should run:

gcloud auth application-default login --billing-project idc-etl-processing

and then once you are finished:

gcloud auth application-default revoke

Use

Each conversion process is implemented as a submodule of the idc_annotation_conversion module, which is installed when you installed this package. Each submodule has an an entrypoint (a __main__.py file), meaning that to run the process once this package is installed you run:

python -m idc_annotation_conversion.<module> <args>

So for example to run the pan_cancer_nuclei_seg conversion process:

python -m idc_annotation_conversion.pan_cancer_nuclei_seg <args>

In each case, the default parameters should be sufficient to run a conversion processon on the entire collection but there a number of optional arguments to control the process. You can see the options by running --help when calling the submodule. E.g.:

python -m idc_annotation_conversion.pan_cancer_nuclei_seg --help

Collection Details

The following modules are currently available:

pan_cancer_nuclei_seg: This module implements conversion of Pan Cancer Nuclei Segmentations for several collections within TCGA. The original data are supplied in a non-standard CSV format giving the image coordinates points on the contours of nuclei as segmented by a deep-learning based segmentation model. These data were previously released here as part of The Cancer Imaging Archive. These coordinates are converted to DICOM Microscopy Bulk Simple Annotation objects, and in addition, the contours are converted to masks and stored as a pyramid of binary DICOM Segmentation objects. Since this "raster conversion" takes place at the highest resolution, this process is very slow and memory intensive.
rms: Conversion of annotations related to the rhabdomyosarcoma mutation prediction project from the Frederick National Laboratory. Both hand annotated regions for tissue type (necrosis, stroma, ARMS, ERMS), used as training data in the project, and model-generated prediction results (for the same tissue classes) are available. Hand annotated regions are provided as ImageScope format XML annotations and are converted to DICOM Structured Report objects with the convert-xml-annotations sub-command. Model-generated probabilistic segmentation maps are provided as serialized NumPy arrays (.npy files) and converted to both binary and fractional DICOM Segmentation objects with the convert-segmentations sub-command.
tcga_til_maps: There are two versions of this collection, both containing patch-level maps of tumor-infiltrating lymphocytes (TILs) predicted by a neural network for several collections within TCGA. The two versions correspond to two different versions of the model, published in 2018 and 2022 by the same lab at Stony Brook University. Conversion routines for these two versions implemented as two separate sub-commands within this module.

The 2018 versions covers a smaller subset of the TCGA collections. The algorithm is published in this paper and the source files are available here. The collection was also described by TCIA on this page. The files are supplied as low-resolution PNG images, where each pixel in the PNG corresponds to a 50 micron patch in the original slide and the pixel value indicates the presence of TILs within the patch. The convert-2018 command converts these to binary DICOM segmentation objects.

The 2022 versions covers a wider range of TCGA images and additionally has probabilistic segmentations (before thresholding) available in addition to binarized versions. This algorithm is described in this paper and the source data are available here. These are supplied as a non-standard text file containing a list of patch coordinates and associated binary or probabilistic pixel values. The convert-2022 command coverts these to pixel arrays and stores them as DICOM Segmentation objects, giving one binary and one fractional (probabilistic) segmentation object for each slide.
gbm_transcriptional_subtypes: This module relates to a collection of results from this paper from Stanford University on transcriptional subtypes within glioblastoma. There are two data types of interest here: transcriptional subtype maps classifying an image patch into a set of transcriptional subtypes, and aggressiveness maps giving the aggressiveness of each image patch. While the conversion process for both is implemented in this repository, only the aggressiveness maps have been released at this time. The source data are not publicly available elsewhere. The aggressiveness maps are supplied as arrays of image coordinates and corresponding aggressiveness scores (between 0 and 1) within an h5 format file, with one aggressiveness score for an entire image patch. These are converted to DICOM Parametric Map objects using the convert-aggressiveness-maps sub-command of this module.

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
src/idc_annotation_conversion		src/idc_annotation_conversion
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IDC Annotation Conversion

Installation

Cloud Authentication

Use

Collection Details

About

Uh oh!

Releases 3

Uh oh!

Languages

License

ImagingDataCommons/idc-sm-annotations-conversion

Folders and files

Latest commit

History

Repository files navigation

IDC Annotation Conversion

Installation

Cloud Authentication

Use

Collection Details

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Languages