Introduction
In a previous article, we explored a zero-code approach to storing DICOM metadata by persisting vendor-specific and modality-specific tags as JSONB in an extradicomtags column. This approach embraces the reality of clinical imaging data: the schema is heterogeneous, nested, and constantly evolving. By storing the raw metadata as a JSON document and applying GIN indexing, we were able to ingest and retain the complete fidelity of DICOM metadata without writing brittle, vendor-locked parsers or ETL pipelines.
That approach works extremely well for ingestion, provenance, and long-term storage. However, once the data moves beyond engineering teams and into the hands of clinicians, physicists, and researchers, a different set of problems emerges. This article focuses on those problems, and on a practical solution: normalizing JSONB-stored DICOM metadata into conventional relational tables directly inside PostgreSQL, without external ETL systems.
This article is written for platform engineers, data engineers, and clinical research teams who need to bridge flexible DICOM ingestion with queryable, analyst-friendly data models.
The problem with JSONB at query time
A schemaless JSONB column is an excellent persistence mechanism, but it is not a natural query interface for most researchers. Even with GIN indexing, querying nested JSON structures requires knowledge of JSON path syntax, casting rules, and DICOM tag hierarchies. For example, extracting a prescription dose from a deeply nested sequence involves operators such as ->, ->>, jsonb_array_elements and explicit type conversions. This is powerful, but it is not intuitive for users who are accustomed to environments like Databricks, BigQuery, or traditional SQL data marts.
The core problem is not performance or data completeness, but usability. JSONB-based schemas optimize for ingestion flexibility, while most clinical and research workflows optimize for discoverability, predictability, and relational querying. This mismatch becomes increasingly painful as datasets grow and query complexity increases.
Another challenge is discoverability. In a normalized system, users can browse tables, inspect column names, and quickly understand what data is available. In a JSONB document, the “schema” is implicit. While GIN indexes make queries fast, they do not make the data self-describing. Nested structures exacerbate this issue, especially in radiation oncology objects such as RTPLAN, RTSTRUCT, and RTDOSE, where clinically meaningful values are buried several levels deep inside sequences.
There are also subtle correctness and usability issues. JSON values arrive as strings, empty values, or vendor-specific encodings. Casting errors can cause queries to fail unexpectedly. Aggregations across studies or patients become verbose. Even simple questions such as “which patients received a prescription dose above a certain threshold” require non-trivial JSON logic that is easy to get wrong and difficult to review.
In short, while JSONB excels at ingestion and completeness, it becomes a liability when the primary users are clinicians and researchers who want human-readable tables, stable column names, and predictable SQL semantics.
Entry conditions and assumptions
The solution described here assumes a few entry conditions that are common in modern clinical research platforms. DICOM objects are already ingested into PostgreSQL, with core identifiers such as patient, study, series, and SOP instance persisted in relational tables. All non-canonical or vendor-specific metadata is stored verbatim in a JSONB column, typically named extradicomtags, with a GIN index applied for efficient filtering.
No assumptions are made about modality-specific parsing at ingestion time. RTPLAN, RTSTRUCT, RTDOSE, and RTREG objects are all treated uniformly at first: the database stores what it receives, without attempting to normalize or interpret clinical semantics during ingestion.
This separation is intentional. It allows ingestion to remain fast, robust, and vendor-agnostic, while deferring interpretation and normalization to a later stage where correctness, performance, and usability can be optimized. In practice, this means the system already supports high-volume ingestion and indexing, but has not yet committed to a fixed analytical schema.
The proposed solution: normalize inside PostgreSQL
The core idea is simple: instead of pushing JSONB documents out to an external ETL engine such as Airflow or Spark for normalization, we perform the normalization directly inside PostgreSQL using SQL and PL/pgSQL. The database already has the data, understands its transactional boundaries, and can efficiently construct derived tables using set-based operations.
For each DICOM modality of interest, we define a set of normalized tables that mirror the logical hierarchy of the DICOM object. For example, an RTPLAN can be decomposed into a plan table, a fraction group table, a beam table, and associated control point tables. Each table contains typed, human-readable columns extracted from the JSONB document.
A simplified example illustrates the approach. Consider extracting high-level RTPLAN attributes into a normalized table:
INSERT INTO flathr.rtplans (
sop_instance_uid,
plan_label,
plan_name,
approval_status,
target_prescription_dose
)
SELECT
r.sop_instance_uid,
j->>'RTPlanLabel',
j->>'RTPlanName',
j->>'ApprovalStatus',
safe_float(j->>'TargetPrescriptionDose')
FROM dicom_raw r
CROSS JOIN LATERAL r.extradicomtags AS j
WHERE r.modality = 'RTPLAN';The safe_float utility function shown above is a small but important detail. In real-world DICOM exports, numeric fields are frequently encoded as empty strings, vendor-specific placeholders, or inconsistently typed values. A direct cast in SQL (::float) would fail at runtime and abort the entire transformation. Instead, safe_float performs a defensive conversion, returning NULL when a value is missing or malformed, and a properly typed numeric value when it is valid.
CREATE OR REPLACE FUNCTION safe_float(val text)
RETURNS double precision AS $$
BEGIN
RETURN NULLIF(val, '')::double precision;
EXCEPTION WHEN others THEN
RETURN NULL;
END;
$$ LANGUAGE plpgsql IMMUTABLE;This pattern is applied consistently across the normalization layer: arrays are expanded only when present, optional sequences are handled gracefully, and vendor inconsistencies are absorbed without breaking downstream queries. The result is a set of relational tables that behave predictably, even when the source DICOM metadata does not.
For nested sequences, lateral joins allow us to explode arrays into relational rows while preserving referential integrity:
INSERT INTO flathr.beam_sequence (
sop_instance_uid,
beam_number,
beam_name,
radiation_type
)
SELECT
r.sop_instance_uid,
safe_int(b->>'BeamNumber'),
b->>'BeamName',
b->>'RadiationType'
FROM dicom_raw r
CROSS JOIN LATERAL jsonb_array_elements(
r.extradicomtags->'BeamSequence'
) AS b
WHERE r.modality = 'RTPLAN';This pattern scales naturally to RTSTRUCT contours, RTDOSE grids, and registration objects, each producing a set of normalized tables with stable schemas.
Why normalization inside PostgreSQL matters
Performing this work inside PostgreSQL has significant advantages. First, it is fast. There is no serialization overhead, no network transfer to external workers, and no need to rehydrate JSON documents into application memory. PostgreSQL executes these transformations using optimized query plans and can parallelize where appropriate.
Second, it is operationally simpler. The normalization logic lives alongside the schema, versioned and deployed as SQL files. There are no long-running ETL jobs to monitor, no retry semantics to tune, and no risk of partial external state. If a normalization step fails, it fails transactionally.
Third, it produces first-class relational data. Once normalized, the resulting tables behave like any other clinical data mart. Column names are searchable, types are enforced, and standard SQL tools work without special knowledge of JSON. Clinicians can explore the data using familiar interfaces, and downstream analytics systems can consume it without custom adapters.
Finally, this approach preserves the original JSONB documents as a source of truth. Normalization is additive and repeatable. If a new clinical question arises or a vendor introduces new tags, the normalization logic can be extended without re-ingesting raw data.
Making the solution approachable
A key goal of this work is approachability. Researchers should not need to understand DICOM encoding rules or JSON path expressions to ask meaningful questions. By exposing normalized tables with names that reflect clinical concepts, we reduce cognitive load and improve adoption.
The normalization framework also includes utility functions, shared casting logic, and a unified dispatcher that orchestrates modality-specific transformations. This keeps the system maintainable as the schema evolves and new modalities are added.
All of the schema definitions and normalization logic are available as open source in the following repository:
https://github.com/vinithbraj/dicom_jsonb_sql_normalizer
Conclusion
Schemaless JSONB storage is an excellent foundation for large-scale DICOM ingestion, but it is not the end state for analytical use. Clinicians and researchers benefit from normalized, typed, and discoverable data structures that align with how they think and work.
By performing normalization directly inside PostgreSQL, we bridge the gap between flexible ingestion and usable analytics. The result is a system that is fast, transparent, and accessible, without sacrificing the completeness or fidelity of the original DICOM metadata. This approach allows engineering teams to move quickly while giving clinical users the relational interfaces they expect, ultimately enabling more effective clinical research and cross-disciplinary collaboration.
