dag: Data Warehouse Design
version: 1.0.0
image: https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=1200&q=80
description: As a data warehouse product manager responsible for integrating new data sources into a four-tier medallion architecture, use this decision tree to identify the most effective integration strategy. The tool assumes three guiding principles: all data consumption occurs from the final data products layer; the further right the integration point (towards Platinum), the more efficient the processing in terms of storage and compute; and we prefer materialised views over additional storage layers wherever the source system supports it.
tags: data, data warehouse, analytics, medallion architecture, etl
entry: Q1
mode: elimination

Q1: How is the data expected to be used?
  A: REAL-TIME: The data is expected to be available as a hyper near-time service, prioritising latency over standardisation and consolidation.  [PLATINUM-M]
  B: NEAR-TIME/BATCH: The data is expected to be available in a timely manner for analytics or onward integrations, but standardisation, accuracy, and completeness are prioritised over latency.

Q2: What is the content of the data to be consumed in the warehouse?
  A: ENTITY: The data contains information about a specific entity or object that follows a specific lifecycle over time and will ultimately need to be included as part of a target logical data model. [COPPER, BRONZE-M, BRONZE-S, SILVER-M, SILVER-S, GOLD-M, GOLD-S, PLATINUM-M]
  B: TRANSACTIONAL: The data contains point-in-time information about a specific event that needs to be included in a target logical data model, either as a list of events or as summarised data. [COPPER, BRONZE-M, BRONZE-S, SILVER-M, SILVER-S, GOLD-M, GOLD-S, PLATINUM-M]
  C: UNDERLYING: The data contains low-level transactional information about a specific event that will ultimately be aggregated or analysed in some way before derived data is included in the target data model. [COPPER, BRONZE-M, BRONZE-S]

Q3: What is the technical integration type for obtaining the data from the source?
  A: FILES: Data is pulled or pushed from the source system as files and needs to land somewhere before being consumed into structured data records. [COPPER]
  B: DATABASE: Data is pulled directly from a database or warehouse, where data can either be queried or triggers established to represent data changes via SQL or another structured query language. [BRONZE-M, BRONZE-S, SILVER-M, SILVER-S]
  C: API: Data is provided via a dedicated API endpoint that can be polled or pushed via a webhook to get one or a few records at a time. [BRONZE-S, SILVER-S]
  D: MESSAGES: Data is received via a real-time message queue as whole or partial records. [BRONZE-S, SILVER-S, GOLD-S]

Q4: Are the files readable as they arrive, or do they require decrypting or unpackaging before you can understand the data?
  when: Q3=A
  A: READABLE: No additional processing is required and the files can be processed as-is into the warehouse. [COPPER]
  B: ENCODED: A specific decoding tool must be applied before the data can be processed into the warehouse. [COPPER]

Q5: What data model, structure, and values does the data arrive in?
  A: SOURCE MODEL: The data arrives in a proprietary model with source-specific values that need to be converted into a target model before being used. [COPPER, BRONZE-M, BRONZE-S]
  B: TARGET MODEL: The data is already formatted to conform to the target logical model and does not need any transformation before it can be used. [COPPER, BRONZE-M, BRONZE-S, SILVER-M, SILVER-S, PLATINUM-M]

Q6: Is the source the sole source of all information for a specific data subject/object, or will it need to be combined with other sources to create a complete logical model?
  when: Q5=B
  A: SOLE-SOURCE: The data source already conforms to the target logical model AND is the only source of data for this data subject. No other data consolidation is needed to establish the complete data records. [COPPER, BRONZE-S, BRONZE-M, SILVER-S, SILVER-M, GOLD-M, GOLD-S, PLATINUM-M]
  B: PARTIAL-SOURCE: While the data source already conforms to the target logical data model, it does not contain everything needed about these records or the full population, and it will need to be consolidated with other sources before being used. [COPPER, BRONZE-S, BRONZE-M, SILVER-S, SILVER-M]

Q7: What additional special situations do you need to take into account when considering the integration strategy?
  A: NONE: No additional capabilities are required.
  B: SOURCE AUDITING: We are required to evidence the exact data content in its original form as received from the source system, separate from our internal warehouse control and processing requirements. [COPPER, BRONZE-S, SILVER-S, GOLD-S]
  C: HIGH PERFORMANCE: We are required to make the data available immediately after it becomes available from the source to downstream consumers. [GOLD-M, PLATINUM-M]

[PLATINUM-M]: Integrate to PLATINUM (as a Materialised View)
  color: #E5E4E2
  description: The source data and its consumers demand the fastest possible access — latency is the primary constraint and standardisation is secondary. The data is available via a standard SQL-like query interface, so bypass the intermediate medallion layers entirely and materialise the data directly into Platinum as its own self-contained Data Product. This is the most efficient integration path in terms of compute and storage: no intermediate copies, no transformation pipelines, no bronze or silver staging. The trade-off is that Platinum-direct integrations carry source-model artefacts into the consumption layer, which can complicate cross-source reporting. Use this pattern only when real-time latency is genuinely non-negotiable and consumers understand they are working with a source-native model.

[GOLD-M]: Integrate to Gold (as a Materialised View)
  color: #D4AF37
  description: The source data already conforms to the target logical schema, this source is the sole provider of all information for the relevant data subject, and the data is accessible via an SQL-style interface that can be queried remotely. All three conditions mean there is no need for additional normalisation, consolidation, or transformation. Materialise the data directly into the Gold layer as a view over the source system — no physical copy is required in the warehouse. This is the most cost-efficient pattern when the source can be queried with acceptable latency: storage overhead is minimal, the data is always current, and there is no ETL pipeline to maintain. Ensure the source system's query interface can handle the read load imposed by downstream consumers before committing to this pattern.

[GOLD-S]: Integrate to Gold (as stored data)
  color: #D4AF37
  description: The source data already conforms to the target logical schema and this source is the sole provider of all information for the relevant data subject — but the data cannot be accessed via a queryable SQL interface and must be physically loaded into the warehouse. There is no need for additional normalisation, consolidation, or transformation beyond ingestion. Load the data directly into Gold, bypassing Bronze and Silver. This pattern is appropriate when the source provides clean, fully conformant data via files, API, or message queue, and the warehouse does not need to retain a raw history of the source payload for audit purposes. Establish an idempotent load process with a clear primary key to enable safe re-runs.

[SILVER-M]: Integrate to Silver (as a Materialised View)
  color: #C0C0C0
  description: The source data already conforms to the target logical schema and is accessible via an SQL-like query engine — but it cannot be taken directly to Gold because it is a partial source that must be consolidated with other data sources to build a complete data product. Establish a materialised view of the data in the Silver layer, joining or unioning it with other Silver-layer sources as needed before promoting the consolidated result to Gold. The materialised view approach keeps storage overhead low while maintaining a live, queryable representation of the source. This pattern works well when the source system can handle the read load and when consolidation logic is stable rather than rapidly evolving.

[SILVER-S]: Integrate to Silver (as stored data)
  color: #C0C0C0
  description: The source data already conforms to the target logical schema, but it is only accessible through an integration mechanism (file drop, API, message queue) that requires physically loading and storing the records in the warehouse. It is also a partial source that needs to be consolidated with other data sources before it is ready for Gold. Load the data into Silver as a stored table, preserving enough metadata (source timestamp, load batch ID) to enable incremental loads and audit tracing. Apply the consolidation logic at the Silver-to-Gold promotion step. This pattern introduces a physical copy but ensures the data is available for consolidation even when the source system is temporarily unavailable.

[BRONZE-M]: Integrate to BRONZE Layer (as a Materialised View)
  color: #CD7F32
  description: The source data is available via a queryable SQL interface, which means you can establish a live materialised view in the Bronze layer without copying the raw data. Bronze becomes the point at which the warehouse first has a structured representation of the source records, but because it is a view, there is no storage duplication. New or changed records are visible in Bronze as soon as they appear in the source. This pattern is appropriate when the source data arrives in a proprietary model that needs transformation before it can be used — Bronze stores the source-native representation, and the Silver layer handles the model conversion. Ensure the view definition is idempotent and that upstream schema changes in the source system are detected and handled.

[BRONZE-S]: Integrate to BRONZE Layer (as stored data)
  color: #CD7F32
  description: The source data must be physically loaded into the Bronze layer because the integration mechanism (file, API, message queue) does not expose a queryable SQL interface. Each incoming record is inserted into Bronze as received, creating an append-only, auditable history of every payload the warehouse has ever received from this source. Bronze is the warehouse's system of record for raw inbound data — downstream transformation layers (Silver, Gold) read from Bronze rather than from the source directly, which decouples the pipeline from source system availability. Establish a clear primary key and ingestion timestamp on each Bronze record to support incremental loading, deduplication, and replay. If source auditing is required, Bronze is the layer where original payloads must be preserved in their exact received form.

[COPPER]: Integrate to COPPER File Landing Zone
  color: #B87333
  description: The source delivers data as files — whether pushed by the source system or pulled by an integration job — and those files must land in a staging area before any structured processing can begin. The Copper layer is a file-based landing zone: raw files are received and stored here, optionally for an extended retention period to satisfy audit or recovery requirements, before being processed onward into Bronze. Loaders pick up each arriving file, apply any necessary decryption, format conversion (CSV to Parquet, XML to JSON, proprietary binary to structured records), and validation before inserting rows into the Bronze layer. Files that fail validation are quarantined rather than discarded, enabling replay once the issue is resolved. Copper is the entry point for all file-based integrations regardless of the eventual target layer — the destination (Bronze, Silver, Gold) is determined by subsequent questions about data model conformance and consolidation requirements.