dag: Practical Data: Which Layer should I Integrate to? version: 1.0.0 image: https://www.maoperatingsystem.com/hubfs/medalion-layers.png description: As a data warehouse product manager responsible for figuring out the most appropriate mechanism for integrating new data sources to a four-tier data warehouse, use this decision tree to identify the most effective integration strategy possible. The tool presumes three architecture principles - i) all data consumption takes place from the final data products layer and ii) The further to the "right" the integration takes place towards platinum, the more efficient the technology data processing (ie less storage and less compute) iii) Finally we prefer to materialise data as views over adding addition storage layers. entry: Q1 mode: elimination Q1: How is the data expected to be used? A: REAL-TIME: The data expected to be available as a hyper near-time service, prioritizing latency over standadrisation and consolidation. [PLATINUM-M] B: NEAR-TIME/BATCH: The data is expected to be available in a timeline manor for analytics or onward integrations but standardisation, accuracy and completness is prioritized over latency. Q2: What is the content of the Data to be consumed in the warehouse? A: ENTITY: The data contains information about a specific entity or object that follows a specific lifecycle over time and will ultimately need to be included as part of a target logical data model [COPPER, BRONZE-M, BRONZE-S, SILVER-M, SILVER-S, GOLD-M, GOLD-S, PLATINUM-M] B: TRANSACTIONAL: The data contains point-in-time information about a specific event that needs to be included in a target logical data model, either as a list of events or as summarized data.[ COPPER, BRONZE-M, BRONZE-S, SILVER-M, SILVER-S, GOLD-M, GOLD-S, PLATINUM-M] C: UNDERLYING: The data contains low-level transactional information about a specific event that will ultimately be aggregated or analyzed in some way before derived data is included in the target data model. [COPPER, BRONZE-M, BRONZE-S] Q3: What is the technical integration type for obtaining the data from the source? A: FILES: Data is pulled or pushed from the source system as files and needs to "land" somewhere before being consumed into structured data records [COPPER] B: DATABASE: Data is pulled directly to a database warehouse, where data can either be queried or triggers established to represent data changes via SQL or other structured query language. [BRONZE-M, BRONZE-S, SILVER-M, SILVER-S] C: API: Data is provided via a dedicated API endpoint that can be polled or pushed via a webhook to get one or a few records at a time [BRONZE-S, SILVER-S] D: MESSAGES: Data is received via a real-time message queue as whole or partial records. [BRONZE-S, SILVER-S, GOLD-S] Q4: Are the files readable as they arrive, or do they require unencrypting or unpackaging before you can understand the data? when: Q3=A A: READABLE: No additional processing is required and the files can be processed as is into the warehouse [COPPER] B: ENCODED: We need to apply a specific decoding tool before we can process the data into the warehouse [COPPER] Q5: What data model, structure, and values does the data arrive in? A: SOURCE MODEL: The data arrives in a proprietary model with source-specific values that need to be converted into a target model before being used. [COPPER, BRONZE-M, BRONZE-S] B: TARGET MODEL: The data is already formatted to conform to the target logical model and does not need any transformation before it can be used.[COPPER, BRONZE-M, BRONZE-S, SILVER-M, SILVER-S,PLATINUM-M] Q6: Is the source the SOLE source of all information for a specific data subject/object, or will it need to be combined with other sources to create a complete logical model? when: Q5=B A: SOLE-SOURCE: The data source already conforms to the target logical model AND is the ONLY source of data for this data subject. No other data consolidation is needed to establish the complete data records. [COPPER, BRONZE-S, BRONZE-M, SILVER-S, SILVER-M, GOLD-M, GOLD-S, PLATINUM-M] B: PARTIAL-SOURCE: While the data source already conforms to the target logical data model, it does not contain everything we need about these records, nor does it contain the full population, and it will need to be consolidated with other sources before being used. [COPPER, BRONZE-S, BRONZE-M, SILVER-S, SILVER-M] Q7: What additional special situations do you need to take into account when considering the integration strategy? A: NONE: No Additional capabilities are required. B: SOURCE AUDITING: We are required to be able to evidence the exact data content in its original form that arrived from the source system, separate from our internal warehouse control and processing requirements. [COPPER, BRONZE-S, SILVER-S, GOLD-S] C: HIGH PERFORMANCE: We are required to make the data available immediately after it becomes available from the source to downstream consumers. [GOLD-M, PLATINUM-M] [PLATINUM-M]: Integrate to PLATINUM (as a materialized View) color: #E5E4E2 description: The available source data and consumers demand speed of access for this specific dataset. The data is available via a standard SQL-like query interface, so materialize it straight into Platinum as its own Data Product. [GOLD-M]: Integrate to Gold (as a materialized View) color: #D4AF37 description: The available source data already conforms to the target schema AND is the sole source of all information associated with a specific target object model AND the data is available via an SQL-style interface that can be queried remotely. As such there is no need for additional normalization, consolidation or other value-add capabilities. As such we can integrate this source straight to the gold layer as a materialized view. [GOLD-S]: Integrate to Gold (as stored data) color: #D4AF37 description: The available source data already conforms to the target schema AND is the sole source of all information associated with a specific target object model. As such, there is no need for additional normalization, consolidation, or other value-add capabilities. As such we can integrate this source straight to the gold layer. [SILVER-M]: Integrate to Silver (as Materialized View) color: #C0C0C0 description: The available source data already conforms to the target schema and is available via an SQL-like query engine. As such, establish a materialized view of the data directly into the Silver layer [SILVER-S]: Integrate to Silver (as stored Data) color: #C0C0C0 description: The available source data already conforms to the target schema; however, it is only accessible through integrations that require loading and storing it directly in the warehouse. [BRONZE-M]: Integrate to BRONZE Layer (As materialized View) color: #CD7F32 description: Insert new records into Bronze as they can be read from the source system. No need to initiate file copies of the in-flight data. Bronze becomes the auditable history of all records received by the data warehouse. [BRONZE-S]: Integrate to BRONZE Layer (As stored data) color: #CD7F32 description: Insert new records into Bronze as they can be read from the source system. No need to initiate file copies of the in-flight data. Bronze becomes the auditable history of all records received by the data warehouse. [COPPER]: Integrate to COPPER File Landing Zone color: #B87333 description: Leverage a Copper file-based landing zone where all incoming files can be stored before processing the data into the bronze stage of the data warehouse. Inbound files can be stored for an extended period to meet audit requirements or simply purged after a minimum retention period. Establish loaders that process each incoming file, decrypt or convert the resulting data as needed, and insert new rows into the Bronze layer.