dag: How should I respond to a data quality issue? version: 1.0.0 image: https://images.unsplash.com/photo-1504868584819-f8e8b4b6d7e3?w=1200&q=80 description: Determine the appropriate response when a data quality issue is discovered in a pipeline, dataset, or report. This tree helps data engineers and analysts triage severity, decide whether to halt or continue processing, and escalate correctly based on impact and regulatory exposure. tags: data, data quality, data engineering, governance, operations entry: Q1 Q1: Are downstream reports, dashboards, or ML models already consuming or affected by the bad data? hint: Check whether any scheduled pipelines have completed successfully since the bad data entered the system, and whether any dashboards or reports have been refreshed with the corrupted values. If stakeholders have already viewed or acted on incorrect figures, the incident has moved from a pipeline concern to a business impact event and needs a communications response, not just a technical fix. Pull your pipeline lineage graph to map exactly which datasets and consumers are downstream of the affected table or source. yes -> Q2 no -> Q3 Q2: Is the severity of the discrepancy material — e.g. more than 5% deviation, missing records, or wrong sign? hint: A 0.1% rounding difference in a non-critical report is qualitatively different from a -50% revenue figure caused by a sign inversion in a financial dashboard. Consider both the magnitude of the error and the audience: a small error in a board-level KPI deck may be more urgent than a large error in an internal experimental dataset. If you are unsure whether the threshold is material, assume it is and escalate — it is far better to over-communicate a data issue than to let a stakeholder discover it independently. yes -> Q4 no -> [LOG_MONITOR] Q3: Do you know the root cause of the issue? hint: A known root cause — e.g. a schema change in the upstream source, a failed dbt test on a specific column, or a broken API connector — is much easier to remediate than an unexplained anomaly. If the cause is unknown, quarantining the suspect data prevents further propagation while you investigate. Check your data observability tool (Monte Carlo, Soda, Great Expectations) for recent test failures, look at pipeline run logs for errors or schema drift alerts, and query the raw source to compare it against the processed output. yes -> [PATCH_SOURCE] no -> [QUARANTINE] Q4: Does the affected data involve regulated information — financial reporting, PII, healthcare records, or audit logs? hint: Regulated data quality issues carry legal and compliance obligations beyond operational remediation. A material error in a financial KPI used for external reporting may require disclosure; a PII breach caused by incorrect data masking may trigger GDPR notification timelines. If you have any doubt about whether the affected dataset falls under a regulatory regime, treat it as regulated and escalate to your Data Governance or Legal team immediately. Do not attempt to quietly patch regulated data without a documented change control and sign-off process. yes -> [ESCALATE_GOVERNANCE] no -> [HALT_ALERT] Q5: Is the root cause located in an upstream source system outside the data team's direct control? hint: If the bad data originates in a CRM, ERP, or third-party feed, the data team can patch a downstream representation of the data but cannot fix the true root cause without engaging the owning team. In this case, patching the data warehouse is a temporary mitigation — the permanent fix requires a ticket to the source system owner and a formal data quality SLA agreement. Document the patch clearly in your data catalogue so that future consumers are aware the data was corrected and can trace back to the source event. yes -> [PATCH_SOURCE] no -> [QUARANTINE] [LOG_MONITOR]: Log and Monitor (Low Impact) color: #48BB78 description: When downstream impact is minimal and the discrepancy is within acceptable tolerance, the appropriate response is to log the issue in your data quality tracking system and set up an alert threshold so that any worsening of the issue triggers automatic notification. Create a ticket in your backlog to investigate and fix the root cause in the next sprint rather than interrupting current work. Add or tighten a dbt test, Great Expectations expectation, or Soda check on the affected column so that the same issue is automatically surfaced if it recurs. Document the observed behaviour and resolution decision in your data catalogue for future reference. code: DQ_LOG_MONITOR [QUARANTINE]: Quarantine and Investigate color: #ED8936 description: When downstream impact is limited but the root cause is unknown, quarantine the suspect dataset by flagging it in the catalogue, pausing downstream pipeline jobs that depend on it, and preventing dashboards from refreshing with unvalidated data. Open an investigation using your data observability tooling — compare row counts, null rates, and value distributions against baseline statistics from the previous successful run. Assign a named owner to the investigation and set a resolution SLA (typically 24–48 hours for a non-critical dataset). Communicate proactively to affected stakeholders that data is under review and provide an ETA for resolution. code: DQ_QUARANTINE [PATCH_SOURCE]: Patch Upstream Source color: #4F86C6 description: When the root cause is known and traceable to a specific upstream system or transformation, implement a targeted patch — a SQL correction, a pipeline fix, or a source system data correction via the owning team. Document every patch in version control with the ticket reference, affected rows, transformation logic applied, and the before/after comparison. Re-run affected downstream pipelines in the correct dependency order and validate outputs against expected metrics before republishing dashboards. Add a regression test (dbt test, Soda check, or CI assertion) that would have caught this issue earlier, and schedule a post-incident review to prevent recurrence. code: DQ_PATCH_SOURCE [HALT_ALERT]: Halt Pipeline and Alert color: #E53E3E description: When a material quality issue is already affecting downstream consumers and the data involves non-regulated but business-critical information, halt all downstream pipeline jobs immediately to prevent further propagation of bad data. Send an incident alert to pipeline owners, affected dashboard consumers, and your data platform on-call channel — include the affected datasets, the nature of the issue, and the estimated time to resolution. Roll back any dashboards or reports to their last known good state by restoring from your data platform's table snapshots or time-travel feature (BigQuery, Snowflake, or Delta Lake all support this). Conduct a blameless post-incident review within 48 hours and document findings in your runbook. code: DQ_HALT_ALERT [ESCALATE_GOVERNANCE]: Escalate to Data Governance color: #9F7AEA description: When a material data quality issue affects regulated, financial, or personally identifiable data, escalate immediately to your Data Governance team, Legal counsel, and senior data leadership — do not attempt to remediate silently. Preserve a forensic snapshot of the bad data and all affected pipeline outputs before making any changes, as this evidence may be required for regulatory or audit purposes. Work with Legal to assess whether any external disclosure obligations are triggered (e.g. GDPR Article 33 for PII breaches, SOX for financial reporting errors) and document the timeline of discovery, impact assessment, and remediation steps in your incident log. All patches to regulated data must go through a formal change control process with named approvers before being applied. code: DQ_ESCALATE_GOVERNANCE