Which data storage tier should I use for this dataset?

By Andrew

Decision tree datastoragedata engineeringcloudcost optimisation

Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.

Open interactive version →

Overview

Type: Decision tree
Tags: data, storage, data engineering, cloud, cost optimisation
Entry: Q1
Questions: 5
Outcomes: 4
Author: Andrew
Last updated: 2026-05-12

Decision Tree

Start: How frequently is this dataset accessed in normal operations?

yes

Continues to question: Does your use case require query response times under a few seconds?

no

Continues to question: Is the data older than 90 days and no longer part of any active pipeline?

Machine-Readable JSON (Canonical Model)

View JSON

{
  "_meta": {
    "schema": "https://www.drawdecisiontree.com/decision-dag.schema.json",
    "source": "https://www.drawdecisiontree.com",
    "description": "DrawDecisionTree.com is a free tool for building, sharing, and embedding interactive decision trees. This file is the machine-readable export of a published decision tree. The `dsl` field contains the original source in the Decision DAG DSL; the `dag` schema is documented at the URL in `schema` above.",
    "links": {
      "interactive": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier.html",
      "embed": "https://www.drawdecisiontree.com/embed/path/drawdecisiontree/data-storage-tier",
      "dsl_reference": "https://www.drawdecisiontree.com/decision-tree-dsl-reference.html",
      "guides": "https://www.drawdecisiontree.com/guides",
      "schema_docs": "https://www.drawdecisiontree.com/decision-dag.schema.json",
      "author_trees": "https://www.drawdecisiontree.com/trees/drawdecisiontree"
    },
    "generated_at": "2026-05-29T12:05:39.271Z"
  },
  "author": {
    "handle": "drawdecisiontree",
    "first_name": "Andrew",
    "last_name": null,
    "avatar_url": "1d32d828-b6ca-40ec-bdd7-771fe7b9c36a/avatar-1778531481027.svg",
    "display_name": "Andrew"
  },
  "file": {
    "id": "9d0ee811-db05-48bb-a46b-e3e3c9d7348b",
    "name": "Which data storage tier should I use for this dataset?",
    "public_slug": "data-storage-tier",
    "updated_at": "2026-05-12T16:53:43.587978+00:00",
    "url": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier.html",
    "json_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier/tree.json",
    "dsl_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier/tree.dag"
  },
  "meta": {
    "description": "Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.",
    "mode": "decision",
    "entry": "Q1",
    "tags": [
      "data",
      "storage",
      "data engineering",
      "cloud",
      "cost optimisation"
    ],
    "image": "https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80"
  },
  "questions": [
    {
      "id": "Q1",
      "text": "How frequently is this dataset accessed in normal operations?"
    },
    {
      "id": "Q2",
      "text": "Does your use case require query response times under a few seconds?"
    },
    {
      "id": "Q3",
      "text": "Is the data older than 90 days and no longer part of any active pipeline?"
    },
    {
      "id": "Q4",
      "text": "Is there a legal, regulatory, or audit requirement to retain this data?"
    },
    {
      "id": "Q5",
      "text": "Does retaining this data provide measurable future analytical or business value?"
    }
  ],
  "outcomes": [
    {
      "id": "HOT",
      "label": "Hot Storage (SSD / In-Memory)"
    },
    {
      "id": "WARM",
      "label": "Warm Storage (HDD / Object Storage)"
    },
    {
      "id": "COLD",
      "label": "Cold Storage (Archive)"
    },
    {
      "id": "DELETE",
      "label": "Delete / Do Not Retain"
    }
  ],
  "dsl": "dag: Which data storage tier should I use for this dataset?\nversion: 1.0.0\nimage: https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80\ndescription: Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.\ntags: data, storage, data engineering, cloud, cost optimisation\nentry: Q1\n\nQ1: How frequently is this dataset accessed in normal operations?\n  hint: Consider both scheduled pipeline reads and ad-hoc analyst queries. \"Frequently\" means daily or more often — think operational dashboards, real-time scoring, or ETL source tables. \"Rarely\" means monthly or less — think historical snapshots, audit exports, or raw event archives. Access frequency is the single most important input into tiering decisions because storage cost differences between hot and cold tiers can be an order of magnitude.\n  yes -> Q2\n  no  -> Q3\n\nQ2: Does your use case require query response times under a few seconds?\n  hint: Sub-second to low-single-digit second latency is typically needed for customer-facing features, live dashboards, and operational ML scoring. If your stakeholders are fine waiting 10–60 seconds for a query to return — as is common for analytical batch reports — you may not need the expense of SSD or in-memory storage. Be honest about whether latency requirements come from actual user needs or inherited assumptions that could be relaxed.\n  yes -> [HOT]\n  no  -> [WARM]\n\nQ3: Is the data older than 90 days and no longer part of any active pipeline?\n  hint: Data older than 90 days that has been superseded by more recent snapshots rarely justifies warm-tier pricing. Check your data lineage tool or pipeline schedules to confirm the dataset is truly retired from active processing. If the data feeds even one scheduled job per month, it may be worth retaining in warm storage for faster retrieval rather than paying retrieval fees from cold archive on every access.\n  yes -> Q4\n  no  -> [WARM]\n\nQ4: Is there a legal, regulatory, or audit requirement to retain this data?\n  hint: Regulations such as GDPR, HIPAA, SOX, and PCI-DSS specify minimum retention periods — often 5–7 years — during which data must be retrievable, even if access is rare. Check with your legal or compliance team before deleting any dataset that touches financial transactions, personal data, healthcare records, or system audit logs. Cold archive tiers (AWS S3 Glacier, Azure Archive, GCS Coldline) are designed exactly for this scenario: low-cost long-term retention with multi-hour retrieval SLAs.\n  yes -> [COLD]\n  no  -> Q5\n\nQ5: Does retaining this data provide measurable future analytical or business value?\n  hint: Ask whether a data scientist or analyst has requested this data in the past year, or whether it might be needed for future model training, root-cause analysis, or customer dispute resolution. If the answer is uncertain, lean towards cold archival rather than deletion — storage costs at archive tier are negligible. Only recommend deletion when the data is definitively superseded, contains no PII risk if leaked, and has no conceivable future use case.\n  yes -> [COLD]\n  no  -> [DELETE]\n\n[HOT]: Hot Storage (SSD / In-Memory)\n  color: #E53E3E\n  description: Hot storage — NVMe SSD, Redis, BigTable, or an in-memory OLAP engine such as Apache Druid — is appropriate when data must be available with millisecond to low-second latency for frequent reads. Provision sufficient IOPS and memory headroom to handle peak traffic without throttling, and implement a TTL or eviction policy to prevent unbounded growth. Use a caching layer (Redis or Memcached) in front of your primary store to absorb repetitive read patterns cheaply. Review hot-tier datasets quarterly and demote anything whose access frequency has dropped — hot storage can cost 10–50x more than cold per GB.\n  code: STORAGE_HOT\n\n[WARM]: Warm Storage (HDD / Object Storage)\n  color: #ED8936\n  description: Warm storage — cloud object storage such as S3 Standard, GCS Standard, or Azure Blob Hot/Cool — balances reasonable retrieval speed (seconds to low minutes) against meaningfully lower cost than SSD. It is ideal for datasets accessed weekly or monthly by batch pipelines, BI tools, or on-demand analyst queries. Partition data by date or entity key so that downstream jobs scan only the relevant subset, reducing both cost and latency. Configure lifecycle policies to automatically promote to hot or demote to cold storage based on last-accessed timestamps.\n  code: STORAGE_WARM\n\n[COLD]: Cold Storage (Archive)\n  color: #4F86C6\n  description: Cold archive tiers — AWS S3 Glacier, Azure Archive Storage, or GCS Coldline — offer the lowest per-GB cost in exchange for retrieval times measured in minutes to hours. This tier is ideal for regulatory retention, historical snapshots, and raw event logs that are unlikely to be queried but must be preserved. Always store a manifest or data catalogue entry pointing to archived datasets so future users can discover and request retrieval without hunting through storage buckets. Test the restore process annually to confirm SLAs are met and that archived data remains readable.\n  code: STORAGE_COLD\n\n[DELETE]: Delete / Do Not Retain\n  color: #718096\n  description: When data has no regulatory obligation, no future analytical value, and no active pipeline dependency, deletion is the responsible choice — not merely a cost-saving measure but a data minimisation practice required by GDPR and similar regulations. Before deleting, run a final lineage check to confirm no downstream system references the dataset, and capture a deletion record in your data catalogue with the date, dataset name, and the rationale. If the data contains personal information, ensure deletion is cryptographically verifiable and document it for your DPA records. Schedule a process to flag similar datasets for review on a regular cycle to prevent data hoarding.\n  code: STORAGE_DELETE\n"
}

DSL Representation

dag: Which data storage tier should I use for this dataset?
version: 1.0.0
image: https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80
description: Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.
tags: data, storage, data engineering, cloud, cost optimisation
entry: Q1

Q1: How frequently is this dataset accessed in normal operations?
  hint: Consider both scheduled pipeline reads and ad-hoc analyst queries. "Frequently" means daily or more often — think operational dashboards, real-time scoring, or ETL source tables. "Rarely" means monthly or less — think historical snapshots, audit exports, or raw event archives. Access frequency is the single most important input into tiering decisions because storage cost differences between hot and cold tiers can be an order of magnitude.
  yes -> Q2
  no  -> Q3

Q2: Does your use case require query response times under a few seconds?
  hint: Sub-second to low-single-digit second latency is typically needed for customer-facing features, live dashboards, and operational ML scoring. If your stakeholders are fine waiting 10–60 seconds for a query to return — as is common for analytical batch reports — you may not need the expense of SSD or in-memory storage. Be honest about whether latency requirements come from actual user needs or inherited assumptions that could be relaxed.
  yes -> [HOT]
  no  -> [WARM]

Q3: Is the data older than 90 days and no longer part of any active pipeline?
  hint: Data older than 90 days that has been superseded by more recent snapshots rarely justifies warm-tier pricing. Check your data lineage tool or pipeline schedules to confirm the dataset is truly retired from active processing. If the data feeds even one scheduled job per month, it may be worth retaining in warm storage for faster retrieval rather than paying retrieval fees from cold archive on every access.
  yes -> Q4
  no  -> [WARM]

Q4: Is there a legal, regulatory, or audit requirement to retain this data?
  hint: Regulations such as GDPR, HIPAA, SOX, and PCI-DSS specify minimum retention periods — often 5–7 years — during which data must be retrievable, even if access is rare. Check with your legal or compliance team before deleting any dataset that touches financial transactions, personal data, healthcare records, or system audit logs. Cold archive tiers (AWS S3 Glacier, Azure Archive, GCS Coldline) are designed exactly for this scenario: low-cost long-term retention with multi-hour retrieval SLAs.
  yes -> [COLD]
  no  -> Q5

Q5: Does retaining this data provide measurable future analytical or business value?
  hint: Ask whether a data scientist or analyst has requested this data in the past year, or whether it might be needed for future model training, root-cause analysis, or customer dispute resolution. If the answer is uncertain, lean towards cold archival rather than deletion — storage costs at archive tier are negligible. Only recommend deletion when the data is definitively superseded, contains no PII risk if leaked, and has no conceivable future use case.
  yes -> [COLD]
  no  -> [DELETE]

[HOT]: Hot Storage (SSD / In-Memory)
  color: #E53E3E
  description: Hot storage — NVMe SSD, Redis, BigTable, or an in-memory OLAP engine such as Apache Druid — is appropriate when data must be available with millisecond to low-second latency for frequent reads. Provision sufficient IOPS and memory headroom to handle peak traffic without throttling, and implement a TTL or eviction policy to prevent unbounded growth. Use a caching layer (Redis or Memcached) in front of your primary store to absorb repetitive read patterns cheaply. Review hot-tier datasets quarterly and demote anything whose access frequency has dropped — hot storage can cost 10–50x more than cold per GB.
  code: STORAGE_HOT

[WARM]: Warm Storage (HDD / Object Storage)
  color: #ED8936
  description: Warm storage — cloud object storage such as S3 Standard, GCS Standard, or Azure Blob Hot/Cool — balances reasonable retrieval speed (seconds to low minutes) against meaningfully lower cost than SSD. It is ideal for datasets accessed weekly or monthly by batch pipelines, BI tools, or on-demand analyst queries. Partition data by date or entity key so that downstream jobs scan only the relevant subset, reducing both cost and latency. Configure lifecycle policies to automatically promote to hot or demote to cold storage based on last-accessed timestamps.
  code: STORAGE_WARM

[COLD]: Cold Storage (Archive)
  color: #4F86C6
  description: Cold archive tiers — AWS S3 Glacier, Azure Archive Storage, or GCS Coldline — offer the lowest per-GB cost in exchange for retrieval times measured in minutes to hours. This tier is ideal for regulatory retention, historical snapshots, and raw event logs that are unlikely to be queried but must be preserved. Always store a manifest or data catalogue entry pointing to archived datasets so future users can discover and request retrieval without hunting through storage buckets. Test the restore process annually to confirm SLAs are met and that archived data remains readable.
  code: STORAGE_COLD

[DELETE]: Delete / Do Not Retain
  color: #718096
  description: When data has no regulatory obligation, no future analytical value, and no active pipeline dependency, deletion is the responsible choice — not merely a cost-saving measure but a data minimisation practice required by GDPR and similar regulations. Before deleting, run a final lineage check to confirm no downstream system references the dataset, and capture a deletion record in your data catalogue with the date, dataset name, and the rationale. If the data contains personal information, ensure deletion is cryptographically verifiable and document it for your DPA records. Schedule a process to flag similar datasets for review on a regular cycle to prevent data hoarding.
  code: STORAGE_DELETE

Machine Access

Static JSON: /t/drawdecisiontree/data-storage-tier/tree.json
Live JSON (SPA): /json/drawdecisiontree/data-storage-tier
Raw DSL: /t/drawdecisiontree/data-storage-tier/tree.dag
Canonical HTML: /t/drawdecisiontree/data-storage-tier.html

Questions in this decision tree

How frequently is this dataset accessed in normal operations?
Does your use case require query response times under a few seconds?
Is the data older than 90 days and no longer part of any active pipeline?
Is there a legal, regulatory, or audit requirement to retain this data?
Does retaining this data provide measurable future analytical or business value?

Possible outcomes

Hot Storage (SSD / In-Memory)
Warm Storage (HDD / Object Storage)
Cold Storage (Archive)
Delete / Do Not Retain

How to use this decision tree

Click "Open interactive version" to step through the questions. Your answers narrow the tree until a recommended outcome is reached. You can also embed this tree on your own site.