Which data storage tier should I use for this dataset?

Which data storage tier should I use for this dataset?

Decision tree datastoragedata engineeringcloudcost optimisation

Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.

Overview

Type
Decision tree
Tags
data, storage, data engineering, cloud, cost optimisation
Entry
Q1
Questions
5
Outcomes
4
Author
Andrew
Last updated
2026-05-12

Decision Tree

Start: How frequently is this dataset accessed in normal operations?

yes

  • Continues to question: Does your use case require query response times under a few seconds?

no

  • Continues to question: Is the data older than 90 days and no longer part of any active pipeline?

Machine-Readable JSON (Canonical Model)

View JSON
{
  "_meta": {
    "schema": "https://www.drawdecisiontree.com/decision-dag.schema.json",
    "source": "https://www.drawdecisiontree.com",
    "description": "DrawDecisionTree.com is a free tool for building, sharing, and embedding interactive decision trees. This file is the machine-readable export of a published decision tree. The `dsl` field contains the original source in the Decision DAG DSL; the `dag` schema is documented at the URL in `schema` above.",
    "links": {
      "interactive": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier.html",
      "embed": "https://www.drawdecisiontree.com/embed/path/drawdecisiontree/data-storage-tier",
      "dsl_reference": "https://www.drawdecisiontree.com/decision-tree-dsl-reference.html",
      "guides": "https://www.drawdecisiontree.com/guides",
      "schema_docs": "https://www.drawdecisiontree.com/decision-dag.schema.json",
      "author_trees": "https://www.drawdecisiontree.com/trees/drawdecisiontree"
    },
    "generated_at": "2026-05-29T12:05:39.271Z"
  },
  "author": {
    "handle": "drawdecisiontree",
    "first_name": "Andrew",
    "last_name": null,
    "avatar_url": "1d32d828-b6ca-40ec-bdd7-771fe7b9c36a/avatar-1778531481027.svg",
    "display_name": "Andrew"
  },
  "file": {
    "id": "9d0ee811-db05-48bb-a46b-e3e3c9d7348b",
    "name": "Which data storage tier should I use for this dataset?",
    "public_slug": "data-storage-tier",
    "updated_at": "2026-05-12T16:53:43.587978+00:00",
    "url": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier.html",
    "json_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier/tree.json",
    "dsl_url": "https://www.drawdecisiontree.com/t/drawdecisiontree/data-storage-tier/tree.dag"
  },
  "meta": {
    "description": "Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.",
    "mode": "decision",
    "entry": "Q1",
    "tags": [
      "data",
      "storage",
      "data engineering",
      "cloud",
      "cost optimisation"
    ],
    "image": "https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80"
  },
  "questions": [
    {
      "id": "Q1",
      "text": "How frequently is this dataset accessed in normal operations?"
    },
    {
      "id": "Q2",
      "text": "Does your use case require query response times under a few seconds?"
    },
    {
      "id": "Q3",
      "text": "Is the data older than 90 days and no longer part of any active pipeline?"
    },
    {
      "id": "Q4",
      "text": "Is there a legal, regulatory, or audit requirement to retain this data?"
    },
    {
      "id": "Q5",
      "text": "Does retaining this data provide measurable future analytical or business value?"
    }
  ],
  "outcomes": [
    {
      "id": "HOT",
      "label": "Hot Storage (SSD / In-Memory)"
    },
    {
      "id": "WARM",
      "label": "Warm Storage (HDD / Object Storage)"
    },
    {
      "id": "COLD",
      "label": "Cold Storage (Archive)"
    },
    {
      "id": "DELETE",
      "label": "Delete / Do Not Retain"
    }
  ],
  "dsl": "dag: Which data storage tier should I use for this dataset?\nversion: 1.0.0\nimage: https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80\ndescription: Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.\ntags: data, storage, data engineering, cloud, cost optimisation\nentry: Q1\n\nQ1: How frequently is this dataset accessed in normal operations?\n  hint: Consider both scheduled pipeline reads and ad-hoc analyst queries. \"Frequently\" means daily or more often — think operational dashboards, real-time scoring, or ETL source tables. \"Rarely\" means monthly or less — think historical snapshots, audit exports, or raw event archives. Access frequency is the single most important input into tiering decisions because storage cost differences between hot and cold tiers can be an order of magnitude.\n  yes -> Q2\n  no  -> Q3\n\nQ2: Does your use case require query response times under a few seconds?\n  hint: Sub-second to low-single-digit second latency is typically needed for customer-facing features, live dashboards, and operational ML scoring. If your stakeholders are fine waiting 10–60 seconds for a query to return — as is common for analytical batch reports — you may not need the expense of SSD or in-memory storage. Be honest about whether latency requirements come from actual user needs or inherited assumptions that could be relaxed.\n  yes -> [HOT]\n  no  -> [WARM]\n\nQ3: Is the data older than 90 days and no longer part of any active pipeline?\n  hint: Data older than 90 days that has been superseded by more recent snapshots rarely justifies warm-tier pricing. Check your data lineage tool or pipeline schedules to confirm the dataset is truly retired from active processing. If the data feeds even one scheduled job per month, it may be worth retaining in warm storage for faster retrieval rather than paying retrieval fees from cold archive on every access.\n  yes -> Q4\n  no  -> [WARM]\n\nQ4: Is there a legal, regulatory, or audit requirement to retain this data?\n  hint: Regulations such as GDPR, HIPAA, SOX, and PCI-DSS specify minimum retention periods — often 5–7 years — during which data must be retrievable, even if access is rare. Check with your legal or compliance team before deleting any dataset that touches financial transactions, personal data, healthcare records, or system audit logs. Cold archive tiers (AWS S3 Glacier, Azure Archive, GCS Coldline) are designed exactly for this scenario: low-cost long-term retention with multi-hour retrieval SLAs.\n  yes -> [COLD]\n  no  -> Q5\n\nQ5: Does retaining this data provide measurable future analytical or business value?\n  hint: Ask whether a data scientist or analyst has requested this data in the past year, or whether it might be needed for future model training, root-cause analysis, or customer dispute resolution. If the answer is uncertain, lean towards cold archival rather than deletion — storage costs at archive tier are negligible. Only recommend deletion when the data is definitively superseded, contains no PII risk if leaked, and has no conceivable future use case.\n  yes -> [COLD]\n  no  -> [DELETE]\n\n[HOT]: Hot Storage (SSD / In-Memory)\n  color: #E53E3E\n  description: Hot storage — NVMe SSD, Redis, BigTable, or an in-memory OLAP engine such as Apache Druid — is appropriate when data must be available with millisecond to low-second latency for frequent reads. Provision sufficient IOPS and memory headroom to handle peak traffic without throttling, and implement a TTL or eviction policy to prevent unbounded growth. Use a caching layer (Redis or Memcached) in front of your primary store to absorb repetitive read patterns cheaply. Review hot-tier datasets quarterly and demote anything whose access frequency has dropped — hot storage can cost 10–50x more than cold per GB.\n  code: STORAGE_HOT\n\n[WARM]: Warm Storage (HDD / Object Storage)\n  color: #ED8936\n  description: Warm storage — cloud object storage such as S3 Standard, GCS Standard, or Azure Blob Hot/Cool — balances reasonable retrieval speed (seconds to low minutes) against meaningfully lower cost than SSD. It is ideal for datasets accessed weekly or monthly by batch pipelines, BI tools, or on-demand analyst queries. Partition data by date or entity key so that downstream jobs scan only the relevant subset, reducing both cost and latency. Configure lifecycle policies to automatically promote to hot or demote to cold storage based on last-accessed timestamps.\n  code: STORAGE_WARM\n\n[COLD]: Cold Storage (Archive)\n  color: #4F86C6\n  description: Cold archive tiers — AWS S3 Glacier, Azure Archive Storage, or GCS Coldline — offer the lowest per-GB cost in exchange for retrieval times measured in minutes to hours. This tier is ideal for regulatory retention, historical snapshots, and raw event logs that are unlikely to be queried but must be preserved. Always store a manifest or data catalogue entry pointing to archived datasets so future users can discover and request retrieval without hunting through storage buckets. Test the restore process annually to confirm SLAs are met and that archived data remains readable.\n  code: STORAGE_COLD\n\n[DELETE]: Delete / Do Not Retain\n  color: #718096\n  description: When data has no regulatory obligation, no future analytical value, and no active pipeline dependency, deletion is the responsible choice — not merely a cost-saving measure but a data minimisation practice required by GDPR and similar regulations. Before deleting, run a final lineage check to confirm no downstream system references the dataset, and capture a deletion record in your data catalogue with the date, dataset name, and the rationale. If the data contains personal information, ensure deletion is cryptographically verifiable and document it for your DPA records. Schedule a process to flag similar datasets for review on a regular cycle to prevent data hoarding.\n  code: STORAGE_DELETE\n"
}

DSL Representation

dag: Which data storage tier should I use for this dataset?
version: 1.0.0
image: https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=1200&q=80
description: Determine the appropriate storage tier for a dataset based on access patterns, latency requirements, age, regulatory obligations, and cost sensitivity. The right tier balances data availability against infrastructure spend and compliance risk.
tags: data, storage, data engineering, cloud, cost optimisation
entry: Q1

Q1: How frequently is this dataset accessed in normal operations?
  hint: Consider both scheduled pipeline reads and ad-hoc analyst queries. "Frequently" means daily or more often — think operational dashboards, real-time scoring, or ETL source tables. "Rarely" means monthly or less — think historical snapshots, audit exports, or raw event archives. Access frequency is the single most important input into tiering decisions because storage cost differences between hot and cold tiers can be an order of magnitude.
  yes -> Q2
  no  -> Q3

Q2: Does your use case require query response times under a few seconds?
  hint: Sub-second to low-single-digit second latency is typically needed for customer-facing features, live dashboards, and operational ML scoring. If your stakeholders are fine waiting 10–60 seconds for a query to return — as is common for analytical batch reports — you may not need the expense of SSD or in-memory storage. Be honest about whether latency requirements come from actual user needs or inherited assumptions that could be relaxed.
  yes -> [HOT]
  no  -> [WARM]

Q3: Is the data older than 90 days and no longer part of any active pipeline?
  hint: Data older than 90 days that has been superseded by more recent snapshots rarely justifies warm-tier pricing. Check your data lineage tool or pipeline schedules to confirm the dataset is truly retired from active processing. If the data feeds even one scheduled job per month, it may be worth retaining in warm storage for faster retrieval rather than paying retrieval fees from cold archive on every access.
  yes -> Q4
  no  -> [WARM]

Q4: Is there a legal, regulatory, or audit requirement to retain this data?
  hint: Regulations such as GDPR, HIPAA, SOX, and PCI-DSS specify minimum retention periods — often 5–7 years — during which data must be retrievable, even if access is rare. Check with your legal or compliance team before deleting any dataset that touches financial transactions, personal data, healthcare records, or system audit logs. Cold archive tiers (AWS S3 Glacier, Azure Archive, GCS Coldline) are designed exactly for this scenario: low-cost long-term retention with multi-hour retrieval SLAs.
  yes -> [COLD]
  no  -> Q5

Q5: Does retaining this data provide measurable future analytical or business value?
  hint: Ask whether a data scientist or analyst has requested this data in the past year, or whether it might be needed for future model training, root-cause analysis, or customer dispute resolution. If the answer is uncertain, lean towards cold archival rather than deletion — storage costs at archive tier are negligible. Only recommend deletion when the data is definitively superseded, contains no PII risk if leaked, and has no conceivable future use case.
  yes -> [COLD]
  no  -> [DELETE]

[HOT]: Hot Storage (SSD / In-Memory)
  color: #E53E3E
  description: Hot storage — NVMe SSD, Redis, BigTable, or an in-memory OLAP engine such as Apache Druid — is appropriate when data must be available with millisecond to low-second latency for frequent reads. Provision sufficient IOPS and memory headroom to handle peak traffic without throttling, and implement a TTL or eviction policy to prevent unbounded growth. Use a caching layer (Redis or Memcached) in front of your primary store to absorb repetitive read patterns cheaply. Review hot-tier datasets quarterly and demote anything whose access frequency has dropped — hot storage can cost 10–50x more than cold per GB.
  code: STORAGE_HOT

[WARM]: Warm Storage (HDD / Object Storage)
  color: #ED8936
  description: Warm storage — cloud object storage such as S3 Standard, GCS Standard, or Azure Blob Hot/Cool — balances reasonable retrieval speed (seconds to low minutes) against meaningfully lower cost than SSD. It is ideal for datasets accessed weekly or monthly by batch pipelines, BI tools, or on-demand analyst queries. Partition data by date or entity key so that downstream jobs scan only the relevant subset, reducing both cost and latency. Configure lifecycle policies to automatically promote to hot or demote to cold storage based on last-accessed timestamps.
  code: STORAGE_WARM

[COLD]: Cold Storage (Archive)
  color: #4F86C6
  description: Cold archive tiers — AWS S3 Glacier, Azure Archive Storage, or GCS Coldline — offer the lowest per-GB cost in exchange for retrieval times measured in minutes to hours. This tier is ideal for regulatory retention, historical snapshots, and raw event logs that are unlikely to be queried but must be preserved. Always store a manifest or data catalogue entry pointing to archived datasets so future users can discover and request retrieval without hunting through storage buckets. Test the restore process annually to confirm SLAs are met and that archived data remains readable.
  code: STORAGE_COLD

[DELETE]: Delete / Do Not Retain
  color: #718096
  description: When data has no regulatory obligation, no future analytical value, and no active pipeline dependency, deletion is the responsible choice — not merely a cost-saving measure but a data minimisation practice required by GDPR and similar regulations. Before deleting, run a final lineage check to confirm no downstream system references the dataset, and capture a deletion record in your data catalogue with the date, dataset name, and the rationale. If the data contains personal information, ensure deletion is cryptographically verifiable and document it for your DPA records. Schedule a process to flag similar datasets for review on a regular cycle to prevent data hoarding.
  code: STORAGE_DELETE

Machine Access

Questions in this decision tree

Possible outcomes

How to use this decision tree

Click "Open interactive version" to step through the questions. Your answers narrow the tree until a recommended outcome is reached. You can also embed this tree on your own site.

More decision trees by Andrew

Which API design pattern is right for my project?
Which API design pattern is right for my project?
Determine the right API design style for your integration scenario.
Authentication Method Selection
Authentication Method Selection
Authentication is a security-critical, high-friction decision to reverse — migrating users from one auth method to another requires coordinated password resets or credential migration campaigns. This tree eliminates methods that don't match your user type, enterprise requirements, and security posture, giving you a clear shortlist before you write a line of code.
Caching Strategy Selection
Caching Strategy Selection
Premature or misapplied caching adds complexity — stale data bugs, invalidation logic, and distributed consistency problems — without solving the actual bottleneck. This tree routes you to the caching pattern that matches your data access profile, so you apply the right tool to the right problem rather than defaulting to Redis for everything.
CI/CD Pipeline Tool Selection
CI/CD Pipeline Tool Selection
Choosing a CI/CD platform is a long-term infrastructure commitment — pipelines accumulate config, custom scripts, and team muscle memory that make switching painful. This tree eliminates tools that don't fit your source control host, infrastructure model, or team scale, leaving only the options genuinely viable for your situation.
Which cloud provider should I use — AWS, Azure, or Google Cloud?
Which cloud provider should I use — AWS, Azure, or Google Cloud?
Answer a few questions to identify the most suitable cloud platform for your workload.
Container Orchestration Platform Selection
Container Orchestration Platform Selection
Container orchestration is foundational infrastructure — the platform you choose shapes how you deploy, scale, network, and operate every service you run. This tree eliminates options that don't match your operational maturity, cloud provider commitment, and workload complexity, so you land on the platform that fits your team today without over-engineering for a scale you haven't reached.