File-Based Storage Is Fine, Actually

File-Based Storage Is Fine, Actually

Not every project needs PostgreSQL. Sometimes JSON files are the right choice.

This is heresy in some circles. Mention file-based storage and people will immediately tell you about all the problems. And they're often right! But the problems don't apply to every situation.

The knee-jerk reaction

When developers hear 'file-based storage' they immediately list the problems:

  • No transactions! What about atomicity?
  • No concurrent writes! Race conditions everywhere!
  • Doesn't scale! What about millions of users?
  • No indexes! How will you query efficiently?

And they're right about all of this. For a lot of use cases, files are a terrible idea.

But for a portfolio site? A personal project? A prototype? A tool that only you use? Files are perfect.

Why I use JSON files for this site

This portfolio site stores all its data in JSON files. Blog posts, projects, resume data, everything. Here's why:

1. No database to manage. No migrations to run. No connection pools to configure. No credentials to rotate. No database server to keep running. The 'database' is just files in a directory.

2. Easy to deploy. Push to git, pull on server, restart the app. That's it. No database provisioning, no managed database costs, no RDS or Cloud SQL to configure.

3. Easy to backup. It's just files. Git handles it. Every deploy is a backup. I can see the complete history of my data in git log.

4. Easy to debug. Open the JSON file in a text editor. See exactly what's stored. No SELECT * FROM posts needed. The data is right there.

5. Easy to edit. Want to fix a typo in a blog post? Edit the JSON file. No admin panel needed. No SQL queries.

The pattern

Here's roughly how I do it:

import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional

DATA_DIR = Path('data')

@dataclass
class BlogPost:
    slug: str
    title: str
    content: str
    published_at: str

def load_posts() -> list[BlogPost]:
    path = DATA_DIR / 'posts.json'
    if not path.exists():
        return []
    data = json.loads(path.read_text())
    return [BlogPost(**post) for post in data]

def save_posts(posts: list[BlogPost]) -> None:
    path = DATA_DIR / 'posts.json'
    data = [asdict(post) for post in posts]
    path.write_text(json.dumps(data, indent=2))

def get_post_by_slug(slug: str) -> Optional[BlogPost]:
    posts = load_posts()
    return next((p for p in posts if p.slug == slug), None)

That's it. That's the whole 'database' layer. Maybe 30 lines of code.

Compare that to setting up an ORM, writing migrations, configuring connection pooling, setting up a database server, managing credentials... the file-based approach is refreshingly simple.

Real-world patterns that actually work

Over the years, I've developed a few patterns that make file-based storage more robust. Here are the ones I use most often.

One file per entity (the small-scale approach)

Instead of cramming everything into one giant JSON file, split it up:

DATA_DIR = Path('data/posts')

def save_post(post: BlogPost) -> None:
    path = DATA_DIR / f'{post.slug}.json'
    path.write_text(json.dumps(asdict(post), indent=2))

def load_post(slug: str) -> Optional[BlogPost]:
    path = DATA_DIR / f'{slug}.json'
    if not path.exists():
        return None
    return BlogPost(**json.loads(path.read_text()))

def load_all_posts() -> list[BlogPost]:
    return [
        BlogPost(**json.loads(path.read_text()))
        for path in DATA_DIR.glob('*.json')
    ]

This approach has some nice properties:

  • Easier to edit: Each post is isolated. No risk of breaking other posts while editing.
  • Better git diffs: Changing one post doesn't show up as a single giant diff in your monolithic JSON file.
  • Simpler mental model: Want to delete a post? Delete the file. Want to see all posts? List the directory.
  • Easier to version: You can see when each individual post was created or modified in git history.

I use this pattern for blog posts where each post is independent and I might want to edit them individually.

Directory structure as data organization

Your directory structure can encode meaningful relationships:

data/
  posts/
    2024/
      01-introduction-to-python.json
      02-advanced-patterns.json
    2025/
      01-new-year-thoughts.json
  projects/
    active/
      beniverse.json
      cool-tool.json
    archived/
      old-experiment.json

This makes it trivial to query by year, status, or category. Want all 2024 posts? Just read data/posts/2024/*.json. Want active projects? Read data/projects/active/*.json.

The filesystem becomes your index. No SQL required.

Hybrid approach: Metadata in files, content in markdown

Here's a pattern I've seen work well for blogs and documentation sites:

content/
  posts/
    hello-world/
      meta.json
      content.md
    python-tips/
      meta.json
      content.md
      cover-image.png

The meta.json contains structured data (title, date, tags), while content.md contains the actual post content. This gives you the best of both worlds:

  • JSON for structured data that your code needs to query/sort/filter
  • Markdown for long-form content that's easier to write and edit
  • Each post gets its own directory, so you can include images and other assets
@dataclass
class BlogPost:
    slug: str
    title: str
    published_at: str
    tags: list[str]
    content: str  # Loaded from content.md

def load_post(slug: str) -> Optional[BlogPost]:
    post_dir = Path('content/posts') / slug
    if not post_dir.exists():
        return None

    meta = json.loads((post_dir / 'meta.json').read_text())
    content = (post_dir / 'content.md').read_text()

    return BlogPost(
        slug=slug,
        title=meta['title'],
        published_at=meta['published_at'],
        tags=meta['tags'],
        content=content
    )

This is essentially what static site generators like Hugo and Jekyll do. It works because the pattern matches how humans think about content.

JSON, TOML, YAML - pick your poison

JSON isn't your only option. TOML and YAML are both great for configuration-heavy data:

TOML is cleaner for structured config. Great for settings, environment configs, or data that humans edit frequently. It's what pyproject.toml uses.

YAML is more flexible and supports references. Good for complex nested structures. But it has edge cases with indentation and type coercion that can bite you.

JSON is the sweet spot for me. It's simple, well-supported, and has no surprising behavior. Python's json module is in the stdlib. Every language can read it. No weird indentation rules.

For blog posts and projects? JSON is perfect. For app configuration? TOML is nicer. Pick based on who edits the files and how often.

What about performance?

For a portfolio site with maybe 50 blog posts, performance is a non-issue. Reading a 100KB JSON file takes milliseconds. We're not querying terabytes.

If you're worried, add caching:

from functools import cache

@cache
def load_posts():
    # Now it only reads from disk once
    ...

Clear the cache when data changes (or just restart the server).

When file-based storage is actually fast

People assume databases are always faster than files. But for read-heavy workloads with small datasets, files can be surprisingly quick:

  • No network overhead: The database is in the same process. No TCP connection, no serialization protocol.
  • No query parsing: You're not sending SQL that needs to be parsed, planned, and executed.
  • OS-level caching: Your OS caches frequently-read files in memory. Reading the same file twice is basically instant.
  • Simple data structures: No B-trees to traverse, no index lookups. Just deserialize and go.

I ran some quick benchmarks on my laptop. Loading 100 blog posts from individual JSON files: ~5ms. Loading the same data from SQLite: ~8ms. From PostgreSQL over localhost: ~15ms.

Now, these numbers change dramatically as your dataset grows or you need complex queries. But for small datasets? Files are competitive.

Lazy loading for larger datasets

If you have hundreds or thousands of items, you don't want to load everything upfront. Here's a simple lazy-loading pattern:

class PostStore:
    def __init__(self, data_dir: Path):
        self.data_dir = data_dir
        self._cache = {}

    def get(self, slug: str) -> Optional[BlogPost]:
        if slug in self._cache:
            return self._cache[slug]

        path = self.data_dir / f'{slug}.json'
        if not path.exists():
            return None

        post = BlogPost(**json.loads(path.read_text()))
        self._cache[slug] = post
        return post

    def list_all(self) -> list[BlogPost]:
        # Only load metadata for listing
        posts = []
        for path in self.data_dir.glob('*.json'):
            data = json.loads(path.read_text())
            # Only extract what we need for the list view
            posts.append({
                'slug': data['slug'],
                'title': data['title'],
                'published_at': data['published_at']
            })
        return sorted(posts, key=lambda p: p['published_at'], reverse=True)

This loads individual posts on-demand and keeps them cached. For list views, it only extracts the fields you need. Simple and effective.

What about writes?

Writes are the tricky part. If multiple processes write to the same file at the same time, you get corrupted data. But... who's writing to my portfolio site? Just me. And I'm not going to race condition myself.

If you do need concurrent writes, you can use file locking:

import fcntl

def save_posts_safely(posts):
    path = DATA_DIR / 'posts.json'
    with open(path, 'w') as f:
        fcntl.flock(f, fcntl.LOCK_EX)
        json.dump([asdict(p) for p in posts], f)
        fcntl.flock(f, fcntl.LOCK_UN)

Or use the atomic write pattern with tempfile and rename:

import tempfile
import os

def save_posts_atomic(posts):
    path = DATA_DIR / 'posts.json'
    data = json.dumps([asdict(p) for p in posts], indent=2)

    # Write to temp file first
    fd, temp_path = tempfile.mkstemp(dir=DATA_DIR)
    with os.fdopen(fd, 'w') as f:
        f.write(data)

    # Atomic rename
    os.rename(temp_path, path)

This guarantees readers never see a half-written file. The rename operation is atomic on most filesystems, so you either get the old data or the new data, never corrupted garbage in between.

But honestly, if you need this level of care, maybe it's time for a real database.

Common gotchas (and how to avoid them)

I've made every mistake with file-based storage. Here are the lessons learned:

1. Handle missing files gracefully

# Bad - crashes when file doesn't exist
def load_posts():
    return json.loads(Path('data/posts.json').read_text())

# Good - handles missing file
def load_posts():
    path = Path('data/posts.json')
    if not path.exists():
        return []
    try:
        return json.loads(path.read_text())
    except json.JSONDecodeError:
        # File exists but is corrupted
        print(f"Warning: Could not parse {path}")
        return []

Files go missing. They get corrupted. Your code should handle it gracefully.

2. Create directories before writing

# Bad - crashes if directory doesn't exist
def save_post(post):
    path = Path(f'data/posts/{post.year}/{post.slug}.json')
    path.write_text(json.dumps(asdict(post)))

# Good - ensures directory exists
def save_post(post):
    path = Path(f'data/posts/{post.year}/{post.slug}.json')
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(asdict(post)))

mkdir(parents=True, exist_ok=True) is your friend. It creates nested directories and doesn't error if they already exist.

3. Watch out for encoding issues

Always specify UTF-8 explicitly:

# Can break on Windows or with special characters
path.write_text(json.dumps(data))

# Better
path.write_text(json.dumps(data, ensure_ascii=False), encoding='utf-8')

Or use the file handle directly:

with open(path, 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

This prevents mojibake with emoji and non-ASCII characters.

4. Don't trust user input in file paths

# DANGEROUS - vulnerable to path traversal
def get_post(slug: str):
    path = Path(f'data/posts/{slug}.json')
    return json.loads(path.read_text())

# What if slug is "../../../etc/passwd"?

# Better - validate the slug
def get_post(slug: str):
    if not slug.replace('-', '').replace('_', '').isalnum():
        raise ValueError("Invalid slug")
    path = Path(f'data/posts/{slug}.json')
    # Also check the resolved path is still in our data directory
    if not path.resolve().is_relative_to(Path('data/posts').resolve()):
        raise ValueError("Invalid path")
    return json.loads(path.read_text())

Path traversal attacks are real. Validate your inputs.

SQLite as a middle ground

If you want some database features but not a full PostgreSQL setup, SQLite is a great middle ground. It's still just a file, but you get SQL, indexes, and transactions.

import sqlite3

conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('SELECT * FROM posts WHERE slug = ?', (slug,))

SQLite handles concurrent reads well and can handle moderate concurrent writes. It's battle-tested and included in Python's standard library.

Here's when SQLite makes sense:

  • You need to query data in complex ways (joins, aggregations, filtering)
  • You have relationships between entities that you need to maintain
  • Multiple processes might write to the data occasionally
  • Your dataset is getting large (thousands of records) but still fits on one machine
  • You want ACID guarantees but don't want to run a database server

SQLite is legitimately good. It powers mobile apps with millions of users, browsers like Chrome and Firefox, and even some production web apps. The key constraint is: one machine, one file, mostly-reads workload.

For my portfolio site, I chose JSON over SQLite because:

  1. I literally never query my data. I just load all posts and sort/filter in Python.
  2. I like being able to edit posts in a text editor without SQL.
  3. JSON diffs in git are easier to read than SQLite binary diffs.

But if I were building a personal task manager or a note-taking app with tags and search? SQLite all the way.

When to graduate to a real database

You'll know when you need a real database:

  • Multiple users writing data concurrently
  • Complex queries with joins and aggregations
  • Data too big to fit in memory
  • Need for transactions across multiple operations
  • Team of developers who need a shared data model

Until then? Files are fine. Don't over-engineer your side project.

Migration strategy (when the time comes)

Eventually, you might outgrow file-based storage. Here's how to migrate without rewriting everything.

Step 1: Abstract your data layer

Instead of calling json.loads() everywhere, create a simple interface:

# data_store.py
from abc import ABC, abstractmethod

class DataStore(ABC):
    @abstractmethod
    def get_post(self, slug: str) -> Optional[BlogPost]:
        pass

    @abstractmethod
    def list_posts(self) -> list[BlogPost]:
        pass

    @abstractmethod
    def save_post(self, post: BlogPost) -> None:
        pass

class JSONDataStore(DataStore):
    def get_post(self, slug: str) -> Optional[BlogPost]:
        path = Path(f'data/posts/{slug}.json')
        if not path.exists():
            return None
        return BlogPost(**json.loads(path.read_text()))

    # ... implement other methods

Now your application code depends on the interface, not the implementation.

Step 2: Implement the new storage

When you're ready to migrate to PostgreSQL or SQLite:

class PostgresDataStore(DataStore):
    def __init__(self, db_url: str):
        self.db = create_engine(db_url)

    def get_post(self, slug: str) -> Optional[BlogPost]:
        with self.db.connect() as conn:
            result = conn.execute(
                text("SELECT * FROM posts WHERE slug = :slug"),
                {"slug": slug}
            )
            row = result.fetchone()
            return BlogPost(**row) if row else None

    # ... implement other methods

Step 3: Dual-write for safety

During migration, write to both stores:

class DualWriteDataStore(DataStore):
    def __init__(self, old_store: DataStore, new_store: DataStore):
        self.old_store = old_store
        self.new_store = new_store

    def save_post(self, post: BlogPost) -> None:
        # Write to both
        self.old_store.save_post(post)
        self.new_store.save_post(post)

    def get_post(self, slug: str) -> Optional[BlogPost]:
        # Read from new store, fall back to old
        result = self.new_store.get_post(slug)
        if result is None:
            result = self.old_store.get_post(slug)
        return result

This lets you migrate gradually and roll back if something breaks.

The real lesson

The real lesson isn't "use files instead of databases." It's "choose the right tool for the job."

For my portfolio site with 50 blog posts that only I edit? JSON files in git are perfect. Zero operational overhead. Simple to understand. Easy to debug.

For a SaaS app with thousands of users writing data concurrently? PostgreSQL all the way. You need transactions, you need concurrent access, you need a query language.

The problem is when people apply the SaaS solution to the portfolio site problem. You end up with:

  • Database credentials to manage
  • Migration scripts for schema changes
  • Connection pooling configuration
  • Backup strategies and restore procedures
  • Database server to monitor and maintain

All for 50 blog posts that change once a month.

It's like using a semi truck to buy groceries. Yeah, it can carry more groceries than your car. But you don't need that, and now you need a CDL.

Who else does this?

I'm not the only one. Lots of successful projects use file-based storage:

  • Static site generators (Hugo, Jekyll, Eleventy): Markdown files for content. They build sites with thousands of pages from flat files.
  • Obsidian: A note-taking app used by millions. All notes are markdown files in a folder.
  • Git: The most successful version control system ever. It's all files. No database server.
  • Many CLI tools: Store config in JSON/YAML files in ~/.config/.

These aren't toys. They're production software used by millions of people. File-based storage scales further than you think.

Closing thoughts

Start simple. Use files. When files become a problem, you'll know. And by then, you'll have a real product with real users and real revenue to justify the complexity of a database.

Don't optimize for scale you don't have. Don't build infrastructure for problems you don't face.

Build the simplest thing that works. Ship it. Learn from it. Iterate.

Files are fine.

Send a Message