File-Based Storage Is Fine, Actually
Not every project needs PostgreSQL. Sometimes JSON files are the right choice.
This is heresy in some circles. Mention file-based storage and people will immediately tell you about all the problems. And they're often right! But the problems don't apply to every situation.
The knee-jerk reaction
When developers hear 'file-based storage' they immediately list the problems:
- No transactions! What about atomicity?
- No concurrent writes! Race conditions everywhere!
- Doesn't scale! What about millions of users?
- No indexes! How will you query efficiently?
And they're right about all of this. For a lot of use cases, files are a terrible idea.
But for a portfolio site? A personal project? A prototype? A tool that only you use? Files are perfect.
Why I use JSON files for this site
This portfolio site stores all its data in JSON files. Blog posts, projects, resume data, everything. Here's why:
1. No database to manage. No migrations to run. No connection pools to configure. No credentials to rotate. No database server to keep running. The 'database' is just files in a directory.
2. Easy to deploy. Push to git, pull on server, restart the app. That's it. No database provisioning, no managed database costs, no RDS or Cloud SQL to configure.
3. Easy to backup. It's just files. Git handles it. Every deploy is a backup. I can see the complete history of my data in git log.
4. Easy to debug. Open the JSON file in a text editor. See exactly what's stored. No SELECT * FROM posts needed. The data is right there.
5. Easy to edit. Want to fix a typo in a blog post? Edit the JSON file. No admin panel needed. No SQL queries.
The pattern
Here's roughly how I do it:
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional
DATA_DIR = Path('data')
@dataclass
class BlogPost:
slug: str
title: str
content: str
published_at: str
def load_posts() -> list[BlogPost]:
path = DATA_DIR / 'posts.json'
if not path.exists():
return []
data = json.loads(path.read_text())
return [BlogPost(**post) for post in data]
def save_posts(posts: list[BlogPost]) -> None:
path = DATA_DIR / 'posts.json'
data = [asdict(post) for post in posts]
path.write_text(json.dumps(data, indent=2))
def get_post_by_slug(slug: str) -> Optional[BlogPost]:
posts = load_posts()
return next((p for p in posts if p.slug == slug), None)
That's it. That's the whole 'database' layer. Maybe 30 lines of code.
Compare that to setting up an ORM, writing migrations, configuring connection pooling, setting up a database server, managing credentials... the file-based approach is refreshingly simple.
Real-world patterns that actually work
Over the years, I've developed a few patterns that make file-based storage more robust. Here are the ones I use most often.
One file per entity (the small-scale approach)
Instead of cramming everything into one giant JSON file, split it up:
DATA_DIR = Path('data/posts')
def save_post(post: BlogPost) -> None:
path = DATA_DIR / f'{post.slug}.json'
path.write_text(json.dumps(asdict(post), indent=2))
def load_post(slug: str) -> Optional[BlogPost]:
path = DATA_DIR / f'{slug}.json'
if not path.exists():
return None
return BlogPost(**json.loads(path.read_text()))
def load_all_posts() -> list[BlogPost]:
return [
BlogPost(**json.loads(path.read_text()))
for path in DATA_DIR.glob('*.json')
]
This approach has some nice properties:
- Easier to edit: Each post is isolated. No risk of breaking other posts while editing.
- Better git diffs: Changing one post doesn't show up as a single giant diff in your monolithic JSON file.
- Simpler mental model: Want to delete a post? Delete the file. Want to see all posts? List the directory.
- Easier to version: You can see when each individual post was created or modified in git history.
I use this pattern for blog posts where each post is independent and I might want to edit them individually.
Directory structure as data organization
Your directory structure can encode meaningful relationships:
data/
posts/
2024/
01-introduction-to-python.json
02-advanced-patterns.json
2025/
01-new-year-thoughts.json
projects/
active/
beniverse.json
cool-tool.json
archived/
old-experiment.json
This makes it trivial to query by year, status, or category. Want all 2024 posts? Just read data/posts/2024/*.json. Want active projects? Read data/projects/active/*.json.
The filesystem becomes your index. No SQL required.
Hybrid approach: Metadata in files, content in markdown
Here's a pattern I've seen work well for blogs and documentation sites:
content/
posts/
hello-world/
meta.json
content.md
python-tips/
meta.json
content.md
cover-image.png
The meta.json contains structured data (title, date, tags), while content.md contains the actual post content. This gives you the best of both worlds:
- JSON for structured data that your code needs to query/sort/filter
- Markdown for long-form content that's easier to write and edit
- Each post gets its own directory, so you can include images and other assets
@dataclass
class BlogPost:
slug: str
title: str
published_at: str
tags: list[str]
content: str # Loaded from content.md
def load_post(slug: str) -> Optional[BlogPost]:
post_dir = Path('content/posts') / slug
if not post_dir.exists():
return None
meta = json.loads((post_dir / 'meta.json').read_text())
content = (post_dir / 'content.md').read_text()
return BlogPost(
slug=slug,
title=meta['title'],
published_at=meta['published_at'],
tags=meta['tags'],
content=content
)
This is essentially what static site generators like Hugo and Jekyll do. It works because the pattern matches how humans think about content.
JSON, TOML, YAML - pick your poison
JSON isn't your only option. TOML and YAML are both great for configuration-heavy data:
TOML is cleaner for structured config. Great for settings, environment configs, or data that humans edit frequently. It's what pyproject.toml uses.
YAML is more flexible and supports references. Good for complex nested structures. But it has edge cases with indentation and type coercion that can bite you.
JSON is the sweet spot for me. It's simple, well-supported, and has no surprising behavior. Python's json module is in the stdlib. Every language can read it. No weird indentation rules.
For blog posts and projects? JSON is perfect. For app configuration? TOML is nicer. Pick based on who edits the files and how often.
What about performance?
For a portfolio site with maybe 50 blog posts, performance is a non-issue. Reading a 100KB JSON file takes milliseconds. We're not querying terabytes.
If you're worried, add caching:
from functools import cache
@cache
def load_posts():
# Now it only reads from disk once
...
Clear the cache when data changes (or just restart the server).
When file-based storage is actually fast
People assume databases are always faster than files. But for read-heavy workloads with small datasets, files can be surprisingly quick:
- No network overhead: The database is in the same process. No TCP connection, no serialization protocol.
- No query parsing: You're not sending SQL that needs to be parsed, planned, and executed.
- OS-level caching: Your OS caches frequently-read files in memory. Reading the same file twice is basically instant.
- Simple data structures: No B-trees to traverse, no index lookups. Just deserialize and go.
I ran some quick benchmarks on my laptop. Loading 100 blog posts from individual JSON files: ~5ms. Loading the same data from SQLite: ~8ms. From PostgreSQL over localhost: ~15ms.
Now, these numbers change dramatically as your dataset grows or you need complex queries. But for small datasets? Files are competitive.
Lazy loading for larger datasets
If you have hundreds or thousands of items, you don't want to load everything upfront. Here's a simple lazy-loading pattern:
class PostStore:
def __init__(self, data_dir: Path):
self.data_dir = data_dir
self._cache = {}
def get(self, slug: str) -> Optional[BlogPost]:
if slug in self._cache:
return self._cache[slug]
path = self.data_dir / f'{slug}.json'
if not path.exists():
return None
post = BlogPost(**json.loads(path.read_text()))
self._cache[slug] = post
return post
def list_all(self) -> list[BlogPost]:
# Only load metadata for listing
posts = []
for path in self.data_dir.glob('*.json'):
data = json.loads(path.read_text())
# Only extract what we need for the list view
posts.append({
'slug': data['slug'],
'title': data['title'],
'published_at': data['published_at']
})
return sorted(posts, key=lambda p: p['published_at'], reverse=True)
This loads individual posts on-demand and keeps them cached. For list views, it only extracts the fields you need. Simple and effective.
What about writes?
Writes are the tricky part. If multiple processes write to the same file at the same time, you get corrupted data. But... who's writing to my portfolio site? Just me. And I'm not going to race condition myself.
If you do need concurrent writes, you can use file locking:
import fcntl
def save_posts_safely(posts):
path = DATA_DIR / 'posts.json'
with open(path, 'w') as f:
fcntl.flock(f, fcntl.LOCK_EX)
json.dump([asdict(p) for p in posts], f)
fcntl.flock(f, fcntl.LOCK_UN)
Or use the atomic write pattern with tempfile and rename:
import tempfile
import os
def save_posts_atomic(posts):
path = DATA_DIR / 'posts.json'
data = json.dumps([asdict(p) for p in posts], indent=2)
# Write to temp file first
fd, temp_path = tempfile.mkstemp(dir=DATA_DIR)
with os.fdopen(fd, 'w') as f:
f.write(data)
# Atomic rename
os.rename(temp_path, path)
This guarantees readers never see a half-written file. The rename operation is atomic on most filesystems, so you either get the old data or the new data, never corrupted garbage in between.
But honestly, if you need this level of care, maybe it's time for a real database.
Common gotchas (and how to avoid them)
I've made every mistake with file-based storage. Here are the lessons learned:
1. Handle missing files gracefully
# Bad - crashes when file doesn't exist
def load_posts():
return json.loads(Path('data/posts.json').read_text())
# Good - handles missing file
def load_posts():
path = Path('data/posts.json')
if not path.exists():
return []
try:
return json.loads(path.read_text())
except json.JSONDecodeError:
# File exists but is corrupted
print(f"Warning: Could not parse {path}")
return []
Files go missing. They get corrupted. Your code should handle it gracefully.
2. Create directories before writing
# Bad - crashes if directory doesn't exist
def save_post(post):
path = Path(f'data/posts/{post.year}/{post.slug}.json')
path.write_text(json.dumps(asdict(post)))
# Good - ensures directory exists
def save_post(post):
path = Path(f'data/posts/{post.year}/{post.slug}.json')
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps(asdict(post)))
mkdir(parents=True, exist_ok=True) is your friend. It creates nested directories and doesn't error if they already exist.
3. Watch out for encoding issues
Always specify UTF-8 explicitly:
# Can break on Windows or with special characters
path.write_text(json.dumps(data))
# Better
path.write_text(json.dumps(data, ensure_ascii=False), encoding='utf-8')
Or use the file handle directly:
with open(path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
This prevents mojibake with emoji and non-ASCII characters.
4. Don't trust user input in file paths
# DANGEROUS - vulnerable to path traversal
def get_post(slug: str):
path = Path(f'data/posts/{slug}.json')
return json.loads(path.read_text())
# What if slug is "../../../etc/passwd"?
# Better - validate the slug
def get_post(slug: str):
if not slug.replace('-', '').replace('_', '').isalnum():
raise ValueError("Invalid slug")
path = Path(f'data/posts/{slug}.json')
# Also check the resolved path is still in our data directory
if not path.resolve().is_relative_to(Path('data/posts').resolve()):
raise ValueError("Invalid path")
return json.loads(path.read_text())
Path traversal attacks are real. Validate your inputs.
SQLite as a middle ground
If you want some database features but not a full PostgreSQL setup, SQLite is a great middle ground. It's still just a file, but you get SQL, indexes, and transactions.
import sqlite3
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('SELECT * FROM posts WHERE slug = ?', (slug,))
SQLite handles concurrent reads well and can handle moderate concurrent writes. It's battle-tested and included in Python's standard library.
Here's when SQLite makes sense:
- You need to query data in complex ways (joins, aggregations, filtering)
- You have relationships between entities that you need to maintain
- Multiple processes might write to the data occasionally
- Your dataset is getting large (thousands of records) but still fits on one machine
- You want ACID guarantees but don't want to run a database server
SQLite is legitimately good. It powers mobile apps with millions of users, browsers like Chrome and Firefox, and even some production web apps. The key constraint is: one machine, one file, mostly-reads workload.
For my portfolio site, I chose JSON over SQLite because:
- I literally never query my data. I just load all posts and sort/filter in Python.
- I like being able to edit posts in a text editor without SQL.
- JSON diffs in git are easier to read than SQLite binary diffs.
But if I were building a personal task manager or a note-taking app with tags and search? SQLite all the way.
When to graduate to a real database
You'll know when you need a real database:
- Multiple users writing data concurrently
- Complex queries with joins and aggregations
- Data too big to fit in memory
- Need for transactions across multiple operations
- Team of developers who need a shared data model
Until then? Files are fine. Don't over-engineer your side project.
Migration strategy (when the time comes)
Eventually, you might outgrow file-based storage. Here's how to migrate without rewriting everything.
Step 1: Abstract your data layer
Instead of calling json.loads() everywhere, create a simple interface:
# data_store.py
from abc import ABC, abstractmethod
class DataStore(ABC):
@abstractmethod
def get_post(self, slug: str) -> Optional[BlogPost]:
pass
@abstractmethod
def list_posts(self) -> list[BlogPost]:
pass
@abstractmethod
def save_post(self, post: BlogPost) -> None:
pass
class JSONDataStore(DataStore):
def get_post(self, slug: str) -> Optional[BlogPost]:
path = Path(f'data/posts/{slug}.json')
if not path.exists():
return None
return BlogPost(**json.loads(path.read_text()))
# ... implement other methods
Now your application code depends on the interface, not the implementation.
Step 2: Implement the new storage
When you're ready to migrate to PostgreSQL or SQLite:
class PostgresDataStore(DataStore):
def __init__(self, db_url: str):
self.db = create_engine(db_url)
def get_post(self, slug: str) -> Optional[BlogPost]:
with self.db.connect() as conn:
result = conn.execute(
text("SELECT * FROM posts WHERE slug = :slug"),
{"slug": slug}
)
row = result.fetchone()
return BlogPost(**row) if row else None
# ... implement other methods
Step 3: Dual-write for safety
During migration, write to both stores:
class DualWriteDataStore(DataStore):
def __init__(self, old_store: DataStore, new_store: DataStore):
self.old_store = old_store
self.new_store = new_store
def save_post(self, post: BlogPost) -> None:
# Write to both
self.old_store.save_post(post)
self.new_store.save_post(post)
def get_post(self, slug: str) -> Optional[BlogPost]:
# Read from new store, fall back to old
result = self.new_store.get_post(slug)
if result is None:
result = self.old_store.get_post(slug)
return result
This lets you migrate gradually and roll back if something breaks.
The real lesson
The real lesson isn't "use files instead of databases." It's "choose the right tool for the job."
For my portfolio site with 50 blog posts that only I edit? JSON files in git are perfect. Zero operational overhead. Simple to understand. Easy to debug.
For a SaaS app with thousands of users writing data concurrently? PostgreSQL all the way. You need transactions, you need concurrent access, you need a query language.
The problem is when people apply the SaaS solution to the portfolio site problem. You end up with:
- Database credentials to manage
- Migration scripts for schema changes
- Connection pooling configuration
- Backup strategies and restore procedures
- Database server to monitor and maintain
All for 50 blog posts that change once a month.
It's like using a semi truck to buy groceries. Yeah, it can carry more groceries than your car. But you don't need that, and now you need a CDL.
Who else does this?
I'm not the only one. Lots of successful projects use file-based storage:
- Static site generators (Hugo, Jekyll, Eleventy): Markdown files for content. They build sites with thousands of pages from flat files.
- Obsidian: A note-taking app used by millions. All notes are markdown files in a folder.
- Git: The most successful version control system ever. It's all files. No database server.
- Many CLI tools: Store config in JSON/YAML files in
~/.config/.
These aren't toys. They're production software used by millions of people. File-based storage scales further than you think.
Closing thoughts
Start simple. Use files. When files become a problem, you'll know. And by then, you'll have a real product with real users and real revenue to justify the complexity of a database.
Don't optimize for scale you don't have. Don't build infrastructure for problems you don't face.
Build the simplest thing that works. Ship it. Learn from it. Iterate.
Files are fine.