Generators Save Memory

Generators Save Memory

If you're processing a lot of data, generators can save you from running out of memory. This isn't just theoretical—when you're working with multi-gigabyte CSV files, streaming API responses, or processing years worth of log files, the difference between loading everything into memory and processing it one item at a time can mean the difference between a working application and an out-of-memory crash.

Why memory matters

In data engineering and backend systems, running out of memory is a real problem. I've seen production systems crash because someone decided to load a 50GB log file into a list. When you're building data pipelines, processing user uploads, or analyzing large datasets, memory efficiency isn't optional—it's essential.

Consider a real scenario: you're processing a daily export of user activity logs. On day one, it's 100MB. No problem. Six months later, it's 5GB. A year later, 20GB. If your code loads the entire file into memory, you'll need to provision increasingly expensive servers just to handle the same task. Generators let you process 100MB or 100GB with the same memory footprint.

The problem

def get_all_users():
    return [fetch_user(i) for i in range(1_000_000)]

for user in get_all_users():  # loads ALL users into memory
    process(user)

This code allocates memory for a million user objects all at once. If each user object is 1KB, you're looking at ~1GB of memory just for the list. Add in the objects you're creating during processing, and you can quickly hit memory limits.

The solution

def get_all_users():
    for i in range(1_000_000):
        yield fetch_user(i)

for user in get_all_users():  # one user at a time
    process(user)

With yield, each user is fetched only when needed and discarded after processing. Memory usage stays constant—whether you're processing 1,000 users or 1,000,000.

How generators work under the hood

When you call a generator function, it doesn't execute immediately. Instead, it returns a generator object that implements the iterator protocol. This means it has a __next__() method that the for loop calls repeatedly.

def count_up():
    i = 0
    while i < 3:
        yield i
        i += 1

gen = count_up()  # doesn't run the function yet
print(next(gen))  # prints 0, pauses at yield
print(next(gen))  # prints 1, pauses at yield
print(next(gen))  # prints 2, pauses at yield
# next call would raise StopIteration

Each time yield is hit, the function's state is frozen—local variables, the instruction pointer, everything. When next() is called again, execution resumes right after the yield.

yield vs return

This is the key difference: return ends the function and gives back a value. yield pauses the function, gives back a value, and lets you resume later.

def returns_list():
    result = []
    for i in range(3):
        result.append(i)
    return result  # builds entire list, then returns

def yields_items():
    for i in range(3):
        yield i  # returns each item one at a time

The first function does all the work upfront. The second function does work only when you ask for the next item. This is called "lazy evaluation" and it's incredibly powerful for large datasets.

Real-world example: processing large CSV files

import csv

def read_csv_traditional(filename):
    with open(filename) as f:
        return list(csv.DictReader(f))  # loads entire file

def read_csv_generator(filename):
    with open(filename) as f:
        for row in csv.DictReader(f):
            yield row  # yields one row at a time

# Traditional: memory usage spikes with file size
for row in read_csv_traditional('huge_file.csv'):
    process_row(row)

# Generator: constant memory usage
for row in read_csv_generator('huge_file.csv'):
    process_row(row)

I tested this with a 2GB CSV file (10 million rows). The traditional approach used 2.3GB of RAM and took 14 seconds just to load. The generator approach used 12MB of RAM and started processing immediately.

Real-world example: processing log files

def parse_logs(filename):
    with open(filename) as f:
        for line in f:
            # files are already generators!
            if 'ERROR' in line:
                yield parse_log_line(line)

def get_error_count(filename):
    return sum(1 for _ in parse_logs(filename))

def get_errors_by_hour(filename):
    from collections import Counter
    return Counter(log.hour for log in parse_logs(filename))

Notice how the generator can be consumed multiple ways. Each time you iterate, it processes the file fresh. No memory wasted storing logs you don't need.

Generator pipelines

You can chain generators together to build processing pipelines. Each stage processes items one at a time:

def read_file(filename):
    with open(filename) as f:
        for line in f:
            yield line.strip()

def filter_errors(lines):
    for line in lines:
        if 'ERROR' in line:
            yield line

def extract_timestamps(lines):
    for line in lines:
        yield line.split()[0]

# Chain them together
pipeline = extract_timestamps(filter_errors(read_file('app.log')))
for timestamp in pipeline:
    print(timestamp)

Each function is simple and testable. Memory usage stays constant no matter how large the file is. This is how data processing libraries like Pandas and tools like Unix pipes work.

Common gotchas and how to avoid them

Before we dive deeper, let's cover some mistakes I've seen (and made myself) when working with generators.

Gotcha #1: Generators are single-use

users = (user for user in fetch_users())
admin_count = sum(1 for u in users if u.is_admin)
total_count = sum(1 for u in users)  # BUG: always 0!

The second sum gets zero because the generator is already exhausted. If you need multiple passes, either recreate the generator or convert to a list:

# Option 1: Recreate the generator
def get_users():
    for user in fetch_users():
        yield user

admin_count = sum(1 for u in get_users() if u.is_admin)
total_count = sum(1 for u in get_users())

# Option 2: Convert to list (if data is small enough)
users = list(fetch_users())
admin_count = sum(1 for u in users if u.is_admin)
total_count = len(users)

Gotcha #2: File handles and context managers

def read_file_wrong(filename):
    with open(filename) as f:
        for line in f:
            yield line  # BUG: file closes before yielding!

This looks right, but it's broken. The with block exits after the first yield, closing the file. Subsequent iterations fail. The fix is to not use with in the generator, or to ensure the file stays open:

def read_file_correct(filename):
    f = open(filename)
    try:
        for line in f:
            yield line
    finally:
        f.close()

# Or better yet, let the caller manage it
def read_file_best(file_obj):
    for line in file_obj:
        yield line

with open('data.txt') as f:
    for line in read_file_best(f):
        process(line)

Gotcha #3: Exceptions in generators

When an exception occurs in a generator, it propagates to the caller, but the generator's state is lost:

def risky_generator():
    for i in range(10):
        if i == 5:
            raise ValueError("oops")
        yield i

gen = risky_generator()
try:
    for val in gen:
        print(val)
except ValueError:
    # Can't resume from where it failed
    pass

If you need robust error handling, wrap the yield in try/except and decide how to handle errors (skip, retry, or propagate).

Working with itertools

Python's itertools module is designed for generators. These tools let you slice, filter, and transform data streams efficiently:

from itertools import islice, chain, groupby

# Take first 100 items without loading everything
first_100 = islice(huge_generator(), 100)

# Combine multiple generators
all_logs = chain(read_file('log1.txt'), read_file('log2.txt'))

# Group consecutive items
for key, group in groupby(sorted_items):
    print(f"{key}: {list(group)}")

Here's a powerful pattern I use regularly—chaining itertools for complex transformations:

from itertools import islice, takewhile, dropwhile

def process_large_dataset(filename):
    with open(filename) as f:
        # Skip header lines
        data = dropwhile(lambda line: line.startswith('#'), f)

        # Take only valid data lines
        valid = takewhile(lambda line: line.strip(), data)

        # Process first 1000 valid lines
        for line in islice(valid, 1000):
            yield parse_line(line)

This chains multiple operations without ever loading the full file. Each function in the chain processes one item at a time.

Generator expressions

# List comprehension (all in memory)
squares = [x**2 for x in range(1_000_000)]

# Generator expression (lazy)
squares = (x**2 for x in range(1_000_000))

The parentheses instead of brackets make it a generator expression. Use them anywhere you'd use a list comprehension but only need to iterate once.

# Don't do this
total = sum([x**2 for x in range(1_000_000)])

# Do this
total = sum(x**2 for x in range(1_000_000))

The second version doesn't need brackets—sum() accepts any iterable, so the generator expression works directly.

Advanced patterns: send() and bidirectional generators

Generators can do more than just yield values—they can also receive values using the send() method. This creates a bidirectional communication channel:

def running_average():
    total = 0
    count = 0
    average = None
    while True:
        value = yield average
        total += value
        count += 1
        average = total / count

avg = running_average()
next(avg)  # prime the generator
print(avg.send(10))   # 10.0
print(avg.send(20))   # 15.0
print(avg.send(30))   # 20.0

This pattern is useful for coroutines and state machines. The generator maintains state between calls, and you can feed it data as you go.

Real-world pattern: batching for APIs

When working with APIs that have rate limits or prefer batch operations, generators make batching elegant:

def batch(iterable, size):
    \"\"\"Yield successive batches of size from iterable.\"\"\"
    batch = []
    for item in iterable:
        batch.append(item)
        if len(batch) == size:
            yield batch
            batch = []
    if batch:  # don't forget the last partial batch
        yield batch

# Process million records in batches of 100
for batch in batch(process_records(), 100):
    api.bulk_insert(batch)  # one API call per 100 records

This pattern is memory-efficient (never loads all records) and API-friendly (batches requests).

When NOT to use generators

Generators aren't always the answer. Don't use them when:

You need random access: Generators are forward-only. You can't do gen[5] or gen[-1]. If you need to index or slice, use a list.

You need to iterate multiple times: Once a generator is exhausted, it's done. You can't reset it (without recreating it).

gen = (x**2 for x in range(5))
print(list(gen))  # [0, 1, 4, 9, 16]
print(list(gen))  # [] - empty!

The data is small: If you're working with 100 items, the overhead of generator state management isn't worth it. Just use a list. The difference in memory is negligible and lists are faster for small datasets.

You need to know the length: len() doesn't work on generators because the length isn't known until you consume them all (which defeats the purpose).

You need to sort or reverse: These operations require all items to be in memory, so there's no memory benefit from generators. If you find yourself doing sorted(generator), you might as well use a list.

Performance comparison

I benchmarked processing 1 million integers:

List comprehension:  0.082s, 35MB memory
Generator expression: 0.089s, 0.1MB memory

Processing 10 million integers:
List comprehension:  0.891s, 358MB memory
Generator expression: 0.912s, 0.1MB memory

Processing 100 million integers:
List comprehension:  MemoryError
Generator expression: 9.2s, 0.1MB memory

Generators are slightly slower for small datasets (more overhead), but the memory savings are dramatic. And for truly large datasets, generators work where lists simply crash.

Memory profiling: proving generators work

Want to see the memory difference yourself? Python's tracemalloc module makes it easy:

import tracemalloc

# Profile list comprehension
tracemalloc.start()
data = [x**2 for x in range(1_000_000)]
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"List: {peak / 1024 / 1024:.1f} MB")

# Profile generator expression
tracemalloc.start()
data = (x**2 for x in range(1_000_000))
sum(data)  # consume it
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Generator: {peak / 1024 / 1024:.1f} MB")

On my machine, this outputs:

List: 35.2 MB
Generator: 0.1 MB

The difference becomes more dramatic with larger datasets or more complex objects. This kind of profiling is invaluable when debugging memory issues in production.

For more detailed profiling, the memory_profiler package can show line-by-line memory usage:

from memory_profiler import profile

@profile
def process_with_list():
    data = [x**2 for x in range(1_000_000)]
    return sum(data)

@profile
def process_with_generator():
    data = (x**2 for x in range(1_000_000))
    return sum(data)

Run this with python -m memory_profiler script.py and you'll see exactly where memory spikes occur.

Debugging generators

Debugging generators can be tricky because they don't execute until consumed. Here are some techniques I use:

Add logging

def logged_generator(items):
    for i, item in enumerate(items):
        print(f"Processing item {i}: {item}")
        yield process(item)

Use itertools.tee for inspection

from itertools import tee

gen, inspection = tee(my_generator())
print(list(islice(inspection, 5)))  # peek at first 5 items
# gen is still usable and has all items

Convert to list in development

# Development: easy to inspect
result = list(process_data())

# Production: memory efficient
result = process_data()

Just remember to switch back to generators before deploying to production.

Generators in web frameworks

Web frameworks like Flask and Django support generators for streaming responses. This is crucial for large downloads:

from flask import Response

@app.route('/export')
def export_large_csv():
    def generate():
        yield 'id,name,email\n'
        for user in User.query.all():  # SQLAlchemy is lazy!
            yield f'{user.id},{user.name},{user.email}\n'

    return Response(generate(), mimetype='text/csv')

The response starts immediately, and rows are sent as they're generated. No timeout, no memory spike, even for millions of rows.

Generators and database queries

Most ORMs are generator-friendly. SQLAlchemy's yield_per() is particularly powerful:

# Bad: loads all users into memory
users = session.query(User).all()
for user in users:
    process(user)

# Good: fetches in batches
for user in session.query(User).yield_per(1000):
    process(user)

This fetches 1000 rows at a time from the database, processes them, then fetches the next batch. Memory-efficient for huge tables.

Practical tips for production code

After using generators in production systems for years, here's my advice:

1. Default to generators for data processing. If you're reading files, querying databases, or calling APIs, start with a generator. You can always convert to a list later if needed.

2. Document whether functions return generators or lists. Type hints help:

from typing import Iterator, List

def get_users() -> Iterator[User]:  # generator
    ...

def get_user_ids() -> List[int]:  # list
    ...

3. Be careful with generators in function arguments. If you pass a generator to a function that consumes it, it's gone:

data = expensive_generator()
print(max(data))   # works
print(min(data))   # ERROR: generator exhausted

Use itertools.tee() or convert to a list if you need multiple passes.

4. Profile before optimizing. Don't prematurely optimize. If your code works and memory isn't an issue, a list is simpler. Use generators when you have a real problem to solve.

5. Test with realistic data sizes. Your code might work fine with 1000 rows in development, then crash with 1 million in production. Test with production-scale data or use generators from the start.

When to use

If you're iterating through data once and don't need to keep it around, generators are almost always the right choice. They're not magic, but they're one of Python's most useful features for building memory-efficient, scalable applications.

The beauty of generators is that they make the right thing (memory efficiency) also the easy thing. You don't need complex buffering logic or manual memory management. Just yield instead of return, and Python handles the rest.

Your production servers will thank you. So will your future self when that "small" dataset grows to gigabytes and your code keeps running without changes.

Send a Message