How Do You Create A Cache File In Python?

This article provides an in-depth guide on how to create a cache file in Python, outlining several methods from simple, built-in solutions to more advanced third-party libraries.

Caching is a crucial optimization technique that stores frequently accessed data in a temporary location, speeding up application performance by avoiding repetitive, expensive computations or network requests.

Method 1: Manual file caching with JSON or Pickle

This approach is the most fundamental, involving manually writing and reading data from a file on disk. It offers full control but requires careful handling of file I/O, error management, and data serialization.

Using the `json` module

This is suitable for caching data that can be easily represented in the JSON format, such as dictionaries and lists.

Pros:

Human-readable: The cache file is a plain text file, which is easy to inspect and debug.
Interoperable: JSON is a language-agnostic format, so the cache could potentially be read by other programs.

Cons:

Data type limitations: It can only serialize standard JSON-compatible Python data types.
Performance: For very large data, JSON can be slower than binary formats.

Example: Caching API results with JSON

import os
import json
import time
CACHE_DIR = 'cache_files'
def get_data_from_api(user_id):
    """Simulates a slow API request."""
    print(f"Fetching data for user {user_id} from the API...")
    time.sleep(2)  # Simulate network delay
    return {"id": user_id, "name": f"User {user_id}", "status": "active"}
def get_user_data(user_id, cache_timeout=300):
    """Fetches user data, using a JSON file cache."""
    if not os.path.exists(CACHE_DIR):
        os.makedirs(CACHE_DIR)
    cache_file = os.path.join(CACHE_DIR, f"user_{user_id}.json")
    # Check if a non-expired cache file exists
    if os.path.exists(cache_file):
        file_mod_time = os.path.getmtime(cache_file)
        if time.time() - file_mod_time < cache_timeout:
            print(f"Retrieving user {user_id} data from cache...")
            with open(cache_file, 'r') as f:
                return json.load(f)
    # If no valid cache, fetch from API and save
    user_data = get_data_from_api(user_id)
    with open(cache_file, 'w') as f:
        json.dump(user_data, f)
    return user_data
# First call: Fetches and caches data
print(get_user_data(101))
# Second call: Retrieves data from the cache
print(get_user_data(101))

Use code with caution.

Using the `pickle` module

The pickle module serializes arbitrary Python objects into a binary format. This is ideal for caching complex objects that are not supported by JSON.

Pros:

Versatile: Can serialize a wide range of Python objects, including custom classes.
Faster: Serialization is generally quicker than JSON, especially for large datasets.

Cons:

Security risks: Deserializing a pickle file from an untrusted source can execute malicious code.
Python-specific: The binary format is not interoperable with other programming languages.

Example: Caching complex objects with Pickle

import os
import pickle
import time
CACHE_DIR = 'cache_files'
class UserObject:
    def __init__(self, user_id, name):
        self.user_id = user_id
        self.name = name
    def __repr__(self):
        return f"<UserObject id={self.user_id}, name='{self.name}'>"
def generate_complex_user(user_id):
    """Simulates generating a complex object."""
    print(f"Generating complex user object for {user_id}...")
    time.sleep(1)
    return UserObject(user_id, f"Complex User {user_id}")
def get_cached_user_object(user_id):
    """Fetches or creates a user object, using a pickle file cache."""
    if not os.path.exists(CACHE_DIR):
        os.makedirs(CACHE_DIR)
    cache_file = os.path.join(CACHE_DIR, f"user_obj_{user_id}.pickle")
    if os.path.exists(cache_file):
        print(f"Retrieving user object for {user_id} from pickle cache...")
        with open(cache_file, 'rb') as f:
            return pickle.load(f)
    user_obj = generate_complex_user(user_id)
    with open(cache_file, 'wb') as f:
        pickle.dump(user_obj, f)
    return user_obj
# First run: Generates and caches the object
print(get_cached_user_object(202))
# Second run: Loads the object directly from the cache
print(get_cached_user_object(202))

Use code with caution.

Method 2: Caching with the `shelve` module

The shelve module is part of Python's standard library and provides a simple, dictionary-like interface for persistent object storage on disk. It handles the serialization process automatically using pickle.

Pros:

Ease of use: It operates just like a dictionary, so you don't need to manage individual files.
Persistence: Data is stored on disk and persists between program executions.

Cons:

Performance: Can be less performant than a dedicated database for very large caches.
Write-back overhead: If you enable the writeback option, it uses more memory and can be slow to close the shelf.

Example: Persistent caching with shelve

import shelve
def expensive_lookup(key):
    """Simulates an expensive operation."""
    print(f"Performing expensive lookup for key '{key}'...")
    time.sleep(2)
    return f"Result for {key}"
# Create or open the shelf file
with shelve.open('my_cache.db') as cache:
    # First access: Performs the expensive lookup and caches it
    if 'data_key' not in cache:
        cache['data_key'] = expensive_lookup('data_key')
    # Second access: Retrieves the result from the shelf
    print("Retrieving from cache...")
    print(cache['data_key'])
    # Example of a new item
    if 'another_key' not in cache:
        cache['another_key'] = expensive_lookup('another_key')
    print("Retrieving from cache...")
    print(cache['another_key'])
# Data is now stored in 'my_cache.db' and persists after the program exits

Use code with caution.

Method 3: Third-party libraries for robust caching

For more demanding applications, several excellent third-party libraries provide advanced features, including cache expiration policies and thread/process safety.

`diskcache`

diskcache is a fast and robust disk-backed caching library that supports concurrent access and advanced features like time-based expiration and size limits.

Pros:

Scalable: Excellent for large caches that exceed available memory.
Concurrent: Safe for use in multi-threaded or multi-process applications.
Feature-rich: Includes a key-value store, deque, and a function decorator.

Example: Caching with diskcache

from diskcache import Cache
import time
def slow_function(x):
    """A function that takes a long time to compute."""
    print(f"Executing slow function with x={x}...")
    time.sleep(1)
    return x * x
# Initialize a cache instance
with Cache('my_disk_cache') as cache:
    # Use the cache directly for key-value storage
    cache.set('key1', 'value1')
    print(f"Retrieved from diskcache: {cache.get('key1')}")
    # Use it as a function decorator
    @cache.memoize(expire=60)  # Cache results for 60 seconds
    def memoized_slow_function(x):
        return slow_function(x)
    print(f"First call: {memoized_slow_function(5)}")
    print(f"Second call (cached): {memoized_slow_function(5)}")

Use code with caution.

`joblib`

joblib is a library for parallel computing in Python that also offers a simple, decorator-based disk caching mechanism. It's particularly useful for caching the results of heavy computations in scientific and machine learning applications.

Pros:

Simple decorator: Easily cache function calls with minimal code.
Efficient: Avoids recomputing expensive results for the same inputs.

Example: Caching with joblib

from joblib import Memory
import time
# Create a memory object that will store cached data in the specified directory
memory = Memory('./joblib_cache', verbose=0)
@memory.cache
def expensive_calculation(n):
    """Simulates a costly computation."""
    print(f"Running expensive calculation for {n}...")
    time.sleep(2)
    return sum(range(n))
# First call: Runs the calculation and caches the result
print(expensive_calculation(10000000))
# Second call with the same input: Loads the result from the cache
print(expensive_calculation(10000000))

Use code with caution.

Choosing the right caching method

Method	Best For	Considerations
Manual (JSON/Pickle)	Simple scripts or when you need full control over file management.	Requires manual logic for cache checks and expiration. `pickle` is not secure with untrusted sources.
`shelve`	Quick-and-dirty persistent object storage.	Limited in features compared to dedicated libraries; less performant for very large caches.
`diskcache`	Large-scale, high-performance applications with concurrent access needs.	More powerful but has a steeper learning curve than `shelve`.
`joblib`	Caching computationally intensive functions, especially in scientific computing.	Primarily focused on function result caching rather than general key-value storage.

Enjoyed this article? Share it with a friend.

Method 1: Manual file caching with JSON or Pickle

Using the json module

Using the pickle module

Method 2: Caching with the shelve module

Method 3: Third-party libraries for robust caching

diskcache

joblib

Choosing the right caching method

Using the `json` module

Using the `pickle` module

Method 2: Caching with the `shelve` module

`diskcache`

`joblib`