At its core, Dogpile provides a locking interface around a “value creation” function.
The interface supports several levels of usage, starting from one that is very rudimentary, then providing more intricate usage patterns to deal with certain scenarios. The documentation here will attempt to provide examples that use successively more and more of these features, as we approach how a fully featured caching system might be constructed around Dogpile.
It’s anticipated that most users of Dogpile will be using it indirectly via the dogpile.cache caching front-end. If you fall into this category, then the short answer is no.
Dogpile provides core internals to the dogpile.cache package, which provides a simple-to-use caching API, rudimental backends for Memcached and others, and easy hooks to add new backends. Users of dogpile.cache don’t need to know or access Dogpile’s APIs directly, though a rough understanding the general idea is always helpful.
Using the core Dogpile APIs described here directly implies you’re building your own resource-usage system outside, or in addition to, the one dogpile.cache provides.
A simple example:
from dogpile import Dogpile
# store a reference to a "resource", some
# object that is expensive to create.
the_resource = [None]
def some_creation_function():
# create the resource here
the_resource[0] = create_some_resource()
def use_the_resource():
# some function that uses
# the resource. Won't reach
# here until some_creation_function()
# has completed at least once.
the_resource[0].do_something()
# create Dogpile with 3600 second
# expiry time
dogpile = Dogpile(3600)
with dogpile.acquire(some_creation_function):
use_the_resource()
Above, some_creation_function() will be called when Dogpile.acquire() is first called. The remainder of the with block then proceeds. Concurrent threads which call Dogpile.acquire() during this initial period will be blocked until some_creation_function() completes.
Once the creation function has completed successfully the first time, new calls to Dogpile.acquire() will call some_creation_function() each time the “expiretime” has been reached, allowing only a single thread to call the function. Concurrent threads which call Dogpile.acquire() during this period will fall through, and not be blocked. It is expected that the “stale” version of the resource remain available at this time while the new one is generated.
By default, Dogpile uses Python’s threading.Lock() to synchronize among threads within a process. This can be altered to support any kind of locking as we’ll see in a later section.
The dogpile lock includes a more intricate mode of usage to optimize the usage of a cache like Memcached. The difficulties Dogpile addresses in this mode are:
To use this mode, the steps are as follows:
Example:
from dogpile import Dogpile, NeedRegenerationException
def get_value_from_cache():
value = my_cache.get("some key")
if value is None:
raise NeedRegenerationException()
return value
def create_and_cache_value():
value = my_expensive_resource.create_value()
my_cache.put("some key", value)
return value
dogpile = Dogpile(3600, init=True)
with dogpile.acquire(create_and_cache_value, get_value_from_cache) as value:
return value
Note that get_value_from_cache() should not raise NeedRegenerationException a second time directly after create_and_cache_value() has been called.
Dogpile is part of an effort to “break up” the Beaker package into smaller, simpler components (which also work better). Here, we illustrate how to approximate Beaker’s “cache decoration” function, to decorate any function and store the value in Memcached. We create a Python decorator function called cached() which will provide caching for the output of a single function. It’s given the “key” which we’d like to use in Memcached, and internally it makes usage of its own Dogpile object that is dedicated to managing this one function/key:
import pylibmc
mc_pool = pylibmc.ThreadMappedPool(pylibmc.Client("localhost"))
from dogpile import Dogpile, NeedRegenerationException
def cached(key, expiration_time):
"""A decorator that will cache the return value of a function
in memcached given a key."""
def get_value():
with mc_pool.reserve() as mc:
value = mc.get(key)
if value is None:
raise NeedRegenerationException()
return value
dogpile = Dogpile(expiration_time, init=True)
def decorate(fn):
def gen_cached():
value = fn()
with mc_pool.reserve() as mc:
mc.put(key, value)
return value
def invoke():
with dogpile.acquire(gen_cached, get_value) as value:
return value
return invoke
return decorate
Above we can decorate any function as:
@cached("some key", 3600)
def generate_my_expensive_value():
return slow_database.lookup("stuff")
The Dogpile lock will ensure that only one thread at a time performs slow_database.lookup(), and only every 3600 seconds, unless Memcached has removed the value in which case it will be called again as needed.
In particular, Dogpile’s system allows us to call the memcached get() function at most once per access, instead of Beaker’s system which calls it twice, and doesn’t make us call get() when we just created the value.
The patterns so far have illustrated how to use a single, persistently held Dogpile object which maintains a thread-based lock for the lifespan of some particular value. The Dogpile also is responsible for maintaining the last known “creation time” of the value; this is available from a given Dogpile object from the Dogpile.createdtime attribute.
For an application that may deal with an arbitrary number of cache keys retrieved from a remote service, this approach must be revised so that we don’t need to store a Dogpile object for every possible key in our application’s memory.
The two challenges here are:
The approach is another one derived from Beaker, where we will use a registry that can provide a unique Dogpile object given a particular key, ensuring that all concurrent threads use the same object, but then releasing the object to the Python garbage collector when this usage is complete. The NameRegistry object provides this functionality, again constructed around the notion of a creation function that is only invoked as needed. We also will instruct the Dogpile.acquire() method to use a “creation time” value that we retrieve from the cache, via the value_and_created_fn parameter, which supercedes the value_fn we used earlier. value_and_created_fn expects a function that will return a tuple of (value, created_at), where it’s assumed both have been retrieved from the cache backend:
import pylibmc
import pickle
import time
from dogpile import Dogpile, NeedRegenerationException, NameRegistry
mc_pool = pylibmc.ThreadMappedPool(pylibmc.Client("localhost"))
def create_dogpile(key, expiration_time):
return Dogpile(expiration_time)
dogpile_registry = NameRegistry(create_dogpile)
def get_or_create(key, expiration_time, creation_function):
def get_value():
with mc_pool.reserve() as mc:
value = mc.get(key)
if value is None:
raise NeedRegenerationException()
# deserialize a tuple
# (value, createdtime)
return pickle.loads(value)
def gen_cached():
value = creation_function()
with mc_pool.reserve() as mc:
# serialize a tuple
# (value, createdtime)
value = (value, time.time())
mc.put(mangled_key, pickle.dumps(value))
return value
dogpile = dogpile_registry.get(key, expiration_time)
with dogpile.acquire(gen_cached, value_and_created_fn=get_value) as value:
return value
Stepping through the above code:
An example usage of the completed function:
import urllib2
def get_some_value(key):
"""retrieve a datafile from a slow site based on the given key."""
def get_data():
return urllib2.urlopen(
"http://someslowsite.com/some_important_datafile_%s.json" % key
).read()
return get_or_create(key, 3600, get_data)
my_data = get_some_value("somekey")
The final twist on the caching pattern is to fix the issue of the Dogpile mutex itself being local to the current process. When a handful of threads all go to access some key in our cache, they will access the same Dogpile object which internally can synchronize their activity using a Python threading.Lock. But in this example we’re talking to a Memcached cache. What if we have many servers which all access this cache? We’d like all of these servers to coordinate together so that we don’t just prevent the dogpile problem within a single process, we prevent it across all servers.
To accomplish this, we need an object that can coordinate processes. In this example we’ll use a file-based lock as provided by the lockfile package, which uses a unix-symlink concept to provide a filesystem-level lock (which also has been made threadsafe). Another strategy may base itself directly off the Unix os.flock() call, and still another approach is to lock within Memcached itself, using a recipe such as that described at Using Memcached as a Distributed Locking Service. The type of lock chosen here is based on a tradeoff between global availability and reliable performance. The file-based lock will perform more reliably than the memcached lock, but may be difficult to make accessible to multiple servers (with NFS being the most likely option, which would eliminate the possibility of the os.flock() call). The memcached lock on the other hand will provide the perfect scope, being available from the same memcached server that the cached value itself comes from; however the lock may vanish in some cases, which means we still could get a cache-regeneration pileup in that case.
What all of these locking schemes have in common is that unlike the Python threading.Lock object, they all need access to an actual key which acts as the symbol that all processes will coordinate upon. This is where the key argument to our create_dogpile() function introduced in Scaling Dogpile against Many Keys comes in. The example can remain the same, except for the changes below to just that function:
import lockfile
import os
from hashlib import sha1
# ... other imports and setup from the previous example
def create_dogpile(key, expiration_time):
lock_path = os.path.join("/tmp", "%s.lock" % sha1(key).hexdigest())
return Dogpile(
expiration_time,
lock=lockfile.FileLock(path)
)
# ... everything else from the previous example
Where above,the only change is the lock argument passed to the constructor of Dogpile. For a given key “some_key”, we generate a hex digest of it first as a quick way to remove any filesystem-unfriendly characters, we then use lockfile.FileLock() to create a lock against the file /tmp/53def077a4264bd3183d4eb21b1f56f883e1b572.lock. Any number of Dogpile objects in various processes will now coordinate with each other, using this common filename as the “baton” against which creation of a new value proceeds.
A less prominent feature of Dogpile ported from Beaker is the ability to provide a mutex against the actual resource being read and created, so that the creation function can perform certain tasks only after all reader threads have finished. The example of this is when the creation function has prepared a new datafile to replace the old one, and would like to switch in the new file only when other threads have finished using it.
To enable this feature, use SyncReaderDogpile. SyncReaderDogpile.acquire_write_lock() then provides a safe-write lock for the critical section where readers should be blocked:
from dogpile import SyncReaderDogpile
dogpile = SyncReaderDogpile(3600)
def some_creation_function(dogpile):
create_expensive_datafile()
with dogpile.acquire_write_lock():
replace_old_datafile_with_new()
# usage:
with dogpile.acquire(some_creation_function):
read_datafile()
With the above pattern, SyncReaderDogpile will allow concurrent readers to read from the current version of the datafile as the create_expensive_datafile() function proceeds with its job of generating the information for a new version. When the data is ready to be written, the SyncReaderDogpile.acquire_write_lock() call will block until all current readers of the datafile have completed (that is, they’ve finished their own Dogpile.acquire() blocks). The some_creation_function() function then proceeds, as new readers are blocked until this function finishes its work of rewriting the datafile.
Note that the SyncReaderDogpile approach is useful for when working with a resource that itself does not support concurent access while being written, namely flat files, possibly some forms of DBM file. It is not needed when dealing with a datasource that already provides a high level of concurrency, such as a relational database, Memcached, or NoSQL store. Currently, the SyncReaderDogpile object only synchronizes within the current process among multiple threads; it won’t at this time protect from concurrent access by multiple processes. Beaker did support this behavior however using lock files, and this functionality may be re-added in a future release.