{% extends "layout.html" %} {% block title %}getting started{% endblock %} {% block head %} {% endblock %} {% block body %}

fedmsg analytics

This is the web portal for statscache, a system to continuously produce statistics and analyses based on current and historical traffic on the fedmsg bus. It works in tandem with the datagrepper service to ensure that its data represents the results of processing the continuous, unbroken history of all fedmsg traffic.

exploration

Check out the dashboard to see what kind of stats we're collecting so far. If you see a model that looks interesting, click on it to go to the direct feed, where you can drill down into the model's entire, historical data set{# and play around with some graphs#}.

architecture

In and of itself, statscache is just a plugin-based framework to support continuous, in-order message processing. It is built on top of the fedmsg consumption interface, which is itself a thin layer atop the underlying Moksha framework. Functionally, statscache acts like any other listener on the fedmsg bus, simply passing on received messages to plugins, but it offers several enhancements to the underlying frameworks that makes it particularly amenable to statistics gathering:

extensibility

The plugin-based architecture, combined with the power of SQLALchemy and the convenience of fedmsg.meta, makes it almost trivial to add a new analysis model to statscache, without even needing any knowledge of the specific message representation. To get statscache to load and run your plugin, all you have to do is include its class as an entry-point under statscache.plugin in your package's setup.py. Shown here is a plugin to measure the relative 'popularity' of users as determined by the number of messages about them on the fedmsg bus.

from statscache.plugins import BasePlugin, BaseModel
from datetime import datetime
import fedmsg.meta
import sqlalchemy as sa

class Model(BaseModel):
    # inherited from BaseModel: timestamp = sa.Column(sa.DateTime, ...)
    references = sa.Column(sa.Integer, nullable=False)
    username = sa.Column(sa.UnicodeText, nullable=False, index=True)

class Plugin(BasePlugin):
    name = "popularity"
    summary = "Number of messages regarding a user"
    description = """
    Track the total number of direct references to a username over time.
    """

    model = Model

    def __init__(self, *args, **kwargs):
        super(Plugin, self).__init__(*args, **kwargs)
        self.pending = []

    def process(self, message):
        """ Process a single message and cache the results internally """
        timestamp = datetime.fromtimestamp(message['timestamp'])
        for username in fedmsg.meta.msg2usernames(message):
            self.pending.append((timestamp, username))

    def update(self, session):
        """ Commit all cached results to the database """
        for (timestamp, username) in self.pending:
            previous = session.query(self.model)\
                       .filter(self.model.username == username)\
                       .order_by(self.model.timestamp.desc())\
                       .first()
            session.add_row(self.model(
                timestamp=timestamp,
                username=username,
                references=getattr(previous, 'references', 0) + 1
            ))
        self.pending = []
        session.commit()

continuity

Unless explicitly overridden, statscache will deliver plugins each and every message successfully published to the fedmsg bus. Of course, downtime does happen, and that has to be accounted for. Luckily, datagrepper keeps a continuous record of the message history on the fedmsg bus (subject to its own reliability). On start-up, each plugin reports the timestamp of the last message successfully processed, and statscache transparently uses datagrepper to fill in the gap in the plugin's view of the message history.

ordering

Although the fedmsg libraries do support backlog processing, old messages are simply interwoven with new ones, making it a very delicate process. Not only is this inconvenient (for example, when calculating running statistics), it can be highly problematic, as an interruption during this process can be practically impossible to recover from.

With statscache, this weak point is avoided by guaranteeing strict ordering: each and every message is delivered after those that precede it and before those that succeed it, as determined by their official timestamps. This does mean that restoring from a prolonged downtime can take quite a long time, as all backprocessing must complete prior to the processing of any new messages.

concurrency

Without concurrency, statscache would have to severely restrict the complexity of message processing performed by plugins or risk crumbling under high traffic on the bus. Built on top of Twisted, statscache's concurrency primitives easily facilitate a work-queue usage pattern, which allows expensive computations to accumulate during periods of high message frequency without impairing the real-time processing of incoming messages. When activity eventually cools down, the worker threads can churn through the accumulated tasks and get things caught up.

restoration

Successfully recovering from outages is a nessecity for reliable and accurate analytics, and statscache was created with this in mind. If a crash occurs at any point during execution, statscache is able to recover fully without data loss.

integration

In production, statscache can be utilized as a backend service via its REST API. Currently, statscache supports exporting data models in either CSV or JSON formats, although the serialization mechanism is extensible (check out the source!). The API will be very familiar if you've programmed against datagrepper before. For more details, please see the reference page.

{% endblock %}