Author: | Jacob Smullyan |
---|---|
Contact: | jsmullyan@gmail.com |
Organization: | WNYC New York Public Radio |
Date: | 2007-01-02 |
Revision: | 132 |
Version: | 0.62 |
Copyright: | Copyright 2006 WNYC New York Public Radio. |
Ewa (East-West Audio) is an application that manages the production of spliced mp3 files. It is meant to solve a common audio production problem facing producers of audio for the web who would like to add extra content such as credits and promotional or underwriting messages to their mp3s and be able to update those messages periodically without having to remaster all their mp3s from scratch.
Ewa makes a distinction between content files and extra files. The content files contain the material of main interest; the extra files are the promotional material. Ewa assumes that one resultant mp3 is normally an aggregation of exactly one content file with any number of extra files. [1]
In order to know what extra files to combine with a particular content file, ewa consults a rule, which is a function that takes the name of the requested file and returns an ordered list of files that should be combined. The rule ewa consults is usually a special kind of rule called a rule list that contains a list of sub-rules; the first sub-rule that matches, i.e., that returns a non-empty list, is the return value of the parent rule. Ewa provides a expressive mini-language for specifying such rule lists; with it, rules can apply only for filenames that match glob or regular expression patterns, or that match date formats, or combinations of such criteria with "and", "or", and "not" operators. Also, rules can be made active only for certain date ranges, so you can add configuration in January that will only become effective in February. If the rule system isn't flexible enough and you have special needs, it is also feasible to plug in your own, implemented in Python.
Ewa also manages transcoding the extra audio to match the content files with which it may be spliced. Master files for each piece of extra audio are placed in a directory managed by ewa, and ewa transcodes them as needed, leaving the transcoded files in another managed directory for future runs. The masters may be in mp3, wav, or aiff format.
Ewa can operate either in batch mode, in which case it produces combined files for the content files specified on the command line, or in server mode.
The batch mode, controlled by the script ewabatch, is convenient if you can't run your own persistent processes on the web server serving your audio files; you can use it to produce static files on a machine under your own control and rsync them up to the web server however often you need. Batch mode can operate either on individual files or recursively on entire directories.
If you are using ewabatch to generate all your files, there is no need for any integration with the web server at all; you just need a cron job to generate the files periodically and perhaps rsync them. But if you have a large number of files, this can become unwieldy. Not only must each new file must be processed before going live, but the costs in time and bandwidth of changing intros for a large number of mp3s and having to rsync them up to your webserver may be prohibitive if undertaken frequently. It is highly inefficient to have to move thousands of files just because an intro has changed. The ewa server gets around this problem.
The server mode, controlled by the script ewa, is a persistent daemon running a simple WSGI application, normally connected to a web server via FCGI or SCGI, which generates the composite files on demand, caching them on the filesystem and rebuilding them upon request with a configurable frequency. The path to the composite file is passed to the webserver via the X-Sendfile technique (which originated with lighttpd and is also supported by apache with the mod_xsendfile module; nginx also has a similar feature). The webserver is responsible for returning the actual file over HTTP, and ewa does not need to do IO; as a result, ewa processes each request extremely quickly, and files are served at almost the same speed as static files, with excellent scaleability.
Ewa has a few limitations that the user should be aware of.
Some or all of these may be addressed in future revisions, depending on community interest.
You will occasionally read that mp3s cannot be reliably spliced, as mp3 frames may store information used by later frames in the bit reservoir. This is not quite true; the reality is that mp3s cannot be reliably cut and spliced. In ewa, all the mp3s are spliced on preexisting mp3 boundaries; they are not cut (except to drop a bad frame at the end of a file). Obviously, the last frame in an mp3 does not store content in the bit reservoir for subsequent frames. Therefore, the bit reservoir does not present a problem for ewa.
Ewa attempts to produce spliced files that are without bad frames; to do so, it looks at the frames preceding frame boundaries and discard broken ones. However, ewa also attempts to splice very quickly, and hence cannot scan entire mp3s to clean them; if the mp3s going into ewa are broken, the ones coming out will be too.
Ewa has been developed and tested on Linux, but should work fine on any flavor of BSD, including Mac OS X, and commercial UNIX implementations. It hasn't been tested on Windows, but in future might work there in whole or in part. Please note that some parts of this manual presuppose a UNIX platform.
Ewa is written in Python, and requires Python 2.4 or later. In addition, the following Python packages need to be installed:
To run tests you also need:
Ewa also requires that lame be installed for transcoding. To run the ewa server, you need to run an http server that supports X-Sendfile or something equivalent: either lighttpd, apache with mod_xsendfile, or possibly nginx.
Ewa releases are available in binary and source form from http://cheeseshop.python.org/pypi/ewa.
If you want to follow the bleeding edge development version, you can check out the latest source code from our subversion repository:
svn co svn://svn.wnyc.org/public/ewa/trunk ewa
To install, if you already have setuptools installed, you can simply do:
easy_install ewa
Or, if you have already installed the source tarball and have unpacked it, cd into it and type:
easy_install .
or equivalently:
python setup.py install
The latter will install setuptools if you don't already have it.
Ewa expects audio to be stored in a directory structure like:
- $basedir/main
- Your content mp3s go here; you manage this directory and can organize it however you like. Ewa needs read access to it.
- $basedir/extras/masters
- Your "extra" files -- intros, outros, ads, etc. -- go here; you manage this directory also. Ewa needs read access to it also.
- $basedir/extras/transcoded
- Ewa manages this directory and needs write access to it; it stores transcoded versions of the audio files extras/masters here.
- $targetdir
- Ewa manages this directory and needs write access to it; this is where it stores the spliced files.
basedir and targetdir are configuration-defined. You must specify basedir in ewa.conf; targetdir will default to $basedir/combined if not otherwise specified.
Some care is necessary to ensure that file permissions will be right for your deployment, especially if you are running both the ewa server and ewa batch processes, as a variety of users may then be creating files in the managed directories.
One approach is to create a user and group that the ewa server will run as, give ownership of the managed directories to it, and make them both group-writeable and the group permissions sticky. On Linux, you might do this:
groupadd ewa useradd -g ewa -s /bin/false -d $targetdir -c "ewa user" ewa chown -r ewa:ewa $targetdir $basedir/extras/transcoded chmod -r g+ws $targetdir $basedir/extras/transcoded
While you are at it, creating directories for ewa's pid file and log file isn't a bad idea:
mkdir -p /var/{run,log}/ewa && chown ewa /var/{run,log}/ewa
In ewa.conf you'll want to set the user and group variables to match the user and group you created. If you do this, ewa and ewabatch will need to be run as root (in the case of ewabatch, most conveniently through sudo), but will drop credentials to your user/group before it creates any files.
Ewa has two configuration files: ewa.conf, for adminstrative options, and a rule configuration file, which is used to determine the playlists.
ewa.conf is written in Python; keys defined there that don't start with an underscore become attributes of the ewa.config.Config object. The following are meaningful keys:
if you want to rotate your logfiles, set this to one of the following:
The rule file can be written either in Python or in a special configuration mini-language, ewaconf. [3]
A rule file in Python format gives you maximum flexibility, at the cost of requiring you to know Python and understand the ewa API. The Python file can contain anything as long as it defines a global with the name rules, which should be a Python callable that, when called, returns an iterator that yields symbolic names for the files that should be combined. (These names will be interpreted as file paths relative to the extras/masters managed directory, unless they have the Python attribute is_original set to a true value, in which case, they will interpreted as file paths relative to the main managed directory.) With this hook you can load into ewa just about any sort of rule system that you might like to devise.
Ewa's default rule configuration format is designed to make it easy to define a list of rules that say, for a given mp3 file, what files ewa should combine to make an aggregate file, and in what order. The rules are consulted in order, and checked to see if they match the input mp3 file; the first one that matches returns a list of files to combine, and those are then combined. ewaconf only supports a limited number of rule types, but nonetheless the system is quite powerful.
A rule is normally written in the form:
condition [options]: pre: [file1,file2...] post: [file1,file2...]
where a condition is a glob pattern, a regex pattern, or a date specification, or combinations of the above with with the logical operators and, or, and not. The pre and post lists indicate what files should go before or after the main content file in the aggregate file ewa produces. Condition options are put in brackets after the condition and separated by commas; they can either be a single symbol, such as F or I, or a name-value pair, separated by =. For example:
bigband*.mp3 [I]: pre: [bigbandintro.wav] post: [bigbandoutro.wav] regex:schwartz.*: pre: [] post: [] and(09/01/2006 - 11/01/2006 [F,fmt=YYYYMMDD], or(lopate/*, bl/*): pre: [] post: [specialoutro.mp3]
The regular expression follows Python regular expression rules. If you want a regex to ignore case, you can pass the I option. Two other regex options are supported: U (unicode) and L (locale). These correspond to the same options in the Python re module. For more information, see the official Python documentation.
Globs support only one option: I. By default, globs are case-sensitive, but if this option is passed they will ignore case. (Globs are implemented with Python's fnmatch module.)
Both globs and regexes can contain arbitary characters if they are delimited with either single or double quotation marks. They can also be written without quotation marks, with some restrictions. Spaces are not permitted for either; for regexes, colons and commas must be escaped with a preceding backslash. Unquoted globs are furthermore restricted to alphanumeric characters, forward slashes, asterisks, question marks, underscores, and periods. When in doubt, quote.
Hint
Both globs and regexes need to match the entire path to requested file, relative to the main content file directory ($basedir/main); and furthermore globs and regexes have different matching behavior, in that a regex will match as long it matches against the beginning of the target string, but a glob needs to match all the way to the end. So if someone requests http://bozoland.org/dingdong/frogling.mp3, the path against which your pattern will be matched will be dingdong/frogling.mp3, without a leading forward slash. *frogling* would match it, as would regex:.*frogling; frogling.mp3 wouldn't, and neither would dingdong, but regex:dingdong would.
The date options are F, T, and the name-value option fmt. F and T are incompatible. T is the default (so its use is actually not necessary except perhaps for readability); it means that the condition will return true only if the current time matches against the date range specified.
F means that the date is matched against the filename using a regular expression derived from a format (the fmt option); the default format is MMDDYYYY. Formats may be specified with the following symbols:
Any additional characters in the format become a literal part of the regular expression. The fmt option has no meaning and may not be used when matching against the current time.
If the pre and post lists are both empty, the special form default may be used. Also, if a rule applies unconditionally, the condition may be omitted. Therefore, the following four forms are equivalent:
*: pre: [], post: [] *: default pre: [], post: [] default
For regex rules, it is possible for the filenames in the pre and post lists to back-reference named groups in the matching regex. Named or numbered group references can be used, with either a shell-like interpolation style:
regex:^/shows/(?P<showname>[^/]+)/.*\.mp3: pre: [intro/$showname.mp3, ad/${showname}.mp3] post: [notices/$1.mp3, outro/${1}.mp3]
or the style used by backreferences in Python regular expression expansions:
regex:^/shows/(?P<showname>[^/]+)/.*\.mp3: pre: [intro/\\g<showname>.mp3] post: [outro/\\1.mp3]
Warning
Back-references can be used with compound conditions only if they refer to the last matching element in the compound condition -- and if the last element is itself compound, the last matching element of it, etc. For instance, the first of the next two rules will work, and the second will not:
# if this matches, the match result of the regex will be # returned, and the back-reference will work and(>01-01-2001, (and(*nougat*, regex:"(foo|bar)"))): pre: [$1.mp3] post: [] # if this matches, the match result of the date match # will be returned, and back-references don't work # with those, so the literal string '$1.mp3' will be # used instead -- probably not what you want and((and(*nougat*, regex:"(foo|bar)"), >01-01-2001)): pre: [$1.mp3] post: []
With and, the last matching element will always the very last element. With or, however, that is not the case -- as soon as one of an or compound condition's sub-matchers matches, that match is returned and subsequent sub-matchers are ignored.
It is convenient under some circumstances to nest lists of rules, with a conditional qualifier shared by all of them. To do this, enclose the nested list of rules in matching brackets:
regex:shows/(?P<showname>[^/]+)/.*: [ <=09-01-2005 [F]: default 09-02-2005 - 10-14-2006 [F]: pre: [intro/$showname.mp3] post: [] >10-15-2006 [F]: pre: [current.mp3] post: [current.mp3] ]
For a complete reference, see the grammar specification below.
The ewabatch script, when run with the -t option, will perform a syntax check on both ewa.conf and the rulefile, and either exit with a Syntax OK message or blow up with a possibly helpful traceback.
Two recommended options for integrating ewa with a web server are discussed below. [4]
First of all, enable fastcgi in ewa.conf. If you are using lighttpd in version 1.4.11 or lower, set sendfile_header to 'X-LIGHTTPD-send-file'.
Then use something like the following lighttpd configuration:
server.modules = ( "mod_access", "mod_fastcgi", "mod_accesslog", "mod_staticfile" ) server.document-root = "/path/to/basedir" server.errorlog = "/var/log/lighttpd/error.log" server.port = 80 accesslog.filename = "/var/log/lighttpd/access.log" fastcgi.server = ( "/" => ( "127.0.0.1" => ( "host" => "127.0.0.1", "port" => 5000, "check-local" => "disable", # note: this is for lighttpd < 1.5. # for 1.5, apparently you do instead: # proxy-core.allow-x-sendfile = "enable" "allow-x-send-file" => "enable" ) ) )
TBD. This should be a fairly straightforward combination of mod_scgi and mod_xsendfile.
Below are summaries of the commandline options of ewa and ewabatch, and also for a third less important program, ewasplice, which provides lower-level access to ewa's splicing facilities.
usage: ewabatch [options] [files]
Produces a combined MP3 file according to the specified rules.
-h, --help | show this help message and exit |
-c CONFIGFILE, --config=CONFIGFILE | |
path to ewa config file | |
-r, --recursive | |
recurse through directories | |
--rulefile=RULEFILE | |
specify a rulefile | |
-d, --debug | print debugging information |
-n, --dry-run | don't do anything, just print what would be done |
-e ENGINE, --engine=ENGINE | |
which splicing engine to use (default ewa splicer, mp3cat, or sox) | |
-a, --absolute | interpret file paths relative to the filesystem rather than the basedir (default: no) |
-t, --configtest | |
just test the config file for syntax errors |
Hint
With both ewabatch and ewa, if you don't specify a config file, ewa will look for it in ~/.ewa/ewa.conf and /etc/ewa.conf.
usage: ewa [options]
Starts ewa's WSGI application that produces combined MP3 files according to the specified rules.
-h, --help | show this help message and exit |
-c CONFIGFILE, --config=CONFIGFILE | |
path to ewa config file | |
-D, --nodaemonize | |
don't daemonize, regardless of config settings |
usage: ewasplice [options] files
This utility splices MP3 files together using the ewa splicer, but doesn't use the managed directories or perform automatic transcoding. You have to specify a file as "tagfile" so it knows where to get id3 tags.
-h, --help | show this help message and exit |
-o OUT, --output=OUT | |
output file (default: stdout) | |
-t TAGFILE, --tagfile=TAGFILE | |
tag file | |
-d, --debug | print debugging information |
-s, --sanitycheck | |
sanity check the input mp3 files | |
-e ENGINE, --engine=ENGINE | |
which splicing engine to use (default ewa splicer, mp3cat, or sox) |
The below is an EBNF grammar for the rule configuration format:
grammar := cond_rule [','? cond_rule]* rulelist := '[' cond_rule [','? cond_rule]* ']' cond_rule := [cond ':']? rule rule := simplerule | rulelist simplerule := prelist ','? postlist | postlist ','? prelist | 'default' prelist := 'pre' ':' speclist postlist := 'post' ':' speclist speclist := '[' [specifier [',' specifier]*]? ']' specifier := string string := BAREWORD | QWORD cond := cond_expr | simple_cond cond_expr := cond_op '(' cond [',' cond]+ ')' cond_expr := NOT '(' cond ')' cond_op := 'and' | 'or' simple_cond := regex | glob | datespec regex := BAREREGEX condopts? | QREGEX condopts? glob := string condopts? datespec := daterange condopts? daterange := [date '-' date] | [ datecompare date ] | date datecompare := '<' | '<=' | '>' | '>=' | '=' date := DATE | DATETIME condopts := '[' condopt [',' condopt]* ']' condopt := BAREWORD | BAREWORD '=' BAREWORD
The tokens that the lexer must produce will be:
- BAREWORD
- an unquoted string with alphanumeric characters, asterisks, backslashes, question marks, underscores, or periods.
- QWORD
- a string delimited by single or double quotation marks. Internal quotation marks of the same type used as the delimiter must be escaped.
- BAREREGEX
- a string that matches a regex; should start with regex:, followed by an unquoted string with the same restrictions as BAREWORD above.
- QREGEX
- like a BAREREGEX, but the regex, after the regex: prefix, is delimited by single or double quotation marks, and escaping (except of quotation marks) is not necessary.
- DATE
- MM-DD-YYYY format. The separator can also be a slash (/) or a period (.), but the same separator must be used in both positions.
- DATETIME
- MM-DD-YYYY HHMM format. The separator can also be a slash or period, as with DATE, and the space before the hour can be either a space or the previously used separator.
- DEFAULT
- 'default'
- PRE
- 'pre'
- POST
- 'post'
- AND
- 'and'
- OR
- 'or'
- OP
- '<', '<=', '>', '>=', '='
- DASH
- '-'
- COMMA
- ','
- COLON
- ':'
- LBRACK
- '['
- RBRACK
- ']'
- LPAREN
- '('
- RPAREN
- ')'
Any text on a line after a pound sign (#) is a comment and is ignored. Whitespace, including line returns, is ignored between tokens. Indentation may be freely used to clarify patterns.
# test rule file. # comments and blank lines are ignored. 08/01/2005-12/01/2006: default >08/08/2006: pre: [lumpy.mp3], post: [] shows/bl/*: pre: [intro/bl.mp3, ad/generic.mp3] post: [outro/bl.mp3] regex:"^/shows/(?P<nick>[a-z][a-z0-9]+)/(?P=nick).*\.mp3" [I]: pre: [intro/newyear.mp3], post: [] =01/20/2001 [F, fmt=MMDDYYYY]: [ shows/studio/*: pre: [] post: [outro/studio.mp3] shows/pingpong/*: pre: [foomanchu.mp3,bingo.mp3] post: [] pre: [plop.mp3], post: [] ] default
[1] | There are use cases in which you might want more than one content file -- one for each segment of a radio program, for instance -- but this usage is not currently supported. |
[2] | The stated default value of this config variable, and of the several following which refer to the configuration of the FCGI and SCGI daemons, are actually enforced by flup; the value help in Config object for all of them is actually None. |
[3] | Actually, there is a third format -- a special dialect of JSON -- but it isn't very useful and may be dropped in a future release. |
[4] | Other options are possible. In addition to nginx, mentioned elsewhere, it would be possible run ewa's WSGI application in another WSGI container or even a CGI. With Apache's mod_rewrite it is possible to detect whether a static file is available and serve it directly if so, and only call a splicing backend if not, which, if X-Sendfile were not available, could accomplish much the same thing with an external redirect. |