Contents:
The EDDIE-Tool (commonly just called EDDIE) is an agent for system, network and security monitoring. It is highly customizable and easily extendable. It has been designed to be as platform-independent as possible, with platform-specific code limited to a small group of modules, making it easily portable to new platforms. It is fully written in Python and the configuration has a Python "look-and-feel" to it, although no Python or coding skills are necessary to configure it.
This user's manual is specific for EDDIE-Tool versions 0.29 and above, as some significant changes were made to improve the configuration. These changes can be read here. The user's manual for earlier versions can be read here.
You need to download the following:
The global configurables are usually in eddie.cf and are listed below:
The EDDIE configuration follows the standard Python code format. Where methods or child objects of an object are indicated by indenting them beneath the parent object definition, sub-objects or parameters of a directive object are similarly indicated by indenting them beneath the parent object definition. For example, a part of the configuration may look like:
group testing: PING testping: host="10.0.0.1" numpings=10 rule="not alive" action=email("chris", "%(host)s failed ping") FILE file1: file='/tmp/file1.tmp' scanperiod='2m' rule='not exists' action=ticker("%(file)s does not exist", timeout=1) act2ok=ticker("%(file)s now exists", timeout=1)A config group called "testing" is defined, then the PING directive "testping" is configured inside this group because it is indented. Similarly, all testping's arguments are indented as they belong to the PING directive configuration. The second directive, FILE called "file1", is at the same indentation level as the group definition (i.e., not indented) and is therefore a global directive. Thus, all hosts using this example config would execute the FILE directive, but only those hosts in the "testing" group would execute the PING directive.
DIRECTIVE name: argument1=value1 [argument2=value2 ...]where "DIRECTIVE" is the directive name, like PROC or FS, and "name" is the user-defined, unqie name of this directive object. The arguments customize the directive appropriately. Some arguments are directive-specific while others are common to all directives.
PROC test: name='syslogd' rule='not exists' scanperiod='30s' action=email("alert@my.domain","syslogd is not running")This is an example definition of a PROC directive, called 'test'. It contains the PROC-specific argument, 'name'. 'rule', 'scanperiod' and 'action' are arguments which are common to all directives. Some arguments are optional while others are required, and errors will be raised if they are missing. In this example 'name' and 'rule' are required. 'scanperiod' and 'action' are optional.
An EDDIE configuration can be simple to get basic monitoring started quickly and made as complicated as required to perform advanced operations. A simple example rules file is shown below to monitor basic services on a host. This rules file, named simple.rules, would be placed in the same directory as eddie.cf and eddie.cf would contain the entry
# Process checks PROC syslogd: name='syslogd' rule='not exists' action=email('root', '%(name)s is not running on %(h)s') PROC inetd: name='inetd' rule='not exists' action=email('root', '%(name)s is not running on %(h)s') PROC sshd: name='sshd' rule='not exists' action=email('root', '%(name)s is not running on %(h)s') # Filesystem checks FS root: fs='/' rule='pctused > 90' action=email('root', '%(mountpt)s over 90%% on %(h)s') FS varlog: fs='/var/log' rule='pctused > 90' action=email('root', '%(mountpt)s over 90%% on %(h)s') # Service Port checks SP smtp_port: port='smtp' protocol='tcp' bindaddr='0.0.0.0' rule='not exists' action=email('root', '%(protocol)s/%(port)s on %(h)s is not listening') SP http_port: port='http' protocol='tcp' bindaddr='0.0.0.0' rule='not exists' action=email('root', '%(protocol)s/%(port)s on %(h)s is not listening') # System statistics checks SYS loadaverage: rule="loadavg1 > 3.00" scanperiod='1m' action=email('root', '%(h)s load-average > 3.00')
DIRECTIVE name: arg1=value1 arg2=value2 argn=valuenWhere "DIRECTIVE" is the name of the directive itself (see Built-in Directives); "name" is a user-defined name of the directive definition (the directive ID is usually constructed as "DIRECTIVE.name", e.g., "FS.root", and will appear in the logs, console, etc); "args" are arguments to define what the directive should do and how it should do it. Some arguments are common to all directives and others are specific to that type of directive.
FS export00_grow: fs='/export/00' scanperiod='1m' history = 1 rule='(pctused - history[1].pctused) > 5' action=email('root','%(mountpt)s grew to %(pctused)s')Example 2, alert if average filesystem growth over last 3 sample periods is too high:
FS export01_avg_grow: fs='/export/01' scanperiod='1m' history = 3 rule='(pctused + history[1].pctused + history[2].pctused + history[3].pctused) / 3 > 10' action=email('root','%(mountpt)s grew to %(pctused)s')Note that at startup, rules will not run until enough history data is available. So, in the previous example, the directive would wait for three scanperiods before it had enough history data (three sample periods) to be able to evaluate the rule. [Eddie 0.31+]
checktime='day=="mon" or day=="tue"' checktime='day in weekdays and hour > 18'[Eddie 0.33+]
As rule expressions are evaluated in a Python environment, links to related Python documentation is provided below.
System Monitoring:
COM
COM is a generic directive used to perform custom
checks that other directives are not available for. It simply executes the
given command in a sub-shell, and captures the stdout/stderr and
return value for testing by the directive rule.
Security note: if EDDIE is run as root, the config files should not
be world-writable as obviously directives like COM can execute any
commands on the system.
COM-specific Arguments:
cmd="/bin/ls /tmp/*.tmp | wc -l"
rule='out == "test"' # true if stdout is just "test" rule='out.find("test")' # true if stdout contains "test" rule='int(out) > 5' # true if out (converted to an integer) is > 5 rule='int(ret) != 0' # true if return value of the cmd is not 0 rule='int(outfield1) != 0' # true if stdout field 3 is not 0
action=email("alert", "the command '%(cmd)s' failed.")
# Check load average (the hard way, without using SYS) COM loadavg: cmd="uptime | cut -d, -f4 | awk '{print $3}'" rule="float(out) > 6.0" action=email("alert", "Load on %(h)s is > 6.0") # Check number of netscapes running COM count_ns: cmd="ps -ef | grep netscape | wc -l" rule="int(out) > 3.0" action=email("alert", "There are %(out)s netscapes running on %(h)s") # A variation on checking load average, using 'outfield' variables COM loadavg: cmd="uptime | cut -d, -f4" rule="float(outfield3) > 6.0" action=ticker("Load on %(h)s is %(outfield3)s", timeout=1)
FILE
This is a directive for performing checks on files or changes to files.
Rules can be written based on any changes to the file metadata, like
modification date, size, ownership, permissions, etc. It can also
pick up changes to the file itself, which can be useful as a security
check.
FILE-specific Arguments:
# Alert when /etc/passwd changes FILE passwd_change: file='/etc/passwd' rule='mtime != lastmtime' action=email('alert','%(file)s has been modified.') # Alert when 'ps' changes FILE ps_change: file='/bin/ps' rule='md5 != lastmd5' action=email('alert','%(file)s has changed.') # Alert if file not owned by root FILE file_root: file='/usr/local/bin/testfile' rule='uid != 0' action=email('alert','%(file)s uid is %(uid)s.') ## Simple test that cron is working ## crontab should have an entry like: ## 0,15,30,45 * * * * /bin/touch /var/run/eddie/cron.test FILE cron_test: file='/var/run/eddie/cron.test' rule='exists and mtime < (now-15*60)' # file modified over 15 minutes ago action=email("alert", "Cron test failed.", "%(file)s mtime=%(mtime)s now=%(now)s") # Make sure this file isn't a symlink FILE check_file: file='/etc/passwd' rule='issymlink' action=email('alert','%(file)s should not be a symlink !!') # Alert if a file disappears FILE file_missing: file='/etc/passwd' rule='missing' action=email('alert','%(file)s has disappeared')
FS
The FS directive is used to perform checks on local filesystems.
Alerting when the filesystem is full would be the most common use for
this directive.
FS-specific Arguments:
fs='/var/log'
rule='pctused > 95.0' # true if filesystem over 95% full rule='avail > 350000' # true if less than about 350MB available
action=email("alert", "%(mountpt)s at %(pctused)s%% - %(avail)d of %(size)d remain")
# alert if / over 95% full FS root: fs='/' rule='pctused > 95' action=email("alert", "%(mountpt)s is over 95%% full on %(h)s") # alert if /var has less than 100MB available FS var: fs='/var' rule='avail < 100*1024' action=email("alert", "%(mountpt)s has only %(avail)dkB free on %(h)s")
HTTP
This is a directive for performing remote HTTP and HTTPS tests against
web servers.
The elapsed connection time is recorded, and all related
connection variables are made available, such as response code, headers and
returned message body, as well as error information if the connection
failed.
SSL-support must be compiled into Python for HTTPS connections.
The POST method is not yet supported.
HTTP-specific Arguments:
request_timeout
argument);
and 0 (false) if not.
[Eddie 0.33+]
If failed:
If not failed:
# Check our web site is up. HTTP website: url='http://www.my.domain.name/index.html' rule='failed' action=email('alert', '%(url)s failed', 'exception: %(exception)s\nerrno: %(errno)s\nerrstr: %(errstr)s') # Check a certain page hasn't disappeared. HTTP mypage: url='http://www.my.domain.name/~fred/fred.html' rule='failed or not ok' action=email('fred', '%(url)s failed') # Store web site response time in RRD db. HTTP web_time: url='http://our.website.com/' rule='not failed' scanperiod='5m' action=elvinrrd('http-%(h)s_%(hostname)s', 'time=%(time)f') actelse=email('alert', 'Connection failed to %(url)s')
IF
The IF directive provides a mechanism for testing network interfaces.
Interfaces listed in "netstat -i" are available for testing.
The test can be simply whether the interfaces exists on a host or not;
or it can be a more complex rule based on various statistics about that
interface.
IF-specific Arguments:
rule='not exists' # true if interface doesn't exist
# alert if eth0 interface has disappeared IF ethexists: name='eth0' rule='not exists' action=email('alert', 'interface %(name)s has disappeared on %(h)s') # alert if input packet errors are greater than 10% (Solaris) IF ierrs: name='hme0' rule="100.0*ierrs/ipkts > 10.0" action=email('alert', 'input packet error > 10%% on %(name)s')
LOGSCAN
The LOGSCAN directive provides a facility to watch files for important
messages. Every line in the file can be matched, or for a busy system,
selective lines can be picked out using a regular expression pattern.
Commonly the resulting lines are emailed to an admin, but any standard
EDDIE action could also be performed with the results.
This directive works by initially finding the end of the file on its
first 'scan' and storing this location. On the second and subsequent
scans, the directive will scan all the new lines of the file that have
been added since the previous scan, and finish by storing the new location
of the end-of-file.
If, however, the file has truncated
in size (i.e., perhaps a log rotation has occured) the directive will
scan all lines from the start of the truncated file.
Note: it is possible that some lines may be missed in between scanning
the file and the file being truncated (or log rotation) if the scanperiod
is not short enough. It is recommended that the scanperiod be short if the
file is updated frequently (i.e., for a busy logfile).
LOGSCAN-specific arguments:
rule='matchedcount > 0' # true if 1 or more lines matched (default rule) rule='lines != ""' # true if some lines matched (same result as above)
action=email("alert", "%(matchedcount)d lines matched scan, they are:\n%(lines)s")
# Email all entries from /var/log/messages to alert every 12 hours. LOGSCAN messages: file='/var/log/messages' regex='.*' scanperiod='12h' action=email("alert", "%(h)s:%(file)s", "-- Logscan matched %(matchedcount)d lines: --\n%(lines)s") # Email lines from /var/log/httpd/error_log, and ignore "notice" messages LOGSCAN httpd_error: file='/var/log/httpd/error_log' regex='.*[notice].*' negate=true action=email("alert", "%(h)s:%(file)s", "-- Logscan matched %(matchedcount)d lines: --\n%(lines)s")
METASTAT
This directive, part of the Solaris directives module, allows simple
checks to be performed on Solaris Disksuite devices.
Currently it only checks whether any metadevices require maintenance,
but will be expanded in the future.
METASTAT-specific Arguments:
rule='need_maintenance' # true if a metadevice needs maintenance
# Check if any metadevice requires maintenance METASTAT maintenance: rule='need_maintenance' action=email('alert', 'A metadevice on %(h)s requires maintenance')
NET
The NET directive provides an interface to the kernel network statistics
usually provided by a call to 'netstat -s'. Simple or complex rules can
be written using these statistics.
Linux Note: network stats counters are now collected from '/proc/net/snmp'.
Try a 'cat /proc/net/snmp' to see what counters are available.
NET-specific arguments:
rule='TcpInErrs > 0' # true if TcpInErrs counter greater than 0 (Linux)
action=email("alert", "There have been %(TcpInErrs)d TCP Input Errors on %(h)s")
# alert if any UDP input errors (Solaris) IF udpinerr: rule="udpInErrors > 0" action=email('alert', '%(h)s has had %(udpInErrors)s UDP input errors')
PID
The PID directive is used to perform simple checks using pid files
which some program generate. The most basic check is whether the pid file
exists or not, which can often indicate whether the program is running or
not; the second most basic check makes sure the pid found in the pid file
also belongs to a process in the process table.
PID-specific Arguments:
pidfile="/var/run/syslog.pid"
rule='not exists' # true if pidfile not found rule='exists and not running' # true if pidfile exists but pid not valid running process
action=email("alert", "the pid-file %(pidfile)s does not exist")
# alert if the sshd pid file doesn't exist PID sshdpid1: pidfile='/var/run/sshd.pid' rule='not exists' action=email("alert", "sshd pid file not found on %(h)s") # alert if the sshd pid doesn't match the process table PID sshdpid2: pidfile='/var/run/sshd.pid' rule='exists and not running' action=email("alert", "sshd pid not a valid process on %(h)s")
PING
This directive provides a facility for checking the availability of
hosts on a network. It allows ICMP ping checks to be performed and
rules and actions can be written based on whether the remote host is
alive, packet loss and round trip times.
PING-specific arguments:
rule='not alive' # true if the host did not respond rule='pktloss > 50.0' # true if greater than 50% packet loss rule='avgtriptime > 1.5' # true if avg RTT greater than 1.5 seconds
action=email("alert", "avg RTT from %(h)s to %(host)s is %(avgtriptime)f")
# Alert if host not responding PING foo: host="foo.domain.name" rule="not alive" action=email('alert', 'host foo is not responding to pings') # Alert via ticker if there is any packet-loss. PING badpings: host='10.0.0.5' numpings=20 rule='pktloss >= 0.0' scanperiod='1m' action=ticker("%(host)s packetloss=%(pktloss)0.1f%% avgrtt=%(avgtriptime)f sec")
POP3TIMING
The POP3TIMING directive is used to measure the performance of a POP3
server. EDDIE connects to the given POP3 server/port and logs in as
the given user, then performs some standard commands before closing
the connection. The time taken for each step of the connection are timed
and stored in variables to be used by the action(s).
Besides timing information, this directive can also be used to perform
basic checks on a POP3 server. A variable is set if the connection
fails, so simple rules can be written to test this.
POP3TIMING-specific Arguments:
server='10.0.0.12' server='pop3.my.domain:10110' user='fred' password='foo'
rule='not connected' # true if connection failed rule='connecttime > 2.0' # true if time to connect over 2 seconds rule='connecttime+authtime+listtime+retrtime > 5.0' # true if whole session took over 5 seconds
action=email("alert", "connection to %(server)s took %(connecttime)f seconds")
POP3TIMING pop3test: server='pop3.domain.com' user='fred' password='foo' rule='connected' action=email('mary', 'host=%(server)s, username=%(username)s, connecttime=%(connecttime)f, authtime=%(authtime)f, listtime=%(listtime)f, retrtime=%(retrtime)f') actelse=email('alert', 'POP3 connection to %(server)s failed')
PORT
The PORT directive tests remote TCP based services. The simplest test
is to determine whether the service is accepting remote connections on
a given TCP port.
The test can be made more complex
by defining send and expect strings. The send string
will be sent to the remote host after connecting, and any reply will
be matched against the expect string (a regular expression).
The check fails if the result does not match.
PORT-specific arguments:
rule='not alive' # true if the connection could not be opened rule='not matched' # true if the received string did not match the expect string rule='connect_time > 0.5' # true if connect time greater than 0.5 seconds
# check that 10.0.0.5 is accepting connections on port 80 PORT webcheck: host='10.0.0.5' port=80 rule='not alive' action=email('alert', 'port 80 not responding on 10.0.0.5') # check that a host is accepting connections on port 25 PORT smtpcheck: host='ahost.domain.name' port=25 expect='220 ' rule='not alive or not matched' action=email('alert', 'port 25 problem on 10.0.0.5')
PROC
The PROC directive is used to perform process checks. In the simplest
case it is used to check if a process is not running when it should be
(or running when it should not be). More complex rules can also be
written, using most of the process statistics such as memory-usage,
owner, percentage cpu used, running time, etc.
PROC-specific Arguments:
name='syslogd'
rule='not exists' # true if process not running rule='pcpu > 50.0' # true if process using over 50% CPU
action=email("alert", "the %(name)s process is not running")
# alert if cron is not running PROC cron: name='cron' rule='not exists' action=email("alert", "cron is not running on %(h)s") # syslog has a memory leak - alert if using over 50MB PROC syslogmem: name='syslogd' rule='vsz > 50*1024' action=email("alert", "%(name)s is using %(vsz)d kBytes")
RADIUS
The RADIUS directive provides a facility for performing radius authentication
checks.
RADIUS-specific Arguments:
RADIUS radtest: server='radius.domain.name:1812' secret='s3cr3t' user='bob@domain.name' password='b0bm@t3' rule='not passed' action='email("alert", "radius FAILED to %(host)s:%(port)d")'
SP
The SP directive is used to perform checks on listening service ports.
These can be either TCP or UDP ports. The simplest use is to check if
nothing is currently listening on the given port, protocol and bind
address combination.
SP-specific Arguments:
rule='not exists' # true if port not listening for connections
action=email("alert", "the port %(protocol)s/%(port)s bound to %(bindaddr)s is not listening on %(h)s")
# alert if nothing listening on http port SP http: port='http' protocol='tcp' bindaddr='0.0.0.0' rule='not exists' action=email('alert', 'http port not bound to on %(h)s') # alert if nothing listening on tcp port 22 on 10.0.0.5 SP sshport: port=22 protocol='tcp' bindaddr='10.0.0.5' rule='not exists' action=email('alert', '%(protocol)s port %(bindaddr)s:%(port)s not listening')
DBI
The DBI directive is used to perform database queries (typically SQL), and check
the results.
DBI-specific Arguments:
# test that our postgresql server is alive and responding to requests properly DBI postgresql_check: dbtype='pg' host='localhost' database='monitoring' user='monitoring' password='sshhh' query='select * from monitoring' rule='not connected or results != 1 or result1 != 42' action=email(ALERT_EMAIL, 'PostgreSQL DB %(database)s failed test', 'Query: %(query)s\nConnected: %(connected)s\nError: %(errmsg)s') # alert if too many connections to the Postgres database DBI db_connections: dbtype='pg' host='localhost' database='mydb' user='pgsql' password='sekrit' query='select count(1) from pg_stat_activity' rule='connected and results > 0 and result1 > 40' action=email('alert', 'Database %(database)s on %(h)s: too many connections (currently %(result1)s)') console='%(database)s on %(host)s : connections = %(result1)d'
SMTP
This directive makes a connection to an SMTP server and returns
the elapsed response time.
SMTP-specific Arguments:
SMTP smtp_test: server='mail.mydomain.com' rule='connected' action=email('alert', "SMTP connection to %(server)s:%(port)s took %(connecttime)s secs")
SNMP
This directive provides an SNMP client to retrieve data from remote
hosts and devices via the SNMP protocol. Multiple values can be
retrieved in one call. Standard EDDIE rules can then perform tests
on the retrieved data, or the data could be stored in RRD files using
the elvinrrd action (for instance).
SNMP-specific Arguments:
# Fetch a counter from a device SNMP foo: host='alt1.domain.name' oid='1.3.6.1.4.1.1872.2.1.1.6.0' community='private' rule='response > 0' maxretry=10 action=email('alert', 'Head for the lifeboats: %(snmpresponse)s') SNMP router_traffic: scanperiod='5m' host='10.0.0.1' oid='1.3.6.1.2.1.2.2.1.10.2, 1.3.6.1.2.1.2.2.1.16.2' community='special' rule='not failed' maxretry=10 action=elvinrrd("net-router_BRI01", "ibytes=%(response1)s", "obytes=%(response2)s")
STORE
The STORE directive is still being developed and tested.
It will be documented at a later date.
SYS
The SYS directive provides an interface to the kernel's system statistics.
Simple or complex rules can be written using these statistics.
SYS-specific arguments:
rule='loadavg1 > 2.0' # true if 1-min load-average > 2.0
action=email("alert", "The loadavg on %(h)s is %(loadavg1)0.2f")
# alert if 1 minute load average > 2 SYS loadavg1: rule="loadavg1 > 2.0" action=email('alert', '%(h)s has a loadavg1 of %(loadavg1)0.2f')
DISK
The DISK directive provides an interface to the kernel's disk I/O statistics.
Simple or complex rules can be written using these statistics.
This requires the data collector diskdevice:DiskStatistics
(which is only available on Solaris and Win32 at time of writing).
[Eddie 0.35+]
Directive-specific arguments:
/usr/bin/kstat -p -c disk
action=email("alert", "Disk %(device)s : rbytes=%(nread)s wbytes=%(nwritten)s")
# /dev/md/dsk/d20 == /var : send read/write counters to RRD DISK md20_thruput: device='md20' scanperiod='5m' rule='True' # always perform action action='elvinrrd("disk-%(h)s_%(device)s", "rbytes=%(nread)s", "wbytes=%(nwritten)s")'
TAPE
The TAPE directive provides an interface to the kernel's tape I/O statistics.
Simple or complex rules can be written using these statistics.
This has almost exactly the same functionality as the DISK directive.
This requires the data collector diskdevice:TapeStatistics
(which is only available on Solaris at time of writing).
[Eddie 0.35+]
Directive-specific arguments:
/usr/bin/kstat -p -c tape
action=email("alert", "Tape %(device)s : rbytes=%(nread)s wbytes=%(nwritten)s")
# st65 == TAPE : send tape read/write counters to RRD TAPE st65_thruput: device='st65' scanperiod='5m' rule='True' # always perform action action=elvinrrd("tape-%(h)s_%(device)s", "rbytes=%(nread)s", "wbytes=%(nwritten)s")
log
log performs message logging to a file, the tty where eddie was
executed, or to syslog. The where depends on the via, which is
the second parameter. If via looks like "XXX.YYY", then it is assumed
that syslog type logging is desired. If via begins with a "/",
then it is assumed that logging to a file is desired. If via is
the string "tty", then the message will go to the tty where eddie
was executed. You may specify multiple vias by separating them with a ";",
as in "FACILITY.LEVEL;/path/to/file1.txt;/path/to/file2.log".
Format:
# generate a syslog notification using the LOG_DAEMON facility and LOG_ALERT level action=log("There is a problem on %(h)s", "DAEMON.ALERT") # append a message to a log file action=log("There is a problem on %(h)s", "/var/log/eddie_disk.log") # display a message on the tty that eddie was started on, and append to eddie.log action=log("There is a problem on %(h)s", "tty;/var/log/eddie.log")
email
email performs message emailing.
How it goes about sending the email depends on your SENDMAIL and SMTP_SERVERS Eddie config options.
Format:
# generate an email alert action=email("me@mydomain.com,them@myotherdomain.com", "There is a problem on %(h)s", "Problem age: %(problemage)s")
system
system allows execution of operating system commands.
Format:
# run command to rotate the web log file action=system("rotate /var/log/web_log")
restart
Run /etc/init.d/(name) start command. Usually used to restart a dead daemon.
Format:
# restart the httpd server action=restart("httpd")
nice
Change the "nice" value of a running process, either up or down.
Note that in order to increase the nice level, eddie has to be running as
super-user.
The process acted upon is the current pid in the dictionary, so
this action only works for PROC and PID directives.
Format:
# change the execution of the process to take a little less time action=nice("+", 5) # de-prioritize the process action=nice(20)
eddielog
This action allows for logging messages to the log file that eddie is
configured to use. Depending on the ADMINLEVEL setting, the message
may also (eventually, depending on ADMIN_NOTIFY setting) get emailed
to the ADMIN.
Format:
# generate an informational message to the eddie log file action=eddielog("Disk issue on %(h)s: used level is %(pctused)s%%") # generate a high-priority message to the eddie log file # (and probably to the ADMIN as well, eventually) action=eddielog("Disk issue on %(h)s: used level is %(pctused)s%%", 9)
ticker
Send a ticker-type message to an Elvin listener.
Format:
# send a ticker-type message action=ticker("%(file)s does not exist", timeout=1)
page
Send a page to the specified recipients. Currently implemented as an email.
Format:
# send a page to the ADMIN_PAGER alias action=pager(ADMIN_PAGER, "Host %(server)s is inaccessable") # send a page to a Sprint phone action=pager("734657XXXX@messaging.sprintpcs.com", "Host %(server)s is inaccessable")
elvindb
Send information to a database listener via Elvin.
Data to insert in db can be specified in the data argument as
'col1=data1, col2=data2, col3=data3' or if data is not specified
it will use values sent previously.
Format:
# send data to table "MYTABLE" via elvindb action=elvindb("MYTABLE", "host=%(h)s,load1=%(load1)s,load5=%(load5)s")
elvinrrd
Send information to a RRDtool database listener via Elvin.
Format:
# send the one-minute load average every minute for this host SYS loadavg1_rrd: rule='True' # always true scanperiod='1m' action="elvinrrd('loadavg1-%(h)s', 'loadavg1=%(loadavg1)f')"
netsaint
Send information to a NetSaint listener via Elvin.
Format:
# send the free memory size to the NetSaint consumer action=netsaint("EddieMem", "Free memory on %(h): %(memfree)s", 1)
N name: Level 0: action1, [action2, ...] Level 1: action1, [action2, ...] Level n: action1, [action2, ...]Levels should range from 0 to 9, with 9 being the most critical. E.g.:
N COMMONALERT: # Info Level 0: email(INFO_EMAIL,INFO) # Warning Level 1: email(ALERT_EMAIL,WARN) # Alert Level 2: email(ALERT_EMAIL,ALERT),ticker(ALERT_P) # Serious Alert Level 3: email(ALERT_EMAIL,ALERT),email(ONCALL_EMAIL,ALERT_P),ticker(ALERT_P)Actions are defined like function calls, and multiple actions are separated by commas. See Actions for more information.
M groupname: MSG msgname1: "string1" "string2" MSG msgname2: "string3" "string4"or
M groupname: M subgroupname1: MSG msgname1: "string1" "string2" MSG msgname2: "string3" "string4" M subgroupname2: MSG msgname3: "string5" "string6"E.g.:
# Define common messages. These are used by the COMMONALERT notification object M commonmsg: # Define a subgroup of messages to be used by PROC directive actions M proc: # Warning-level message for email MSG WARN: "Warning: %(name)s on %(h)s not running" "The %(name)s process on %(h)s is not running" # Warning-level message for paging or tickertape MSG WARN_P: "Warn: The %(name)s daemon on %(h)s is not running." "" # Alert-level message for email MSG ALERT: "Alert: %(name)s on %(h)s not running" """ALERT: The %(name)s daemon on %(h)s is not running. %(problemage)s %(problemfirstdetect)s """ # Alert-level message for paging or tickertape MSG ALERT_P: "ALERT: The %(name)s daemon on %(h)s is not running." ""
By default every directive is shown in the Console output in the format "<ID> - <state>". This can be modified with the console directive argument, or the directive not shown at all by setting this argument to None.
Substitution variables available to the console argument string are:
Directive examples:
# check root filesystem usage FS rootfs: fs='/' rule="pctused > 95" action=email("root", "%(mountpt)s at %(pctused)s%%") console='%(state)s %(pctused)s%%' # email me load average every 5mins SYS loadavg5: rule="True" action=email('chris', '%(h)s loadavg5: %(sysloadavg5).02f') scanperiod='5m' console="loadavg5=%(sysloadavg5).02f" # store root filesystem data in RRD (don't show on Console) FS root_rrd: fs='/' rule="True" scanperiod='5m' action=elvinrrd("fs-%(h)s_root", "used=%(fsused)s", "size=%(fssize)s") console=None
Console example:
$ telnet localhost 33343 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. Eddie Console Gateway FS.rootfs - ok 33% SYS.loadavg5 - loadavg5=0.14 Connection closed by foreign host.
Solaris:
System stats from '/usr/bin/uptime': uptime - time since last boot (string) users - number of logged on users (int) loadavg1 - 1 minute load average (float) loadavg5 - 5 minute load average (float) loadavg15 - 15 minute load average (float) System counters from '/usr/bin/vmstat -s' (see vmstat(1M)): ctr_swap_ins - (long) ctr_swap_outs - (long) ctr_pages_swapped_in - (long) ctr_pages_swapped_out - (long) ctr_total_address_trans_faults_taken - (long) ctr_page_ins - (long) ctr_page_outs - (long) ctr_pages_paged_in - (long) ctr_pages_paged_out - (long) ctr_total_reclaims - (long) ctr_reclaims_from_free_list - (long) ctr_micro_hat_faults - (long) ctr_minor_as_faults - (long) ctr_major_faults - (long) ctr_copyonwrite_faults - (long) ctr_zero_fill_page_faults - (long) ctr_pages_examined_by_the_clock_daemon - (long) ctr_revolutions_of_the_clock_hand - (long) ctr_pages_freed_by_the_clock_daemon - (long) ctr_forks - (long) ctr_vforks - (long) ctr_execs - (long) ctr_cpu_context_switches - (long) ctr_device_interrupts - (long) ctr_traps - (long) ctr_system_calls - (long) ctr_total_name_lookups - (long) ctr_toolong - (long) ctr_user_cpu - (long) ctr_system_cpu - (long) ctr_idle_cpu - (long) ctr_wait_cpu - (long) Process/memory stats from '/usr/bin/vmstat' (see vmstat(1M)): procs_running - number of processes running (int) procs_blocked - number of processes blocked (int) procs_waiting - number of processes waiting (int) mem_swapfree - amount of free swap (kB) (int) mem_free - amount of free RAM (kB) (int)Linux:
loadavg1 - 1min load average (float) loadavg5 - 5min load average (float) loadavg15 - 15min load average (float) ctr_uptime - uptime in seconds (float) ctr_uptimeidle - idle uptime in seconds (float) ctr_cpu_user - total cpu in user space (int) ctr_cpu_nice - total cpu in user nice space (int) ctr_cpu_system - total cpu in system space (int) ctr_cpu_idle - total cpu in idle thread (int) ctr_cpu%d_user - per cpu in user space (e.g., cpu0, cpu1, etc) (int) ctr_cpu%d_nice - per cpu in user nice space (e.g., cpu0, cpu1, etc) (int) ctr_cpu%d_system - per cpu in system space (e.g., cpu0, cpu1, etc) (int) ctr_cpu%d_idle - per cpu in idle thread (e.g., cpu0, cpu1, etc) (int) ctr_pages_in - pages read in (int) ctr_pages_out - pages written out (int) ctr_pages_swapin - swap pages read in (int) ctr_pages_swapout - swap pages written out (int) ctr_interrupts - number of interrupts received (int) ctr_contextswitches - number of context switches (int) ctr_processes - number of processes started (I think?) (int) boottime - time of boot (epoch) (int)HP-UX:
System stats from '/usr/bin/uptime': uptime - (string) users - (int) loadavg1 - (float) loadavg5 - (float) loadavg15 - (float) System counters from '/usr/bin/vmstat -s' (see vmstat(1)): ctr_swap_ins - (long) ctr_swap_outs - (long) ctr_pages_swapped_in - (long) ctr_pages_swapped_out - (long) ctr_total_address_trans_faults_taken - (long) ctr_page_ins - (long) ctr_page_outs - (long) ctr_pages_paged_in - (long) ctr_pages_paged_out - (long) ctr_reclaims_from_free_list - (long) ctr_total_page_reclaims - (long) ctr_intransit_blocking_page_faults - (long) ctr_zero_fill_pages_created - (long) ctr_zero_fill_page_faults - (long) ctr_executable_fill_pages_created - (long) ctr_executable_fill_page_faults - (long) ctr_swap_text_pages_found_in_free_list - (long) ctr_inode_text_pages_found_in_free_list - (long) ctr_revolutions_of_the_clock_hand - (long) ctr_pages_scanned_for_page_out - (long) ctr_pages_freed_by_the_clock_daemon - (long) ctr_cpu_context_switches - (long) ctr_device_interrupts - (long) ctr_traps - (long) ctr_system_calls - (long) ctr_Page_Select_Size_Successes_for_Page_size_4K - (long) ctr_Page_Select_Size_Successes_for_Page_size_16K - (long) ctr_Page_Select_Size_Successes_for_Page_size_64K - (long) ctr_Page_Select_Size_Successes_for_Page_size_256K - (long) ctr_Page_Select_Size_Failures_for_Page_size_16K - (long) ctr_Page_Select_Size_Failures_for_Page_size_64K - (long) ctr_Page_Select_Size_Failures_for_Page_size_256K - (long) ctr_Page_Allocate_Successes_for_Page_size_4K - (long) ctr_Page_Allocate_Successes_for_Page_size_16K - (long) ctr_Page_Allocate_Successes_for_Page_size_64K - (long) ctr_Page_Allocate_Successes_for_Page_size_256K - (long) ctr_Page_Allocate_Successes_for_Page_size_64M - (long) ctr_Page_Demotions_for_Page_size_16K - (long)
The format for specifying time is either:
© Chris Miles 2002-2005
$Id: manual.html 910 2007-12-10 12:48:51Z chris $