Reducing Load on a Distributed Nagios Installation

Our environment consists of a central Nagios server running on a dual Xeon 3Ghz machine with 4GB RAM and three satellites (collection stations) each running on a single 3Ghz Xeon machine with 2GB RAM. The satellites report each of their 2.200 checks via NSCA as passive checks to the main server, resulting in about 6.600 passive checks to the master from the distributed systems plus 1.200 passive checks delivered by monitored systems, totalling 8.200 checks on the master.

The Problem

Upon setting up this distributed environment, we immediately noticed that the central Nagios instance (I’ll call it the master) started swapping after about twenty minutes. The master‘s process table then contained over six hundred (600) nagios processes, and this amount increased every few seconds to such an extent, that the machine started swapping. After an hour or two, the system crashed because it ran out of swap space.

Reading messages from the Nagios mailing list and other information on the web, always pointed to the service_reaper_frequency configuration setting, which we’d already set as low as possible (BTW, this FAQ entry posted by Ethan Galstadt exactly explains our symtoms, his solution unfortunately didn’t help us at all).

We then recompiled Nagios with debugging (-DDEBUG0 -DDEBUG1 -DDEBUG2 -DDEBUG3) after guessting that the problem lay in the internal IPC pipe which is created between the main Nagios process and its children. In effect, a process list on the Linux box taken with ps alx showed the over six hundred children waiting to write to a pipe (pipe_w).

It turns out, that the main Nagios process, upon reading the passive checks from the command file (nagios.cmd) starts a host check for each host from which a service is retrieved that reports an Unknown state (of which our installation reported many). These host checks take such a long time (even though all the hosts are readily available) that the pipe isn’t cleaned out fast enough by the read_svc_message() routine (at least it isn’t on our installation), in effect getting all children to block whilst writing to the pipe.

This situation aggravates and escalates from one minute to the next, when always more Nagios children are created, each trying to write their results to the IPC pipe.

Our Solution

To prove our theory, we simply replaced the check_icmp program on the master with a dummy

#!/bin/sh
echo "Hello"
exit 0

without even restarting or reloading the master Nagios. Immediately, we saw the pipe reading code in read_svc_message() being called very quickly and the number of children immediately dropped to a very low number (somewhere in the low tens).

It became clear to us, that the problem lay in the fact that Nagios was calling check_icmp for each of critical or unknown passive service checks (of which there where many at that moment) being reported by the distributed Nagios installations. Whilst it was executing the check_icmp, the main Nagios process is blocked and didn’t handle any other check it was getting from its own children.

We decided to source out the check_icmp, by getting the required ICMP or ping data out of band, storing that in a database and replacing the check_icmp plugin by a program which simply retrieves the last available value from the database.

Implementation

The following programs and files we installed in /usr/local/nagios/nag and all paths are relative to that directory (but can of course easily be changed). The programs could simply pipe their output to each other, but we wanted the intermediate files for inspection. Feel free to change that if you don’t need them.

Getting the Host Addresses

The grab-ip-addresses script runs hourly from cron and retrieves the list of addresses from the Nagios configuration files and stores them in a file called nagios.iplist. This is used as input to getpings which uses fping to do the actual pinging.

#!/bin/sh

cd `dirname $0`
 
HOSTFILES="../etc/hosts.cfg"
 
cat ${HOSTFILES} |
        grep '^[        ][      ]*address' |
        awk '{ print $2 ; }' > nagios.iplist.$$ && mv nagios.iplist.$$ nagios.iplist

Cron

The crontab entries for these two programs are

09  * * * *  /usr/local/nagios/nag/grab-ip-addresses
*/5 * * * *  /usr/local/nagios/nag/getpings

Doing the Ping...

getpings invokes fping with the list of IP addresses produced above and writes output to fping.list which contains the raw output as produced by fping as well as passing that output to todb.pl which writes the data to the MySQL database. The output of fping is the list of IP addresses with each of two (-C option) times that can be ‘-’ to indicate failure:

10.0.242.2     : 0.71 0.71
10.2.0.4       : 2.40 1.75
10.66.68.1     : 71.82 -
10.84.28.130   : - -

After producing that list (which can take a while, BTW), getpings uses the trivial Perl program todb.pl to perform inserts and/or updates to the Mysql database table. During this process, if only one of the ping times is reported as unavailable (-), then a status of 1 (WARNING) is logged to the database table. If both times are reportedly unavailable, then a status of 2 (CRITICAL) is reported, otherwise 0 (OK).

#!/bin/sh

cd `dirname $0`
 
/usr/sbin/fping -B1.5 -r1 -C 2 -q < nagios.iplist  > /dev/null  2> fping.list
./todb.pl < fping.list 2>/dev/null

...and Storing the Results ...

#!/usr/bin/perl
use strict;
use DBI;
 
my $database = "nag";
my $host = 'localhost';
my $dsn = "DBI:mysql:$database:$host";
my $dbuser = "username";
my $dbpass = "password";
my $dbh;
 
$dbh = DBI->connect( $dsn, $dbuser, $dbpass)
        or die "Can't connect to $dsn: " . $dbh->errstr;
 
my $ins = $dbh->prepare("INSERT INTO ping (ip,stat,avtime) VALUES (?, ?, ?)");
my $upd = $dbh->prepare("UPDATE ping SET stat = ?, avtime = ? WHERE ip = ?");
 
while (<>) {
        next unless (/^\d+\.\d+\.\d+\.\d+/);	# IP address at begin of line
        my ($ip, $rest) = split(/\s+:\s+/, $_, 2);
        my $avtime;
        my ($t1, $t2) = split(/\s+/, $rest);
 
        my $stat = 0;
        # $stat = 1 if ($t1 eq '-' || $t2 eq '-');
        $stat = 2 if ($t1 eq '-' && $t2 eq '-');
 
        $t1 = 0.0 if ($t1 eq '-');
        $t2 = 0.0 if ($t2 eq '-');
        $avtime = $t1 + $t2 / 2.0;
 
        $ins->execute($ip, $stat, $avtime);     # ignore errors for duplicate keys
        $upd->execute($stat, $avtime, $ip) or
                warn("Cannot update $ip: " . $upd->errstr);
}
$ins->finish;
$upd->finish;
$dbh->disconnect;
exit 0;

... in a Database

I chose to use a small MySQL database table for storing the results, although it is overkill. An SqLite embedded database or even GDBM (for which Perl and C support are readily available) or similar would have more than sufficed. The advantage of MySQL (for us) is that we can use the same data set over the network from the satellite systems, without having to do the ICMP checks redundantly.

However, the database contains a single table appropriately called ping with this schema

CREATE TABLE `ping` (
  `ip` varchar(15) NOT NULL DEFAULT '',
  `stat` int(1) DEFAULT NULL,
  `modif` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `avtime` float DEFAULT NULL,
  PRIMARY KEY  (`ip`)
)

then reports the following data for a SELECT * FROM ping

+--------------+------+---------------------+---------+
| ip           | stat | modif               | avtime  |
+--------------+------+---------------------+---------+
| 10.0.242.2   |    0 | 2005-10-13 14:31:37 |  0.71   |
| 10.2.0.4     |    0 | 2005-10-13 14:31:37 |  0.64   |
| 10.66.68.1   |    1 | 2005-10-13 14:31:37 | 71.82   |
| 10.84.28.130 |    2 | 2005-10-13 12:36:37 |    -0   |
+--------------+------+---------------------+---------+

If you’ll be using the xping plugin remotely, don’t forget to give it permission to do so by adjusting the MySQL permissions

GRANT SELECT ON dbname.ping TO username@'%' IDENTIFIED BY 'password';

xping Plugin

xping is the name of the new Nagios plugin which will replace check_icmp in the Nagios configuration. A few sample runs based on the data collected above should show

$ /usr/local/nagios/libexec/xping -H 10.2.0.4
OK - 10.2.0.4: 0.64 1.54
$ /usr/local/nagios/libexec/xping -H 10.84.28.130
CRITICAL - 10.84.28.130: -0 -0

and the program will of course exit with the appropriate exit code to indicate success or failure to Nagios

xping is installed in /usr/local/nagios/libexec/xping and supports the -H option (which is actually used) as well as the option letters w, c, and t which are ignored completely. This ought to make it possible to have xping be a drop-in replacement for check_icmp. Option D can be used to specify a database host to which it should connect to. The source code of xping is presented here. It can further be streamlined if deemed necessary, but we think this’ll do the trick for the time being.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sysexits.h>
#include <time.h>
#include <getopt.h>
#include <mysql.h>
 
/*
 * $Header$
 *
 * $Log$
 */
 
#define DBHOST  "localhost"
#define DBBASE  "nag"
#define DBUSER  "username"
#define DBPASS  "password"
 
static char *statii[] = { "OK", "WARN", "CRITICAL", "UNKNOWN" };
 
int main(int argc, char **argv)
{
    MYSQL db;
    MYSQL_ROW row;
    MYSQL_RES *result;
    char sql[5120];
    int rc;
    char *ip = NULL;
    char *dbhost = DBHOST;
    int ch;
    extern char *optarg;
 
 
    /* Original is called with $USER1$/check_icmp -H $HOSTADDRESS$ -w 4900.0,80% -c 5000.0,100% -t 5
     */
    while ((ch = getopt(argc, argv, "H:D:w:c:t:")) != EOF) {
        switch (ch) {
        case 'H':
            ip = strdup(optarg);
            break;
        case 'D':
            dbhost = strdup(optarg);
            break;
        case 'w':
        case 'c':
        case 't':
            break;
        default:
            fprintf(stderr, "Usage: %s -H hostaddress [-D dbhost] -w xx -c xx -t xx\n", *argv);
            exit(EX_USAGE);
        }
    }
 
    if (ip == NULL) {
        fprintf(stderr, "Usage: %s -H IP-address\n", *argv);
        exit(EX_USAGE);
    }
 
    mysql_init(&db);
    if (!mysql_real_connect(&db, dbhost, DBUSER, DBPASS, DBBASE, 0, NULL, 0)) {
            fprintf(stderr, "Failed to connect to database: Error: %s\n",
            mysql_error(&db));
            return;
    }
 
    sprintf(sql, "SELECT ip, stat, avtime FROM ping WHERE ip = '%s'", ip);
    // puts(sql);
 
        rc = mysql_query(&db, sql);
    if (rc) {
    fprintf(stderr, "Failed to execute SQL command {%s}: Error (%d): %s\n",
        sql, mysql_errno(&db), mysql_error(&db));
        goto end;
    }
 
    rc = 3; // UNKNOWN
    result = mysql_store_result(&db);
    if (result) {
        my_ulonglong nrows = mysql_num_rows(result);
 
        if (nrows == 1) {
            row = mysql_fetch_row(result);
 
            /*
             * Original Nagios output is
             *  OK - 10.71.21.1: rta 45.872ms, lost 0%|rta=45.872ms;200.000;500.000;0; pl=0%;40;80;;
             *  CRITICAL - 10.0.8.160: rta nan, lost 100%|rta=0.000ms;200.000;500.000;0; pl=100%;40;80;;
             */
 
            rc = atoi(row[1]);
 
            printf("%s - %s: %s\n",
                statii[rc], ip, row[2]);
        }
    }
 
   end:
        mysql_close(&db);
 
    if (rc == 3) {      // IP not found
        printf("UNKNOWN - %s not found in pingDB\n", ip);
    }
    return (rc);
}

Voila!

For our installation, this worked wonders. Our main Nagios server is quite idle now (Thomas & Günther? remember: you owe me 3GB RAM ;-) )

And that is all.

Future

We’ll see how this works out during the next few months.

After thinking about this implementation a bit more, next time I’ll probably implement it with a CDB which we can then just rsync out to the collection stations, thereby avoiding

  1. the overhead of an (otherwise unnecessary) MySQL database
  2. the TCP overhead used for each call to the xping plugin which then retrives database results over a (possibly unavailable) network
  3. a quite heavyweight MySQL connection for each xping invocation (of which there are several thousand per five-minute interval

Jan-Piet Mens 2005-10-15 11:41

 
nagios/icmp.txt · Last modified: 2006-09-22 13:44
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki