This document offers an introduction to writing translators for the Nagios monitoring system. It provides two case studies - a simple translator for the Kerberos component, and a more complex example monitoring the Cosign system
A translator is a perl class which provides a set of methods as defined in LCFG::Monitoring::Interfaces::Translator.
It is responsible for taking in a set of LCFG resources and producing one (or more) fragments of Nagios configuration information.
Only a single instance of a translator class will exist at any one time,
that instance's translate()
or notifyUnchanged()
method will be called once for every machine where the translator's component is being monitored.
The LCFG monitoring system architecture attempts to optimise translation as much as possible.
For this reason,
if a machine's configuration profile has not changed,
the translate()
class will not be called,
and the configuration fragment last produced by the translator will be called.
Translators may override this behaviour by creating an alwaysRun()
method which returns true.
Note,
however,
that this will have a significant effect on the performance of the monitoring system as a whole.
In effect, translators are run in the following fashion:
$translator = new <YourTranslator> ($server, $component) push @configuration, $translator->startPass(); foreach $machine (@machines) { if ($profileChanged || $translator->alwaysRun()) { my $fragment = $translator->translate($machine, $componentProfile);\ push @configuration, $fragment; $configurationCache{$machine} = $fragment; } else { $translator->notifyUnchanged($machine); push @configuration, $configurationCache{$machine} } } push @configuration, $translator->finishPass()
With the contents of the @configuration
array containing the configuration of the monitoring system by the end of the run. This is an obviously simplified description - in particular it illustrates only one component's translator being invoked.
As mentioned above, the formal interface description of these methods is contained in the interface definition - LCFG::Monitoring::Interfaces::Translator. However, here's a more informal look at each of these methods
This creates a new instance of your translator class. Generally, you'll be able to inherit this from either the LCFG::Monitoring::Nagios::Translators::Base or LCFG::Monitoring::Nagios::Translators::SimpleCheck classes (of which more below).
The constructor takes two arguments, $server is a LCFG::Monitoring::Nagios::Server class. This provides access to a number of server specific objects, which maintain information about the entire nagios configuration state. In particular, the userManager()
method provides access to an implementation of LCFG::Monitoring::Nagios::Interfaces::UserManager which must be used by your translate() function to register any contact details it stores (of which more below). $component is the name of the component being monitored, which will usually be the same as the name of your class.
This is called at the start of each fresh configuration run. It can be used to output static configuration data that should be included at the start of the configuration file, or to reset per-run data held within the $translator object.
This is the meat of the translation system, and as such will be discussed in much more detail below. It takes the name of the machine (in $machine) and a LCFG::XML::Node tree rooted at the component's data, and returns a LCFG::Monitoring::Interfaces::ConfigFragment configuration fragment.
Return true if the translator should always be invoked, regardless of whether the machine's profile has changed or not. Setting alwaysRun() may simplify the implementation of complex translators, but has signficant performace penalties. In particular, the monitoring system doesn't parse unchanged profiles, unless explicitly required to do so in order to execute a translator. Setting a translator to alwaysRun() will force parsing of every machine that translator is used on for every compilation cycle.
Inform the translator that the profile for $machine is unchanged. This method may not produce any results. It's designed for use in complex translators as a means of maintaining internal state information.
Signals the translator that the current compilation run is complete. The translator may use this opportunity to output a further configuration fragment, possibly based on composite results generated throughout the run. Translators which decide do this must save their own internal state, such that the output of this function is identical despite machines being unmodified (and so being processed by notifyUnchanged(), rather than by translate())
Before considering translator construction in more detail, it is important to understand some portions of Nagios configuration. The Nagios manual goes into all of this in far more detail, and should be used as the definitive reference. What follows is just an introduction.
Nagios supports two different kinds of service checks - passive and active. With passive checks an external script informs Nagios when the status of a service changes. With active checks, Nagios invokes a script (referred to as a plugin in the Nagios documentation) at regular intervals, and uses the result of that script to determine the state of the service.
In the following section, we only consider active checks. The majority of network services are best checked using active checks, and the mechanism is both simpler to understand, and to implement. The use of passive checks is discussed in the ADVANCED TOPICS section towards the end of this document.
Nagios configuration is written in terms of objects, which have the following syntax:
define <object-type> { use <parent> name <object-reference-name> ... register <0|1> }
<object-type> is the thing that the object defines. This document will only concern itself with "command" or "service" objects, although many others are possible.
<parent> gives a mechanism for inheriting settings from the objects. In order to perform inheritance this should be set to the <object-reference-name> (as specified for the name attribute) of the object whose settings are to be copied,
<object-reference-name> is the inheritance key. It should not be confused with the many other *_name options.
register controls whether the object directly affects the final configuration. Objects which are not registered (have a value of 0) are not used directly by the system, although they may be included by reference.
In order to monitor a service, Nagios needs a number of settings to be defined. The monitoring system provides a parent 'default-service', from which useful defaults may be inherited, such that a typical service description becomes
define service { use default-service host_name myhost service_description My Service check_command check_service!option1!option2 contact_groups mygroup }
This configures a service called 'My Service' which is to be monitored on myhost by running the command 'check_service' (this isn't a command in the filesystem sense - we'll talk more about commands in a moment), and report any failures to the Nagios contact group mygroup. Most of these configuration options are somewhat self explantory, but we'll talk through them anyway!
The name of the host that the service is to be monitored on. For the rest of the monitoring system to work correctly, this must be set to the value passed in to translator() as $machine - it should not be expanded, canonicalised, or otherwise fiddled with.
The name of the service being monitored. This can be any free text string, but it must be unique on a given machine (so no two translators should use the same name, and a translator which produces multiple service objects must use a different service_description for each one)
The first part of this is a refence to a command object (see below) describing the command to run. Each item after that, seperated by the ! character, are arguments that are passed into the command object,
This is the name of a Nagios group that should be contacted when the service fails. Nagios groups are managed by the UserManager class held within the Server object passed into the translator at construction, and all groups must be registered with this class by calling the addContactGroup($contact, $machine, $component) method of that class. See the examples below.
Writing these service structures by hand, and then turning them into perl configuration fragment instances is a bit of a laborious process, so some helper classes are provided. The LCFG::Monitoring::Nagios::Fragment::Service class will produce a fragment that embodies a service definition. See the class documentation for full details, but to implement the minimal service object detailed above you could write:
$fragment = new LCFG::Monitoring::Nagios::Fragment::Service('default-service') $fragment->hostName('myhost'); $fragment->serviceDescription('My Service'); $fragment->checkCommand('check_service!option1!option2); $fragment->contactGroups('mygroup');
The constructor is called with the name of the Nagios object to inherit defaults from (no inheritance will be performed if this is left empty), then each attribute is defined by an appropriate method call.
As hinted at above, a Nagios command object must defined for each plugin which is to be used. This object defines the path on the filesystem of the command, and the way in which the Nagios ! seperated arguments are translated into a command line string.
Many Nagios plugins ship with pre-defined command objects, either in the /etc/nagios/commands.cfg file, or as individual configuration files in /etc/nagios/commands.d. If you are writing, or packaging, a command, it is strongly recommended that you distribute your command's configuration object as a file in /etc/nagios/commands.d. However, if you are using an existing command that has no configuration object, or the existing object is unsuitable, you may wish to generate a suitable command object as part of your translator. Command objects should be output as a command fragment from the startPass() method of your translator - an example of this is given below.
Commands rarely inherit, and no default command definition is provided. A typical command object looks like:
define command { command_name check_service command_line $USER1$/check_service -H$HOSTADDRESS$ -p $ARG1$ -D $ARG2$ }
This is the name of the command, as referenced by the check_command option in a service object. Command names must be unique across the configuration, so it's recommended they use some element of the component name in order to ensure uniqueness.
This is the command line that is run in order to invoke the command. There's a fair amount of variable substitution that happens here. $USER1$ is expanded to the Nagios plugins directory (/usr/lib/nagios/plugins/). $HOSTADDRESS$ is the IP address of the host the command is being invoked on. $ARG1$, $ARG2$ and so on are the ! seperated optional arguments specified as part of the check_command option in the service configuration object.
As with services, commands have a class to make constructing them simpler - LCFG::Monitoring::Nagios::Fragment::Command|LCFG::Monitoring::Nagios::Fragment::Command
. The above defintion could be created by writing:
$command = new LCFG::Monitoring::Nagios::Fragment::Command(); $command->commandName('check_service'); $command->commandLine('$USER1$/check_service -H$HOSTADDRESS$ -p $ARG1$ -D $ARG2$');
Note that the dollars in the commandLine call mean that you have to be very careful with your quoting in order to ensure that perl does not attempt to expand them as perl variables.
Now that we have seen all of the complexity, lets step back and look at some mechanisms by which trivial translators may be written. A helper class LCFG::Monitoring::Nagios::Translators::SimpleCheck is provided which can do all of the hard work in writing a translator. It is, however, only suitable for use in particular situations.
SimpleCheck is designed to provide a rapid way of writing translators that monitor simple, single, services. It will only work for services that can be completely monitored by the use of a single check command, that require no additional service configuration beyond that in default-service, and that are prepared to use a standard set of LCFG resources to configure the component. In particular, the name of the contact group must be contained in the component resource nagios_groups
Here's an example of a translator based on the SimpleCheck object, which monitors the KDCs (configured by the Kerberos component).
package LCFG::Monitoring::Translators::kerberos use strict; use warnings; use base qw(LCFG::Monitoring::Nagios::Translators::SimpleCheck); sub checkCommand { my ($self, $machine, $profile) = @_; return("check_krb5!INF.ED.AC.UK!/etc/nagios.keytab!nagios/duffus.inf.ed.ac.uk"); } sub serviceDescription { return("Kerberos"); } 1;
Let's break this down into individual bits.
The first line declares the package name. This must be LCFG::Monitoring::Nagios::Translators
followed by the name of the component being monitored.
The 'use' section configures perl's maximum warning level, in order to detect code errors better, and defines the package as inheriting from LCFG::Monitoring::Nagios::Translators::SimpleCheck. It's that bit that lets the rest be so straightforward - all of the hard work is being done by the SimpleCheck class.
The 'checkCommand' function defines the check_command line to be included in the eventual service definition. In the example above, this will always check a KDC for the realm INF.ED.AC.UK, with a number of other fixed options. This isn't ideal, as not every machine using the Kerberos component is a KDC, and not every KDC will be for the INF.ED.AC.UK realm. We'll look at how to fix that below.
The 'serviceDescription' method returns what the service_description for the service being defined is. As noted earlier, it must be unique on any given machine.
As noted above, the current checkCommand isn't ideal - it doesn't check any of the monitored service's resources, and so isn't really adapting to the system its monitoring. We should check to see whether the machine is acutally configured to run a KDC, and which realm it's providing a KDC for. We should also allow the monitoring service to be moved between machines, so the hostname of the check principal shouldn't be hardcoded.
Here's a more complex version of that checkCommand ...
use LCFG::Monitoring::Exception; use Sys::Hostname ... sub checkCommand { my ($self, $machine, $profile) = @_; my $type=$profile->k('type')->d or die LCFG::Monitoring::Exception::RunTime->new ("Unable to load type from profile"); my $realm=$profile->k('realm')->d or die LCFG::Monitoring::Exception::RunTime->new ("Unable to load realm from profile"); if ($type ne "master" && $type ne "slave") { warn "kerberos: Monitoring machine '$machine' that is neither master nor slave"; return ""; } my $hostname=hostname; return("check_krb5!$realm!/etc/nagios.keytab!nagios/$hostname"); }
There's a number of important new concepts here, let's work through them in the order they appear...
We load resources in from our profile. The mechanism for doing this is documented in gory detail in the next section. Suffice to say that $type and $realm are set to the kerberos.type and kerberos.realm resources, respectively.
We die() if we don't have the correct resources. In the monitoring system a call to die() isn't fatal - it's simply a way of throwing an exception. Components which die() will be logged both in the nagios system, and by sending an email to the contact_groups for that component. There are a wide range of possible exceptions which are documented in LCFG::Monitoring::Exception, you can also define your own by inheriting from a suitable, existing, Exception class.
We warn() if someone has attempted to configure us to monitor machines that aren't KDCs. The more machines the monitoring system has to cope with the harder its work is, so it's important not to silently add null actions into the workload. If a machine shouldn't be monitored, raise an error. warn() puts the text its called with into the Nagios syslog facility, and then continues execution. Throwing an exception with die() would also have been appropriate in this case.
If we're not supposed to be monitoring the machine, we simply return an empty string. The SimpleCheck class detects this, and skips this host.
Otherwise, we return a check_command string as before, but this time with the realm pulled out of the LCFG resources for the machine we're monitoring.
In the earlier section we saw a couple of examples of pulling resources out of LCFG profiles. The monitoring system has a more complex view of LCFG profiles than many of the original LCFG perl modules, so it's worth taking a slight detour here to consider these, and how they affect our representation of resources.
Firstly, let us consider what a profile looks like in 'traditional' LCFG format. For simple (non-list) resources this is fairly straightforward, with straightforward attribute names. For list resources, things become more complex. We have a resource that contains a list of keys for that resource, and then a set of list elements for each key, with names being constructed with some _'s and glue. For example, a single level list would become (the item on the left is the attribute name, the next column is the value)
keys key1 key2 key3 elementA_key1 blob elementB_key1 blobby elementA_key2 blobby blobby elementB_key2 blobby blobby blobby
... and so on.
Things become even more scary for nested lists. The LCFG compiler jumps through a large number of hoops in order to translate this into a parse tree which looks something like
keys | |----key1 | |-----elementA----blob | |-----elementB----blobby | |----key2 |-----elementA----blobby blobby |-----elementB----blobby blobby blobby
The usefulness of this tree representation can be seen more clearly when dealing with a multi-level list. Say that 'elementA' is, in itself a list, with members subElementA and subElementB, and keys keyA1 and keyA2. Our parse tree then becomes
keys | |----key1 | |-----elementA | | |--------keyA1 | | | |-------subElementA----wibble | | | |-------subElementB----wobble | | | | | |--------keyA2 | | |-------subElementA----flibble | | |-------subElementB----plib again | | | |-----elementB----blobby | |----key2 |-----elementA | |--------keyA3 | | |-------subElementA----wibble again | | |-------subElementB----wobble again | | | |--------keyA4 | |-------subElementA----flibble agaon | |-------subElementB----plib again | |-----elementB----blobby blobby blobby
Within conventional LCFG syntax there are multiple ways that this could be represented. One mechanism is
keys key1 key2 elementA_key1 keyA1 keyA2 subElementA_keyA1 wibble subElementB_keyA1 wobble subElementA_keyA2 flibble subElementB_keyA2 plib elementB_key1 blobby elementA_key2 keyA3 keyA4 subElementA_keyA3 wibble again subElementB_keyA3 wobble again subElementA_keyA4 flibble again subElementB_keyA4 plib again elementB_key2 blobby blobby blobby
As you can see, a significant amount of knowledge about where resources sit in the tree is being lost. A slightly more verbose alternative representation is
keys key1 key2 elementA_key1 keyA1 keyA2 subElementA_keyA1_key1 wibble subElementB_keyA1_key1 wobble subElementA_keyA2_key1 flibble subElementB_keyA2_key1 plib elementB_key1 blobby elementA_key2 keyA3 keyA4 subElementA_keyA3_key2 wibble again subElementB_keyA3_key2 wobble again subElementA_keyA4_key2 flibble again subElementB_keyA4_key2 plib again elementB_key2 blobby blobby blobby
There is now slightly more information in the resource names, but it can still be difficult to discover a resource's position within the parse tree.
As hinted earlier, the monitoring system uses a different system of resource representation, where the name of a resource directly corresponds to its position in the parse tree. This representation can best be understood as a dotted syntax, where the dot seperates each node in the tree. So, the above would become:
keys key1 key2 keys.key1.elementA keyA1 keyA2 keys.key1.keyA1.subElementA wibble keys.key1.keyA1.subElementB wobble keys.key1.keyA2.subElementA flibble keys.key1.keyA2.subElementB plib keys.key1.elementB_key1 blobby keys.key2.elementA keyA3 keyA4 keys.key2.keyA3.subElementA wibble again keys.key2.keyA3.subElementB wobble again keys.key2.keyA4.subElementA flibble again keys.key2.keyA4.subElementB plib again keys.key2.elementB blobby blobby blobby
This syntax may be used when calling the lookup() function on the profile passed into the translate() or checkCommand() methods. For example,
$var = $profile->lookup("keys.key2.elementB");
In this situation, a lookup on a leaf node returns the value held in that node, a lookup on an internal node returns a list of the names of that node's children.
There is also a more complex lookup mechanism available by directly traversing the parse tree, starting with the contents of $profile.
will return the XML node of the child $child of the node $node
will return the data held within the current leaf node
will return a list of all of the child nodes.
So, for example, to access keys.key2.elementB you could write
$var = $profile->k('keys')->k('key2')->k('elementB')->d
This probably seems clumsy when it comes to the lookup syntax, but where it comes into its own is in iterating across lists. For example, again referring to the above parse tree
foreach my $node ($profile->k('keys')->kids()) { print $node->k('elementB')->d; }
Would print out the contents of all of the element B's in the above tree. Or,
foreach my $node ($profile->k('keys')->kids()) { foreach my $subnode ($node->k('elementA')->kids()) { print $subnode->k('subElementA')->d; } }
prints all four subElementB values.
It is hoped to introduce a further, cleaner, syntax for object oriented parse tree manipulation at some point in the future.
Finally, before we move on to looking at more complicated translator examples, it is useful to look behind the scenes of our Kerberos example. As SimpleCheck won't be able to assist with what follows, discarding it now and looking at a version of the kerberos translator without it will clarify later translators.
package LCFG::Monitoring::Nagios::Translators::kerberos; use strict; use warnings; use base qw(LCFG::Monitoring::Nagios::Translators::Escalate); use Sys::Hostname; sub translate { my ($self, $machine, $profile) = @_; my $type=$profile->k('type')->d or die LCFG::Monitoring::Exception::RunTime->new ("Unable to load type from profile"); my $realm=$profile->k('realm')->d or die LCFG::Monitoring::Exception::RunTime->new ("Unable to load realm from profile"); if ($type ne "master" && $type ne "slave") { warn "kerberos: Monitoring machine '$machine' that is neither master nor slave"; return ""; } my $contact = $self->handleContacts($machine, $profile); my $fragment = new LCFG::Monitoring::Nagios::Fragment::Service ("default-service"); $fragment->hostName($machine); $fragment->serviceDescription("Kerberos KDC"); $fragment->checkCommand("check_krb5!$realm!/etc/nagios.keytab!nagios/". hostname); $fragment->contactGroups($contact); return $self->mergeEscalation($fragment, $profile); } sub notifyUnchanged { my ($self) = @_; return; } 1;
Again, there's a fair number of new concepts to be considered here.
Firstly, note that whilst we're no longer inheriting from LCFG::Monitoring::Nagios::Translators::SimpleCheck, we're now inheriting from LCFG::Monitoring::Nagios::Translators::Escalate. This class provides some helper functions for dealing with escalations, which are discussed in more detail in the next section. Classes which do not require escalation support, or which wish to implement it differently, may inherit directly from LCFG::Monitoring::Nagios::Translators::Base See the documentation for those classes for full details of the functionality they provide.
We've now also got an explicit translate() method - as discussed many lines ago, this is at the heart of the monitoring system. This translate() method is a typical, simple, example, so let's examine it in more detail.
The first few lines should be familiar from the checkCommand method we had earlier - they pull the type and realm resources from the profile, and warn() if that information is not correct.
Then, we start something new. We call the handleContacts() method (provided by the Base class) to extract the contacts from the nagios_groups resource, register them with the UserManager, and return us a list of contacts in a form that we can pass to Nagios. In simple cases, handleContacts() is the best way of performing this operation - lower level methods of dealing with contacts are discussed in the Advanced Topics section at the end.
Then, we create and set up a Nagios Service configuration fragment, as discussed earlier. Note that this fragment uses Nagios's own inheritance rules to inherit many defaults from the pre-defined 'default-service' object.
Finally, we merge in any necessary escalation information, as discussed in the next section.
In typical operation, Nagios will contact the user's within a service's contactGroups attribute to notify them of that service's failure. However, it is sometime desirable to notify more people, the longer a service remains broken. To this end, Nagios provides a complex escalations system, which the monitoring system provides an interface to.
Escalations are provided through the LCFG::Monitoring::Nagios::Escalation class, the documentation for which provides an LCFG oriented view of how to configure which escalations will be performed. From a component author's perspective, all that is required to add escalation support is for the constructEscalation
method of this class to be called for each service for which escalations are required. This class takes two arguments - the first being the service fragment to be escalated, and the second being the LCFG profile from which configuration resources can be obtained.
In the simplest form, a component would construct an instance of the LCFG::Monitoring::Nagios::Escalation
class in its constructor, call constructEscalation
for each service fragment, and merge all of the resulting fragments into the return from its translator. A number of helper mechanisms exist to simplify this task. The mergeEsclation
method will return a fragment containing both the supplied service fragment, and the escalation fragment, simplifying use with translators which only return a service fragment.
In addition, a base class LCFG::Monitoring::Nagios::Translators::Escalate is provided which takes care of constructing the escalation class, and provides mergeEscalation
and constructEscalation
as methods of the translator itself. It is this usage of mergeEscalation
that can be seen in the example above.
Translator authors requiring either a more complex model of escalations, or more flexibility than detailed should examine the LCFG::Monitoring::Nagios::Escalation class documentation, and construct their own escalation definition fragments for each service providing esclation support.
Before progressing to the final worked example, let's take one more stop off along the way. This time, we're going to look at a translator for the Jabber service. There are a few interesting issues here that make this translator worth considering.
Firstly, whilst the Nagios plugins packages ship a plugin command to monitor a Jabber service, they don't include a Nagios command defintion. As shipped, the plugin doesn't correctly interrogate our local service, and needs some additional configuration information to work correctly. So, we have to provide the check command definition within the translator code
Secondly, we have to monitor two services from within the translator. Jabberd provides both a normal and an SSL service on two different ports.
package LCFG::Monitoring::Nagios::Translators::jabberd; use strict; use warnings; use base qw(LCFG::Monitoring::Nagios::Translators::Escalate); use LCFG::Monitoring::Nagios::Fragment::Command; use LCFG::Monitoring::ConfigBundle; sub startPass { my ($self) = @_; # The XML that check_jabber expects to be returned differs in ordering # from what our server returns. my $expectxml = "\"<?xml version='1.0'?>". "<stream:stream xmlns:stream='http://etherx.jabber.org/streams' ". "xmlns='jabber:client'\""; my $commands = new LCFG::Monitoring::ConfigBundle(); my $normal = new LCFG::Monitoring::Nagios::Fragment::Command(); $normal->commandName("check_jabber"); $normal->commandLine('$USER1$/check_jabber -H$HOSTADDRESS$ -p 5222 -e'.$expectxml); $commands->addFragment($normal); my $ssl = new LCFG::Monitoring::Nagios::Fragment::Command(); $ssl->commandName("check_jabber_ssl"); $ssl->commandLine('$USER1$/check_jabber -H$HOSTADDRESS$ -p 5223 --ssl -D$ARG1$ -e'.$expectxml); $commands->addFragment($ssl); return $commands; } sub translate { my ($self, $machine, $profile) = @_; my $contact = $self->handleContacts($machine, $profile); my $bundle = new LCFG::Monitoring::ConfigBundle(); my $normal = new LCFG::Monitoring::Nagios::Fragment::Service("default-service"); $normal->hostName($machine); $normal->serviceDescription("Jabber"); $normal->checkCommand("check_jabber"); $normal->contactGroups($contact); $bundle->addFragment($normal); $bundle->addFragment($self->constructEscalation($normal, $profile)); if ($profile->k('c2s_pemfile')->d || $profile->k('c2s_oldpemfile')->d) { my $ssl = new LCFG::Monitoring::Nagios::Fragment::Service ("default-service"); $ssl->hostName($machine); $ssl->serviceDescription("Jabber SSL"); $ssl->checkCommand("check_jabber_ssl!28"); # Hard coded 28 days for certificate validity $ssl->contactGroups($contact); $bundle->addFragment($ssl); $bundle->addFragment($self->constructEscalation($ssl, $profile)); } return $bundle; } sub notifyUnchanged { my ($self) = @_; return; } 1;
Once again, let's examine the differences between this and the previous code. We've gained a new method StartPass(). This is called at the start of every translation run, and may return a configuration fragment which is included at the start of the Nagios configuration file (It has a partner in crime - FinishPass(), which is called at the end of a run, which can put fragments at the end of the file, but we're not using that here).
In StartPass() we defined two Command configuration fragments to define the check_jabber and check_jabber_ssl check commands. We use a LCFG::Monitoring::ConfigBundle to aggregate these two fragments into a single fragment to be returned. A configuration bundle is simple a set of one or more fragments, which can be treated as a single configuration fragment by the rest of the monitoring system. Bundles preserve ordering, so fragments appear in the eventual configuration file in the order they are added to the bundle.
translate() works in a similar fashion to the previous example - we determine our contact group, and register it wiht the user manager. Instead of just creating a single Service definition, we create two, and use a ConfigBundle to aggregate these together. Note that we have to define an escalation for each of the two Service defintions, and that we explicitly add these escalation definitons to our bundle.
Having worked your way through all of that, you've pretty much seen all of the monitoring system you'll need for day-to-day coding. Let's round off by working through defining the monitoring for a relatively complex component - the one which manages the cosign system. NB: This is now a historical discussion, with the cosign component having been refactored to handle some of its tasks to other components. However, it is still a good example of a more complex component.
The first thing to consider is what should be monitored. Whilst this will change on a component to component basis, a good rule of thumb is that all of the services that a component launches should be monitored individually. In Cosign's case, there are four services that the component launches: Normal Apache, Apache SSL, cosignd, 'monsterd'. Monsterd is only present if the service is replicated
Next, Nagios plugins to monitor each service must be assembled. For the first two, Apache based, services the existing http plugins will suffice. A quick read of /etc/nagios/commands.cfg reveals that the 'check_http' command is defined. However, there's no command defined for HTTPS services, so we'll have to write a command definition for this, using the existing plugin
As cosign's cosignd and monsterd services are somewhat obscure, there aren't any distributed Nagios plugins to monitor them. A check at http://nagiosexchange.org/ reveals that no-one else has written them either. However, nagios ships with a general purpose check_tcp component which can be used to interrogate arbitrary services. Whilst this can't check that a service is completely functional, it can connect to arbitrary ports, and require particular response from a server.
For cosignd, we know that it runs on port 6663, and returns 220 2 Collaborative Web Single Sign-On when it's working correctly, so a check command like
/usr/lib/nagios/plugins/check_tcp -H osprey.inf.ed.ac.uk -p 6663 \ --expect "220 2 Collaborative Web Single Sign-On"
can be used. It's worth testing these commands manually before integrating them into a translator.
monster doesn't run on a network port, and so can't be monitored remotely. We'll ignore monster for now, although a mechanism for monitoring it locally would be highly desirable.
Here's the cosign translator in all of its glory. If you've been following along with the previous sections, there shouldn't be anything new in here.
If you're using the SimpleCheck or Escalate base classes, then you must include a set of standard resources within your component's default file. It is good practice to include these resources in every component - as the monitoring system depends on some of them as 'fallbacks' should your component fail to execute correctly. The list of resources is contained in the 'nagios_component-1.def' defaults file - however including this file directly is not recommended. Many LCFG using sites will not wish to have a monitoring service at all, some may wish to deploy the monitoring framework but use a different monitoring technology. So, a 'monitoring-1.def' header is provided, which your component's defaults should include. On sites using Nagios this header will provide the contents of nagios_component-1.def.
You may also wish to define additional, monitoring specific, resources. In order to make components as portable as possible, you should structure your defaults file such that these resources are only included for sites which use Nagios. To help with this, NAGIOS_MONITORING will be defined by monitoring-1.def when Nagios resources should be included.
The convention is that translator code lives in a 'nagios' subdirectory of the component source, with the filename 'component.pm' (with component being replaced by the name of the component in question.)
For an LCFG component, the lcfg-reltool build system automatically copes with the building of the translator, but additions need to be made to the 'specfile' of the component. Edit that file and, after the defaults-s@SCHEMA@ package definition, and before the %clean section, add a definition for the new 'nagios' translator package as follows:
%package nagios Summary: Nagios translator for @LCFG_FULLNAME@ Group: @LCFG_GROUP@/Translators/Nagios BuildArch: noarch %description nagios Nagios translator for the LCFG @LCFG_NAME@ component. @LCFGCONFIGMSG@ %files nagios %defattr(-,root,root) @LCFGPOD@/LCFG::Monitoring::Nagios::Translators::*.pod %{perl_vendorlib}/LCFG/Monitoring/Nagios/Translators/*.pm %doc %{_mandir}/man3/
The lcfg monitoring system ships with a utility program lcfg-monitor-test which can be used to check the functioning of your translator before shipping it to the server. If you haven't already, you should install the LCFG Nagios package set, by including the dice/options/nagios_packages.h header file.
Before running this you must 'make' in your component. You can then do
lcfg-monitor-test --libs=./blib/lib <component> <machine>
Where component is the name of the component you're testing the translator for, and machine is a machine running this service whose resoures you want to use.
lcfg-monitor-test will output the results of running your translator. It's important that you check these to ensure there are no errors - some results will be output even if the translator dies part way through processing.
Once everything's been tested and built, we're ready to unleash the new module on the server. We need to be careful of the ordering here, as if the server receives the configuration resources before the translator is installed, it will flag the translator as 'broken', and won't try to reload it until the monitoring service is restarted.
Firstly, the new translator package must be added to the package list for the monitoring server. There is a 'live' packages file for this purpose, translators should be listed here initially for testing, and then moved over to the production headers.
Then, once the change has propagated, updaterpms needs to be run on the monitoring server.
om nagios.updaterpms run
If the RPM hadn't installed (either because the profile hadn't propagated, or because of another error), you must stop here - as the next step will cause problems!
Then, we add the necessary resources to configure monitoring to the profile or headers of the service being monitored. As part of the earlier testing, you've probably added the component specific resources - at this point you add
#include <dice/options/nagios-client.h> !nagios_client.components mADD(myComponent)
This will get propagated to the monitoring server, and it will now start monitoring your component.
This section is a compilation of a number of topics that will only concern those writing either more complex translators, or translators that intergrate further with the rest of the monitoring system.
We've talked in the preceeding sections about the UserManager, although interaction with it has always been through the handleContacts
method. Some translators, particularly those which encompass multiple services, will require a more complex relationship with the UserManager.
The UserManager exists to keep track of all of the contact groups, and individual contacts, handled by the Nagios system. Nagios requires that there be exactly one contact group entry per group, and one contact entry per user. The UserManager handles this by receiving all of the contact group information for all of the translators during a compilation run, and combining it into a single group list. It then pulls the list of group members out of the appropriate database, and constructs configuration directives for each group, and group member. Support for different databases may be acheived by creating different UserManager classes, however all user managers are guaranteed to implement the interface LCFG::Monitoring::Nagios::Interfaces::UserManager
Translators which are handling their own contact groups lists should simply call the addContactGroup
method of the server's userManager
attribute with each group they are adding. This method takes three arguments, all of which are text strings: the group being added; the machine currently being translated; and the name of the component. So, in a typical translator, you might have a code fragment similar to
my @contacts=split(/ /, $contactListFromLcfg) foreach (@contacts) { $self->server->userManager->addContactGroup($_, $machine, $self->component); } $service->contactGroups(join(",",@contacts));
Note that the LCFG contact group list is space seperated, but the Nagios one is comma seperated.
In situations where the translator is not run, because the underlying configuration information is unchanged, the server will take care of adding the relevant contact groups on its behalf - there is no need to call the user manager from the translator's notifyUnchanged
method.
In situations where a component's translator defines multiple Nagios services, it is possible that these services may depend on each other. Defining dependencies means that in the event of a service failing, services which are dependent upon it will not have notifications sent - only the 'master' service will send failure messages. In this way, dependencies can help prevent notification storms.
A typical example of this is in the Apache virtual host case. It may be desirable to monitor each of the virtual hosts provided by Apache, to check that the web applications they provide are still operational. However, if the main Apache process dies, only a single notification should be sent, rather than one for each of the (potentially hundreds) of virtual services hosted by the machine.
Nagios's dependency system is somewhat complex, and can be difficult to get your head around. To help with this, a simple method is provided by the LCFG::Monitoring::Nagios::Fragment::ServiceDependency class. Calling createDependency($service, $dependentService)
returns a configuration fragment marking the $dependentService as being reliant upon $service.
In many cases, the service that a given component is dependent upon may not be local to the current component, or even to the current host. LCFG has no real language for expressing these kind of cross-host dependencies, so the monitoring system has implemented one of its own, using a system of unique tags. A translator may declare any given service as providing a particular tag (a free-text string), and another service may be marked as depending upon a tag. This requires that only that tags be coordinated between services, and that the service and host names can be changed at will. Only one service may provide a tag at any given time, but many services may depend upon a particular tag.
The system of providing, and depending upon, tags is governed by the LCFG::Monitoring::Nagios::DependencyManager object, which is available to translators through the dependencyManager
method of their server.
In order to register a service fragment as providing a particular tag, a translator should call the addProvides
method of the dependency manager, with the tag name, the service fragment for which the tag is being registered, the current machine, and the current component. For example:
$self->server->dependencyManager->addProvides("My_Tag", $service, $machine, $self->component);
As noted before, it is an error (and the dependencyManager will throw an exception) to register the same tag for multiple services. There is nothing to stop a single service providing multiple tags, however. If a service is a member of a group, and the dependency should be expressed against a group, rather than against each individual service, then clusters should be used instead (see CLUSTERING, below).
To declare a service fragment as dependent upon a particular tag, the addDependency
method of the dependency manager is used. It is called with the tag name, the service fragment for which the dependency is being declared, the current machine, and the current component. For example:
$self->server->dependencyManager->addDependency("My_Tag", $service, $machine, $self->component);
A single service may register multiple dependencies, and the same tag may be dependended upon by multiple different services.
Dependencies are evaluated at the end of a compliation run. Dependencies which have been declared against non-existent tags will only be detected at this stage, and will not result in compliation errors - just warnings in the server logs.
In Nagios terminology, a cluster is a set of more than one machines that redundantly provide a particular functionality. For instance, a set of KDCs, or AFS database servers. Clusters may be monitored so that notifications are only raised when more than a certain number of machines in the cluster are down, and can be hugely useful when expressing interservice dependencies (not currently supported).
Clusters are looked after by a Cluster Manager, provided by the LCFG::Monitoring::Nagios::ClusterManager class, which translators can access throuch $self->server->clusterManager. In order to register a service fragment as belonging to a particular cluster, translators should call the addCluster
method with the name of the cluster, the service fragment, the name of the machine they are translating for, and their component.
Typically, this looks like:
$self->server->clusterManager->addCluster("My_Cluster_Name", $service, $machine, $self->component)
In addition to being registered as a cluster, a cluster also provides it's cluster name as a dependency. This means that cluster names and dependency tags occupy the same namespace, but allows the creation of dependencies against groups of machines. The cluster manager will also add service group definitions to the configuration for all groups that it has defined.
So far, we have only considered services where the monitoring is scheduled by Nagios itself, and performed by Nagios plugins. It is possible to monitor services using external scripts, and simply push the results of those scripts into Nagios. This is referred to as passive monitoring.
There are three potential use cases for passive, rather than active, monitoring. The first is to keep an eye on services which are not visible on the network. In this model there would be a regularly executed script which would check on the state of the service, and submit its results back to Nagios - in much the same way as plugins do at the moment, with the exception that the scheduling of the script is outwith Nagios's control (and the script can run on any machine on the network). The second use case is to deal with events, such as SNMP traps. In this case the script is only run when the event occurs, and it then submits the appropriate result to Nagios. The final model is reporting regular system tasks (for example, AFS volume releases) - a script performs the tasks, and as well as logging the output, it produces a Nagios status message.
In the first and third cases, as well as being interested in the results of the check, we're also bothered about its recency - not receiving results from a service check may be as big a problem as that check notifying us of service failure. Nagios has a concept of 'freshness' in order to implement this, which we'll discuss more below.
Nagios provides a mechanism, NPRE, for allowing hosts to submit passive checks to the central Nagios server. However, the authentication and acccess control in this mechanism is pretty primitive, and hard to tailor to the needs of sites which already have distributed access control systems. The Nagios portion of the LCFG Monitoring system provides a pluggable framework for configuring a system to accept passive checks. This framework may be used to configure NPRE, as well as more powerful systems such as one based upon remctl. Central to all of these systems is the idea that a passive check for a given service, running on a given machine, may only be submitted by particular authentication identities.
In order to receive service checks, the monitoring system must be configured with an appropriate service. These services may be produced by translators in exactly the same way as services which perform active checks, with a few exceptions.
In the service definition, the check command should be set to a dummy value, rather than a real plugin. The service should be configured to accept passive checks, and should have active checks disabled. Something like the following will achieve this:
$service->checkCommand('check_passive'); $service->activeChecksEnabled(0); $service->passiveChecksEnabled(1);
Secondly, if passive checks are to be submitted through the standard mechanism detailed above, the translator must register the service with the Passive Check Manager, along with an identity that will be used to submit the passive check results. These registrations are used to build up the ACL lists which control which identities can submit check results for which services. The translator may do this by calling
$self->server->passiveManager->addPassiveCheck($service, "myidentity@MYREALM", $machine, $self->component);
The implementor must now ensure that some mechanism exists to submit passive checks to this service. In our first use case this must be done regularly, perhaps by something that is called by cron. In the second and third use cases the submission occurs as a result of some event. Passive results may be submitted through the 'nagios-remctl-send' command line script, or by using the LCFG::Monitoring::Nagios::RemctlSend class.
The command line tool is invoked with
nagios-remctl-send --keytab /etc/nagiospassive.keytab --server nagios.example.org --service ServiceName -- /usr/lib/nagios/plugins/check_my_service <options>
where the keytab contains key material for the principal given to the addPassiveCheck
command, the server is the name of the nagios server to which the results will be sent, ServiceName is the name of the service registered with addPassiveCheck
. This will then run the specified plugin, with the given options and send its result and exit status code.
If you wish to communicate the result of a command run, rather than the ouput of a monitoring plugin, the --text
and --result
options may be used to communicate a textual result and status code to the server. These must match the standard output from Nagios plugins, as specified in http://nagiosplug.sourceforge.net/developer-guidelines.html. In particular, the text should be a single line of less than 80 characters in the format
SERVICE_STATUS: Information text
where SERVICE_STATUS
is typically OK, WARNING or CRITICAL, and the result code should be between 0 and 3 with the following meanings:
0 : OK 1 : WARNING 2 : CRITICAL 3 : UNKNOWN
Note that Nagios uses purely the result code to determine the state of the service. The text message is for information only, and is not parsed by the system.
For example:
nagios-remctl-send --keytab /etc/nagiospassive.keytab --server nagios.example.org --service BeerGlass --result 1 --text "WARNING: Almost empty"
reports a warning that the BeerGlass service is almost empty.
This can also be acheived by a script using the LCFG::Monitoring::Nagios::RemctlSend class:
use LCFG::Monitoring::Nagios::RemctlSend; my $sender = LCFG::Monitoring::Nagios::RemctlSend->new (keytab=>"/etc/nagiospassive.keytab", server=>"nagios.example.org", service=>"BeerGlass", result=>1, text=>"WARNING: Almost empty"); $sender->send();
Finally, there are a few other issues which should be considered. If the check results are expected at regular intervals (for instance, from regular remote server checks, or as a result of a nightly cronjob), Nagios should be told to report an error when those checks do not appear. This can be done by configuring Nagios's freshness options in the translator, as follows:
$service->checkFreshness(1); $service->freshnessThreshold(60);
The freshnessThreshold is the number of seconds after which Nagios will consider a check as being stale. You should set this to slightly more than the frequency with which you expect check results to be submitted, in order to avoid false positives. When a result becomes stale, Nagios attempts to perform an active check against a server - if you have used the 'check_passive' command as detailed above, this will simply return a CRITICAL warning.