Nagios Passive Agent - Configuration

Getting Started

NPA comes with some example configuration which is commented out in the npa.xml file. These files are used for all the configuration:

npa.xml	All checks and parameters
log4j.xml	Logging output
defaults.groovy	Settings for the nagios server, HTTP proxies, authentication and so on.
wrapper.conf	Settings for the daemon configuration, including JAVA_HOME.

The files are usually located in /usr/local/npa/config (UNIX/Linux) or c:\npa\config (Windows).

Using Default Configuration

The default configuration contains several common checks which will run every 3 mins by default:

chk_disk_free will check for any disks with 10% or less free space and trigger a warning, or 5% or less, which will trigger a critical status. It will loop through all standard filesystems.
chk_cpu_pct will check the percentage CPU usage on the all CPUs. It will trigger a warning at 90%, and critical at 100%.
chk_disk_busy_pct will check how busy the disks are. If any are over 80% busy, it will trigger a warning. 90% will trigger a critical status.

These checks are fairly basic. You can edit the thresholds yourself, and add additional checks in the npa.xml file. It is worth setting up new servers with just the default checks to start with though. This way, you can be sure communication back to nagios is working, before you configure additional checks.

An Explanation of the Config Files

Here's a breakdown of configuring a check. Checks are members of check groups. All checks must be a member of a checkgroup:

...
    <check-group name="os" type="nagios" interval="120000">
...

Interval: All checks inside this check group will execute every 120000 milliseconds, which is two minutes.
Name: is an arbitrary name for your group
Type: This could allow other types to be specified in the future. Currently it should always be set to "nagios"

Now, lets see a breakdown of the check itself:

...
        <check name="chk_disk_free" warn="10" crit="5" type="LTE">
            <nagiosServiceName>chk_disk_free</nagiosServiceName>
            <unitType>percent</unitType>
            <volume>ALL</volume>
         </check>
...

Name: Check name. This corresponds to the internal check name and must correspond to an actual check type.
Warn: This is the warning threshold which triggers a warning level in Nagios.
Crit: This is the critical trigger level for nagios.
Type: This is the type of comparison made of the output. LTE=Less than or equal to. See comparison types below.

The tags beneath the check tag are different for each check type, and certain ones will be expected for each check. However, nagiosServiceName is always used to submit back the check result to Nagios. This corresponds to the check name configured at the nagios server, so should be configured appropriately.

This particular check (chk_disk_free) requires the unitType and volume settings to operate correctly. If you do not specify all required options, you will see lines in the npa.log file indicating this.

A note on hostnames

Currently NPA is not very forgiving on hostnames. It will submit the check result back to nagios with the hostname, as given by the Java host.name system property. This often, corresponds to the short name of the host. This behavior is not very flexible and may be changed in the future.

Using Additional Checks

NPA supports external checks using any Nagios plugin. To do this, a configured command is executed on the operating system at the specified interval. There is an example of how to configure this in the npa.xml file.

Some Notes on External Checks:

Please check Nagios plugins manually before configuring NPA to run them for you. This will ensure it is easy to satisfy any individual configuration requirements for these plugins, before trying to debug the NPA part too.

With external plugins, NPA simply runs the script/executable you provide, and parses the output, following the nagios plugin standard. Therefore, the plugin must satisfy the nagios standards in order for the check results to be valid, and for the performance data to be correctly sent back to nagios. These are covered in detail at the following site.

The full path of the plugin should be used - relative paths do not presently work, even if you have saved the plugins within the NPA subdirectories.

Here is an example configuration:

        <check name="chk_nagios">
            <nagiosServiceName>chk_test</nagiosServiceName>
            <saveMetrics>true</saveMetrics>
            <scriptName>/home/chris/work/projects/monitoring/dummy-nagios.sh</scriptName>
            <scriptType>shell</scriptType>
            <scriptArgs>-t -U -p</scriptArgs>
        </check>

Notes on Individual Checks

chk_disk_free

This check checks the space of each of the filesystems or a single specified filesystem on a server. It can be run on all platforms, and is a part of the OSGatherer/OSCheck classes in NPA. It has options to warn based on a percentage free threshold, and megabytes free.

Here is a set of typical XML configuration strings:

Check Type	Warning Threshold	Critical Threshold
/u01 filesystem <= 100MB	<= 500MB	<= 100MB
Any disk free <= 5%	<= 10%	<= 5%
/u02 filesystem >= 90%	>= 80%	>= 90%

As you can see, as with all NPA checks, you can specify the thresholds as greater-than or less-than options. This allows some flexibility. Usually we would configure disk checks as a percentage free space, but this is really a matter of preference.

When to Use MB Free instead of Percentage and When to Separate Volume Checks

Sometimes disks can be too large to give a reasonable warning range for the other disks in the system, without triggering too quickly. For example:

Disk 1: 1TB
Disk 2: 10GB

When disk1 is 90% full, it will still have 100GB free, which is a lot of space! On the other hand, disk2 will only have 1GB free. Obviously the disk condition on disk2 is much more serious. In a case like this, it would be wise to write different disk checks for each disk. You can do this by adding the volume parameter to the check, as per the above examples, and changing the nagiosServiceName parameter to indicate to disk you are checking. E.g. chk_disk_free_u01. At the nagios end, you then need to add a new service check to that server. Sometimes with checks like this, the service check will have to be created, since it is not generic across all servers.

chk_disk_busy_pct

This check uses iostat (UNIX/Linux) and wmic (Windows) to check how busy each of the physical disks in the system is. It makes no distinction for local/SAN disks and checks them the same way. It will trigger warning and critical thresholds, if any of the disks is greater than a certain threshold. It doesn't presently have the ability to set individual volumes, but I will add the facility in a later release. Here's a good example of the best way to configure it:

Check Type	Warning Threshold	Critical Threshold	XML Config
Disks is 90% busy	>= 80%	>= 90%	<check name="chk_disk_busy_pct" warn="85" crit="95" type="GTE"> <nagiosServiceName>chk_disk_busy_pct</nagiosServiceName> </check>

chk_tablespaces

This check checks oracle tablespaces and tries to determine how much free space is available. It supports checks on a percentage, or number of megabytes free. It uses the following SQL:

select df.tablespace_name
,      sum(max_size_mb-total_used_mb) as free_space_mb
,      sum(max_size_mb) as max_size_mb
,      sum(total_size_mb) as current_size_mb
,      (sum(total_used_mb)/sum(max_size_mb))*100 as pct_used
FROM (select distinct df1.file_id, (decode(df1.autoextensible, 'NO', nvl(df1.bytes,1), 'YES', maxbytes)/1024/1024) as max_size_mb,
(nvl(df1.bytes,1)/1024/1024) as total_size_mb,
((nvl(df1.bytes,0)-nvl(fr1.bytes,0))/1024/1024) as total_used_mb,
df1.autoextensible as auto_ext
from dba_data_files df1 
left outer join
(select sum(bytes) bytes, file_id from dba_free_space group by file_id) fr1 
on df1.file_id = fr1.file_id) calc, dba_data_files df
where tablespace_name like '%'
and df.file_id = calc.file_id
group by tablespace_name
order by tablespace_name;

If this check returns a warning or critical result along with valid output, it is certain that the tablespace needs to be looked at. It takes into account whether autoextend is turned on on a datafile, so max_size_mb includes the maximum auto extend sizes too (which is usually 32GB). If the check triggers, usually adding a datafile is all that is required. It is recommended to run this check only every half-an-hour or so, as checks every minute could affect performance on the database.

Occasionally, it may trigger because you have configured a tablespace which is not yet being used. To avoid this, create a small datafile for the tablespace, with autoextend turned on. This will ensure that the tablespace has room to expand once it comes into use. For example (with OMF turned on):

alter tablespace case_notes add datafile size 5m autoextend on next 50m;

Here's some examples on configuration:

Check Type Warning Threshold Critical Threshold XML Config

Any Tablespace <= 5% free

<= 10%

<= 5%


<check name="chk_tablespace" warn="10" crit="5" type="LTE">
	    <nagiosServiceName>chk_tablespaces</nagiosServiceName>
	    <unitType>percent</unitType>
	    <tablespace_name>ALL</tablespace_name>
   	    <host>oracle10g.corelogic.local</host>
	    <user>fw34x</user>
	    <password>fw34x</password>
	    <database>ora10test</database>
	    <port>1521</port>
	</check>

chk_http

This check checks a web site address for a valid return code. For instance, 200. 404, or 500 error codes would produce a critical status. It is fairly simple to setup, and monitor. It also has a threshold in milliseconds for the page response to be received. This will then trigger warning or critical status levels if the web server is running particularly slow. Here is a typical example of how to configure it:

Check Type Warning Threshold Critical Threshold XML Config

Page response >= 8 seconds

>= 5 seconds

>= 8 seconds


<check name="chk_http" warn="5000" crit="8000" type="GTE">
    <nagiosServiceName>chk_http_fwlive</nagiosServiceName>
	 <serverpath>http://linuxapp3.corelogic.local:7778</serverpath>
	 <url>/fwlive</url>
	 <host>linuxapp3</host>
</check>

The host parameter needs to match the nagios hostname setting for checks to be correctly accepted. It should be set to the hostname, as determined by the output of the 'hostname' command on the monitored host. The same is true for the nagiosServiceName, which will need to be a configured service check for that host, on the nagios server.

chk_nagios

This is a pseudo check. It runs an external script at a scheduled interval with specified criteria, and then returns the results of that check directly. This allows NPA to run any existing Nagios compatible script.

It's quite easy to configure - here is an example:

Check Type Warning Threshold Critical Threshold XML Config

External Script


	<check name="chk_nagios">
	    <nagiosServiceName>chk_test</nagiosServiceName>
	    <saveMetrics>true</saveMetrics>
	    <scriptName>/home/chris/work/projects/monitoring/dummy-nagios.sh</scriptName>
	    <scriptType>shell</scriptType>
	    <scriptArgs>-t -U -p -w 1 -c 2</scriptArgs>
	</check>

As you can see, with chk_nagios you do not need crit and warn parameters for the check. These should instead be specified as part of the scriptArgs setting, which are the arguments passed to the script on the command line. With chk_nagios, only the scheduling and submission parts of NPA are used to execute and return the results of the check. This means you should be able to test the check fully from the command line before submitting.

A note on users

The script will be executed as the user which the NPA daemon is running as (usually NPA on UNIX/Linux). Since this is the case, you need to make sure that the command has the correct permissions when executed as that user. If it doesn't, you can configure a sudo setting to allow it to be executed as another user. For example, commands which need to run as the oracle user, would need settings something like below:

<scriptName>/sbin/sudo</scriptName>
<scriptArgs><scriptName>-u oracle /path/to/script arg1 arg2</scriptName>

Notice that the only parameter for the scriptName setting is the executable. This is required, and you should not add the sudo arguments to this parameter - they should be instead specified at the start of the scriptArgs setting.