Nagios Monitoring Standards & Guidelines

Naming Standards

Hostname Naming Standards:

  • Pattern: host.domain
  • Example:

Each hostname will have a fully qualified domain name.

Host Groups Naming Standards:

  • Pattern: Operatingsystem_cc
  • Examples: Linux_us, Windows_us, Cisco_jp, Hpux_us

Each host group contains the operatins system followed by a two letter, lower-case country code.

Service Groups Naming Standards:

  • Pattern: Servicegroup_loc_env_cc
  • Example: Apache_ext_prod_us, Drupal_int_test_us
  1. Must start with a descriptive name of the service type.
  2. Should contain network location of external (ext) or internal (int).
  3. Must have what environment it is in prod (production) or test.
  4. Must have a 2 letter country code in lower case at the end.

Service Naming Standards:

  • Pattern: Apache_host_cc
  • Example: Httpd_hostname, mysql_hostname, ping_hostname
  1. Must have a capitalized desciptive name about the service type it will hold.
  2. Must have the host name.
  3. Must have a 2 letter country code in lower case country code at the end.

Contact Groups Naming Standards:

  • Pattern: Contactgroup_dept
  • Examples: Admins_noc, Helpdesk_noc, SA_noc
  1. Will have a name that is descriptive of the department.
  2. Will contain the abbreviated name to help identify the group in the company.

Contact Naming Standards:

  • Pattern: FLast
  • Example: jsmith
  1. Will match the windows login name of the user.

Monitoring Frequency Standards

The default will be to monitor services every 2 minutes.

When an error is caught by the monitoring, the service will be checked 3 times every 30 seconds until a notification is sent.

If the production service is not critical, the service can be checked less frequently.

Test systems, if important, will be checked every 15 minutes, and then will be rechecked 5 more times every 5 minutes before a notification is sent.  (Only important test systems will have monitoring and they could be down for 25-45 minutes with no alerts)

Where Monitoring Should Occur

The goal behind our monitoring systems is to have each system handle as much of the monitoring checks as possible to lessen the load off of the main Nagios monitoring system.  This will allow more frequent checks and more stability of the monitoring server.


Common Status Code Standards

The 4 status codes that we will standardize on are:

0 = OK  (0) means that the process ran to completion and is running within acceptable parameters.
1 = Warning  (1) means that the process didn't fail, but it is in a state where some action may be required.
2 = Error (2) means that an error occurred with the process and action needs to be taken.
3 = Unknown (3) means that something unknown may have happened to the process and should be checked.