Monitoring Strategy and Architecture

In the attached child pages is information about IT Monitoring at Nu Skin.  This includes both the current state of monitoring, some research on best practices and a strategy for where we should take monitoring.  We want to move to more of a model where we manage to service levels not just responding to outage events.

Executive Summary

  1. Nu Skin has too many monitoring tools
  2. Nu Skin needs to have a better coordinated monitoring effort across groups
  3. Nu Skin can reduce the annual labor and maintaince of monitoring tools by migrating off of some tools.
  4. Nu Skin should enlist consultants to help clean up our monitoring, but should not spend a bunch more money on license for new tools.

Gartner Recommendations on IT Event Monitoring

Large enterprises should consider a multitier event management hierarchy, pushing some event processing and correlation out to the managed IT element at the bottom of the hierarchy to reduce the overflow of unnecessary events, using specialized event management tools to gain additional depth in specific IT domains at the middle tier of the hierarchy, and placing a general purpose manager of managers (MoMs) product at the top tier of the architecture to achieve a single, integrated view of events from a wide range of IT infrastructure elements.

  • It is a good practice to push as much element-specific event correlation as possible down to the lowest tiers of the ECA architecture, even as far down as the managed element itself.
  • It is a good practice to push as much element-specific event correlation as possible down to the lowest tiers of the ECA architecture, even as far down as the managed element itself.Implement a tiered event correlation architecture, where event management products in each IT technology domain filter and pass data on to allow IT operations to view only the most important multidomain data at the highest-level MoM or BSM console.
  • Identify the information you need at the highest-level MoM to help define what products pass data, which data is passed, when data is passed, and how data is filtered and correlated, as well as to highlight gaps in monitoring what will need to be filled.
  • Preprocess event data at lower tiers in the ECA architecture, so as not to overload the MoM or BSM console.

Background

Projects in the past have focused on tool selection not on tool implementation. We have purchased many different monitoring tools at Nu Skin, often with the desire to combine all of our monitoring into one giant tool.  This has not been successful.  In fact we have started several projects to select the perfect tool, bought the tool and then implemented only a few of our alerts in the new tool, without retireing any of our existing monitoring tools.  We are good at selecting tools but not good at implementing them all the way.

We have a elaborate, decentralized, disorganized, ineffecient but effective, monitoring on our systems.  Nu Skin has lots of different ways of monitoring our systems.  So given our lack of monitoring strategy each group has implemented their own monitoring to ensure the health of the systems they are responsible for.  We have a total of 17 ways we monitor IT systems (including the 7 tools that Carl S. was using). 

Ownerhip of our IT systems groups for keeping their respective systems up is very high.  What this really means is 99% of the time events are detected and fixed down in the individual groups before they ever cause a critial problem.  

Recommendations

Promote the continuded use of low level event managers by the DBA, System Admin, Network groups, but add an general purpose manager to the top tier of the architecture to achieve a single, integrated view of events.

Specifially replace

  1. Sitescope 6
  2. Sitescope 8
  3. HP Openview
  4. Nagios 

with Nagio XI the commercial version

In the 2010 budget we have broken down as follows:

  • One time license cost
  • Consulting cost
  • One year support
  • Server H/W cost

Benefits

This would lower the annual support costs for tools, reduce the number of tools we use to monitor and provide a integrated view of critical events.

The money spent here mostly for implementation, not for licenses or support.Monitor

Subject