Revision as of 12:49, 28 November 2007

Nagios Integration with COSMOS

Change History

Name:	Date:	Revised Sections:
Ali Mehregani	11/19/2007	Initial version
Ali Mehregani	11/27/2007	Modified based on Mark Weitzel and Valentina Popescu's comments

Workload Estimation

Rough workload estimate in ONE person week
Process	Sizing	Names of people doing the work
Design	15
Code	25
Test	15
Documentation	2
Build and infrastructure	1
Code review, etc.*	2
TOTAL	60

Terminologies/Acronyms

The terminologies/acronyms below are commonly used throughout this document. The list below defines each term regarding how it is used in this document:

Term	Definition
MDR	Management Data Repository
CMDBf	Specification for a CMDB that federates between multiple MDRs
CMDB	Configuration Management Database
CBE	Common Base Event - A standard that defines a common format for logging
SML	Service Modeling Language - An XML based language used for modeling
SML Model	A set of SML compliant resources
SML Repository	The SML Repository describes any SML model together with a set of COSMOS API used to add new SML resources to the SML model and to query the SML model.
CMDBf query	MDRs make data available via a query service defined in the CMDBf specification. The input and output of a CMDBf query is a structured XML document described in the specification.
Host	A host in Nagios terms is any entity on a network that can be monitored (e.g. desktop, router, printer, etc...)
Host check	A host check in Nagios corresponds to running a command that will indicate the status of a host
Service	There are two types of services that can be monitored by Nagios on a host: public and private. Examples of public services are HTTP, FTP, POP3, SSH, etc... and examples of private services are CPU utilization, memory consumption, disk space, power consumption, etc...
Service check	Analogous to host check, a service check involves running a command that will check the status of a service
Command	A Nagios command is either a shell executable or a Perl script that performs a specific task (e.g. host/service check)

Introduction

The COSMOS vision is entailed in the definition of what COSMOS is - "The world or universe regarded as an orderly, harmonious system". The intention of the project is to apply the same concept to the world of system management. Complementing standards such as CMDBf, SML, WS-Notification, and Web2.0 technologies are making this vision a reality. The overall COSMOS vision is to provide an extensible framework, based on a set of acceptable standards, to simplify the task of building an ecosystem of existing system management tooling.

Inline with that vision, Nagios can help to illustrate how standards deliver value to an open source community and adopters that intend to provide higher level management capabilities on top of COSMOS. Industry standards can help to integrate commercial solutions with well established monitoring infrastructures such as Nagios. The end goal of this integration effort is to develop a framework around WS-Notification and the CMDBf specification using Nagios as an exemplary consumer.

The next two sections give a brief overview of Nagios and WS-Notification.

What is Nagios?

Nagios is a system and network monitoring application that is capable of detecting and notifying abnormal behavior. The definition and monitoring behavior is defined by administrators using a set of flat-file configurations. The files indicate what and how things should be monitored. There are three primary atomic entities in Nagios:

Host	A physical device on a network that is intended to be monitored (e.g. a desktop, printer, router, switch, hub, etc...).
Service	Indicates the specific component of a host that should be monitored (e.g. CPU utilization, memory consumption, HTTP, etc...)
Command	A utility that allows for a host/service check, notification handling, alerts, etc.... For example, check_CPU can be a command used to monitor the CPU utilization on a particular host.

An administrator is required to define hosts, services, and commands to effectively monitor a set of resources. The actual monitoring of a host/service is not done by Nagios. It is instead done by add-on plug-ins that are defined as individual commands. This architecture provides the capability to virtually monitor any aspect of a system that can be automated. There are already many available plug-ins for monitoring common hosts/services in a typical networking environment. Where limited, administrators can write their own plug-in to accomplish the monitoring of an uncommon host/service. The data collected from plug-ins are logged to flat files. Nagios itself doesn't persistent events to a database but plug-ins are available to direct events to an RDBMS such as MySQL.

The Nagios service runs on Linux but it is capable of monitoring desktops running Windows via its plug-in architecture. As part of its monitoring solution, Nagios also provides an alerting mechanism that broadcasts a problem to sets of contacts or contact groups. A notification handler can also be registered to take certain actions based on incoming events (e.g. storing status information in an RDBMS). The diagram below, extracted from Nagios documentation, pictorially depicts the components:

There is also a web-based UI included that provides reporting and limited administration capabilities. A screen shot of the Nagios web-based UI is included below. The next section describes the scope and the value of this enhancement.

See Nagios user guide to find out more about its capabilities.

What is WS-Notification?

WS-Notification is an umbrella for a set of specifications that describe the publishing and subscription of events in the context of Web services. There are three specifications that fall under WS-Notification:

The first specification is used to describe the basic interfaces and calls required by notification producers and consumers, the second specification describes a middle tier between a producer and a consumer, and finally the third specification describes the structure of topics for publishing and subscription.

COSMOS intends to provide a notification broker as part of its framework for publication and subscription of events. The notification broker should not be confused with the broker that resides in the management domain. They are separate components with different functionalities. There is a separate enhancement under development for the notification broker and its implementation detail will not be included in this document.

The following is a list of terminologies commonly used in the context of WS-Notification:

Term	Definition
Notification Producer	An entity that creates a notification message
Notification Consumer	An entity receiving a notification message
Subscription	The act of advertising the interest for listening on a set of topics
Publishing	The act of advertising the interest for producing notification on a set of topics
Topic	Topics are used to categorize the notification messages produced. Topics can be defined in the form of hierarchies
Topic Space	A forest of topic trees (i.e. sets of topic tress)

Purpose

The purpose of this document is to describe how COSMOS, and by extension commercial vendors, can leverage an existing installation of Nagios via industry standard interfaces.

Scope

There are a number of areas where COSMOS can add value to Nagios. The areas can be summarized into three categories:

Standardized Query Capability
Publication and Subscription of Nagios Events
Reporting on WS-Notifications

Standardized Query Capability

The contribution of a CMDBf query service on top of a Nagios server will provide a standardized mechanism for querying the configuration items managed by Nagios. A CMDBf query service will also allow Nagios to participate in a federating CMDB environment. A well-known query service will make it easier to integrate multiple Nagios servers and/or commercial-based solutions under one infrastructure.

There are 10 different object types defined in Nagios:

1. Hosts
2. Host Groups
3. Services
4. Service Groups
5. Contacts
6. Contact Groups
7. Commands
8. Time Periods
9. Notification Escalations
10. Notification and Execution Dependencies

The first 6 object types are examples of configuration items that can be exposed via a CMDBf query service. Operational data such as the status of a host/service will not be exposed via the query service. This information will instead be published to a notification broker described in the next section.

Publication and Subscription of Nagios Events

Assuming the existence of a notification broker in COSMOS, Nagios can publish a set of topics to indicate the status of the monitored hosts and services. The notification broker can disseminate messages to any client that subscribe for the published topics. This mechanism will provide the ability for clients to process events generated from multiple monitoring solutions. Consider the following example:

Assume the existence of three data managers:

A provisioning solution capable of deploying software to multiple nodes
A Nagios server monitoring a cluster of nodes
A second Nagios server monitoring a different cluster of nodes

Also assume that the two Nagios servers are used to monitor the current patch level on a set of Windows nodes. The Nagios servers can publish a topic to the notification broker to describe the patch level on each Windows-based node. The provisioning solution can then subscribe to this topic and deploy any necessary update when available. The figure below depicts the example. The notification broker is assumed to reside in the management domain and the events are assumed to be persisted by a notification manager:

Reporting on WS-Notifications

A number of reports are available under the Nagios web-based UI: trends, availability, alert histograms/history/summary, notifications, and event logs. The views are generated based on events per Nagios server. There is no mechanism available to aggregate results from multiple Nagios servers.

COSMOS on the other hand can generate reports based on aggregated data from multiple notification producers. The reports will be generated from messages reported to the notification broker. Any Nagios server instance or commercial-based solution producing notification messages will be able to use a set of general reports that provide an overview of the events that have been reported. Adopters can alternatively provide a custom report template that generate reports based on a subset of messages reported to the notification manager.

This assumes the existence of a data manager that subscribes to all topics published to the notification broker. The data manager can persist all events reported by the notification broker. This component is depicted as the notification manager in the figure shown in the previous section.

Requirements

The following is a list of requirements that falls in the scope of the Nagios/COSMOS integration:

Provide the capability of querying the configuration items of a Nagios server using the CMDBf query APIs
Publish a topic space related to Nagios situations to the notification broker
Notify the notification broker when a situation related to a topic is reached
Provide a set of effective reports in analyzing notification messages created by Nagios servers

Use Cases

The following use cases outline some of the typical tasks that COSMOS users will perform to accomplish an objective.

Use Case 1: Adding a machine to the asset database

Assumption: The COSMOS framework is successfully installed with the asset repository MDR

User opens a browser and points to the URL of the COSMOS client
User right clicks the asset database and selects 'Define Object'. A form is displayed under the details pane for the user to populate the required fields. The form can contain multiple pages that cleanly break down the flow of user actions.
The fields are populated and the 'Finish' button is pressed.
Client will indicate that it is writing the data to the database. The user is either prompted with an error message or a confirmation message to indicate success. In case of user error, the form is returned to be corrected.

Use Case 2: Monitoring of a host in Nagios

Assumption: The COSMOS framework is successfully installed with the asset repository MDR and the Nagios data collector.

User opens a browser and points to the URL of the COSMOS client
User right clicks the Nagios item displayed under the data manager navigator and selects 'Define Object'. A form is displayed under the details pane for the user to populate the required fields. The form can contain multiple pages that cleanly break down the flow of user actions.
User selects the type of object that is to be defined, which in this case happens to be a host. The form is populated by all items from registered data managers that are candidates of the object type selected. In this case all resources that are candidates of being monitored are queried from the asset database and displayed in the details pane.
Where possible, items of corresponding data stores are displayed to make it easier for a field to be populated. For example, the items of an employee database can be displayed for the contact/contact group fields of a host definition.
The user either has the option of selecting a discovered host or populating the fields manually. In the case where a discovered host is selected, the fields of the form are populated based on the selected host.
User clicks 'Finish' to finalize the process of initiating the monitoring process of the host

Use Case 3: Monitoring of a service defined with Nagios

Assumption: The COSMOS framework is successfully installed with the asset repository MDR and the Nagios data collector.

User opens a browser and points to the URL of the COSMOS client
User right clicks the Nagios item displayed under the data manager navigator and selects 'Define Object'. A form is displayed under the details pane for the user to populate the required fields. The form can contain multiple pages that cleanly break down the flow of user actions.
User selects the type of object that is to be defined, which in this case happens to be a service. The form is populated by all available services on the associated host. For example, a configuration database can be queried by the client to retrieve all available services on an associated host.
Where possible items of corresponding data stores are displayed to make it easier for a field to be populated. For example, the items of an employee database can be displayed for the contact/contact group fields of a host definition.
The user either has the option of selecting a discovered service or populate the fields manually. In the case where a discovered service is selected, the fields of the form are populated based on the selected service.
User clicks 'Finish' to finalize the process of initiating the monitoring process of the service

Use Case 4: Viewing the status of hosts/services being monitored

Assumption: The COSMOS framework is successfully installed with the Nagios data collector. It's assumed that one or more host/service is configured for monitoring.

User opens a browser and points to the URL of the COSMOS client
User right clicks the Nagios item displayed under the data manager navigator and selects 'Display Monitoring Resources'. A tree of hosts and services are displayed in the details pane with corresponding icons that indicate the last status check of a host/service. See use case 5 for details about finding more information about a failed host/service.

Use Case 5: Determining the problem associating with a host/service

Assumption: The COSMOS framework is successfully installed with the Nagios data collector. It's assumed that one or more host/service is configured for monitoring and at least one host/service is down. The context of the use case is the status navigation tree.

User right clicks a host/service that is indicated to be down and selects 'Display Information'
A BIRT report is generated to display the events of the selected host/service that led to its downtime.

Use Case 6: Generating reports based on host availability

Assumption: The COSMOS framework is successfully installed with the Nagios data collector.

User opens a browser and points to the URL of the COSMOS client
User right clicks the Nagios item displayed under the data manager navigator and selects 'Generate Report > Availability'. A BIRT report is generated and displayed to show the general availability of hosts and services being monitored.

Use Case 7: Removing a host being monitored

Assumption: The COSMOS framework is successfully installed with the Nagios data collector. It's assumed that one or more host/service is configured for monitoring.

User opens a browser and points to the URL of the COSMOS client
User right clicks the Nagios item displayed under the data manager navigator and selects 'Display Monitoring Resources'. A tree of hosts and services are displayed in the details pane.
User right clicks a host and selects 'Remove'. The user is prompted with a message that indicates the implication of removing a host.
The user clicks OK to proceed. The host is removed from Nagios and the status navigation tree is updated to reflect the user action.

Implementation Detail

Integrating Nagios will span features across three sub-projects: Data Collection, Resource Modeling, and Data Visualization. The features can be categorized into three different areas:

Nagios Data Manager Package
Resource Modeling MDRs
Nagios Client

The diagram below displays the interaction between the COSMOS and Nagios components:

Nagios Data Manager Package

A package will need to be included to register the Nagios server as a data manager with COSMOS framework. The package will need to:

Discover the management domain and register itself as a data manager with the advertised brokers
Discover the CBE data manager to determine its end point. This may be code that needs to run periodically until the CBE repository registers itself with the management domain. Keep in mind that there is no ordering of how data managers are registered. There is always the possibility of the CBE data manager registering itself after the Nagios data manager.
After discovering the CBE data manager, activate a Nagios plug-in that will redirect all events to the CBE repository. This step will require Nagios to be reconfigured and restarted. The user will need to be prompted before restarting the Nagios process.

The Nagios packaging code base is expected to be checked into the Data Collection subproject.

Resource Modeling MDRs

As part of illustrating the seamless integration of multiple MDRs with a system monitoring application, there will be two additional MDRs added to the Resource Modeling subproject. The asset repository will also need to be modified to ensure a smooth integration. The two new MDRs will be:

Configuration MDR - contains configuration detail about what is stored and running on a host
Employee MDR - contains information about staff members

The first database will be used to discover services that can be monitored on a specific host and the second database will be used to display a list of employees that can be included in the contact list of a host or service definition. Both MDRs are expected to be implemented on top of the SML repository which already includes a CMDBf query capability. The SML repository code will need to be refactored to extract out any code that is specific to the asset repository.

Nagios Client

The data visualization subproject is expected to contribute the following functionalities:

The ability to define hosts and services
The ability to generate reports on Nagios events, notifications, alerts, and etc...
The ability to write objects to the asset, configuration, and employee MDRs
Nagios specific views to visualize the status of the monitored objects

Task Breakdown

The following section breaks down each individual task based on subproject. Symbols are used to indicate the enhancement that each work item falls under.

Resource Modeling

Φ Refactor any code necessary to provide write capability to the asset repository
Φ Refactor the data center SML model to make it fit better with the resource model that Nagios uses
Φ Refactor the CMDBf query code for the asset based repository to provide any additional queries that the client will need
Φ Provide a model mapping from the asset model to the Nagios model
Ψ Refactor the SML repository code to provide a common plug-in that multiple SML based repositories can use
Ψ Provide an employee based model using SML
Ψ Extend the SML repository code to provide an employee database
Ψ Provide a CMDBf query implementation for the employee database
Ψ Provide a model mapping from the employee model to the Nagios model
Ω Provide a configuration based model using SML
Ω Extend the SML repository code to provide a configuration database
Ω Provide a CMDBf query implementation for the configuration database
Ω Provide a model mapping from the configuration model to the Nagios model
β Use the programming model to plug-in the employee MDR into COSMOS framework
β Use the programming model to plug-in the configuration MDR into COSMOS framework

Enhancements:
Φ [Nagios]Generalize the asset repository and the data center model
Ψ [Nagios]Provide an employee MDR based on the SML repository
Ω [Nagios]Provide a configuration MDR based on the SML repository
β [Nagios]Add the employee and configuration MDRs to the COSMOS framework

Data Collection

Φ Define a mapping between Nagios events and CBE events
Φ Provide a Nagios plug-in to forward events to the CBE data manager
β Provide a mechanism to register a Nagios server as a data manager
β Provide administrative capabilities that the client can invoke

Enhancements:
Φ [Nagios]Provide a Nagios plug-in to log events as CBEs to the CBE data manager
β [Nagios]Register a Nagios monitoring server as a data manager

Data Visualization

Φ Provide actions to write to the asset MDR
Φ Provide the forms necessary to write data to the asset MDR
Φ Provide actions to write to the employee MDR
Φ Provide the forms necessary to write data to the employee MDR
Φ Provide actions to write to the configuration MDR
Φ Provide the forms necessary to write data to the configuration MDR
Ψ Provide actions to define objects on the Nagios data manager
Ψ Provide the forms necessary to define the Nagios objects
Ψ Provide actions to perform administrative tasks on the Nagios data manager
Ψ Define a framework that allows for an MDR to be replaced/added as part of defining objects for Nagios
Ω Provide a navigator that displays the status of hosts and services monitored on Nagios
β Provide reporting capabilities for viewing host/service events
β Provide two general reporting capabilities on Nagios events (e.g. availability and alert history)

Enhancements:
Φ [Nagios]Provide write actions and forms for the asset, configuration, and employee MDR
Ψ [Nagios]Provide actions and forms for defining Nagios objects
Ω [Nagios] Provide a status navigator for the Nagios data manager
β [Nagios] Provide reporting capabilities for the Nagios events

Open Issues/Questions

All reviewer feedback should go in the talk page for Nagios integration with COSMOS.

@@ Line 227: / Line 227: @@
 The following is a list of requirements that falls in the scope of the Nagios/COSMOS integration:
-# Writing resource information to the SML repository via the client
+# Provide the capability of querying the configuration items of a Nagios server using the CMDBf query APIs
-# Viewing Nagios as a data manager in COSMOS framework
+# Publish a topic space related to Nagios situations to the notification broker
-# Initiating and controlling the monitoring of resources via the client
+# Notify the notification broker when a situation related to a topic is reached
-# Generating reports on Nagios based events
+# Provide a set of effective reports in analyzing notification messages created by Nagios servers
-# Viewing Nagios resources and their status
-# Storing Nagios events in a CBE database
-# Registering an employee database as a data manager
-# Registering a configuration database as a data manager
 == Use Cases ==

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Difference between revisions of "COSMOS Design 188390"

Revision as of 12:49, 28 November 2007

Contents

Nagios Integration with COSMOS

Change History

Workload Estimation

Terminologies/Acronyms