*
Quick Links|Home|Worldwide
Microsoft TechNet*
|TechCenters|Downloads|TechNet Program|Subscriptions|My TechNet|Security Bulletins|Archive
Search for


Alert Tuning Solution Accelerator

Published: November 22, 2004
On This Page
Executive SummaryExecutive Summary
IntroductionIntroduction
Alert Tuning OverviewAlert Tuning Overview
Processes and ActivitiesProcesses and Activities
AppendicesAppendices

Executive Summary

In the increasingly important and complex world of information technology (IT) operations, it is essential to implement a robust and reliable systems management infrastructure based on proven methods. Service Monitoring enables data center managers to increase operational efficiency and to achieve a higher state of availability for mission-critical applications and management of Microsoft® Windows® services. Increased levels of performance can be achieved on Windows platforms through the implementation of Service Monitoring, which incorporates best-practice guidance in planning, building, designing, and deploying Microsoft Operations Manager (MOM) 2005 to monitor Windows applications and services.

An overwhelming volume of alerts has been ranked as one of the most crippling issues faced by IT operations. False alerts can create “noise,” which adds costs in headcount, causes inefficiencies for operators having to navigate through an overload of alerts, and perhaps most importantly, creates operational ineffectiveness—resulting from delays in response to legitimate alerts. In such an environment, it can be desirable to adjust Management Packs for a lower level of alert noise.

Under most circumstances, Management Packs will be applicable for the majority of organizations without any adjustments (such as alert tuning). This document is intended to assist large organizations, with complex deployments, in understanding how to utilize alert tuning to achieve the maximum benefit from MOM 2005 and its Management Packs. This process might require a significant up-front investment but, for enterprise IT organizations, it can yield significant benefits over time.

Alert tuning offers increased operational efficiency through:

A reduction or prevention of service incidents through the use of proactive remedial action.

Faster and more effective responses to service incidents.

Improved overall availability of services.

An increase in user satisfaction.

Introduction

Document Purpose

This guide provides detailed information about alert tuning for organizations that have deployed, or are considering deploying, Microsoft Operations Manager (MOM) in a data center or other type of enterprise computing environment. This information consists of prescriptive guidance and custom Microsoft SQL Server™ reports.

Background

Alert tuning is a practice that is part of the Service Monitoring and Control (SMC) service management function (SMF). The SMC SMF is one of 21 SMFs (shown in Figure 1) defined and described in the Microsoft Operations Framework (MOF) Process Model. Every SMF within MOF benefits from some aspect of SMC because these functions are inherent to ongoing process improvement. This is especially true in the Operating Quadrant of the MOF Process Model, where the SMFs are closely interrelated.

Figure 1. MOF Process Model and related SMFs

Figure 1. MOF Process Model and related SMFs

Appendix B: Key Performance Indicators contains statistics that should be reviewed to understand the performance of SMC as well as to identify opportunities for improvement.

Intended Audience

This guide is intended for the IT professional in data centers and in large and enterprise organizations. MOM administrators, help-desk personnel, and others involved in service monitoring and control should find this guide helpful. It assumes that the reader is familiar with the intent, background, and fundamental concepts of MOF, MOM, and other Microsoft technologies discussed. (Links to further information are contained in Appendix A: Resources.) An overview of MOF and its companion, Microsoft Solutions Framework (MSF), is available at http://www.microsoft.com/technet/itsolutions/cits/mo/mof/default.mspx.

Terminology

This guide uses terminology that is current for MOM 2005. Table 1 lists the changes in terminology between MOM 2000 Service Pack 1 (SP1) and MOM 2005.

Table 1. Terminology Changes Between MOM 2000 SP1 and MOM 2005

MOM 2000 SP1MOM 2005

zone configuration group (middle tier)

source management group

master configuration group (top tier)

destination management group

DCAM

(MOM) Management Server

processing rule group

rule group

processing rule

rule

Feedback

Please direct questions and feedback about this guide to msmfeed@microsoft.com.

Alert Tuning Overview

Goals and Objectives

This chapter specifies the reasons for implementing alert tuning; in addition, it lists key phases and high-level requirements as they pertain to alert tuning.

Alert tuning is the process of reviewing a Management Pack to determine its applicability to a specific environment; it also involves developing an IT operations plan that is built around Microsoft Operations Manager (MOM). Management Packs are designed to produce alerts only for conditions that require action on the part of an administrator. However, variations in specific operating environments can lead to “noise,” which is a barrage of false alerts that can become overwhelming for operators to process, and which shadow alerts that have real value. Noise is often a symptom of a systems-monitoring capability that has not been optimized. Sources of alert noise can include false positives, false negatives, nonactionable alerts, and multiple alerts with the same root cause.

By adjusting Management Packs for a lower level of alert noise, the highest possible level of operational efficiency can be achieved. The result is that IT operations is able to introduce a more qualified, internally developed Management Pack into the MOM environment. This makes the MOM tool an even more relevant and trusted source for SMC alerts.

The primary goal of alert tuning, therefore, is to increase operational efficiency through the enhanced effectiveness of MOM and the overall effectiveness of service monitoring and control.

The successful implementation of alert tuning achieves the following objectives:

Reduction in the number of false alerts

Rapid resolution of actual and potential service breaches through the identification of actionable alerts

Reduction in investigation of service breaches through the identification of valid alerts

Availability of up-to-date infrastructure performance data from an efficiently running MOM infrastructure

Key Phases

Alert tuning is accomplished in seven steps:

1.

Alert Tuning Preparation. Gain consensus on the tuning activities, ensure the entrance criteria are met, and create the lab environments for the specific MOM Management Pack being reviewed.

2.

Health Specification/Health Model Creation and Review. Manually trace the Health Specification, Health Model, and Management Pack, as applicable. These are reviewed for validity, event-to-alert mapping, rules mapping, and the ability to act upon the findings (such as actionability).

3.

Isolated Lab Validation. Test the net effects of introducing the Management Pack on a management group that carries no production load. This allows for the isolated assessment of this introduction onto a scaled-down MOM infrastructure and client base

4.

Pre-Production Lab Review. Analyze the Management Pack behavior in a multihomed lab environment. This environment should include actual production conditions.

5.

Preparation and Deployment Review. Perform a final analysis of all results and prepare for deployment of the tuned Management Pack into production.

6.

Deployment of the Tuned Management Pack. Transfer the Management Pack from the pre-production environment to the production environment

7.

Run-Time Alert Tuning. Provide ongoing tuning of the Management Pack once it has been introduced to a production environment through ongoing assessment, tuning optimization, and feedback to development.

These phases are discussed in Chapter 4, “Processes and Activities.”

High-Level Requirements

The successful implementation of alert tuning includes the following requirements:

A good understanding of the Service Monitoring and Control SMF and fulfillment of its requirements. This SMF is available at http://www.microsoft.com/technet/itsolutions/cits/mo/smf/smfsmc.mspx.

Completion of entrance criteria as outlined in the “Alert Tuning Preparation” section, later in this document.

An intermediate to advanced understanding of MOM. Further information is contained in the MOM 2005 documentation.

Processes and Activities

This chapter shows what processes and activities must be completed to implement alert tuning—from initial activities, such as gaining consensus on the tuning activities, to final activities, such as using feedback to improve subsequent implementation cycles.

When implementing alert tuning, organizations should adhere to the Microsoft Solutions Framework (MSF) life cycle and project-focused guidance. MSF provides a flexible and scalable framework for planning, building, and deploying business-driven solutions. More specifically, the MSF Process Model—with its Envisioning, Planning, Developing, Stabilizing, and Deploying Phases—should be applied to the implementation process.

Alert tuning also requires close coordination with other Service Monitoring and Control (SMC) activities. These activities include the six core processes that an IT organization follows to fully adopt SMC. Figure 2 illustrates the relationship between these SMC activities and the steps involved when implementing alert tuning.

Figure 2. Steps in alert tuning as they relate to the six core processes of SMC

Figure 2. Steps in alert tuning as they relate to the six core processes of SMC
See full-sized image

Those involved in alert tuning should be familiar with the MOF Service Monitoring and Control SMF (which describes these six core processes) and with the guidance related to it. Further information is available at http://www.microsoft.com/technet/itsolutions/cits/mo/smf/smfsmc.mspx

The following sections describe the steps required for implementing alert tuning.

Alert Tuning Preparation

Overview

The first step in implementing alert tuning is alert tuning preparation. The objective is to ensure the entrance criteria are met, obtain customer and IT consensus regarding tuning activities, and create the lab environments for the specific MOM Management Pack being reviewed.

Entrance Criteria

This preparation step includes the following entrance criteria:

1.

Select a Management Pack to tune, which can be either:

1.

A Management Pack that has been acquired from a vendor for a commercial off-the-shelf (COTS) software product.

2.

A Management Pack that has been created by IT operations for an internally developed application.

2.

Formalize the Health Model or Health Specification, which can be either:

1.

A complete Health Model (for internally developed software), which was the basis for the Management Pack created by IT operations.

2.

A Health Specification, which is derived from the vendor Management Pack for a COTS application. This can also be a manual walkthrough of the Management Pack itself.

A Health Specification (also called a Health Model for internally developed software) documents significant information used for monitoring a specific component. This may include all actionable events, event exposure and behavior, and instrumentation protocols and behavior. Ideally, this information is directly codified into a language or configuration dataset that MOM can use. Further information about the Health Model and Health Specification is contained in the SMC SMF, available at http://www.microsoft.com/technet/itsolutions/cits/mo/smf/smfsmc.mspx.

Preparation Activities

The following figure illustrates the four key activities needed to prepare for alert tuning, which are detailed in this section. However, in addition to performing these activities, you should consult both the SMC SMF and MSF when implementing any service-management capability.

Figure 3. Activities in the preparation process for alert tuning

Figure 3. Activities in the preparation process for alert tuning
See full-sized image

Form Agreement on Tuning Activities

The first preparation step is for all the involved teams to reach consensus on the following activities:

Scope of the review process

Appropriate and required participants

General schedule for the review process

Other resources required

Table 2 provides a sample timeline (in business days). Actual schedules might vary; for first-run organizations, additional time for tuning execution might be required, since these Management Packs often present more opportunities for improvement.

Table 2. Sample Alert Tuning Timeline

Identify Participants

1.   Identify alert tuning team and service owners.

5 Days

2.   Formalize criteria for alert tuning deliverables.

5 Days

Review Health Model/Health Specification 

1.   Deliver model/specification documents.

2.   Review material, provide feedback, and integrate feedback into candidate Management Pack.

15 Days

Test and Deploy Management Pack 

1.   Deliver build of Management Pack to IT operations.

2.   Deploy Management Pack to isolated lab.

a.   Install Management Pack into lab management group and run Management Pack against lab agents capturing key performance indicators. (See Appendix B: Key Performance Indicators, for a description of each.)

2 Days

b.   Gather and report results of lab run.

1 Day

c.   Buffer for resolution of possible performance issues discovered in Management Pack lab pass.

2 Days

3.   Deploy Management Pack to pre-production management group.

a.   Import the Management Pack into the pre-production management group.

b.   Review Management Pack in pre-production management group.

20 Days

Review Alert Tuning  

1.   Gather results of pre-production run of Management Pack.

5 Days

2.   Sign off.

2 Days

Total time

57 Days

Define Roles and Responsibilities

The alert tuning process requires the cooperation and participation of a number of people from groups throughout IT. These teams work together in a virtual-team capacity to complete all activities. Some of the activities, for example, making change requests that are associated with the Management Pack, are only possible or relevant if the organization has access to the Management Pack development team. In short, these roles include:

IT operations. The IT operations team is responsible for the engineering, implementation, and support of monitoring and manageability infrastructure.

Development team. The development team is responsible for producing enterprise applications that are typically internally developed or that are highly customized or extended. This is an optional role that is only applicable when the Management Pack is developed by the same organization that is reviewing the Management Pack.

Service owner and subject matter experts (SMEs). The service owner and SME virtual team consists of one or more individuals from within the company who are responsible for engineering and upper-tier support of the underlying technology or product. This team is generally responsible for evaluating the technical accuracy of alerts and Management Packs from a qualitative perspective.

All the roles within the alert tuning process have particular tasks for which they are responsible. The following is a detailed list of these responsibilities according to role.

Responsibilities Shared by All Alert Tuning Roles

Establish the scope and exit criteria for the given Management Pack being reviewed. Establishing a common understanding of what work is going to be done during the review process and what criteria the review process will be based upon is vital to the success of the alert tuning process. The teams involved should agree on at least the following criteria prior to beginning the alert tuning process:

Resource commitment. An approximation of how much time per week each team can devote to the alert tuning process.

The scope of tuning. Based on the complexity of the Management Pack and the availability of each team, the alert tuning process can range from very limited to very involved. For Management Packs that contain no scripts and for which IT operations cannot provide technical expertise, the review process might not include isolated lab testing or qualitative review by a service owner or by SMEs. For a Management Pack that contains many scripts, and for which IT operations has many teams that will depend on the service’s monitoring, the alert tuning process might require multiple stages of isolated lab testing; it will also require more time for a complete evaluation of the Management Pack. The complexity of the Management Pack needs to be fully understood up front by all teams; additionally, the scope of the alert tuning process needs to be agreed upon by all teams involved.

The exit criteria. This needs to be established so that all teams know what they are accountable for delivering at the end of the alert tuning process. This can include, but not be limited to, an overall evaluation of the alert-to-ticket ratio (ATR) that the Management Pack provides, the commitment that the development team will have toward resolving outstanding bugs and improvements based on severity and priority, the number of agents to which the Management Pack will be deployed, and the decision of whether IT operations will implement all or portions of the Management Pack into production after the alert tuning process.

Establish the timeline for the alert tuning process. All teams should agree to the start and end dates of the alert tuning process. While this initial schedule can be altered as needed, and as agreed upon by all teams, it is necessary to agree on a general timeline.

Follow standard communications. Representatives from each team must subscribe to the Management Pack alert e-mail distribution list, which will be configured as a notification group in the pre-production MOM infrastructure. Representatives will receive all alerts in the form of e-mails.

Responsibilities for IT Operations

Coordinate the overall alert tuning process. IT operations is generally responsible for leading the alert tuning process. This includes facilitating recurring meetings and discussions over the course of alert tuning. IT operations is responsible for sending out regular status reports during the alert tuning process, which outlines outstanding issues and progress to date. There will be instances where project-manager representatives from the development teams fulfill all or part of these responsibilities.

Perform isolated lab analysis. Provide analysis of the Management Pack being reviewed in an isolated lab prior to it being deployed into the pre-production environment to detect possible performance issues. This step is primarily performed on Management Packs that contain scripts. In instances where this step is deemed necessary, IT operations will import the Management Pack into an isolated lab environment and run the Management Pack against a limited set of agents to ensure that the Management Pack does not introduce any adverse performance impacts.

Configure and support the pre-production MOM infrastructure. IT operations is responsible for ensuring the following are in place with respect to the pre-production MOM infrastructure:

The pre-production MOM infrastructure is in working order.

All necessary MOM agents are members of the pre-production management group.

The necessary version of the Management Pack being reviewed is imported in a timely manner and is properly configured according to any release notes or other documentation. This can come from a vendor or can be developed internally.

All necessary notification groups have been configured within MOM to allow alert forwarding via e-mail to the relevant consumers for review of the alerts. IT operations will need to set up a distribution list for people to join to receive the alert e-mails that are forwarded to the notification group.

There will be some instances in which the service owners maintain their own pre-production MOM infrastructure. If they host the Management Pack being reviewed within their infrastructure, the responsibilities listed herein would pertain to the service owners.

Provide support for the service owners and subject matter experts to make customizations to the Management Pack when required. IT operations is responsible for implementing changes in a timely manner to a given Management Pack whenever a change to a component is deemed necessary. IT operations is responsible for documenting what change was made and providing justification for the change. Additionally, IT operations must, wherever relevant, submit a bug or change request associated with the Management Pack. This allows the development team to review and possibly incorporate the change into the default configuration of the Management Pack.

Document and advocate bugs and change requests associated with the Management Pack. As the service owners and SMEs are reviewing the Management Pack, they might raise concerns with various aspects of it. The IT operations team will triage the validity of the issue being raised and present the issue to the alert tuning project team for consideration. Where applicable, IT operations will generate a bug or a change request to get resolution on the issue addressed in the Management Pack. There might be instances where the service owners or SMEs are comfortable in leading this entire process.

Provide ongoing quantitative data of the Management Pack. During the course of the alert tuning process, IT operations is responsible for archiving the necessary information from the pre-production management group to provide ongoing and historic quantitative data for the day-to-day review of Management Pack performance. IT operations is also responsible for providing a means for quick analysis of this information.

Provide the final quantitative analysis of the Management Pack. At the end of the alert tuning process, IT operations is responsible for providing a final quantitative analysis of both the impact that the Management Pack has had on the MOM infrastructure and the overall quality and acceptability of the Management Pack.

Responsibilities for the Development Team

Provide technical representation from the development or application team responsible for developing or managing (in COTS) the underlying technology that the Management Pack is monitoring. In order to ensure that proposed bugs and changes for a given Management Pack can be addressed in a timely manner, a representative from the development or application teams responsible for the underlying technology is required. It is the responsibility of the development team’s representative to ensure that the internal team is involved or that an alternative arrangement is made.

Supply up-to-date copies of specification and model documents at the beginning of the alert tuning process. The development team will provide the alert tuning team with an initial copy of the most recent specification and model documents, and will need to give a timely response to feedback provided to these documents.

Provide continuous up-to-date copies of the Management Pack under review. The development team is accountable for providing the alert tuning team with an initial copy of the Management Pack being reviewed, as well as copies of the updated Management Pack in a timely manner at least once during the review process. A typical alert tuning process, which is reviewing a Management Pack of low-to-moderate complexity, will require at least two builds of a Management Pack (if internally developed for COTS)—the initial build of the Management Pack, and one build halfway through the process, which incorporates changes based on IT operations’ feedback to that date. If the Management Pack is received from a vendor, the most recently updated version should be acquired.

Communicate feedback criteria and expectations to IT operations, service owners, and SMEs to ensure that the development team is receiving sufficient feedback. To ensure that the alert tuning process is as beneficial as possible, the development team is accountable for providing initial direction concerning the feedback that they are looking for. Likewise, the development team should continuously provide guidance on the feedback needed over the course of the alert tuning process. Wherever possible, all feedback criteria and expectations should be communicated and agreed upon in the beginning and documented in the Alert Tuning Exit Criteria. A sample template for status reports can be found in Appendix C: Template for Project Status Reports.

Responsibilities for the Service Owner and Subject Matter Experts

Provide technical representation from the relevant support teams to ensure the best possible technical review is given to the Management Pack. The service owners are responsible for finding at least one SME who will be able to dedicate the necessary effort to be involved in the alert tuning process. The time commitment of a SME is approximated as follows:

Two to five hours for initial orientation meetings and reading of the alert tuning process documentation

Eight hours for the complete review of the Management Pack specification documents

One hour per day for the ongoing review of the Management Pack as it is running in the pre-production management group

Four to eight hours for the final qualitative analysis of the Management Pack

In some instances, the service owner will also be a SME and will therefore satisfy all necessary roles.

Provide ongoing qualitative review of the Management Pack. The primary role of the service owner and the SME is to provide qualitative review of the Management Pack and to raise issues they find based on their review. For the purposes of the alert tuning process, although reducing alert volume is the primary goal, all aspects of the Management Pack are open to review and discussion. SMEs are encouraged to review and propose changes to all attributes provided within the Management Pack.

Provide the final qualitative assessment of the Management Pack. The service owners and SMEs are responsible for providing final qualitative assessment of the Management Pack. This includes a detailed analysis of each rule that has generated an alert during the course of the alert tuning process. The analysis will also include the evaluation of the rule’s actionability, validity, quality of the knowledge-base content, suppression, and other relevant feedback.

Develop Isolated and Pre-Production Lab Environment

The lab environment needs to be in place prior to the alert tuning review processes. This environment is illustrated in Figure 4.

Figure 4. Alert tuning infrastructure for isolated and pre-production lab environment

Figure 4. Alert tuning infrastructure for isolated and pre-production lab environment

Isolated Lab Environment

The isolated lab environment is used to exercise the agent components of a new Management Pack in order to analyze the impact on the managed node. Ideally, the isolated lab environment would be a scaled-down version of the production MOM client and server infrastructure. This typically has the following configuration:

MOM server with MOM Management Server and database

Two or three managed nodes representing the operating systems and application (applicable for the Management Pack) used in production. For example, if there are many servers running Windows Server™ 2003 and Windows 2000 in the production environment, there should be at least one of each type in the isolated lab.

The lab network should be isolated.

All agents in the isolated lab should be running the application service or technology that the Management Pack monitors.

The agents should be undergoing as close to no load as possible. The purpose of this test is not to see how the Management Pack performs when load is introduced but rather to understand how it affects a nearly idle system. This allows for better isolation.

Pre-Production Lab Environment. 
The pre-production lab environment is used to evaluate the quality of the Management Pack. This typically has the following configuration:

Install and Deploy Alert Tuning Reports
Overview

There are three reports connected with alert tuning. One displays the number of alert counts between two dates based on various conditions, such as for a Management Pack or a combination of Management Pack and computer group. Another report provides information about the alerts that are frequently generated, their names, and descriptions. A third report displays the total number of alerts for a Management Pack and MOM for ranges within different weeks. Further information about each of these reports is contained in Appendix D: Alert Tuning Reports.

This section provides step-by-step procedures to install the Alert Tuning Reports application onto a local machine. The application can be installed onto a server by providing the server name and the database names at the time of the installation from which the application procures data. Once deployed, the reports can be seen on the Web.

Installing Alert Tuning Reports

Before installation and deployment, ensure that:

The server running MOM is installed and configured.

The OnePoint database is set up and running.

The report server is installed.

The Alert Tuning Reports application is installed using the AlertTuningReports.msi provided to the user.

To install the Alert Tuning Reports application

1.

Double-click AlertTuningReports.msi. The Alert Tuning Reports wizard begins.

2.

From the Welcome page in the wizard, click Next.

3.

From the Select Installation Folder page, browse to the location where the application is to be installed (for instance, C:\Alert Tuning Reports). Indicate who the installation is for by selecting either Just me or Everyone.

If the path field is left blank, the application will be installed in the default path C:\Program Files\Microsoft\Alert Tuning Reports\.

4.

From the Confirm Installation page, click Next.

5.

From the Enter Installation Details page, in the text box type the name of the server hosting the OnePoint database (for example, ffl-na-mom-01). If the MOM database is not OnePoint, type the correct database name.

6.

If the MOM database is not running on the default instance of SQL Server, select the Select an Instance check box.

7.

Enter the instance name from the drop-down list, and then click Install.

8.

From the Installation Complete page, click Close.

Validating Installation of Alert Tuning Reports

To verify successful installation of the reports

1.

Navigate to the location where the application is installed. There should be a folder called MSMReports that contains the following files and folders:

ReportScripts – This is a folder containing the SQL script for the stored procedures to be uploaded onto the server.

AlertCountByDates.rdl – file

AlertCountByDevice.rdl – file

AlertCountByProcessingRules.rdl – file

AlertTuningReports.rptproj – file

AlertTuningReports.rptproj.user – file

AlertTuningReports.sln – file

AlertTuningReports.suo – file

OnePoint.rds – file

2.

From Control Panel, in Add or Remove Programs, check for Alert Tuning Reports.

3.

Make sure the following stored procedures are present in the OnePoint database that the Alert Tuning Reports are using:

sp_AlertByProcessRules

sp_AlertCountByDevices

sp_GetNumberOfDayssp_GetTopLevelComputerGroups

sp_PrintdaysBetweenStartDateandEndDate

4.

Make sure the following user-defined function is present in the OnePoint database: fn_ProcessRuleGroupsMembers

Deploying Alert Tuning Reports

To view the reports on a Web page, the reports must be deployed on the report server. For information on how to do this, either proceed to the following steps, or refer to Reporting Services Books Online, available at http://www.microsoft.com/sql/reporting/.

To deploy the reports

1.

Log on to the server that hosts the report server.

2.

Open Report Manager. Or from Internet Explorer, type the address of the site as http://localhost/Reports if you are accessing the reports from a local machine. Otherwise, if you are accessing them remotely, type the address of the site as http://<machinename/>Reports. This opens up the following home page.

Figure 5. Report Manager home page

Figure 5. Report Manager home page
See full-sized image

3.

Click New Folder. To create a new folder with the name “Alert Tuning Reports,” type Alert Tuning Reports in the Name text box. Type an appropriate description, and then click OK.

4.

On the home page, click the Alert Tuning Reports link, and then click Upload File.

5.

In the File to upload box, browse to the location where you have installed the application, and select the report definition (RDL) file. For example, choose AlertCountByDates.rdl to upload the Alert Count By Dates report. In the Name box, type the appropriate name as shown in the following screenshot, and then click OK.

Note The three report RDLs must be uploaded individually. Uploading all the RDLs together is not supported.

Figure 6. Uploading the RDL file

Figure 6. Uploading the RDL file
See full-sized image

6.

From the home page that appears once the RDL is uploaded, click New Data Source. The following screenshot appears.

(Steps 7 to 10 indicate the details regarding the data source that must be supplied to create the data source for the reports.)

Figure 7. Creating the data source

Figure 7. Creating the data source
See full-sized image

7.

In the Name box, type the name of the data source (for example, OnePoint), and in the Description drop-down list, type a description if needed.

8.

From the Connection Type drop-down list, set the connection to Microsoft SQL Server, and in the Connection String box, type the connection string used in the OnePoint.rds. It should be data source=<SQL Server>; initial Catalog=OnePoint.

9.

Click Credentials stored securely in the report server and type the credentials (user name and password) of the user who has privileges to access the database.

10.

Select the Use as Windows credentials when connecting to the data source check box, and then click Apply.

11.

From the home page, click Alert Tuning Reports. From the Alert Tuning Reports page, click Show Details.

Figure 8. Showing details of the uploaded report

Figure 8. Showing details of the uploaded report
See full-sized image

12.

Click the Edit icon of the uploaded report.

13.

Click the Properties tab. On the Report Properties page, click Data Sources to the left of the screen. In the Location box, browse and then associate the created data source (OnePoint) with this report, as shown in the following screenshot.

Figure 9. Associating the data source with the RDL file

Figure 9. Associating the data source with the RDL file
See full-sized image

14.

Click OK, and the following screen appears.

Figure 10. The data source has been associated with the RDL file.

Figure 10. The data source has been associated with the RDL file.
See full-sized image

All the uploaded reports will need to be associated with the data source using the process just explained.

Validating Deployment of Alert Tuning Reports

To verify whether the reports have been successfully deployed

1.

Log on to the report server.

2.

Navigate to the site http://localhosts/Reports. There should be a folder named Alert Tuning Reports.

3.

Click Alert Tuning Reports. The following page will appear.

Figure 11. Alert Tuning page with the reports and data source uploaded

Figure 11. Alert Tuning page with the reports and data source uploaded
See full-sized image

Uninstalling Alert Tuning Reports

To uninstall the Alert Tuning Reports application

1.

In Control Panel, click Add or Remove Programs. Click Remove a program and from the Currently Installed Programs list, click Alert Tuning Reports, and then click Remove.

2.

When prompted to remove the reports, click Yes.

3.

Navigate to the location where Alert Tuning Reports was installed. There should be only one file (reportconfigurer.InstallState) in the folder.

4.

Remove the file manually.

This will remove Alert Tuning Reports from the server where it was installed and not from the report server. To remove the reports from the report server, follow the guidance provided in Reporting Services Books Online, available at http://www.microsoft.com/sql/reporting/.

Health Model/Health Specification Creation and Review

Overview

The second step in implementing alert tuning is the creation and review of the Health Model and Health Specification. The objective of this process is to perform a manual validation of the event lists and alerts as defined in the Health Model and Health Specification. The value of this activity is in its holistic view of the instrumentation from a service-owner perspective. This establishes a common understanding of how the Management Pack will function, and provides a first-round “validation on paper” of the strategy applied to the Management Pack.

This step may not be applicable for all situations, such as a vendor-provided Management Pack where a Health Specification is not available. However, in many cases, IT operations might make drastic changes or additions to the vendor-provided Management Packs, which should be reviewed in this step. Also, in the case of framework applications, development teams might create applications or extensions to the frameworks; Management Packs are created for the new functionality.

Creation Activities

The Health Model defines what it means for a system to be healthy (operating within normal conditions) or unhealthy (failed or degraded) as well as the transitions in and out of such states. Good information on a system’s health is necessary for the maintenance and diagnosis of running systems. The contents of the Health Model become the basis for system events and instrumentation on which monitoring and automated recovery are built. All too often, system information is supplied in a developer-centric way, which does not help the administrator know what is going on. Monitoring becomes unusable when this happens, and real problems become lost. The Health Model helps to determine what kinds of information should be provided and how the system or the administrator should respond to the information.

Users want to know at a glance if there is a problem in their systems. Many ask for a simple red or green indicator to identify a problem with an application or service, security, configuration, or resource. From this alert, they can then further investigate the affected machine or application. Users also want to know that when a condition is resolved or no longer true, the state will return to “OK.”

Creation of the Health Model includes the following activities:

1.

Document all management instrumentation exposed by an application or service.

2.

Document all service health states and transitions that the application can experience when running.

3.

Determine the instrumentation—events, traces, performance counters, and Windows Management Instrumentation (WMI) objects and probes—necessary to detect, verify, diagnose, and recover from bad or degraded health states.

4.

Document all dependencies, diagnostic steps, and possible recovery actions.

5.

Identify which conditions will require intervention from an administrator.

6.

Improve the model over time by incorporating feedback from customers, product support, and testing resources.

The Health Model is initially built from the management instrumentation exposed by an application. By analyzing this instrumentation and the system-failure modes, Service Monitoring and Control (SMC) can identify where the application lacks the proper instrumentation.

Further information about the Health Model is contained in the Design for Operations white paper available at http://www.microsoft.com/windowsserver2003/techinfo/overview/designops.mspx.

It is common for an IT organization to purchase commercial off-the-shelf (COTS) software. A set of documented information that is identical to the Health Model also needs to be created for all COTS software. However, because COTS software is not developed internally, the term Health Specification is used here to differentiate it from the Health Model. The Health Specification material for COTS software is created by IT operations (such as the SMC staff) and not developers, and it is designed for COTS software and other purchased service components. In some instances, COTS software is accompanied by its MOM Management Pack, and the documentation of the Management Pack in this case serves as its Health Specification. If COTS software is not accompanied by its Management Pack, it requires staff from IT operations to manually create documentation based on the observed behavior of the COTS software once it is installed in the operational environment.

Review Activities

Conduct a thorough review of the Health Specification and or Health Model to check for compliance with service-monitoring standards, accuracy, and actionability. Organize review teams and obtain the Health Specification or Health Model for the Management Pack that will be tuned. The following review activities need to occur:

1.

Conduct an initial review of the Management Pack design and its approach, using the model as a basis. The development team performs this activity for internally created applications, whereas the operations team performs it for vendor Management Packs (for COTS).

2.

Conduct a cursory review of the material for field overloading, which is the misuse of fields and attributes. This step will make sure that these are used appropriately. For example, the Name and Description fields should not contain unique identifiers, and unique identifiers should be unique and organized.

3.

Conduct a line-by-line review of the material, giving special attention to the following areas:

1.

Names. Ensure they make sense and are applicable to the condition they are used for.

2.

Event IDs. Make sure they are not duplicated in this or any Management Packs that might be used together.

3.

Any documented suppression. Validate that it makes sense and applies correctly to the situation it is used for.

4.

Descriptive fields. Make sure the text is understandable and provides adequate information.

4.

Conduct a review of the associated knowledge-base material, and assess its completeness and actionability.

5.

Include the results of this activity in the Health Model, Health Specification, Management Pack, or any of its supporting documentation.

Isolated Lab Validation

Overview

The third step in implementing alert tuning involves using the isolated lab to test the effects of the Management Pack on the agent. The actual test duration in the lab is typically three days, although this can vary depending on resource, and on Management Pack size and complexity. If there are conditions that require retesting or further investigation, the run length might also be longer. This stage is important not only for optimization, but also serves to protect the next stage’s pre-production environment.

Validation Activities

The following activities are performed in the isolated lab that was created in the Alert Tuning Preparation step:

1.

Install the Management Pack into the isolated lab.

2.

Manually tune the Management Pack script frequency to once per minute.

1.

On the MOM Operator console, click each rule group in the Management Pack.

2.

Go into each respective rule and sort the list by the Response column.

3.

Look at the properties for all rules that have script responses.

4.

If the provider type is a timed event, alter the frequency to every one minute.

3.

Use System Monitor to capture the behavior of specific instances on the agents, as shown in Table 3.

Table 3. Capturing the Behavior of Instances on Agents

Performance ObjectInstanceCounterResponse to Deviation

Process

MOMService

Working Set

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product.

Process

MOMService

Private Bytes

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

Process

MOMService

% Processor Time

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

Process

MOMService

Handle Count

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

Process

All (Any) MOMHost Instances

Working Set

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

Process

All (Any) MOMHost Instances

Private Bytes

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

Process

All (Any) MOMHost Instances

% Processor Time

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

Process

All (Any) MOMHost Instances

Handle Count

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

DProcessor

All

% Processor Time

After the deviation has been investigated, feedback should be given to:

The development team, if the Management Pack or application was created internally.

IT operations, if the Management Pack was created by the vendor for a COTS product

4.

The results of the MOM server health counters should be analyzed, as in Table 4, and shared with the review team. Any significant anomalies should be reported to the development team for tuning of the Management Pack. The development team should review these anomalies and correct any scripts before proceeding to the pre-production stage. Particular MOM server counters should be monitored on the Management Server in the isolated lab to assess how much MOM data is generated by the Management Pack in the lab. The Db Alert Insert Simple Count counter will provide a general sense of the quantity of certain types of data the Management Pack will produce. The Queue Space Percent Used counter will illustrate how well the isolated lab environment can process the data. The queue percent should remain consistently low. If it does not, this could be an indicator that the Management Pack could overwhelm the production environment.

Note This evaluation is to be balanced with the fact that script execution is at an abnormally frequent interval.

Table 4. Analyzing the Results

CounterAnalysis

Working Set

The Working Set values should have no steady increase.

There should be no exorbitant usage.

For example, in a lab with a server running Microsoft Windows Server 2003, and with a Dual Pentium III-800 using the Microsoft Active Directory® Management Pack, over a three-day viewing interval, the value should be seen as a flat line. (If sufficiently zoomed-in or observed over a short time interval, this would look like a very wavy line.) The value should be approximately 15 megabytes (MB). However, this value can be justifiably higher based on the number of scripts in the Management Pack.

Private Bytes

The Private Bytes values will have an initial peak and then will stabilize with a horizontal line over the three-day time span.

There should be no sustained deviation. A line resembling a stair step followed by a horizontal line, and then repeating with an upwards progression, indicates a possible memory leak.

% Processor Time

The % Processor Time values will vary depending on actual hardware infrastructure. However, they should be low, around three percent to ten percent. Exceeding this range means that the agent components might adversely affect the overall performance.

The individual values should be compared against Processor (all instance); if Processor (all instance) is low overall, the conditions might be normal. However, if the individual values are high compared to the overall, it might indicate a problem with the Management Pack.  

Handle Count

The Handle Count values will fluctuate immensely. At the initiation of a Management Pack script, they will increase, but then should proportionately fall at completion of script execution when handles are released.

Over time, if the Handle Count values continue to increase, this may indicate that the scripts are not properly releasing. Tracking their mean values over time is one measure of the existence of this problem.  

These results should be used to create performance baselines for agent behavior. A baseline is important to determine anomalies in the pre-production and production environments.

Pre-Production Lab Review

Overview

The fourth step in implementing alert tuning is validating the pre-production environment. The objective of the pre-production validation is to assess and refine the Management Pack using stimulus in order to tune and reduce alerts that are not valid or actionable (noise).

Review Criteria

The primary means of assessing and refining a Management Pack is to put it into pre-production and then to review the alerts that are produced by the Management Pack. These actions ensure that the alerts effectively and accurately monitor for a given service. Not every alert that the Management Pack produces needs to be triaged from start to finish. However, this process is intended to confirm that each rule that is generating alerts is properly configured to watch for and provide an alert on actionable and valid issues. In addition, this process is intended to ensure that, for any given failure that would merit the creation of a ticket, ideally only one alert is generated by any given rule within the Management Pack. Each rule should be evaluated according to the following criteria:

Actionability. An alert is actionable if it tells you what went wrong and how to fix it. In order to be actionable, the alert text (such as the subject line and description) and the related knowledge-base content must do the following:

Provide a concise description of what failure has occurred.

Contain information that is precise and does not mislead; also, the name and description of the alert should be meaningful and make sense.

Give clear steps to identify and resolve the issue.

The underlying alert must be a condition that requires some action on the part of an administrator. If an alert is telling you that “everything is ok” or that “something has failed but it will fix itself,” the Management Pack should not be generating alerts on it.

Actionability is something that is defined at the rule level; if one instance of an alert from a rule is found to be actionable, all alerts generated by that rule are actionable. A non-actionable event would be missing one or all of the criteria just listed.

Validity. An alert is valid if the following are true:

The alert generated by the rule raises an issue that can be confirmed at the moment of the alert. Alerts that report something that occurred sometime in the past are not valid.

The alert generated raises an issue that has in fact occurred. If the alert's text indicates that a device is offline, but upon further investigation when the issue is triaged the actual state of the device does not support the alert, the alert is not valid.

Validity is something that can vary from alert to alert. A rule could potentially catch events that have only a degree of validity. In these instances where a rule is found to be periodically invalid, the issue should be raised as a bug against the rule, and documentation of valid and invalid instances of the alert should be provided.

Suppression. For any given failure that is to be detected by a Management Pack, there should be one, and only one, alert generated by the Management Pack stating that an issue has occurred. If the same rules generate multiple alerts that all point to the same central failure, effective suppression has not been achieved, and the issue should be addressed.

Table 5 offers a general guide on what action to take based on the result of the three evaluation criteria for the alerts generated by a rule.

Table 5. Qualitative Review Characteristics and Actions

ValidInvalid

Actionable

Suppression: Good

Characteristic: This is the target state that alert tuning is striving for.

Prescription: There is no issue.

Characteristic: Even if an alert is actionable and has good suppression, it must be valid.

Prescription: Triage alert instances. Submit a bug to create better logic regarding when to provide an alert about these events.

Actionable

Suppression: Bad

Characteristic: If the rule is a performance threshold rule, it is possible that an alert is valid and actionable, but that the threshold value is too lax or too sensitive. If that is the case, the threshold should be re-evaluated and, if appropriate, the change should be proposed to the development team.

Prescription: Alter the suppression rules to allow for better duplicate alert suppression; submit a bug to have the Management Pack altered.

Characteristic: Even if an alert is actionable and has good suppression, it must be valid.

Prescription: Triage alert instances. Submit a bug to create better logic regarding when to provide an alert about these events.

Non-Actionable

Poor Documentation

Characteristic: The alert is valid, but the alert’s subject line, description, and/or knowledge base content do not effectively explain the source of the problem and/or how to resolve the issue.

Prescription: A bug should be submitted to the IT operations or development team to resolve this issue.

Characteristic: A non-actionable, invalid alert is in the worst condition.

Prescription: Disable alerting and/or the entire rule, and submit a bug to the IT operations or development team.

Non-Actionable

Don’t Care: Informational

Characteristic: The alert informs you of a state that is worth knowing about, but which requires no repair. If the alert does not correlate to something that requires action, there should not be an alert generated. An example of this is an alert that states “Backup has completed successfully.” These are events that you should store in MOM, but they do not require anyone’s action when they occur.

Prescription: Alerting should be disabled for the rule, and a bug should be submitted to the IT operations or development team.

Characteristic: A non-actionable, invalid alert is in the worst condition.

Prescription: Disable alerting and/or the entire rule, and submit a bug to the IT operations or development team

Non-Actionable

Don’t Care: Unimportant

Characteristic: The alert informs you of a state that is not worth knowing about and requires no repair.

Prescription: The rule should be disabled, and a bug should be submitted to the IT operations or development team.

Characteristic: A non-actionable, invalid alert is in the worst condition.

Prescription: Disable alerting and/or the entire rule, and submit a bug to the IT operations or development team

Preparation for Evaluation

Using the pre-production environment created in the Alert Tuning Preparation step, perform the following activities:

1.

Install the Management Pack into the pre-production environment.

2.

Configure performance counters on selected client machines to collect counters identical to the isolated lab.

3.

Manually configure the Management Pack to e-mail results to a mail public folder.

4.

Create an alert rule that will forward all alerts in e-mail to the corresponding public folder of the Management Pack. This step should be completed for every rule group in the Management Pack containing rules that will generate alerts. This is performed so that there is a convenient and common repository of alert details during this process. All the alert tuning roles should be granted access to this mail public folder. Perform the following actions to complete this step:

1.

Create a new alert rule. The Alert Processing Rule Properties wizard will begin.

2.

Select the box next to Only match alerts generated by rules in the following group. This setting will automatically resolve to Generated by this rule group. Click Next.

3.

Click Next again to accept Always processing data.

4.

In the response section, click Add, and then select Send a notification to a notification group.

5.

Click the Notification tab, and select the appropriate notification group from the drop-down list.

6.

Click the E-mail Format tab, and select Custom e-mail format.

7.

Change the subject to read $Alert Name$ [$Domain$\\$Computer$] and leave the Message section as is.

8.

Click OK to accept the notification steps, click Next, and then click Next again to skip over the Knowledge Base section.

9.

Name the rule “Alert Tuning Forward Alerts to Public Folder,” and then click Finish.

Evaluation and Analysis Activities

The activities associated with evaluating and analyzing the Management Pack in the pre-production environment are illustrated in Figure 12. The conclusion of the cycle is determined up front by the Alert Tuning project team and is based on time and exit criteria. A description of these activities is integrated into the comprehensive steps that are presented in this section.

Figure 12. Evaluation and analysis activities

Figure 12. Evaluation and analysis activities

Step 1. Accessing the “Alert Stream” Generated by a Management Pack

The primary impetus for this phase of the Management Pack review process is the alerts that the Management Pack generates. By reviewing these alerts, it will become apparent which rules need to be reviewed and refined.

Once a Management Pack has completed the isolated lab validation and is confirmed to be suitable for deployment into the pre-production environment, a mail public folder will be set up to which the alerts from the Management Pack will be forwarded.

Installing and Using the MOM Operator Console

Now that alerts are being sent to the reviewers, they will need to get access to the MOM Operator console. This allows them to read further details about the alerts and their associated events, and to validate the associated knowledge-base content.

To grant reviewer access to the MOM Operator console

1.

Ensure that the participants are part of the MOM Pre-Production Security group as well as the Users and Operators groups.

2.

Run the installation for the MOM Operator console. Refer to the MOM 2005 User Guide for detailed instructions on this installation.

Step 2. Prioritizing What to Evaluate

When prioritizing what to tune in a newly installed Management Pack, rather than tediously stepping through each alert sent in e-mail (it is assumed for this discussion that a common method of sending alerts is through e-mail) and validating them one by one, approach alerts on a rule-by-rule basis. Move from the highest overall alert generator to the lowest. To assist in showing which rules are in fact generating the most alerts, review the alert tuning reports bundled with this guide. Further information is contained in Appendix D: Alert Tuning Reports.

To view the alert tuning reports

1.

Verify that the correct input parameters are set. For example, for the Alert Count by Processing Rules report, make sure you have chosen the correct start and end dates and the correct Management Pack to investigate.

2.

The Alert Count by Processing Rules report shows the count of raised and total alerts broken down per rule and listed in descending order, from the highest number of raised alerts to the lowest. When approaching the alert mail being forwarded, focus on alerts from the top “raised alert” generators, and move down the list from there. The total alert count is the number of alerts generated by MOM without regard to suppression. This represents a raw count of alerts raised by a rule. With proper suppression, MOM can detect identical alerts and suppress the subsequent instances after the first one has been received. The number of alerts when suppression has been applied is the raised alert count.

Negative alerts (alerts that indicate a correction or “good” state) raised by rules that have state monitoring enabled will have a status of “No Problem.” State-monitoring alerts and conditions should not be included in the alert tuning process.

3.

Supplementary reports are also included to help manage the pre-production environment and to aid in prioritization and opportunity targeting. These reports include:

1.

Alert Count by Device. This report is used to rank alert volumes generated by each machine participating in the pre-production environment. This report is important for detecting abnormalities in the pre-production environment due to problematic systems or scheduled deployments. The abnormalities could also be the result of a known change. These abnormalities can generate excessive volume that could skew the results of the Alert Counts by Processing Rule report.

2.

Alert Count by Day. This report is used to rank the alert volumes generated on a daily basis for the hosts participating in the pre-production environment. This report is important for detecting abnormalities in the pre-production environment because of scheduled deployments, or a known change or fix. For example, if the sampling period is one week, and there is a known patch rollout on Wednesday and Thursday, the alerts generated by rules on the specific systems receiving a patch for those two days should be discounted.

Step 3. Triaging a Given Rule’s Alerts

Using the report, find the rule that generates the highest number of alerts. Make sure that this rule has not yet been reviewed and that it has no outstanding issue based on the report.

To triage the rule’s alerts

1.

Open one of the alerts generated by the rule. This can be found in the alert sent to the public folder or distribution group.

===============================================
From: mom-e@examplecompany.com [mailto:mom-e@examplecompany.com] 
Sent: Tuesday, March 10, 2004 5:40 AM
To: MP Reviewer
Subject: ::MOM MP:: MOM Agent Status Monitoring [PREPROD\HOST01]
 
Severity:  Error
Status:  New
Source:  Microsoft Operations Manager
Name:  MOM Agent Status Monitoring
Description:  The agent service on Computer HOST01 in domain PREPROD
may be unavailable.  The agent on this computer failed to heartbeat 
but it did respond to a ping within the allotted time.  The last 
heartbeat received from the agent was 2/10/2004 05:25:11.
Domain:  PREPROD
Agent:  HOST01
Time:  3/10/2004 05:40:20
Owner: 
(view with <Web server not defined. Object ID is {C4AFC9A4-C8B6-43C6-8B0C-
887493839CA1}>)
===============================================

2.

Find the alert in the MOM Operator console. Now that there is an instance of an alert from the rule in question, the next step is to look the alert up in the MOM Operator console to get complete details on the issue.

1.

Open the MOM Operator console by going to Start -> All Programs -> Microsoft Operations Manager 2005 -> MOM Operator console. The MOM Operator console will open up to the default view of Alerts – All.

2.

From the tabular menu on the right-hand side of the screen, change from the Views tab to the My Views tab. Here you will create a custom view that will show you just the alert you are interested in, based on the globally unique identifier (GUID) of the alert.

3.

Right-click somewhere in the empty space of the My Views tab. When the context menu appears, select New-> Alerts View. The Alert View Properties wizard will open.

4.

In the Which type of alert view do you want to create list, select Alerts that satisfy specified criteria, and then click Next.

5.

In the Which alerts do you want to view list, scroll down and select with specified GUID. Ensure that all other check boxes are cleared. In the bottom section of the window, the line “with specified GUID” will have just appeared. Click the link specified. In the text box, paste the alert’s GUID from the alert e-mail. (The GUID is highlighted in the previous sample alert e-mail and can be found at the very end of the alert e-mail where it reads “Object ID is {<GUID>}.) After you have pasted the GUID into the text field, click OK, and then click Next.

6.

Type whatever name and description you want to give the view, and then click Finish. Now the view has been created and added to your custom views.

7.

In the tree view within My Views, click the view you just created to see the alert. The result screen should look as follows.

Figure 13. Alert details

Figure 13. Alert details
See full-sized image

8.

Once an alert has been selected, further details become available in the lower section of the screen including properties, events, and knowledge-base information. Review the alert name and the details forwarded with the alert.

9.

Confirm that the name and the description of the alert are meaningful and make sense. In addition, confirm that the events associated with this alert should indeed be associated with this alert. Try to answer the question “Is this alert actionable?” and confirm that the details of alert are not too vague. Check to see if this alert is actually a standalone failure or if it can be associated with another failure or alert. This would be a factor of effective correlation.

10.

Check the rule configuration to see if it has Enable State Alert Properties flagged. This indicates that the rule is used for state monitoring. Check the logic for the state change and validate its accuracy.

Note This step requires that reviewers be MOM administrators as well as know how to use the MOM Administrator console. These properties cannot be viewed through the MOM Operator console.

11.

Refer to the qualitative review criteria for validity and suppression in the “Pre-Production Lab Review” section, earlier in this document.

3.

Review the knowledge base associated with the alert. One of the greatest benefits of using MOM and specifically the MOM Operator console is that knowledge-base content can be directly associated with the alerts being generated. This knowledge-base content is provided by the development groups who develop the Management Pack; it comes ready to use with the Management Pack. Read through the knowledge-base content provided, and ensure it answers the following questions:

What is the problem?

How bad is it, will it get worse, and how does it affect the health or effectiveness of the product or computer?

What are the steps to confirm the failed state?

What are the steps to resolve the failed state?

If any one of the previous points is missing from the knowledge base, the alert is not actionable, and a bug needs to be submitted against the Management Pack to have the knowledge-base content altered.

4.

Take the steps prescribed in the knowledge base to triage and resolve the issue.

5.

Once you have looked through the knowledge base and generally confirmed its validity, go through it once more, and follow the steps prescribed to triage the issue on the machine that generated the alert. In addition, after confirming that the issue exits, follow the proposed resolution steps from the knowledge base.

6.

If you find issues at any point while you are looking at the knowledge base, the alert is considered “not actionable.” This should be raised as an issue to the Alert Tuning team (IT operations), and a bug should be created against the Management Pack.

7.

Refer to the qualitative review criteria for actionability and suppression in the “Pre-Production Lab Review” section, earlier in this document.

Once the actionability, validity and suppression have been determined, refer to the MOM 2005 documentation for detailed instructions on how to change alert suppression and the alert knowledge base, and on how to disable or delete a rule.

Step 4. Review Completion

The trigger to conclude the review is time-based. This is agreed upon up front during the Alert Tuning Preparation step. At the end of the review, the team should collect all artifacts for final analysis.

Preparation and Deployment Review

Overview

The fifth step in implementing alert tuning is conducting the Preparation and Deployment review. The objective is to perform a final analysis of all results and to prepare for deployment of the tuned Management Pack into production.

Review Activities (Exit Criteria)

Using the results from the pre-production environment, analyze the final report for the following conditions:

Alert-to-Ticket Ratio (ATR)

Maximum acceptable performance levels for an agent affected by the Management Pack

Percentage of the Management Pack that will be put into production

ATR Ratio

The ATR metric-based criteria are as follows:

What is the maximum number of alerts that a Management Pack can generate that corresponds to one trouble ticket being created?

What is the target percentage of rules that meet the ATR?

The ATR value is derived from the pre-production qualitative assessment. Alerts that are valid, have good suppression, and are actionable are considered to have an ATR of 1:1. This means that after it is implemented in production, each alert generated by the Management Pack’s rule will result in a unique ticket. Alerts that are valid but have less than ideal suppression and actionability will have an ATR value of 2:1 or will have a less optimal ATR value, such as 3:1.

The suggested ATR value is no more than two alerts for a given ticket. The suggested target percentage of rules that meet ATR is 90 percent or higher. The ATR metric is only applicable to rules that generate alerts. This needs to be accounted for when calculating ATR metrics.

This metric result provides workload guidance in preparation for the production implementation of the Management Pack. For example, if the Management Pack generally succeeds and achieves the 1:1 ATR, alerts will require minimal additional MOM operator handling. However, if alerts for the Management Pack are determined to be valid and actionable, but do not meet the 1:1 ATR (such as 2:1 or acceptable value), this will require MOM operator handling of investigation and ticketing for each incident.

Maximum Acceptable Performance Levels

Over the duration of the pre-production run for the Management Pack, performance counters should be captured for the key MOM services on the agent for which the Management Pack applies. If these counters exceed baseline values, further investigation may be needed to confirm the usage is within acceptable bounds. Keep in mind that baselines of the pre-production environment are different from the isolated lab results because the clients used in the pre-production environment also have production-level workloads. The isolated lab also has the script frequency tuned to run every one minute, which is very different from the settings used in production.

Percentage of Management Pack to Be Put into Production

Based on the exit review, the service owners should project what percentage of the Management Pack that they intend to put into production. This includes the following activities:

Issues stemming from actionability should result in the Management Pack being corrected in its naming, description, or knowledge base.

Issues stemming from validity should result in the rule being corrected or disabled where appropriate.

Issues stemming from suppression should be corrected prior to production implementation.

The existing agent configuration in the pre-production environment can be removed or uninstalled after the review cycle at the discretion of the Alert Tuning team. This pre-production agent base can be kept in use for future cycles of the alert tuning process.

Deployment of the Tuned Management Pack

Once you are comfortable with the performance of the Management Pack, the sixth step is to export it from the pre-production environment and import it to the production environment. It is not necessary to uninstall the multihomed agent from the production environment, but it is possible to do so if you deem necessary. Follow the MOM 2005 documentation for information on how to uninstall a multihomed agent.

Run-Time Alert Tuning

Overview

The seventh and final step in implementing alert tuning is conducting run-time reviews and improvements. The objective is to provide ongoing tuning of the Management Pack once it has been introduced to a production environment, through ongoing assessment, tuning optimization and feedback to development. Although the activities are mostly identical to the pre-production stage (and thus the details are not repeated in this section), the run-time review process runs on an as-needed basis, driven by the IT organization’s policy for continuous improvement.

For more information about what to consider during the run-time review, see the Service Monitoring and Control SMF, available at http://www.microsoft.com/technet/itsolutions/cits/mo/smf/smfsmc.mspx.

Run-Time Alert Tuning Activities

The activities associated with the run-time review are sequenced in Figure 14. The timing of this continuous cycle is determined by the IT operations team within its internal operating level agreements (OLAs) and service monitoring policies.

Figure 14. Run-time alert tuning activities

Figure 14. Run-time alert tuning activities

Step 1. Conduct Regularly Scheduled Reviews

Even though the Management Pack has been reviewed in the alert tuning process, when it is first installed in a full-scale production environment, it can still generate a fair number of alerts. However, this number is not nearly as high as if the Management Pack had not been reviewed.

The review process in production requires close interaction between various personnel in IT operations. The reviews are primarily based on operator feedback. Operators in their daily duties are fully aware of the alerts that are problematic or that require additional tuning. The reviews can be conducted on a monthly basis, depending on volume.

Step 2. Assess for Validity, Actionability, and Suppression

After discussing operator-observed conditions, investigate the rules that might require tuning, including the following steps:

1.

Find the alert in the MOM 2005 Operator console. Now that there is an instance of an alert from the rule in question, the next task is to look up the alert in the MOM 2005 Operator console to get complete details on the issue. Perform the following actions:

1.

Review the alert and the details of the instance.

2.

Review the details of the alert. Confirm that the name and the description of the alert are meaningful and make sense. In addition, confirm that the events associated with this alert should indeed be associated with it. The key is to review the alert from the perspective of a person who has little expertise in supporting this technology. Try to answer the question “Is this alert actionable?” and confirm that the details of alert are not too vague. Check to see if this alert is actually a standalone failure or if it can be associated with another failure/alert. This would be a factor of effective suppression.

3.

Check rule configuration to see if it has Enable State Alert Properties flagged. This indicates that the rule is used for state monitoring. Check the logic for the state change, and validate its accuracy.

4.

Refer to the qualitative review criteria for validity and suppression in the “Pre-Production Lab Review” section, earlier in this document.

2.

Knowledge-base review. Review the knowledge base associated with the alert.

One of the greatest benefits of using MOM and specifically the MOM Operator console is that knowledge-base content can be directly associated with the alerts being generated. This knowledge-base content is provided by the development groups who develop the Management Pack; it comes ready to use with the Management Pack. Read through the knowledge-base content provided, and ensure it answers the following questions:

What is the problem?

How bad is it, will it get worse, and how does it affect the health or effectiveness of the product or computer?

What are the steps to confirm the failed state?

What are the steps to resolve the failed state

If any one of the previous points is missing from the knowledge base, the alert is not actionable; a bug needs to be submitted against the Management Pack to have the knowledge base content altered. The following actions need to be performed:

1.

Take the steps prescribed in the knowledge base to triage and resolve the issue.

2.

Once the steps in the knowledge base have been followed and the alert’s validity is generally confirmed, follow the prescribed steps again to triage the issue on the machine that generated the alert. After confirming the issue exits, follow the proposed resolution steps from the knowledge base.

3.

If you find issues at any point while you are looking at the knowledge base, the alert is considered “not actionable.” This should be raised as an issue to the Alert Tuning team (IT operations), and a bug should be created against the Management Pack.

4.

Refer to the qualitative review criteria for actionability and suppression in the “Pre-Production Lab Review” section, earlier in this document.

Step 3. Provide Feedback

During run time, IT operations (Service Monitoring and Control) will typically perform the previous activities. The results of the analysis should be presented to the Service Monitoring and Control team for review and further analysis. Feedback will then be used for the next SMC Implement cycle. This helps ensure continuous improvement of the Management Pack and MOM infrastructure. Examples of improvement can include removal of specific alert conditions, additional suppression, or editing of alert attributes such as the description or the knowledge base.

Appendices

Appendix A: Resources

Overview of Microsoft Operations Framework (MOF) and Microsoft Solutions Framework (MSF), available at http://www.microsoft.com/technet/itsolutions/cits/mo/mof/default.mspx

Service Monitoring and Control (SMC) SMF, available at http://www.microsoft.com/technet/itsolutions/cits/mo/smf/smfsmc.mspx

Appendix B: Key Performance Indicators

The following statistics should be reviewed to understand the performance of SMC as well as to identify opportunities for improvement. Each value is mapped over predefined timeframes (such as daily, weekly, or monthly).

Alert-to-Ticket Ratio (ATR). This is a key statistic that indicates the quality of SMC alerts. The goal is to achieve a 1:1 ratio between alerts and tickets. This indicates that each alert is valid and has a well-defined and well-documented problem set associated with it.

Number of Tickets with No Alerts. A high count of tickets with no alerts is an indication that monitoring missed critical events. This statistic can be used as a starting point for improving instrumentation and rules.

Number of Events per Alert. As rules and correlation improve, this count should increase. Often, multiple events are triggered; however, there is typically only one true source of issue. A high number of events per alert count can also indicate opportunities for reducing the number of exposed events.

Number of Invalid Alerts. Alerts that are generated with incorrect fault determination should be carefully reviewed and corrected. The number of invalid alerts might increase during the initial deployment of new infrastructure components and services; however, it should drastically decrease with better rules and event filtering.

Number of Non-Actionable Alerts. This refers to alerts that are generated with insufficient descriptions or poor documentation that does not yield to determination of corrective action. The number of non-actionable alerts might increase especially during the introduction of a new Management Pack; however, it should drastically decrease (ideally to zero) after the alert tuning process.

Appendix C: Template for Project Status Reports

Appendix D: Alert Tuning Reports

The Alert Tuning Solution Accelerator contains three reports that provide information on the alert volumes produced by MOM. These values are used to identify the tuning opportunity for a given processing rule. They can also be used to improve accuracy of alert tuning by quantifying anomalous conditions during the tuning cycle. The reports included are:

Alert Count By Processing Rules

Alert Count By Dates

Alert Count By Device

This appendix gives instructions on how to navigate and access the reports and their data.

Accessing the Reports

After installing and deploying the reports by following the prescriptive guidance given in the “Install and Deploy Alert Tuning Reports” section, view the reports at http://localhost/reports.

Figure 15. Home page

Figure 15. Home page
See full-sized image

Double-click Alert Tuning Reports, and it will display the three bundled reports. Click Show Details, and the following Web page appears.

Figure 16. Alert Tuning Reports page

Figure 16. Alert Tuning Reports page
See full-sized image

The features that are visible in the preceding figure are basically used to control or change the properties of the folder. For example, to delete the folder, select the check box at the far left of the screen, and then click Delete above that. The Edit option allows you to go to the following page from where you can change the security- and name-related properties of the reports folder, as shown here.

Figure 17. Alert Tuning Reports Properties page

Figure 17. Alert Tuning Reports Properties page
See full-sized image

All the reports contain the count of raised alerts and total alerts for different scenarios. These counts are defined as follows:

The raised alert count represents the number of distinct alerts that MOM captures.

The total alert count includes the repeat counts of all the distinct alerts, in addition to the raised alert count.

Alert Count By Processing Rules

The Alert Count By Processing Rules report displays the count of raised and total alerts that a rule captures between a start and end date. The rules are filtered on the basis of the Management Pack that the user selects. To go to this report, click the AlertCountByProcessingRules link on the Alert Tuning Reports page. The screen shows an Enter Start Date box, an Enter End Date box, a Select Management Pack drop-down box, and a View Report button.

To view the report

1.

Type a valid start date in the Enter Start Date box (mm:dd:yyyy format).

2.

Type a valid end date in the Enter End Date box (mm:dd:yyyy format).

3.

Select one of the Management Packs that are listed in the Select Management Pack box.

Figure 18. AlertCountByProcessingRules Data Input page

Figure 18. AlertCountByProcessingRules Data Input page
See full-sized image

4.

Click View Report.

The results are displayed with three columns, as shown in the following screenshot:

Processing Rule. The name of the processing rule under the Management Pack selected by the user who generated the alerts between the chosen start and end dates.

Raised Alerts. Number of alerts raised after suppression has been applied.

Total Alerts. Rows sorted by raised alerts in descending order.

Figure 19. AlertCountByProcessingRule Results page

Figure 19. AlertCountByProcessingRule Results page
See full-sized image

Alert Count By Dates

The Alert Count By Dates report gives the distribution of the raised alerts and the total alerts for a particular Management Pack for all the days between the chosen start and end date. The distribution is done initially based on the week ranges, but it is also possible to drill down to the day ranges. Moreover, this report gives a comparative picture of the number of alerts raised for a particular Management Pack and the number raised for all the Management Packs taken together for a day. To view this report, click the AlertCountByDates link on the Alert Tuning Reports page.

To run and view this report

1.

In the Enter Start Date box, type a start date (mm:dd:yyyy format).

2.

In the Enter End Date box, type an end date (mm:dd:yyyy format).

3.

Select a Management Pack name from the Select Management Pack drop-down list for which you want to see the report. This list will have all the Management Pack names that are associated with the MOM server to which the application is pointing.

Figure 20. AlertCountByDates Data Input page

Figure 20. AlertCountByDates Data Input page
See full-sized image

4.

Click View Report at the top-right corner.

5.

The default view of the report is as shown. It contains the week ranges between the start and the end date; it also shows the total number of alerts for the selected Management Pack and for all the Management Packs together.

Figure 21. AlertCountByDates Results page

Figure 21. AlertCountByDates Results page
See full-sized image

6.

Click the plus (+) symbol next to any of the week ranges to expand it and to view the date display of the alert counts for the selected Management Pack and for all Management Packs taken together.

Figure 22. AlertCountByDates Results page, expanded

Figure 22. AlertCountByDates Results page, expanded
See full-sized image

7.

Click the Date column (for example, 6/16/2004), and you will be navigated to the Alert Count By Processing Rulesreport. The report opens up as shown in the following screenshot.

Figure 23. AlertCountByProcessingRules page from AlertCountByDates report

Figure 23. AlertCountByProcessingRules page from AlertCountByDates report
See full-sized image

This report displays the alert details for the selected Management Pack on that date.

Alert Count By Device

The Alert Count By Device report gives the number of alerts raised by a computer that runs the MOM agent service. The agent computers for the MOM server are grouped in computer groups; each computer group consists of computers of similar properties. This report lists the computers on the basis of the computer group that is selected. It also shows the raised and total alerts for all the computers under the selected computer group for a given Management Pack between two selected dates. On clicking the AlertCountByDevice link on the Alert Tuning Reports page, you are navigated to the page shown in the following figure.

To run this report

1.

Select the Management Pack name from the Select Management Pack drop-down list.

2.

Select a computer group from the Select Computer Group drop-down list.

3.

In the Enter Start Date box, type a start date (mm:dd:yyyy format).

4.

In the Enter End Date box, type an end date (mm:dd:yyyy format).

Figure 24. AlertCountByDevice Data Input page

Figure 24. AlertCountByDevice Data Input page
See full-sized image

5.

Click View Report to the far right of the screen (not shown in the preceding figure) to view the report for the condition that you have chosen. The report appears as follows.

Figure 25. AlertCountByDevice Results page

Figure 25. AlertCountByDevice Results page
See full-sized image

The Computer column shows all the computers under the selected computer group that have raised alerts between the start and end dates and for the selected Management Pack. The rows in this report are sorted on the basis of the raised alerts in descending order.


 

© 2006 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement
Microsoft