You are here

System monitoring and incident response

8 September, 2015 - 16:42

Organizations of almost any size will have some type of help desk function. When a user has a problem, they call for help and sooner or later they usually get it. Larger organizations may include another activity called an operations control center. In some organizations the help desk and operations control center are part of the same activity, i.e., the help desk may also serve as an operations center as well as providing user assistance. In other organizations the two functions are managed separately. The important point here is that the functions are being accomplished, not how they are organized.

We are concerned primarily with what are normally considered to be operations center functions. Those two functions are system monitoring and incident response. Quite logically, we cannot expect someone to respond to an incident before he or she has discovered that it has occurred. Detecting incidents is the function of system monitoring. Incidents can be defined as any type of event that impacts the confidentiality, integrity or availability of an IT-enabled information service. As we have discussed above, there are a lot of potential threats to an information system. When a threat occurs, it is an incident. An operator may back up old data over new data, a lightning strike may lead to a power surge that destroys a critical piece of equipment, a hacker may break into the system and steal or modify data. The list of possible incidents is limitless.

The challenge to maintain systems availability is to be able to detect incidents and respond to them quickly and effectively. While we are not trying to turn the readers of this text into ops center personnel, organizational managers should have knowledge of the system monitoring and incident response functions if their organizations significantly depend upon IT-enabled services. That is, organizational managers should ensure that the system monitoring and incident response functions are aligned to the operational needs of the organization.

First, the organization needs to decide whether it is willing to invest in proactive system monitoring as opposed to reactive system monitoring. Reactive monitoring simply means that the activity will be attempting to detect incidents as soon as possible after they occur and then respond to them. At its most basic level, there is virtually no monitoring going on at all. When someone calls to complain, the incident is recorded and the response is initiated. But there are also specific computer applications that can be used to monitor systems so that operations center personnel may receive an alarm that a component has failed before any user has detected that failure. These are typically referred to as systems or network management applications. These applications will monitor designated services or system components. When the component stops or fails to perform correctly, the application will initiate some type of alarm or other notification. Some systems management applications have been designed to send email or telephone personnel to advise them that an alarm has occurred.

Proactive monitoring is really reactive monitoring but in a more ambitious manner. Often times, systems may provide some indication that something is not quite right, before there is an actual failure. For example, if a hard drive, a computer component on which information is stored, becomes full, an important application might fail. With proactive monitoring, the systems management application monitors hard disk usage and sends an alarm when it reaches 80% of capacity. The IT support activity can either clear off some data or install a larger hard disk before a failure actually occurs. That is proactive system monitoring.

The key to both reactive and proactive system monitoring is to understand the system baseline. Operations center personnel want to have a very accurate understanding of what the system looks like when everything is working properly. Monitoring then becomes largely a function of detecting deviations from the system baseline. For example, imagine that an organization typically used very little Internet connectivity during the evening hours. The network administrator notices that for the last three nights that there has been a lot of network utilization starting at 2 am. It might be quite legitimate traffic. The organization had decided to back up its data to an offsite location during the early morning hours so that it would not interfere with normal system use. However, it could mean that an intruder had compromised the organizational system and was copying confidential data or perhaps using the organization's computers to launch SPAM out onto the Internet. Under reactive monitoring, the organization may not respond. After all, nothing appears broken and no one has complained. A proactive monitoring system will detect the incidents, the early morning network use, and investigates the cause of that traffic so that an appropriate response can be taken.

This example leads us to the second operations center function, incident response. Incident response refers to the actions that an organization takes in response to detected incidents. Upon incident detection, the first action is to minimize or contain damage resulting from the incident. The second response is to restore the service. Depending upon the organization and the type of incident, a range of other responses may be appropriate. There may be a need to provide notifications to key personnel. If there are multiple incidents occurring at the same time, a prioritization scheme may be required to determine which incidents are likely to cause the greatest damage to the organization. Effective organizations have a method of documenting incidents, such as a trouble ticketing system, and will perform after-action-analysis of incidents to determine if there are recurring patterns of incidents occurring.

Given that there are a near infinite number of possible incidents, there is also a near infinite number of possible responses. If a circuit to the internet fails, the operations center will typically look to see if there is a problem with the organization's equipment. If the organization cannot isolate the problem to its equipment, then the appropriate response is to notify the telephone company or internet service provider.

As indicated in the introduction to this section, there are a variety of ways that this function may be organized.

Many organizations have adopted a three-tiered response model. The three-tiered response model reflects the observation that some operational personnel are quite knowledgeable while others are less so. The more knowledgeable, highly skilled personnel are paid more while less knowledgeable personnel are paid less. Organizations have learned that it is economically beneficial to have the low skilled personnel working on the easer incidents while the high-skilled personnel work on the more difficult incidents.

The tiered response model supports this objective. Less skilled individuals serve as the first responders. They record the incident information and resolve as many incidents as they are capable of resolving. Tougher incidents get passed to a second tier of fairly skilled personnel for resolution. Hopefully most of the incidents can be resolved in the first two tiers. However, sometimes things are really complicated and you need to bring in the really experienced professionals to resolve them. Some organizations will have these highly skilled individuals on their staffs. But other might rely on outside personnel to provide this third level of support. These individuals tend to be very expensive and an organization does not want them to spend their time working on easier problems that less-skilled, lower-paid staff or capable of handling.

Before leaving the topic of system monitoring and incident response, there is one other subject that merits discussion. We previously mentioned the ITIL framework as providing best management practices for managing IT operations. The authors of the ITIL have found it useful to distinguish between incident management and problem management. The incident response process just described falls under the category of incident management. Problem management is a bit trickier. If incidents are events that result in system failures, what are problems? Under ITIL, problems are the causes that underlie incidents. To explain, let us go back and reconsider the lightening strike that fried some of our equipment. The incident was the equipment failure. The equipment could no longer fulfill its intended function because of surge of electricity melted critical circuitry.

The problem, then, might be defined as the fact that lightening is a recurring problem in that particular geographic area. The organization may stock spare supplies of equipment and recover the service fairly quickly. However, services will still be disrupted and replacing the equipment could prove expensive depending upon how often lightening can be expected to strike. Or the problem might even be more broadly defined as an unstable power supply. Lightening may be one cause of power fluctuations, but there might be a variety of reasons that power might fluctuate. Problem management attempts to broadly define the problem, e.g. unstable power, and determine how that problem can best be managed. While replacing fried equipment can resolve the incident, other measures are required if the organization wants to avoid service failures resulting from unstable power. The examination of outages over a period of time to see how many result from unstable power is an example of using a trend analysis. When properly conducted, the combination of documenting incidents, conducting trend analyses, and resolving identified problems can greatly increase the availability of system services, reduce the number of incidents that must be fielded and may allow for reductions in the number of IT support staff.

Organizational managers might start to think that this discussion is getting technical and that the IT personnel should be taking care of these problems. Certainly, an organization should staff its operations center with technically competent IT staff. However, as was discussed above, organization management needs to determine what level of risk it is able to tolerate and ultimately determine the capabilities of system monitoring and incident response functions.