Health roll-up and correlation

Question

Hi,

I know that in SCOM we have event correlation. As in we can correlate multiple "similar" events in to an existing alert if the conditions match.

I also know that we have Health roll-up where when say X number of child objects are unhealthy, the roll-up will be unhealthy. And therefore we could choose to have the alert at the roll-up level or the child level.

But is there any way at the moment (or any plans) to allow a combination of the roll-up, like SolarWinds does and also like ScienceLogic does. In other words, alert at child level (and therefore get more specific details in the alert) as long as the parent is deemed healthy, BUT...do NOT alert at child level when the parent is deemed unhealthy, and instead let the roll-up alert. At the moment (as far as I understand), we can only have one or the other, as we simply chose to enable or disable the monitor. We still need the child alert up to a point where we can say that there is something wrong at a higher level.

In SolarWinds, if for example a switch is down, you can tell it to put the child objects such as ports in an unmonitored state so that you don't get a port alert and the switch alert.

In ScienceLogic, you can configure masking/suppression that acts in a similar way by silencing child devices if the parent is unhealthy.

I know Kevin did something similar as part of a demo for consolidating heartbeat alerts when a Gateway Server goes down, but it almost feels like Microsoft are missing a trick at the moment.

Our main monitoring tool is SCOM but we also have ScienceLogic used (by our service provider) for cloud servers and we also have SolarWinds for Core network devices. I am concerned at the moment because SCOM is currently noisy with these duplicated scenarios, and we have reviews at the moment. I know all these tools have their advantages and disadvantages but I sense if we cannot do something to correlate the noise from SCOM, one of these other tools may be chosen instead.

I am happy to author something, but I don't want to end up almost re-inventing the way SCOM works just to make it happen.

Thanks

Andrew

Answer

Hi Andrew,

I think I understand the logic here, but I am, not quite sure if this can be achieved the same way with SCOM (or at least not without some complex authoring). What I would do in your case is post this on the new System Center Operations Manager Feedback Site, maybe the SCOM Product Group will find this interesting and address it:

System Center Operations Manager Feedback Site
https://feedback.azure.com/d365community/forum/2a49c9ee-4436-ec11-b6e6-00224824730c

----------

(If the reply was helpful please don't forget to upvote and/or accept as answer, thank you)
Regards
Stoyan Chalakov

Answer

I agree with Stoyan. Nothing out of the box and probably some PowerShell to evaluate health states and alert appropriately.

Do you have a link to Kevins presentation ? "I know Kevin did something similar as part of a demo for consolidating heartbeat alerts when a Gateway Server goes down"

It might be possible to review the code and reverse engineer something similar but it is almost certainly going to be on a roll up by roll up basis and so not easily extensible \ scalable as a "fix" for rollups in general.

Regards

Graham

Answer

Thanks All for your replies. Apologies for the delay (time flies!) in coming back on this. Yeah, I think for now we can at least target the main ones creating thousands of incidents per month, so things like the heartbeat alert and failed to connect. That would massively reduce the totals. The link for Kevin's post is https://scomathon.com/blog/automating-maintenance-mode-for-computers-behind-a-gateway-scom-management-pack-by-kevin-holman/ I have been toying with the idea of doing something in Orchestrator as well as we also use a custom solution from Kelverion for bringing in SolarWinds alerts. This would also allow us to see what network issues (core network devices are monitored with SW) are going on at a higher level. But I stalled on this as it was getting a bit complex to try to come up with something that would map an object in SCOM as being part of a switch, for example in SolarWinds and of course how it would be presented in our CMDB.

Answer

"In other words, alert at child level (and therefore get more specific details in the alert) as long as the parent is deemed healthy, BUT...do NOT alert at child level when the parent is deemed unhealthy, and instead let the roll-up alert."

To distill your request, basically you don't want the child instances to generate alerts if the parent is in a critical state.

You can do some customization using the command channel and PowerShell but I would look better and what you want to achieve here.

What is the actual issue? To many alerts?

Where? On the SCOM Dashboard? Email/IM alert notifications? SQL database?

Even with your switch and ports example, Dashboards can be filtered and scoped. Same with IM alert notifications.

Share via

Health roll-up and correlation

4 answers