Intermittent DOWN alerts from switches that are UP

Discussion:

Stijn Van den Bruel

2012-11-08 14:47:42 UTC

Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69756#69756

--------------------------------------------------------------
Since we upgraded from Zenoss 3 to 4.2, we are experiencing strange, intermittend down alertrs from almost all of our switches.
The amount of alerts vary depening on several factors, but a simple example of one of our remote (20 km away with fiber in between) locations with around 25 switches results in 100+ alerts (thus 200+ DOWN + clear emails)Â in a 2 hour timespan. This is the same with switches under our desks and in the serverroom, connected straight to the core switches.

What we have found out:
- It seems that switches further away (both distance and more hops, thus probably a litt more latency, but never more than 10 ms), are more affected then switches closer to our core-switches
- If we ping a switch IP from our local workstations or a server (not Zenoss), we do not see any drops or latency problems with a consistent latency of a few ms, even while we are receiving alerts from that switch
- If we ping a switch which we are gettings loads of alerts from from the Zenoss linux box, we are suddenly not getting any more alerts for that switch and it keeps on showing healthy in Zenoss. I have tried to ping a whole lot of switches at the same time, logging the output of each to a file, which then results in no alerts from any of those switches ...
- In those ping output logs, we see that on all of these switches, we get a "truncated" pong from another IP ... This can be a random IP in the network. I have suspected this may have something to do with the alerts we are receiving.

What we have tried:
- Our collectors configuration is currently as follows:
--- Ping tries: 2 (this has been set to 3 without much effect)
--- Ping Cycle Time: 5 (this has been moved down from 60 in order to generate more alerts for testing purpose)
--- Maximum ping packets in flight: 200 (I have moved this up way higher for testing purposes without any positive impact)
- Move back to our Zenoss 3 VM, which doesn't generate any bogus alerts
- Install Zenoss 4.2 again, both with deploy script and manually
- Probably lots of things which I don't remember right now
- Installed SP1

Our Zenoss 3 installation was running on Debian. We are now running CentOS as Debian does not seem to be supported properly at this point.
All our switches are HP Procurve

Has anyone experienced anything like this? Should we go back to version 3?

Looking forward to any help or suggestions!

Stijn
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69756#69756]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]

Floyd Strimling

2012-11-08 15:40:15 UTC

Permalink

Floyd Strimling [http://community.zenoss.org/people/fstrimling] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69777#69777

--------------------------------------------------------------
Stijn,

We have not seen issues with thise before.

Can you please post your zenping debug log so we can take a closer look?

Cheers,

Floyd
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69777#69777]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]

Stijn Van den Bruel

2012-11-09 08:08:07 UTC

Permalink

Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69759#69759

--------------------------------------------------------------
Hi Floyd,

I did not know about this log. there seems to be a consistent increase of "missed runs".
Does this have to do with the 5 seconds interval I have set up? As mentioned before, we had the same problem (less alerts, because of less checks, i suppose) when the interval was 60 seconds.

2012-11-08 13:28:17,113 INFO zen.maintenance: Performing periodic maintenance
2012-11-08 13:28:17,114 INFO zen.zenping: Counter eventCount, value 621582
2012-11-08 13:28:17,116 INFO zen.zenping: 207 devices processed (0 datapoints)
2012-11-08 13:28:17,120 INFO zen.collector.scheduler: Tasks: 208 Successful_Runs: 37394 Failed_Runs: 0 Missed_Runs: 104280 Queued_Tasks: 0 Running_Tasks: 1
2012-11-08 13:33:17,122 INFO zen.maintenance: Performing periodic maintenance
2012-11-08 13:33:17,123 INFO zen.zenping: Counter eventCount, value 621690
2012-11-08 13:33:17,125 INFO zen.zenping: 207 devices processed (0 datapoints)
2012-11-08 13:33:17,131 INFO zen.collector.scheduler: Tasks: 208 Successful_Runs: 37406 Failed_Runs: 0 Missed_Runs: 104328 Queued_Tasks: 0 Running_Tasks: 1
2012-11-08 13:38:17,133 INFO zen.maintenance: Performing periodic maintenance
2012-11-08 13:38:17,134 INFO zen.zenping: Counter eventCount, value 621806
2012-11-08 13:38:17,135 INFO zen.zenping: 207 devices processed (0 datapoints)
2012-11-08 13:38:17,138 INFO zen.collector.scheduler: Tasks: 208 Successful_Runs: 37418 Failed_Runs: 0 Missed_Runs: 104376 Queued_Tasks: 0 Running_Tasks: 1
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69759#69759]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]

Stijn Van den Bruel

2012-11-09 08:55:10 UTC

Permalink

Stijn Van den Bruel

2012-11-09 09:20:44 UTC

Permalink

Stijn Van den Bruel

2012-11-09 12:40:19 UTC

Permalink

Doug Syer

2012-11-10 18:53:57 UTC

Permalink

Doug Syer [http://community.zenoss.org/people/dsyer%40nwnit.com] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69827#69827

--------------------------------------------------------------
Are you binding the ping datasource to all your ethernet interfaces?Â Did you set the manged ip to an interface that has more than one ip bound to it, ie like an hsrp interface?
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69827#69827]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]

Stijn Van den Bruel

2012-11-12 07:32:11 UTC

Permalink