Discussion:
Intermittent DOWN alerts from switches that are UP
Stijn Van den Bruel
2012-11-08 14:47:42 UTC
Permalink
Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69756#69756

--------------------------------------------------------------
Since we upgraded from Zenoss 3 to 4.2, we are experiencing strange, intermittend down alertrs from almost all of our switches.
The amount of alerts vary depening on several factors, but a simple example of one of our remote (20 km away with fiber in between) locations with around 25 switches results in 100+ alerts (thus 200+ DOWN + clear emails)  in a 2 hour timespan. This is the same with switches under our desks and in the serverroom, connected straight to the core switches.

What we have found out:
- It seems that switches further away (both distance and more hops, thus probably a litt more latency, but never more than 10 ms), are more affected then switches closer to our core-switches
- If we ping a switch IP from our local workstations or a server (not Zenoss), we do not see any drops or latency problems with a consistent latency of a few ms, even while we are receiving alerts from that switch
- If we ping a switch which we are gettings loads of alerts from from the Zenoss linux box, we are suddenly not getting any more alerts for that switch and it keeps on showing healthy in Zenoss. I have tried to ping a whole lot of switches at the same time, logging the output of each to a file, which then results in no alerts from any of those switches ...
- In those ping output logs, we see that on all of these switches, we get a "truncated" pong from another IP ... This can be a random IP in the network. I have suspected this may have something to do with the alerts we are receiving.

What we have tried:
- Our collectors configuration is currently as follows:
--- Ping tries: 2 (this has been set to 3 without much effect)
--- Ping Cycle Time: 5 (this has been moved down from 60 in order to generate more alerts for testing purpose)
--- Maximum ping packets in flight: 200 (I have moved this up way higher for testing purposes without any positive impact)
- Move back to our Zenoss 3 VM, which doesn't generate any bogus alerts
- Install Zenoss 4.2 again, both with deploy script and manually
- Probably lots of things which I don't remember right now
- Installed SP1

Our Zenoss 3 installation was running on Debian. We are now running CentOS as Debian does not seem to be supported properly at this point.
All our switches are HP Procurve

Has anyone experienced anything like this? Should we go back to version 3?

Looking forward to any help or suggestions!

Stijn
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69756#69756]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Floyd Strimling
2012-11-08 15:40:15 UTC
Permalink
Floyd Strimling [http://community.zenoss.org/people/fstrimling] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69777#69777

--------------------------------------------------------------
Stijn,

We have not seen issues with thise before.

Can you please post your zenping debug log so we can take a closer look?

Cheers,

Floyd
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69777#69777]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Stijn Van den Bruel
2012-11-09 08:08:07 UTC
Permalink
Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69759#69759

--------------------------------------------------------------
Hi Floyd,

I did not know about this log. there seems to be a consistent increase of "missed runs".
Does this have to do with the 5 seconds interval I have set up? As mentioned before, we had the same problem (less alerts, because of less checks, i suppose) when the interval was 60 seconds.

2012-11-08 13:28:17,113 INFO zen.maintenance: Performing periodic maintenance
2012-11-08 13:28:17,114 INFO zen.zenping: Counter eventCount, value 621582
2012-11-08 13:28:17,116 INFO zen.zenping: 207 devices processed (0 datapoints)
2012-11-08 13:28:17,120 INFO zen.collector.scheduler: Tasks: 208 Successful_Runs: 37394 Failed_Runs: 0 Missed_Runs: 104280 Queued_Tasks: 0 Running_Tasks: 1
2012-11-08 13:33:17,122 INFO zen.maintenance: Performing periodic maintenance
2012-11-08 13:33:17,123 INFO zen.zenping: Counter eventCount, value 621690
2012-11-08 13:33:17,125 INFO zen.zenping: 207 devices processed (0 datapoints)
2012-11-08 13:33:17,131 INFO zen.collector.scheduler: Tasks: 208 Successful_Runs: 37406 Failed_Runs: 0 Missed_Runs: 104328 Queued_Tasks: 0 Running_Tasks: 1
2012-11-08 13:38:17,133 INFO zen.maintenance: Performing periodic maintenance
2012-11-08 13:38:17,134 INFO zen.zenping: Counter eventCount, value 621806
2012-11-08 13:38:17,135 INFO zen.zenping: 207 devices processed (0 datapoints)
2012-11-08 13:38:17,138 INFO zen.collector.scheduler: Tasks: 208 Successful_Runs: 37418 Failed_Runs: 0 Missed_Runs: 104376 Queued_Tasks: 0 Running_Tasks: 1
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69759#69759]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Stijn Van den Bruel
2012-11-09 08:55:10 UTC
Permalink
Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69791#69791

--------------------------------------------------------------
I have now changed the ping cycle from 5 to 30 and packets in flight to 300.
I am now not seeing "Missed Runs".

We'll try turning on the alerts again with these settings!
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69791#69791]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Stijn Van den Bruel
2012-11-09 09:20:44 UTC
Permalink
Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69792#69792

--------------------------------------------------------------
Although I'm not getting any "Missed_Runs" any more, I'm still getting loads of alerts. (40+ down/clear emails in 20 minutes time) ...
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69792#69792]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Stijn Van den Bruel
2012-11-09 12:40:19 UTC
Permalink
Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69795#69795

--------------------------------------------------------------
I have now installed a completely new physical server (used virtual before) with centos 6.3 and zenoss 4.2.
Without any further configuration, I changed the Max ping failures to 10 seconds and max ping packets in flight, and added our server and 2 switch networks (1 remote and our local).
I now have a total of 175 devices discovered, and set up a new trigger which sends me an alert without any delay. I have now in the past 10 minutes received over 120 emails (including CLEAR messages, thus 60 DOWN alerts.

On our "production" installation we had a 5 minute delay on our server alerts. I am now getting alerts on both servers and switches, so we can assume all devices are affected.

We never had any issues with ZenOss version 3
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69795#69795]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Doug Syer
2012-11-10 18:53:57 UTC
Permalink
Doug Syer [http://community.zenoss.org/people/dsyer%40nwnit.com] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69827#69827

--------------------------------------------------------------
Are you binding the ping datasource to all your ethernet interfaces?  Did you set the manged ip to an interface that has more than one ip bound to it, ie like an hsrp interface?
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69827#69827]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Stijn Van den Bruel
2012-11-12 07:32:11 UTC
Permalink
Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion

"Re: Intermittent DOWN alerts from switches that are UP"

To view the discussion, visit: http://community.zenoss.org/message/69837#69837

--------------------------------------------------------------
On the Zenoss server side, we have only 1 IP bound to a network bond, which exists of 4 physical network interfaces.
As we have tried several zenoss installations with 1 network interface, resulting in the exact same issue, this shouldn't be a problem.

We are getting alerts both from almost all our switches and all our servers (which as mentioned before are connected to the same vlan and core switch as the Zenoss server). Most of our switches have just one mgmt IP, the same goes for the servers.
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69837#69837]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
Loading...