Stijn Van den Bruel
2012-11-08 14:47:42 UTC
Stijn Van den Bruel [http://community.zenoss.org/people/stijnvb] created the discussion
"Intermittent DOWN alerts from switches that are UP"
To view the discussion, visit: http://community.zenoss.org/message/69756#69756
--------------------------------------------------------------
Since we upgraded from Zenoss 3 to 4.2, we are experiencing strange, intermittend down alertrs from almost all of our switches.
The amount of alerts vary depening on several factors, but a simple example of one of our remote (20 km away with fiber in between) locations with around 25 switches results in 100+ alerts (thus 200+ DOWN + clear emails)Â in a 2 hour timespan. This is the same with switches under our desks and in the serverroom, connected straight to the core switches.
What we have found out:
- It seems that switches further away (both distance and more hops, thus probably a litt more latency, but never more than 10 ms), are more affected then switches closer to our core-switches
- If we ping a switch IP from our local workstations or a server (not Zenoss), we do not see any drops or latency problems with a consistent latency of a few ms, even while we are receiving alerts from that switch
- If we ping a switch which we are gettings loads of alerts from from the Zenoss linux box, we are suddenly not getting any more alerts for that switch and it keeps on showing healthy in Zenoss. I have tried to ping a whole lot of switches at the same time, logging the output of each to a file, which then results in no alerts from any of those switches ...
- In those ping output logs, we see that on all of these switches, we get a "truncated" pong from another IP ... This can be a random IP in the network. I have suspected this may have something to do with the alerts we are receiving.
What we have tried:
- Our collectors configuration is currently as follows:
--- Ping tries: 2 (this has been set to 3 without much effect)
--- Ping Cycle Time: 5 (this has been moved down from 60 in order to generate more alerts for testing purpose)
--- Maximum ping packets in flight: 200 (I have moved this up way higher for testing purposes without any positive impact)
- Move back to our Zenoss 3 VM, which doesn't generate any bogus alerts
- Install Zenoss 4.2 again, both with deploy script and manually
- Probably lots of things which I don't remember right now
- Installed SP1
Our Zenoss 3 installation was running on Debian. We are now running CentOS as Debian does not seem to be supported properly at this point.
All our switches are HP Procurve
Has anyone experienced anything like this? Should we go back to version 3?
Looking forward to any help or suggestions!
Stijn
--------------------------------------------------------------
Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69756#69756]
Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]
"Intermittent DOWN alerts from switches that are UP"
To view the discussion, visit: http://community.zenoss.org/message/69756#69756
--------------------------------------------------------------
Since we upgraded from Zenoss 3 to 4.2, we are experiencing strange, intermittend down alertrs from almost all of our switches.
The amount of alerts vary depening on several factors, but a simple example of one of our remote (20 km away with fiber in between) locations with around 25 switches results in 100+ alerts (thus 200+ DOWN + clear emails)Â in a 2 hour timespan. This is the same with switches under our desks and in the serverroom, connected straight to the core switches.
What we have found out:
- It seems that switches further away (both distance and more hops, thus probably a litt more latency, but never more than 10 ms), are more affected then switches closer to our core-switches
- If we ping a switch IP from our local workstations or a server (not Zenoss), we do not see any drops or latency problems with a consistent latency of a few ms, even while we are receiving alerts from that switch
- If we ping a switch which we are gettings loads of alerts from from the Zenoss linux box, we are suddenly not getting any more alerts for that switch and it keeps on showing healthy in Zenoss. I have tried to ping a whole lot of switches at the same time, logging the output of each to a file, which then results in no alerts from any of those switches ...
- In those ping output logs, we see that on all of these switches, we get a "truncated" pong from another IP ... This can be a random IP in the network. I have suspected this may have something to do with the alerts we are receiving.
What we have tried:
- Our collectors configuration is currently as follows:
--- Ping tries: 2 (this has been set to 3 without much effect)
--- Ping Cycle Time: 5 (this has been moved down from 60 in order to generate more alerts for testing purpose)
--- Maximum ping packets in flight: 200 (I have moved this up way higher for testing purposes without any positive impact)
- Move back to our Zenoss 3 VM, which doesn't generate any bogus alerts
- Install Zenoss 4.2 again, both with deploy script and manually
- Probably lots of things which I don't remember right now
- Installed SP1
Our Zenoss 3 installation was running on Debian. We are now running CentOS as Debian does not seem to be supported properly at this point.
All our switches are HP Procurve
Has anyone experienced anything like this? Should we go back to version 3?
Looking forward to any help or suggestions!
Stijn
--------------------------------------------------------------
Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/69756#69756]
Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]