Delay between FS threshold breach and event

Discussion:

James Stewart

2011-12-05 04:25:30 UTC

James Stewart [http://community.zenoss.org/people/amorphic] created the discussion

"Delay between FS threshold breach and event"

To view the discussion, visit: http://community.zenoss.org/message/63034#63034

--------------------------------------------------------------
Hello all,

I have just experienced a strange problem. A filesystem on one of our Linux servers started to fill up, eventually breaching first its Error threshold @ 12:45 and then its Critical threshold @ 12:55. This can be seen in the graph for the filesystem in question:

Loading Image...

Loading Image...

However, we did not receive any events for the Error threshold breach. We did receive a single event for the Critical threshold breach, but this did not arrive until about 40 mins after the breach occured, (this was subsequently auto-cleared):

Loading Image...

Loading Image...

As the filesystem usage is graphed in Zenoss and I do not see any errors for this host in zenperfsnnpd.log, I can only assume that the data was correctly collected via zenperfsnmpd.

So, can anybody suggest why this delay between the threshold breach and resulting event may have occured? There was no unusual load on my Zenoss server at the time and I have not seen any problems like this previously.

Failing that, any suggestions as to where I might look for futher clues?

Cheers,

James
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/63034#63034]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]

Chet Luther

2011-12-07 15:25:28 UTC

Permalink

Chet Luther [http://community.zenoss.org/people/cluther] created the discussion

"Re: Delay between FS threshold breach and event"

To view the discussion, visit: http://community.zenoss.org/message/63100#63100

--------------------------------------------------------------
Could you provide the exact configuration of each of these thresholds?
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/63100#63100]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]

James Stewart

2011-12-07 23:45:29 UTC

Permalink

James Stewart [http://community.zenoss.org/people/amorphic] created the discussion

"Re: Delay between FS threshold breach and event"

To view the discussion, visit: http://community.zenoss.org/message/63133#63133

--------------------------------------------------------------
Sure Chet, thanks for your interest...

This is based on the standard Filesystem monitoring template applied to /Server, polling OID 1.3.6.1.2.1.25.2.3.1.6 to get a value for usedBlocks.

I had a requierment to be able to set custom thresholds for ever filesystem on every server. Creating local template copies for this would be a mess, so I have a slightly more complex setup to determine my Critical, Error and Warning thresholds. For each server I have 3 Custom Properties, (each shown with examples):

cFilesystemCritical: '/|95 /boot|80 /home|95 /opt|95 /tmp|90 /var|90'
cFilesystemError: '/|90 /boot|70 /home|90 /opt|90 /tmp|80 /var|80'
cFilesystemWarning: '/|85 /boot|60 /home|95 /opt|95 /tmp|70 /var|70'

I then use a one-liner in the 'Maximum Value' field of each threshold to dynamically obtain the threshold for that filesystem like so, (for the Critical threshold):

here.totalBlocks * float(here.device().getProperty('cFilesystemCritical').strip().split(here.name())[1].split()[0].split('|')[1]) / 100

A little complicated, but it has worked fine for a long time and continues to raise timely filesystem utilisation threshold breach events across many servers on a daily basis.

In the case above, it seems there was a delay somewhere in the chain of:

snmp poll->snmp value returned->threshold applied->event raised

As the filesytem utilisation was graphed and there were no errors in the zenperfsnmp log, I can only assume that the snmp poll returned data in a timely fashion. This leads me to believe that the delay occured the processing the obtained value. However 40 mins between threshold breach and event raising seem very strange...

Any ideas/help would be greatly appreciated...

J.
--------------------------------------------------------------

Reply to this message by replying to this email -or- go to the discussion on Zenoss Community
[http://community.zenoss.org/message/63133#63133]

Start a new discussion in zenoss-users by email
[discussions-community-forums-zenoss--***@community.zenoss.org] -or- at Zenoss Community
[http://community.zenoss.org/choose-container!input.jspa?contentType=1&containerType=14&container=2003]