Obviously, the central application that my employer runs is still running on OpenVMS (or I'd probably be out of a job). When a standardised monitoring solution was chosen for implementation here, not a lot of notice was paid to monitoring OpenVMS and consequently, the monitoring solution that was chosen had no operating system specific agent to support the platform.
The monitoring solution does however support SNMP, but when we asked the consultants who were hired to configure the monitoring solution to support a simple dashboard for the central application, it all proved too difficult for them.
Recently, I put a proof of concept dashboard together from a number of tools that I was already using, and a custom MIB to describe some parts of the application.
The screen was originally designed with just the three large indicators, which are what matters to the business. The two smaller indicators cover critical middleware that's important to the Infrastructure Team.
Simple as the screen is, there is a number of technologies behind it, and data for each indicator is gathered in a number of ways.
The "Server Status" indicator displays the state of the four production machines running the application. The machines are in a homogenous VMS Cluster, so the loss of a machine, even two machines, it not catastrophic. The indicator displays the up/down status of the machines by asking Nagios for their latest ping states.
The "Login Status" indicator displays the state of the application itself. The business has a number of locations, and each of these locations can have end user access permitted or denied based on the status of overnight runs and other things (for example, maintenance windows). If 100% of locations allow logins, the indicator is green, If no locations are allowing logins, the indicator is red. Else it's amber. The information for this indicator is obtained by an eSNMP subagent specifically coded for the dashboard. It populates one OID's value with the number of valid locations, and another OID's value with the number of locations that currently permit logins.
The "Response Time" indicator's data is also made available by the same custom eSNMP subagent. The subagent code regularly runs a canned application query against real data in the application, and records how long it takes (more on this in a minute). The state of the indicator is based on "acceptable" response time as defined by the end users.
The "Samba" indicator displays the state of the SMB service on the cluster. This is obtained as the last status that Nagios has about the state of the service.
And lastly, the "Attunity" indicator displays the state of this middleware layer. The Infrastructure Team already had a monitoring job for this service that ran every five minutes via an external scheduling system. It was a simple matter to add an SNMP SET command to the command procedure to set the value of an OID in the custom eSNMP subagent so the dashboard code could retrieve it.
The dashboard generation code itself is written in Python, and has the job of retrieving a number of SNMP OID values, as well as retrieving the last state that Nagios has recorded about specific hosts and services. From this information the HTML is generated every 60 seconds.
The other thing worth mentioning about this setup is to ensure that SNMP GET requests can't affect the production system in particular with respect to retrieving the response time. So, the eSNMP subagent does not run the response time test code when an SNMP GET is received, but rather returns the result from the last time it ran.
The actual running of the test code is controlled in a separate kernel thread (implemented with pthreads) in the eSNMP subagent that runs independent of SNMP activity on a periodic basis.
(Also, this project was used as a demonstration of why it's worth learning more than one language as a programmer. Languages used in this project include C, DCL, ASN.1, Python, HTML, CSS, and Bash.)
Posted at August 28, 2013 12:30 PMComments are closed