Tkwatcher is a tcl based program for monitoring a system.
ANNOUNCING release 1.4 of of tkwatcher. (See the end for a Changelog.)
Tkwatcher is a tcl program that allows monitoring and analysis of
program output. It can use any tcl based shell including tclsh,
expect, wish, or scotty. It runs with tcl-7.3 and tcl-8.0 and should
work with future versions.
By default it reports problems by email, but it can also send problem
reports to a file or to standard output. The reports can be human or
program readable. In addition to these reporting modes, it can log
error using external programs to permit syslogging, paging or other
real-time notification of errors.
It was inspired by the program watcher by Kenneth Ingham (
not Inghman
as mistyped in the docs these many years. Hope to fix this typo in
version 1.5, sorry Mr. Ingham), but adds features that I found
lacking in the original watcher. Among those features are the ability to:
- select portions of the controlfile
- print command headers in the error messages
- select individual lines from a command output stream using absolute
positions, or a regular expression
- perform and test calculations based on the input data
- specify multiple tests on a value that are anded together to determine if a warning should be issued.
- set thresholds for reports when all other tests are positive. This
allows the user to set thresholds low enough for problems to be caught early, but excessive noise is eliminated by waiting until enough low
threshold tests are passed.
This tool has been used to:
- monitor disk space
- look for stalled jobs in the print queue
- look for run away processes chewing up cpu time
- monitor swap space changes
- look for problems with network interfaces:
- excessive collision rates
- bad rpc calls
- excessive bad xid's indicating that an nfs server
is overloaded
- monitor for swapping and paging activity
- verify operation of the X windows font server
- monitoring ntp network for hosts with excessive deviance
from established norms
- monitor for excessive ping round trip time
- verify that required daemons are running on the system
- look for excessive copies of certain programs running
on the system
- monitor users who keep software licenses for more than a preset
amount of time
- watch for changing ip to ethernet mappings in the arp cache indicating an arp attack.
It permits you to parse any command output line extracting values from
the line, and performing tests on these values.
The list of tests
includes:
- is the value in a give numeric range?
- is a required value present?
- is the value equal/not equal to X
- is the value equal/not equal to X or Y or Z or A ...
- has the value increased/decreased by a certain amount
from the previous run?
- has the value increased/decreased by a certain percentage
from the previous run?
- has the value changed or not changed since the last run
You can perform calculations on multiple values to apply tests to a
calculated value. You can also write arbitrary tcl procedures that can
be used for special purpose tests.
To give a bit better idea of what it can do, a sample entry is shown
below:
set watch(6,disk) {"df -k"
"disk space"
{
{"=1" {header 0-end %H}}
{"^/mnt" {ignore 0 %s} }
{"^/var" {filesystem 0 %k} {varcapacity 4 %d}}
{"^/" {filesystem 0 %k} {capacity 4 %d}}
}
{ {capacity delta_up 10 range 0 80}
{capacity delta_up 30 range 0 60 severity alert}
{capacity range 0 99 severity emerg orgroup a}
{capacity range 0 98 delta_up 0 severity alert orgroup a}
{capacity range 0 90 delta_up 0 orgroup a}
{varcapacity range 0 75 change}
{varcapacity delta_up 10 range 0 50 severity warning}
}
}
df -k produces output that looks like:
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/hda5 1763113 1001124 670868 60% /
/dev/hda6 763113 125344 637769 16% /var
/dev/hda1 2044208 1244848 799360 61% /disk/hda1
/dev/hdc 634944 634944 0 100% /mnt/cdrom
/dev/sda4 98078 23740 74338 24% /mnt/dos/zip
This checks for:
- capacity of the /var filesystem has changed and
is above 75% full generates a warning
- capacity of the /var filesystem has increased by 10, and it
is above 50% full generates a warning
- capacity increasing by more than 10% when the disk is more than 80% full
- capacity increasing by more than 30% when the disk is more than 60% full
- capacity at more than 99% gets a severity emerg warning
- capacity at more than 98% that has increased gets a severity alert warning
- capacity at more than 90% that is increasing gets a warning
Only one of the previous three will generate a report even if the disk
has just gone up to 100% full.
- Capacity measurements for any removable disks mounted under the /mnt
filesystem is ignored.
The distribution consists of:
- Readme - this file
- tkwatcher.1 - man page for tkwatcher
- Watcherfile - a sample tkwatcher controlfile
- tkwatcher - the tcl script
- ed.scr - an ed script to strip comments and extra debug stuff
from tkwatcher
- tests - directory of various tests. Some examples may be useful.
Tkwatcher should run provided tclsh is in your path. Some external
command such as mail, logger ... are used to report problems. If the
defaults in the tkwatcher script don't work for you, change the
defaults in the command array located near the top of the tkwatcher
file. E.G. the default (which works under linux) command for reporting
via syslog is:
set command(syslog) "/usr/bin/logger -t tkwatcher -i -p daemon."
if its in /usr/ucb/logger, simply change the line to read:
set command(syslog) "/usr/ucb/logger -t tkwatcher -i -p daemon."
Those interested in increasing its performance may want to hard code
the path to the tcl interpreter in the file.
Other than that chmod +x the file (if needed) put it on your path,
create a Watcherfile and you should be all set.
For security reasons, the location of the watcherfile and history
file should not be writable to anybody except the user running
tkwatcher.
To reduce tkwatcher's size (and thus increase its startup speed) and
to remove debugging calls (also increasing its speed), execute the
following commands:
grep -v "^#" ed.scr | ed tkwatcher
this will create the file tkwatcher.strip that can be used with the
tracing subsystems, F for function trace, t for watcherfile trace,
or e for viewing handled exceptions. The -v flag will work since it
uses the "t" watcherfile tracing subsystem.
Tkwatcher is copyright 1995, 2000 John P. Rouillard (
rouilj@ieeeSTOPSPAM.org)
Tkwatcher may not be sold.
Tkwatcher may be redistributed freely.
There are no other restrictions on how this program may be used.
There is no warranty with this program. It works for me, your
mileage may vary.
Written by: John Rouillard (
rouilj@ieeeSTOPSPAM.org) with thanks to
Mark Lamourine
.
CHANGELOG: Changes since 1.3:
Many cosmetic fixes, including code reorg to put configurable bits
closer to the top of the file.
Tcl procedure interface now better defined. Multiple field arguments are
now properly passed to tcl proc. Added an example tcl proc to
Watcherfile and updated the manual page.
The history file name is now derived from the control
filename. Previously the file ~/.watcher.hist was always used. A
control filename that is > 9 characters will result in a history file
name > 14 characters and may be a problem on some systems. But I
consider those systems archaic and I really don't want to support them
in the mainline code.
Added the ability to specify a regular expression for parsing data
from a selected field. This is a more generic version of the prefix
elimination function that was present in earlier and current versions
of tkwatcher.
Added ability to designate fields counting from right to left so that
you can select the rightmost space separated field. From the
Ideas file in the original watcher.
Added ability to specify column ranges in reverse order to grab columns
by counting from the end. E.G range of 39-32 would grab the data in the line
that was 39 to 32 columns from the end. Note it doesn't reverse the data in
the line, it merely specifies columns counting from the right.
If command to be parsed is not started properly (not found, can't exec
etc) tkwatcher doesn't bail out with a tcl error. It reports the error
in the output reporting mechanism.
Fixed a couple of tracing subsystem typos.
Fixed a bug that resulted in tkwatcher not reporting a "required" error
when there were no matching parse lines.
Fixed bug where the a prefix elimination parse string "iddpr2-%d"
caused the history file to save bogus information. This happened when
the prefix string failed to parse the information properly. This
happens a lot in column based parsing when a command rearranges the
output columns as numbers increase. The historyfile data is now
blanked out for the failed parse. Since tkwatcher reports
the failed parse, the owner will be notified of the problem, so I
don't consider loss of history info an issue.
Fixed a bug that prevented execution of other actions after a value,
novalue, enum or noenum action.
Fixed a bug that made the action 'value "a value with spaces"' or
novalue not work correctly. Enum and no enum was not affected by this bug.
Previous history value is displayed (if it exists) properly when a tcl proc is called.
A tcl proc can now stop evaluation of other actions by setting the
variable stop_eval to 1.
Obtaining tkwatcher and more info/examples
TkWatcher can be obtained from
ftp://ftp.cerias.purdue.edu/pub/tools/unix/sysutils/tkwatcher/tkwatcher1_4.tar.gz
More info can be gotten from:
the man page.
Which also has annotated examples. The tarball has a full
working Watcherfile example which is also available here. I have edited the
Watcherfile to obscure/eliminate host names and IP
addresses. Hopefully I did not break anything in the process.
-- John Rouillard