Tkwatcher is a tcl based program for monitoring a system.

ANNOUNCING release 1.4 of of tkwatcher. (See the end for a Changelog.)

Purpose

Tkwatcher is a tcl program that allows monitoring and analysis of program output. It can use any tcl based shell including tclsh, expect, wish, or scotty. It runs with tcl-7.3 and tcl-8.0 and should work with future versions.

By default it reports problems by email, but it can also send problem reports to a file or to standard output. The reports can be human or program readable. In addition to these reporting modes, it can log error using external programs to permit syslogging, paging or other real-time notification of errors.

Benefits over Watcher

It was inspired by the program watcher by Kenneth Ingham (not Inghman as mistyped in the docs these many years. Hope to fix this typo in version 1.5, sorry Mr. Ingham), but adds features that I found lacking in the original watcher. Among those features are the ability to:

select portions of the controlfile

print command headers in the error messages

select individual lines from a command output stream using absolute positions, or a regular expression

perform and test calculations based on the input data

specify multiple tests on a value that are anded together to determine if a warning should be issued.

set thresholds for reports when all other tests are positive. This allows the user to set thresholds low enough for problems to be caught early, but excessive noise is eliminated by waiting until enough low threshold tests are passed.

It can Monitor

This tool has been used to:

monitor disk space
look for stalled jobs in the print queue
look for run away processes chewing up cpu time
monitor swap space changes
look for problems with network interfaces:
- excessive collision rates
- bad rpc calls
- excessive bad xid's indicating that an nfs server is overloaded
monitor for swapping and paging activity
verify operation of the X windows font server
monitoring ntp network for hosts with excessive deviance from established norms
monitor for excessive ping round trip time
verify that required daemons are running on the system
look for excessive copies of certain programs running on the system
monitor users who keep software licenses for more than a preset amount of time
watch for changing ip to ethernet mappings in the arp cache indicating an arp attack.

It permits you to parse any command output line extracting values from the line, and performing tests on these values.

Types of tests it can perform

The list of tests includes:

is the value in a give numeric range?
is a required value present?
is the value equal/not equal to X
is the value equal/not equal to X or Y or Z or A ...
has the value increased/decreased by a certain amount from the previous run?
has the value increased/decreased by a certain percentage from the previous run?
has the value changed or not changed since the last run

You can perform calculations on multiple values to apply tests to a calculated value. You can also write arbitrary tcl procedures that can be used for special purpose tests.

To give a bit better idea of what it can do, a sample entry is shown below:

set watch(6,disk) {"df -k"
         "disk space"
         {
            {"=1" {header 0-end %H}}
            {"^/mnt" {ignore 0 %s} }
                 {"^/var" {filesystem 0 %k} {varcapacity 4 %d}}
                 {"^/" {filesystem 0 %k} {capacity 4 %d}}
         }
         { {capacity delta_up 10 range 0 80}
      {capacity delta_up 30 range 0 60 severity alert}
      {capacity range 0 99 severity emerg orgroup a}
      {capacity range 0 98 delta_up 0 severity alert orgroup a}
      {capacity range 0 90 delta_up 0 orgroup a}
      {varcapacity range 0 75 change}
      {varcapacity delta_up 10 range 0 50 severity warning}
         }
     }

df -k produces output that looks like:

Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hda5              1763113   1001124    670868  60% /
/dev/hda6               763113    125344    637769  16% /var
/dev/hda1              2044208   1244848    799360  61% /disk/hda1
/dev/hdc                634944    634944         0 100% /mnt/cdrom
/dev/sda4                98078     23740     74338  24% /mnt/dos/zip

This checks for:

capacity of the /var filesystem has changed and is above 75% full generates a warning
capacity of the /var filesystem has increased by 10, and it is above 50% full generates a warning

capacity increasing by more than 10% when the disk is more than 80% full
capacity increasing by more than 30% when the disk is more than 60% full

capacity at more than 99% gets a severity emerg warning
capacity at more than 98% that has increased gets a severity alert warning
capacity at more than 90% that is increasing gets a warning

Only one of the previous three will generate a report even if the disk has just gone up to 100% full.

Capacity measurements for any removable disks mounted under the /mnt filesystem is ignored.

Files List

The distribution consists of:

Readme - this file
tkwatcher.1 - man page for tkwatcher
Watcherfile - a sample tkwatcher controlfile
tkwatcher - the tcl script
ed.scr - an ed script to strip comments and extra debug stuff from tkwatcher
tests - directory of various tests. Some examples may be useful.

Installation

Tkwatcher should run provided tclsh is in your path. Some external command such as mail, logger ... are used to report problems. If the defaults in the tkwatcher script don't work for you, change the defaults in the command array located near the top of the tkwatcher file. E.G. the default (which works under linux) command for reporting via syslog is:

  set command(syslog) "/usr/bin/logger -t tkwatcher -i -p daemon."

if its in /usr/ucb/logger, simply change the line to read:

  set command(syslog) "/usr/ucb/logger -t tkwatcher -i -p daemon."

Those interested in increasing its performance may want to hard code the path to the tcl interpreter in the file.

Other than that chmod +x the file (if needed) put it on your path, create a Watcherfile and you should be all set.

For security reasons, the location of the watcherfile and history file should not be writable to anybody except the user running tkwatcher.

To reduce tkwatcher's size (and thus increase its startup speed) and to remove debugging calls (also increasing its speed), execute the following commands:

grep -v "^#" ed.scr | ed tkwatcher

this will create the file tkwatcher.strip that can be used with the tracing subsystems, F for function trace, t for watcherfile trace, or e for viewing handled exceptions. The -v flag will work since it uses the "t" watcherfile tracing subsystem.

Tkwatcher is copyright 1995, 2000 John P. Rouillard (rouilj@ieeeSTOPSPAM.org) Tkwatcher may not be sold. Tkwatcher may be redistributed freely. There are no other restrictions on how this program may be used. There is no warranty with this program. It works for me, your mileage may vary.

Written by: John Rouillard (rouilj@ieeeSTOPSPAM.org) with thanks to Mark Lamourine .

CHANGELOG: Changes since 1.3:

Many cosmetic fixes, including code reorg to put configurable bits closer to the top of the file.

Tcl procedure interface now better defined. Multiple field arguments are now properly passed to tcl proc. Added an example tcl proc to Watcherfile and updated the manual page.

The history file name is now derived from the control filename. Previously the file ~/.watcher.hist was always used. A control filename that is > 9 characters will result in a history file name > 14 characters and may be a problem on some systems. But I consider those systems archaic and I really don't want to support them in the mainline code.

Added the ability to specify a regular expression for parsing data from a selected field. This is a more generic version of the prefix elimination function that was present in earlier and current versions of tkwatcher.

Added ability to designate fields counting from right to left so that you can select the rightmost space separated field. From the Ideas file in the original watcher.

Added ability to specify column ranges in reverse order to grab columns by counting from the end. E.G range of 39-32 would grab the data in the line that was 39 to 32 columns from the end. Note it doesn't reverse the data in the line, it merely specifies columns counting from the right.

If command to be parsed is not started properly (not found, can't exec etc) tkwatcher doesn't bail out with a tcl error. It reports the error in the output reporting mechanism.

Fixed a couple of tracing subsystem typos.

Fixed a bug that resulted in tkwatcher not reporting a "required" error when there were no matching parse lines.

Fixed bug where the a prefix elimination parse string "iddpr2-%d" caused the history file to save bogus information. This happened when the prefix string failed to parse the information properly. This happens a lot in column based parsing when a command rearranges the output columns as numbers increase. The historyfile data is now blanked out for the failed parse. Since tkwatcher reports the failed parse, the owner will be notified of the problem, so I don't consider loss of history info an issue.

Fixed a bug that prevented execution of other actions after a value, novalue, enum or noenum action.

Fixed a bug that made the action 'value "a value with spaces"' or novalue not work correctly. Enum and no enum was not affected by this bug.

Previous history value is displayed (if it exists) properly when a tcl proc is called.

A tcl proc can now stop evaluation of other actions by setting the variable stop_eval to 1.

Obtaining tkwatcher and more info/examples

TkWatcher can be obtained from

ftp://ftp.cerias.purdue.edu/pub/tools/unix/sysutils/tkwatcher/tkwatcher1_4.tar.gz

More info can be gotten from: the man page. Which also has annotated examples. The tarball has a full working Watcherfile example which is also available here. I have edited the Watcherfile to obscure/eliminate host names and IP addresses. Hopefully I did not break anything in the process.

-- John Rouillard