Content-type: text/html Manpage of tkwatcher

tkwatcher

Section: User Commands (1)
Updated: 12/4/1995
Index Return to main page

NAME

tkwatcher - monitor (watch) user specified command output

SYNOPSIS

tkwatcher [-A] [-a allow_string]... [-d deny_string]...

     [-D [system:]level] [-f controlfile] [-h historyfile]

     [-H hostname] [-M min[.sec]] [-m email_address]

     [-o [outputfile]] [-R report_method_list] [-S] [-v]

DESCRIPTION

tkwatcher runs a series of commands, and parses the output of those commands. It compares the values in the parsed output against a series of tests. If any of the tests fail, a warning is reported. Once all of the commands have been run and tested, watcher reports all of the warnings to an interested party.

The default reporting mechanism is email or output to a file or output to standard output. However syslog, pager, or other real time notification methods can also be used.

The controlfile is used to specify the commands to run, the parsing methods, and the tests.

In contrast to its name, tkwatcher is written in basic tcl, and will run under tclsh, wish, expect, expectk, scotty or just about any other tcl based shell you can imagine.

It was inspired by the program watcher written by Kenneth Inghman, but adds features that I found lacking in the original watcher. Among those features are the ability to:

: select portions of the controlfile
: print command headers in the error messages
: select individual lines from a command output stream using absolute positions, or a regular expression
: select portions of the line using regular expressions as well as white space delimited fields and character ranges.
: perform and test calculations based on the input data
: specify multiple tests on a value that can be combined to determine if a warning should be issued.
: set thresholds for reports when all other tests are positive. This allows the user to set thresholds low enough for problems to be caught early, but excessive noise is eliminated by waiting until enough low threshold tests are passed.

Control File

The syntax for the control file has no resemblance to the format for the original watcher. This was caused by the need to support additional functionality in the control file.

Comment lines in the control file start with a pound sign (#) as the first non whitespace character. They cannot occur inside of a set command.

The control file is a tcl file, and as such uses the tcl syntax. A single associative array called "watch" is defined. The comma separated list of indexes for the elements of the watch array are matched against the regular expressions passed in using the -a and -d flags. By convention, numeric indices are run by default, even if an explicit allow list is defined using -a. For example:


set watch(12,!box7,xs1,memory) { "vmstat -s" ... }

would always be run unless -A is specified. If "-a xs1", or "-a memory" were specified, this test would also be run. However, if "-a box7" was specified, this test would not be run since the presence of "!box7" in the index prevents execution of that watch rule.

Because the control file is run by tkwatcher, it can be used to provide ancillary functions such as:

: saving old history files for later analysis.
: creating files required by the commands in the control file. For example: the command "ps -ef | fgrep -f /daemons | wc - l" can be used to monitor for the presence of daemon processes. The daemons file could be created by commands in the control file.

Just because these tricks can be done does not imply that they should be done, but these abilities have been useful in the past and may just be useful to the reader. A fair knowledge of tcl coding is required to make use of this feature. If you wish to use this feature the procedure Watcher_proc should be defined in the controlfile. This procedure will be called just before tkwatcher processes the watch array.

The format of the entries in the control file is a 4 element list in tcl standard style. Braces { , } enclose lists, and nested braces create lists of lists.

Lets take a look at a sample control file entry (numbers enclosed in parenthesis are for reference and are not part of the example):


set watch(4) { "nfsstat -c"  (1)
       "nfs client statistics"  (2)
       {                                            (3)
         {=3 {header 0-end %H}}
         {=4 {calls_rpc_server 0 %d}
             {badcalls_rpc_server 1 %d}
             {retransmits 2 %d} {badxid 3 %d}
             {timeout 4 %d}
             {timers_expire 7 %d}
         }
         {"^null" {header 0-end %H}}
         {=10 {getattr% 3 %d} {readlink% 12 %d} }
       }
       {                                             (4)
         {badcalls_rpc_server range 0 0}
         {badcalls_rpc_server pdelta 2%}
         {retrans_% 
                calc 100.0*@retransmits@/@calls_rpc_server@ 
                range 0 5}
         {retrans_badxid 
                calc 
                100.0*(@retransmits@-@badxid@)/@retransmits@ 
                  range 0 5}
       }
  }

The first element (1) is the command string. It is a one element list (string) that is passed to the shell. Each line of the output of this command will be analyzed by the parse table (3) and action table (4). In this example, the command "nfsstat -c" will be run.

The second element (2) the description string. This string describes the purpose of the test. This description is reported in the warning message.

The third element (3) is a list of lists (parse table) describing what lines to parse and how to parse them into tokens.

The fourth element is a list of lists (action table) describing how to test the tokens generated by applying the parse table.

The first and second elements are self explanatory, but the format of the third and fourth elements require additional explanations.

parse table

The third element is a list of lists describing what lines to parse and how to parse them into tokens. Each sublist (parse entry) has the form:

line_selector {token_name column-spec format_spec}...

where line selector is:

all

the word all matches every line run.

none

the word none doesn't match any lines.

=<number>

which matches the <number>'th line. The last alternative is:

<regexp>

a tcl regular expression You should quote with a \ any non-alphanumeric characters in the regular expression that are special to the regular expression parser. Otherwise seemingly valid regular expression may not work.

This list of characters includes: "*+?()[].^$\|".

See regexp(n) the tcl regular expression man page for the full details.

The "token_name" is a string of characters without whitespace or the "@" character. The names (variable names) are used in the action table entries to refer to these parsed values.

column specifier

The "column-spec" defines where on the line the item is. It can be field or column based. The specification are:

n

a whitespace delimited field counting left to right.. Multiple instances of whitespace characters (tab, space) are treated as if they were a single whitespace character. NOTE: Fields are numbered starting with 0 not 1, so the first field on the line is 0, the second field is 1 etc.

-n

a whitespace delimited field counting right to left.. Multiple instances of whitespace characters (tab, space) are treated as if they were a single whitespace character. NOTE: Fields are numbered starting with 0 not 1, so the first (last) field on the line is -0, the second (to last) field is -1 etc.

n1-n2

Include all characters in columns running from n1 to n2 (note that both endpoints are included). All lines start at column 0 and the special symbol "end" can be used to specify the end of the line.

If n2 < n1 (e.g. 39-32) or n1 is "end", the columns are counted from right to left rather than from left to right. This is the column equivalent to the "-n" field based specifier.

A specifier of "end-10" would take the last 10 characters from the line. "11-2" would skip the last character and take the next 10 at the end. "end-0" would capture the entire line just as would "0-end". Note that the string is not reversed, only the method of calculating the indicies is changed.

format specifier

The format_spec can be a regular expression enclosed in "/" marks and with one parenthesized subexpression. This subexpression is the value that is saved. For example with input of the form:


13:04:36

a regular expression like:


/..:..:(..)/

would select the seconds from the timestamp. This can be used for any string that needs to be scanned. It always produces a string of type "%s". However, tcl automatically converts strings to numbers when needed, so even regular expression specified values can be used in calc actions. See regexp(n) the Tcl regular expression manual page for the details. Pay particular attention to the specification for subMatchVar which is the functionality that this is based on.

If the format_spec is not a regular expression, then it identifies the item type:

%d

the item is an integer.

%f

the item is a floating point (real) number.

%s

the item is a string.

You can specify characters in front of %s, %d or %f to eliminate a prefix. E.G. The string to be scanned is: "printer iddpr2 now printing iddpr2-365 enabled" With the spec {job_num 4 iddpr2-%s}, job_num has the value "365".

One caveat with using prefix elimination with %s. The string assigned by %s will end at the first whitespace character in the string to be scanned. If you need to have embedded whitespace in your scanned string, you will have to use explicit column ranges, or regular expressions, and NOT USE prefix elimination. See scan(n) the Tcl scan manual page for further details.

%k

the item is a keyword. Keywords are used to differentiate between different lines that are parsed the same. For example consider the entry:

set watch(7) {"df -k"
          "disk space"
          {
            {"=1" {header 0-end %H}}
            {"^/" {filesystem 5 %k}
                  {capacity 4 %d}
            }
          }
          {
            {capacity delta 5}
            {capacity range 0 90}
            {capacity delta 25 range 0-70
                 severity alert}
            {capacity range 0-99 severity emerg}
          }
      }

: with the lines:

/dev/dsk/c0t3d0s6 192423 80705 92478 47% /usr
/dev/dsk/c0t3d0s4 96455 42328 44487 49% /var

: the keyword is the name of the mounted filesystem (i.e. /usr, /var). It is used to allow the delta (or any history operation) to access the value for capacity for the same filesystem from the previous run's historical data, otherwise we could be comparing the current values for capacity on /usr against the previous run's capacity for /var, since /var is the last input line scanned.

%H

the item starts a header line. This line is saved and reported if there are any warnings for this command. This allows you to have reports with the labels for the output data displayed in the report. It makes interpretation of the data much easier.

%h

the item is concatenated to the header line. This allows the user to build up multi-line headers by using %H to set the first line, and %h to add additional lines.

The parse rules are scanned in the same order as they are entered. The first parse rule whose line selector matches the input line is the ONLY parse rule applied to that line. For example:

Given an input line (generated by df) of:

/dev/dsk/c0t3d0s4 96455 42328 44487 49% /var

and the parse rules:

        {
                { "/var$" {vcapacity 4 %d}}
                { "^/" {capacity 4 %d} }
        }

The first entry (/var$) will match the line, and the second entry (^/) will not be used. This is useful for excluding or special casing the attributes for certain lines. For example you can have a lower capacity threshold (say 80%)) for the /var filesystem, while still having a 90% capacity threshold for all other filesystems. By placing the parsing rule for /var first, and changing the token names, you create a warning that would be generated by a var capacity of 85%.

NOTE: THIS IS A CHANGE FROM PREVIOUS VERSIONS OF TKWATCHER.

If you were counting on fall-through to apply multiple parse rules to a single input line, it won't work anymore.

action table

The fourth list of lists (action table) is where the tests and other actions are carried out. Each action line has the form:

name action action_args <action action_args>...

The name is either the name of a token specified in the parse table, a global variable name surrounded by "@" signs, or in the case of tcl and calc types is simply a name for the calculation or tcl code.

As indicated above, multiple actions can be placed in an action line. Output is generated only when all of the actions would generate output. As soon as one test would not produce output, no other actions or tests are performed. This results in the actions being ANDed. Different action lines provide an OR operation. This or operation is not short-circuited.

Unlike the rules in the parse table, ALL applicable action lines are applied, not just the first. This produces as much information from the given command output line as is possible, to make problem determination easier. There are mechanisms (last, orgroup, iftrue, iffalse) that can be used to limit the actions that are applied to reduce the amount of output.

When using actions, remember that they are anded together and short-circuited. Thus, actions that make decisions should be placed after actions that do not make decisions. In the list below, the first line is a table summarizing the important features of the commands.

If decision is yes, then the failure of that action stops evaluation of further actions on that line. If warning is yes, the action can generate a warning. If flow is yes, it affects flow of control through the action table.

Note that tcl and calc actions must occur first on action lines. Otherwise, tkwatcher will abort when attempting to process them.

actions

The actions that are defined and their arguments are:

calc <expression>

decision: yes, warning: no, flow: no

The calc action takes an expression that can use values from the parsed string. To insert the value from the parsed string put @ signs around the name specified in the parse list. This works on numbers only. Division by zero is caught and silently ignored. Set debug level e:2 to see warnings. NOTE: tcl does integer division unless one of the operands is a real number, so it is not uncommon to see (@variable@*1.0) in calculations just to get the conversion to real division.

If you need to use the results of one calculation in another calculation, use the "global" action to save the result of your calculations. All non-global variables/token_names are destroyed when a new input line is read.

If a calculation fails because not all of the variables have been replaced, or there is a syntax error, no other commands in the action line are run.

tcl <proc name> {<arguments>}

decision: maybe, warning: maybe, flow: maybe

NOTE: the tcl action is not yet fully implemented. A better API needs to be developed. The description below is correct however.

The tcl action provides a general mechanism for manipulating data including string, boolean and file operations. Arguments are specified in the same way as the calc action with @ signs around any variables. Multiple arguments should be enclosed in a single set of braces. Unlike calc, non-numeric variables are valid. The value returned from the tcl proc is left as the value against which other tests are executed. A descriptive error message is saved using "uplevel 1 {set result <error message>}". If the tcl procedure should prevent evaluation of further tests, the variable stop_eval should be set using "uplevel 1 {set stop_eval 1}".

Knowledge of the internals of tkwatcher is required to use this in any non-trivial way since there is no interface that defines historical variables and other values gleaned by tkwatcher. This should not be used by the faint of heart. I have never really had a need to use it, but somebody convinced me that adding it was a good idea. It could be use to create composite values dependent on advanced parsing of multiple lines of input. See the sample Watcherfile for a simple tcl procedure example.

If the procedure fails because not all of the variables have been replaced, or there is a syntax error, no other commands in the action line are run.

global <global_name>

decision: no, warning: no, flow: no

The global action promotes a variable parsed from a line to a variable of the name <global_name> that persists for the entire command. For example a command produces:


        18962792 total packets received
        0 bad header checksums
        0 with size smaller than minimum
        0 with data size < data length
        0 with header length < data size
        0 with data length < header length
        355284 fragments received

and you want to calculate some percentage of the total packets received. Remember that the parse variables are erased on every new input line, so some mechanism must be used to save the value of total packets received.. The following control file entry accomplishes this task:


{ ... {
        { =1 {tot_pkts_ip 0 %d}}
        { =2 {bad_header_sums 0 %d}}
      }
      {
        {tot_pkts_ip global total_packets_ip}
        {bad_header_sums_% calc 100.0*@bad_header_sums@/@total_packets_ip@
                change}
      }
}

If the global keyword was not available this calculation could not be performed since the value of tot_pkts_ip would not be defined when the second line is scanned. (You want to make your calc expression have at least one non-global variable. If you don't the calculation will be evaluated for every line in the input after the lines that defined the global variables.)

You should be careful in choosing names for global variables. If a line local (parse) variable exists with the same name as the global variable, its value will be used instead of the global variable. If you need to do calculations between values gathered from different commands, try running the commands together as in this example:


set watch(4) { {sh -c "nfsstat -c; nfsstat -s"}
       "nfs client statistics"
       {
         ... 
       }
       {
        ...
       }

severity level

decision: no, warning: no, flow: no

The severity action associates an importance level with an action line. The defined severities are taken from the syslog severities. In decending order of importance they are: "emerg alert crit err warning notice". The default level for all actions/tests is notice. The default level does not produce any special actions on output. Setting the severity level causes a string like "SEVERITY **** emerg" to be prepended to each report section in the reports, or causes a change in the level at which the entry is sent to syslog, or some other noticeable change for other reporting methods.

disable

decision: yes, warning: no, flow: no

If present this line will never report. Useful for temporarily disabling a test that is generating noise on multiple hosts.

last

decision: no, warning: no, flow: yes

If the action line that this is a part of would produce a warning, then this is the last action line that is used. The next command input line is read and the parse table is applied to it.

orgroup name

decision: no, warning: no, flow: yes

This adds the action line to a group of or'ed commands that are short-circuited. The group is named "name". Once one of the commands in the name group produces output, evaluation of other action lines in the same group is suppressed. Note that action lines are evaluated in order of entry, so the most specific/highest criticality lines should be entered first in an orgroup.

required

decision: no, warning: yes, flow: no

The required action is used to specify that a variable must be defined during the run of the command. If all of the lines in the command have been read and processed, and the name for the action line has not been defined, a warning is issued. This can be used to verify that daemon programs such as sendmail, inetd etc are running. Note that flow commands may cause required to behave strangely, so put all of the required commands at the top of the action table. Also required really should not be combined with any actions except for: cycle, suppress, last, severity, and orgroup.

iftrue

decision: yes, warning: no, flow: no

If the current value (as set from a calc, or tcl, or from the parse) is not zero, continue with the other actions.

iffalse

decision: yes, warning: no, flow: no

If the current value (as set from a calc, or tcl, or from the parse) is zero, continue with the other actions.

range low high

decision: yes, warning: yes, flow: no

The range test action is applied to numbers, the value of the name must fall between the low and high numbers, if not a warning is printed.

value n

novalue n

decision: yes, warning: yes, flow: no

The value test action can be used for numbers or strings. If the value of the "name" is not "n" a warning is printed. For novalue, if the value of the "name" is "n" a warning is printed. To embed spaces in the value "n" enclose the string in {}'s or "'s.

enum a b c ...

noenum a b c ...

decision: yes, warning: yes, flow: no, Must be last

the enum test action is the generalization of the value test action. Indeed the value test action is present only for historic compatibility. An enum test action with a single argument is equivalent. The value of "name" must be one of a or b or c. This can be used for numbers or strings. For the noenum form, the value of "name" must not be a, b or c. To embed spaces in the values a, b etc. enclose the value in {}'s or "'s. This test MUST be the last on the line, because there is no way to determine the end of the list of values a, b , c ... all further values (including any actions) are made part of the enum/noenum test.

rdelta low high

decision: yes, warning: yes, flow: no

the rdelta test action calculation calculates the absolute change (delta), and makes sure that the delta falls in the range between low and high. If not, a warning is printed. This works on numerical values. Low and high can be positive or negative values. The previous value is always subtracted from the current value for the determination of the difference. This results in positive differences for increases in the value from the last run and negative differences for decreases in the value from the last run.

delta allowed_change

decision: yes, warning: yes, flow: no

if the absolute change in the value of "name" is greater than the allowed change, a warning is printed. This works only on numerical values. The allowed change should be a positive number.

pdelta allowed_percent_change[%]

decision: yes, warning: yes, flow: no

the pdelta test action calculates the percentage change in the value of "name" since the last time tkwatcher was run. The percent sign after the number is optional. If the percentage change in the variable is greater than the allowed change shown, a warning is printed. This works only on numerical values. The allowed percent change should be a positive number.

The delta and pdelta tests also support "_up" and "_down" forms. These forms are used to create rising and falling style thresholds. While delta_up and delta_down tests can be done using rdelta tests, they are common enough that the _up and _down forms were created. For example:



        {disk_capacity delta_up 10}
        {disk_capacity delta 10}
        {disk_capacity delta_down 10}

will report if the disk capacity (amount of disk space in use) has gone up by 10 since the previous run (a useful thing to know); has changed up by 10 since the previous run regardless of whether it has gone up or down; or has gone down by 10 since the previous run (an unlikely but happy occurrence).

change
decision: yes, warning: yes, flow: no: if there was any change in the value of "name" from the previous run it will print a warning. This works for both strings and numbers.
nochange
decision: yes, warning: yes, flow: no: if there was not a change in the value of "name" from the previous run it will print a warning. This works for both strings and numbers.
suppress count
decision: yes, warning: no, flow: no: The suppress keyword suppresses the action line for count runs following a warning being printed by the action line. This is useful for reducing the amount of noise from certain monitoring operations.
cycle hits period
decision: no, warning: yes and suppresses, flow: no: The cycle keyword suppresses a warning message if there have not been "hits" hits in "period" runs including the current run. An example may help here. We have a warning that a filesystem is too full. This is the first warning. The cycle statement is "cycle 2 4". The first time through, this action line will suppress the warning. The next time the action statement is run, the filesystem is back to normal. The third time the action is run, the filesystem is too full. We now have two positive hits in the previous 4 (or fewer) runs, and the cycle statement permits the warning. The next time the action runs, the filesystem is normal and no warning is produced. The fifth time the action is run, the filesystem has filled up, and again we have at least 2 warnings in the previous 4 runs, so a warning is produced.

OPTIONS

The options for tkwatcher are described below. Note that the operators can not be grouped together. Each option must stand alone on the command line. Thus a command line parameter of "-ASv" is invalid.

-a allow_string

If the index of the watch array is matched by the allow string with an optional period and number, or just a number appended, then the corresponding watch element is run. If something is explicitly allowed, it can not be denied. If multiple "-a allow_strings" are specified, each allow string is compared, and if a match occurs, the watch array element is run. This makes it easy to run special tests e.g. "-a interface_speed" in addition to the default tests (which have numeric indices). Multiple tests can be defined by using index names of the form: "disk.1", "disk.2", "disk.3", ... "disk.30" which will all be matched by an allow string of disk. Also, if the allow string is matched with an ! mark in front of it, that item is not run. This makes it easy to exclude a test for a given host in the case of a known problem of indeterminate or very long duration. Note that allowing a single integer E.G. "-a 1" will choose every line that starts with a 1. Changing it to "-a '^1$'" will do what was intended.

-A

Disables numeric watch array indices unless explicitly allowed. So "-A -a interface" will run only tests whose indices match interface.

-d deny_string

If the deny string is matched by the index of the watch array, then the corresponding watch element is not run. If multiple "-d deny_string" are specified, each deny_string is compared to the index. If any deny string matches, the watch element is not run. An element can not be denied if it is explicitly allowed with the -a flag. As with the allow flag, an optional period and numbers can follow the deny string and will be matched.

-D [system:]level

Set the debug level and the systems to which it applies. The debug level is an integer the higher the number the more verbose the output. Anything above level 2 is not very useful for those who don't actively debug tkwatcher. The systems are listed in the source code. A sample would be: Cr:4. This will generate level 4 debugging output for the "C" cycle and "r" reporting systems. The "system:" portion of the debugging spec is optional, so just the number "4" is also a valid argument for the "-D" flag and will produce debugging output for all systems. Useful options for debugging control files are "-D t:1", or "-D t:2". This will produce a list of commands, or commands and command output lines.

-f controlfile

Set the name of the control file. The default control file is "./Watcherfile". If it doesn't exist, the file ~/.Watcherfile is used.

-h historyfile

Set the location of the history file in which tkwatcher stores the values of the previous run. The name of the history file is the name of the control file with ".hist" appended.

-H hostname

hostname should be a fully qualified host name. This is just a shortcut way of specifying the allow options for the hostname and its nodename. If the hostname is foo.bar.edu,

-H foo.bar.edu

acts like "-a foo.bar.edu" and "-a foo".

-M min[.sec]

Allows you to run tkwatcher as a daemon. The min parameter is the number of minutes tkwatcher should sleep between runs over the watch array. The minimum value is 1. Since the subprocesses run by tkwatcher can take a lot of resources, the sec parameter is provided to allow tkwatcher to sleep for sec seconds between invocation of the commands in the watch array. It can be set to 0 to provide no delay. During daemon startup no attempt is made to fork and close descriptors, redirection of file descriptors and backgrounding of the job must be done in the invoking shell.

The control file is only read once. A new tkwatcher must be started if you wish to change the controlfile and have the new control file used.

-m email_address

The default email address for notification is the user running tkwatcher. An alternate address can be specified using this flag.

-o [outputfile]

The default output mechanism is to send email. Using this flag email is turned off, and the output can be redirected to a file. If the file argument is missing then the output is sent to standard output.

-R report_method_list

Enables alternate reporting mechanisms. Email, or file output is still enabled even if one of these methods is used. The first three reporting mechanisms are useful in conjunction with the -o flag in that a real-time warning can be sent, and the output can be perused by logging into the affected machine and reading the output file. The message is passed on the command line to the external program. The alternate reporting mechanisms are:

syslog: uses an external program such as /usr/ucb/logger to report a one line summary of the warning. Severity actions are translated directly into syslog severity levels.
pager: uses an external program to generate a page.
x: passes output to an external program for real-time display. While it is called X, and intended for use with a popup X windows based messaging system, any real time messaging system: zephyr, inform, or other program could be used as well. The program must accept the message to be transmitted on the command line. A simple shell wrapper can be used to do this manipulation if required.
When used with a real X popup program, this report mode can work well with the daemon mode above since DISPLAY and authentication info is readily available.
machine: when used with -o, this modifies the output report to generate a line of "^" delimited data for each error. These lines are more easily parsed by other reporting programs. The format of the lines is a number of fields separated by ^ characters. The field elements are: severity, description, key information, type of test (range, enum etc), the trigger, the name of the variable, the current value of the variable, the previous value of the variable, the header line for the command, the line from the command that was analyzed. If any of the values are not available, two ^^ are placed adjacent in the output. If the description, header or line fields contain "^" characters, they are replaced by "-" characters. Other lines produced have only one field and are either waring/error lines with three stars "***" in them, or the lines are empty, or they are the end sentinel line.

-S

causes empty mail messages to be suppressed. This stops the mail message with a subject line of "no watcher problem" from being sent. I would suggest that at least one run of tkwatcher per day not use this flag. Not using "-S" tests the mail system on the hosts.

-v

This is equivalent to "-D t:1".

SECURITY CONSIDERATIONS

Since the controlfile is tcl code, and is loaded into a running tkwatcher, any operation may be done from it. You should never use a controlfile that you do not trust.

See the bugs and security bugs section for a few more concerns.

CUSTOMIZATION and INSTALLATION

Change the #! line at the top of the script to point to your tcl based shell. I like "expect -d" myself when debugging, but use plain tclsh in production.

There are two arrays that can be customized. The command associative array is used to set the commands for the pager, x, syslog and mail. Check this and make substitutions for your programs and paths where necessary.

The other array is the report array that can be used to customize the reporting strings for the output mechanisms. Knowledge of the internals of tkwatcher's reporting code is needed to do this successfully.

EXAMPLES

Here is a rather interesting (and partly contrived) example (the numbers in parentheses down the left hand side are reference numbers and are not part of the array definition):


set watch(4,nfs,network) { "nfsstat -c" "nfs client stats"
     {
        {=3 {header 0-end %H}}
        {=4 {calls_rpc_server 0 %d} {badcalls_rpc_server 1 %d}
            {retransmits 2 %d} {badxid 3 %d} {timeout 4 %d}
            {timers_expire 7 %d}
        }
        {=10 {getattr 2 %d} {readlink% 12 %d} }
     }
     {
 (1)    {calls_rpc_server global glob_c_r_s}
        {badcalls_rpc_server delta_up 5}
        {retrans_% calc 100.0*@retransmits@/@calls_rpc_server@ 
                        range 0 5 delta_up 0}
        {retrans_badxid calc 
               100*(@retransmits@-@badxid@)/@retransmits@
               range 50 100 delta_down 0 }
 (2)    {getattr% calc @getattr@*100.0/@glob_c_r_s@ 
               global g_getattr%
               range 0 5 }
 (3)    {@g_getattr%@ calc @g_getattr%@>@readlink%@ value 1}
 (4)    {@g_getattr%@ orgroup b delta_up 10 severity crit}
 (5)    {timeout_badxid calc @timeout@>@badxid@ iftrue
               calc (@timeout@*1.0)/@badxid@ range 2 10000 }
 (6)    {timeout_badxid_1 calc @timeout@<@badxid@ iftrue
               calc (@timeout@*1.0)/@badxid@ range 0 0.5
               delta_up 0 }
     }
 }

Note the use of iftrue and iffalse in action lines 5 and 6. We don't know if the number of timeouts or the number of badxid's will be larger. However, to set up the ranges and the delta_up or delta_down tests, we need to know this. So we do a small calc that determines which value is larger and then use the iftrue tests to stop the evaluation of the line if it is a calculation for the opposite condition.

Also we see a use of the global action in rules 1, and 2. In rule 2 the result of the getattr% (get attributes calls as a percentage of total calls) calculation is saved. Then this saved value is used in rules 3, and 4.

Rule 3, 4 will not be executed until after rule 2 has been executed since the variable getattr% is created by the global statement in rule 2. In rule 3 we use the g_getattr global variable in another (boolean) calculation to make sure that the percentage of readlink calls is always less than the percentage of getattr%. Note that rule 3 is only executed when the readlink% variable is defined, so it will only be executed when line 10 is read.

Action 4 adds one more test to the calculated global value (g_getattr%) that reports a critical problem if it has increased by more than 10. Command 4 is executed for every line after the 10th input line because g_getattr% is defined. So for line 11, 12... command 4 is executed. If the command produces output on line 10, the orgroup will prevent output on lines 11...end. It will also prevent execution of delta_up action as well if and only if output was produced. If output was not produced, all of rule 4 is executed for every additional line in the output of "nfsstat -c". While this is a performance issue, it is inconsequential from a reporting point of view.

FILES

./Watcherfile, ~/.Watcherfile: The default tkwatcher command file.
./Watcherfile.hist, ~/.Watcherfile.hist: The default tkwatcher history files. The name of the history file is derived from the name of the control file that is used.

DIAGNOSTICS

The control file is only sourced, so any tcl syntax errors will result in diagnostics from tcl.

The calculations are done using tcl as well, so tcl diagnostics about improper expression syntax and errors are generated.

Warning about command failures are printed.

BUGS

The original cut of this was 8k and encompassed all of the functions of the original Watcher. It took me 5 hours to produce. I am now nearing the 100 hour mark, and it is nearing the 70Kb mark with comments and debug statements. It is only 48K without them. However it is still too large for my tastes. Then again performance isn't shabby, it still takes more wall clock time for my processes to run than it does for tkwatcher to analyze them, and the memory usage isn't bad either.

A mistyped command line can erase your watcherfile. I.E. tkwatcher -o Watcherfile will result in an empty watcherfile. The correct syntax is: tkwatcher -o -f Watcherfile. Arguably this is a misfeature of the command line syntax.

The delta_up/down and pdelta_up/down may not work right if the values being checked are negative.

This program has been tested through use, not formal testing, so combining actions in certain interesting ways may have unexpected results.

There are enough ways of reducing chatter that non-judicious use may result in loss of useful info. Then again who wants to wade through 100 mail messages about the same problem.

There is no formal parsing or syntax checking of the control file in the program since it would slow it down too much. Even reading in syntax checking code can be painful. There should be a separate syntax checking/pretty printing program for this. Since the control file is just a tcl script, "tclsh watcherfile" can be used as a syntax check for the code. It doesn't perform any semantic checks however.

The sense of the testing actions is not always consistent, and it helps to have a warped mind when attempting to create certain types of tests.

It should be possible to customize the report_format array without having to know the internals of tkwatcher.

Keywords should be able to contain an equals sign. The current practice of changing it to a - is not satisfactory.

The command line parser should allow grouping of options. It should at the least warn about invalid groupings.

reporting mechanism other than email or standard output (e.g. paging, real time reporting using x or inform and syslog) haven't been tested as well as I would like.

If the value of a variable contains the @ character, tcl and calc action will be fooled into thinking that variable substitution failed. As a result the calc and tcl actions won't be executed. Ue sed or tr as part of the input command to substitute another character for the @ character.

SECURITY BUGS

When calling an alternate report mode, the report should be passed to the stdin of the report command rather than passing it on the command line. This will prevent the unintended execution of code which is a security risk. Exactly how a cracker could make use of this bug is unclear, but I have been at this too long to ignore the risk. I am sure that bugtraq will have an exploit for this if tkwatcher is in wide use.

tkwatcher should really test the controlfile to make sure it is a real file and not a link. Also it should refuse to write/read any historyfile that has world write permissions. Tcl 7 doesn't support any way of testing for these on an open file handle, and any other way of testing produces a race condition. Better that it should not give people a false sense of security.