Content-type: text/html
Manpage of tkwatcher
tkwatcher
Section: User Commands (1)
Updated: 12/4/1995
Index
Return to main page
NAME
tkwatcher - monitor (watch) user specified command output
SYNOPSIS
tkwatcher
[-A]
[-a allow_string]...
[-d deny_string]...
[-D [system:]level]
[-f controlfile]
[-h historyfile]
[-H hostname]
[-M min[.sec]]
[-m email_address]
[-o [outputfile]]
[-R report_method_list]
[-S]
[-v]
DESCRIPTION
tkwatcher runs a series of commands, and parses the output of those
commands. It compares the values in the parsed output against a
series of tests. If any of the tests fail, a warning is reported.
Once all of the commands have been run and tested, watcher reports all
of the warnings to an interested party.
The default reporting mechanism is email or output to a file or output
to standard output. However syslog, pager, or other real time
notification methods can also be used.
The controlfile is used to specify the commands to run, the parsing
methods, and the tests.
In contrast to its name, tkwatcher is written in basic tcl, and will
run under tclsh, wish, expect, expectk, scotty or just about any other
tcl based shell you can imagine.
It was inspired by the program watcher written by Kenneth Inghman, but
adds features that I found lacking in the original watcher. Among
those features are the ability to:
-
select portions of the controlfile
-
print command headers in the error messages
-
select individual lines from a command output stream using absolute
positions, or a regular expression
-
select portions of the line using regular expressions as well as white
space delimited fields and character ranges.
-
perform and test calculations based on the input data
-
specify multiple tests on a value that can be combined to determine
if a warning should be issued.
-
set thresholds for reports when all other tests are positive. This
allows the user to set thresholds low enough for problems to be caught
early, but excessive noise is eliminated by waiting until enough low
threshold tests are passed.
Control File
The syntax for the control file has no resemblance to the format for
the original watcher. This was caused by the need to support
additional functionality in the control file.
Comment lines in the control file start with a pound sign (#) as the
first non whitespace character. They cannot occur inside of a set
command.
The control file is a tcl file, and as such uses the tcl syntax. A
single associative array called "watch" is defined. The comma
separated list of indexes for the elements of the watch array are
matched against the regular expressions passed in using the -a and -d
flags. By convention, numeric indices are run by default, even if an
explicit allow list is defined using -a. For example:
-
set watch(12,!box7,xs1,memory) { "vmstat -s" ... }
would always be run unless -A is specified. If "-a xs1", or "-a
memory" were specified, this test would also be run. However, if "-a
box7" was specified, this test would not be run since the presence of
"!box7" in the index prevents execution of that watch rule.
Because the control file is run by tkwatcher, it can be used
to provide ancillary functions such as:
-
saving old history files for later analysis.
-
creating files required by the commands in the control file. For
example: the command "ps -ef | fgrep -f /daemons | wc - l" can be used
to monitor for the presence of daemon processes. The daemons file
could be created by commands in the control file.
Just because these tricks can be done does not imply that they should
be done, but these abilities have been useful in the past and may just
be useful to the reader. A fair knowledge of tcl coding is required to
make use of this feature. If you wish to use this feature the
procedure Watcher_proc should be defined in the controlfile. This
procedure will be called just before tkwatcher processes the watch
array.
The format of the entries in the control file is a 4 element list in
tcl standard style. Braces { , } enclose lists, and nested braces
create lists of lists.
Lets take a look at a sample control file entry (numbers enclosed in
parenthesis are for reference and are not part of the example):
-
set watch(4) { "nfsstat -c" (1)
"nfs client statistics" (2)
{ (3)
{=3 {header 0-end %H}}
{=4 {calls_rpc_server 0 %d}
{badcalls_rpc_server 1 %d}
{retransmits 2 %d} {badxid 3 %d}
{timeout 4 %d}
{timers_expire 7 %d}
}
{"^null" {header 0-end %H}}
{=10 {getattr% 3 %d} {readlink% 12 %d} }
}
{ (4)
{badcalls_rpc_server range 0 0}
{badcalls_rpc_server pdelta 2%}
{retrans_%
calc 100.0*@retransmits@/@calls_rpc_server@
range 0 5}
{retrans_badxid
calc
100.0*(@retransmits@-@badxid@)/@retransmits@
range 0 5}
}
}
The first element (1) is the command string. It is a one element list
(string) that is passed to the shell. Each line of the output of this
command will be analyzed by the parse table (3) and action table
(4). In this example, the command "nfsstat -c" will be run.
The second element (2) the description string. This string describes
the purpose of the test. This description is reported in the warning
message.
The third element (3) is a list of lists (parse table) describing what
lines to parse and how to parse them into tokens.
The fourth element is a list of lists (action table) describing
how to test the tokens generated by applying the parse table.
The first and second elements are self explanatory, but the format of
the third and fourth elements require additional explanations.
parse table
The third element is a list of lists describing what lines to parse
and how to parse them into tokens. Each sublist (parse entry) has the
form:
line_selector {token_name column-spec format_spec}...-
where line selector is:
- all
-
the word all matches every line run.
- none
-
the word none doesn't match any lines.
- =<number>
-
which matches the <number>'th line. The last alternative is:
- <regexp>
-
a tcl regular expression You should quote with a \ any
non-alphanumeric characters in the regular expression that are special
to the regular expression parser. Otherwise seemingly valid regular
expression may not work.
This list of characters includes:
"*+?()[].^$\|".
See regexp(n) the tcl regular expression man page for the full details.
The "token_name" is a string of characters without whitespace or the
"@" character. The names (variable names) are used in the action
table entries to refer to these parsed values.
column specifier
The "column-spec" defines where on the line the item is. It can be
field or column based. The specification are:
- n
-
a whitespace delimited field counting left to right.. Multiple
instances of whitespace characters (tab, space) are treated as if they
were a single whitespace character. NOTE: Fields are numbered starting
with 0 not 1, so the first field on the line is 0, the second field is
1 etc.
- -n
-
a whitespace delimited field counting right to left.. Multiple
instances of whitespace characters (tab, space) are treated as if they
were a single whitespace character. NOTE: Fields are numbered starting
with 0 not 1, so the first (last) field on the line is -0, the second
(to last) field is -1 etc.
- n1-n2
-
Include all characters in columns running from n1 to n2 (note that
both endpoints are included). All lines start at column 0 and the
special symbol "end" can be used to specify the end of the line.
If n2 < n1 (e.g. 39-32) or n1 is "end", the columns are counted from
right to left rather than from left to right. This is the
column equivalent to the "-n" field based specifier.
A specifier of "end-10" would take the last 10 characters from the
line. "11-2" would skip the last character and take the next 10 at the
end. "end-0" would capture the entire line just as would "0-end". Note that the string is not reversed, only the method of
calculating the indicies is changed.
format specifier
The format_spec can be a regular expression enclosed in "/" marks and
with one parenthesized subexpression. This subexpression is the value
that is saved. For example with input of the form:
-
13:04:36
a regular expression like:
-
/..:..:(..)/
would select the seconds from the timestamp. This can be used for any
string that needs to be scanned. It always produces a string of type
"%s". However, tcl automatically converts strings to numbers when
needed, so even regular expression specified values can be used in
calc actions. See regexp(n) the Tcl regular expression manual page for
the details. Pay particular attention to the specification for
subMatchVar which is the functionality that this is based on.
If the format_spec is not a regular expression, then it identifies the
item type:
- %d
-
the item is an integer.
- %f
-
the item is a floating point (real) number.
- %s
-
the item is a string.
You can specify characters in front of %s, %d or %f to eliminate a
prefix. E.G. The string to be scanned is: "printer iddpr2 now
printing iddpr2-365 enabled" With the spec {job_num 4 iddpr2-%s},
job_num has the value "365".
One caveat with using prefix elimination with %s. The string assigned
by %s will end at the first whitespace character in the string to be
scanned. If you need to have embedded whitespace in your scanned
string, you will have to use explicit column ranges, or regular
expressions, and NOT USE prefix elimination. See scan(n) the Tcl scan
manual page for further details.
- %k
-
the item is a keyword. Keywords are used to differentiate between
different lines that are parsed the same. For example consider the
entry:
-
set watch(7) {"df -k"
"disk space"
{
{"=1" {header 0-end %H}}
{"^/" {filesystem 5 %k}
{capacity 4 %d}
}
}
{
{capacity delta 5}
{capacity range 0 90}
{capacity delta 25 range 0-70
severity alert}
{capacity range 0-99 severity emerg}
}
}
-
with the lines:
/dev/dsk/c0t3d0s6 192423 80705 92478 47% /usr
/dev/dsk/c0t3d0s4 96455 42328 44487 49% /var
-
the keyword is the name of the mounted filesystem (i.e. /usr,
/var). It is used to allow the delta (or any history operation) to
access the value for capacity for the same filesystem from the
previous run's historical data, otherwise we could be comparing the
current values for capacity on /usr against the previous run's
capacity for /var, since /var is the last input line scanned.
- %H
-
the item starts a header line. This line is saved and reported if
there are any warnings for this command. This allows you to have
reports with the labels for the output data displayed in the report.
It makes interpretation of the data much easier.
- %h
-
the item is concatenated to the header line. This allows the user to
build up multi-line headers by using %H to set the first line, and %h
to add additional lines.
The parse rules are scanned in the same order as they are entered.
The first parse rule whose line selector matches the input line is the
ONLY parse rule applied to that line. For example:
- Given an input line (generated by df) of:
-
-
/dev/dsk/c0t3d0s4 96455 42328 44487 49% /var
and the parse rules:
-
{
{ "/var$" {vcapacity 4 %d}}
{ "^/" {capacity 4 %d} }
}
The first entry (/var$) will match the line, and the second entry (^/)
will not be used. This is useful for excluding or special casing the
attributes for certain lines. For example you can have a lower
capacity threshold (say 80%)) for the /var filesystem, while still
having a 90% capacity threshold for all other filesystems. By placing
the parsing rule for /var first, and changing the token names, you
create a warning that would be generated by a var capacity of 85%.
NOTE: THIS IS A CHANGE FROM PREVIOUS VERSIONS OF TKWATCHER.
If you were counting on fall-through to apply multiple parse rules to
a single input line, it won't work anymore.
action table
The fourth list of lists (action table) is where the tests and other
actions are carried out. Each action line has the form:
name action action_args <action action_args>...-
The name is either the name of a token specified in the parse table, a
global variable name surrounded by "@" signs, or in the case of tcl
and calc types is simply a name for the calculation or tcl code.
As indicated above, multiple actions can be placed in an action line.
Output is generated only when all of the actions would generate
output. As soon as one test would not produce output, no other
actions or tests are performed. This results in the actions being
ANDed. Different action lines provide an OR operation. This or
operation is not short-circuited.
Unlike the rules in the parse table, ALL applicable action lines are
applied, not just the first. This produces as much information from
the given command output line as is possible, to make problem
determination easier. There are mechanisms (last, orgroup,
iftrue, iffalse) that can be used to limit the actions that are
applied to reduce the amount of output.
When using actions, remember that they are anded together and
short-circuited. Thus, actions that make decisions should be placed
after actions that do not make decisions. In the list below, the first
line is a table summarizing the important features of the commands.
If decision is yes, then the failure of that action stops evaluation
of further actions on that line. If warning is yes, the action can
generate a warning. If flow is yes, it affects flow of control
through the action table.
Note that tcl and calc actions must occur first on action lines.
Otherwise, tkwatcher will abort when attempting to process them.
actions
The actions that are defined and their arguments are:
- calc <expression>
-
- decision: yes, warning: no, flow: no
-
The calc action takes an expression that can use values from the
parsed string. To insert the value from the parsed string put @ signs
around the name specified in the parse list. This works on numbers
only. Division by zero is caught and silently ignored. Set debug
level e:2 to see warnings. NOTE: tcl does integer division unless one
of the operands is a real number, so it is not uncommon to see
(@variable@*1.0) in calculations just to get the conversion to real
division.
If you need to use the results of one calculation in another
calculation, use the "global" action to save the result of your
calculations. All non-global variables/token_names are destroyed when
a new input line is read.
If a calculation fails because not all of the variables have been
replaced, or there is a syntax error, no other commands in the action
line are run.
- tcl <proc name> {<arguments>}
-
- decision: maybe, warning: maybe, flow: maybe
-
NOTE: the tcl action is not yet fully implemented. A better API needs
to be developed. The description below is correct however.
The tcl action provides a general mechanism for manipulating data
including string, boolean and file operations. Arguments are specified
in the same way as the calc action with @ signs around any
variables. Multiple arguments should be enclosed in a single set of
braces. Unlike calc, non-numeric variables are valid. The value
returned from the tcl proc is left as the value against which other
tests are executed. A descriptive error message is saved using
"uplevel 1 {set result <error message>}". If the tcl procedure should
prevent evaluation of further tests, the variable stop_eval should be
set using "uplevel 1 {set stop_eval 1}".
Knowledge of the internals of tkwatcher is required to use this in any
non-trivial way since there is no interface that defines historical
variables and other values gleaned by tkwatcher. This should not be
used by the faint of heart. I have never really had a need to use it,
but somebody convinced me that adding it was a good idea. It could be
use to create composite values dependent on advanced parsing of
multiple lines of input. See the sample Watcherfile for a simple tcl
procedure example.
If the procedure fails because not all of the variables have been
replaced, or there is a syntax error, no other commands in the action
line are run.
- global <global_name>
-
- decision: no, warning: no, flow: no
-
The global action promotes a variable parsed from a line to a variable
of the name <global_name> that persists for the entire command. For
example a command produces:
-
18962792 total packets received
0 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with header length < data size
0 with data length < header length
355284 fragments received
-
and you want to calculate some percentage of the total packets
received. Remember that the parse variables are erased on every new
input line, so some mechanism must be used to save the value of total
packets received.. The following control file entry accomplishes this
task:
-
{ ... {
{ =1 {tot_pkts_ip 0 %d}}
{ =2 {bad_header_sums 0 %d}}
}
{
{tot_pkts_ip global total_packets_ip}
{bad_header_sums_% calc 100.0*@bad_header_sums@/@total_packets_ip@
change}
}
}
If the global keyword was not available this calculation could not be
performed since the value of tot_pkts_ip would not be defined when the
second line is scanned. (You want to make your calc expression have at
least one non-global variable. If you don't the calculation will be
evaluated for every line in the input after the lines that defined the
global variables.)
You should be careful in choosing names for global variables. If a
line local (parse) variable exists with the same name as the global
variable, its value will be used instead of the global variable. If
you need to do calculations between values gathered from different
commands, try running the commands together as in this example:
-
set watch(4) { {sh -c "nfsstat -c; nfsstat -s"}
"nfs client statistics"
{
...
}
{
...
}
- severity level
-
- decision: no, warning: no, flow: no
-
The severity action associates an importance level with an action
line. The defined severities are taken from the syslog severities. In
decending order of importance they are: "emerg alert crit err warning
notice". The default level for all actions/tests is notice. The
default level does not produce any special actions on output. Setting
the severity level causes a string like "SEVERITY **** emerg" to be
prepended to each report section in the reports, or causes a change in
the level at which the entry is sent to syslog, or some other
noticeable change for other reporting methods.
- disable
-
- decision: yes, warning: no, flow: no
-
If present this line will never report. Useful for temporarily
disabling a test that is generating noise on multiple hosts.
- last
-
- decision: no, warning: no, flow: yes
-
If the action line that this is a part of would produce a warning,
then this is the last action line that is used. The next command input
line is read and the parse table is applied to it.
- orgroup name
-
- decision: no, warning: no, flow: yes
-
This adds the action line to a group of or'ed commands that are
short-circuited. The group is named "name". Once one of the commands
in the name group produces output, evaluation of other action lines in
the same group is suppressed. Note that action lines are evaluated in
order of entry, so the most specific/highest criticality lines should
be entered first in an orgroup.
- required
-
- decision: no, warning: yes, flow: no
-
The required action is used to specify that a variable must be defined
during the run of the command. If all of the lines in the command have
been read and processed, and the name for the action line has not been
defined, a warning is issued. This can be used to verify that daemon
programs such as sendmail, inetd etc are running. Note that flow
commands may cause required to behave strangely, so put all of the
required commands at the top of the action table. Also required really
should not be combined with any actions except for: cycle, suppress,
last, severity, and orgroup.
- iftrue
-
- decision: yes, warning: no, flow: no
-
If the current value (as set from a calc, or tcl, or from the parse)
is not zero, continue with the other actions.
- iffalse
-
- decision: yes, warning: no, flow: no
-
If the current value (as set from a calc, or tcl, or from the parse)
is zero, continue with the other actions.
- range low high
-
- decision: yes, warning: yes, flow: no
-
The range test action is applied to numbers, the value of the name
must fall between the low and high numbers, if not a warning is
printed.
- value n
-
- novalue n
-
- decision: yes, warning: yes, flow: no
-
The value test action can be used for numbers or strings. If the value
of the "name" is not "n" a warning is printed. For novalue, if the
value of the "name" is "n" a warning is printed. To embed spaces in
the value "n" enclose the string in {}'s or "'s.
- enum a b c ...
-
- noenum a b c ...
-
- decision: yes, warning: yes, flow: no, Must be last
-
the enum test action is the generalization of the value test
action. Indeed the value test action is present only for historic
compatibility. An enum test action with a single argument is
equivalent. The value of "name" must be one of a or b or c. This can
be used for numbers or strings. For the noenum form, the value of
"name" must not be a, b or c. To embed spaces in the values a, b
etc. enclose the value in {}'s or "'s. This test MUST be the last on
the line, because there is no way to determine the end of the list of
values a, b , c ... all further values (including any actions) are
made part of the enum/noenum test.
- rdelta low high
-
- decision: yes, warning: yes, flow: no
-
the rdelta test action calculation calculates the absolute change
(delta), and makes sure that the delta falls in the range between low
and high. If not, a warning is printed. This works on numerical
values. Low and high can be positive or negative values. The previous
value is always subtracted from the current value for the
determination of the difference. This results in positive differences
for increases in the value from the last run and negative differences
for decreases in the value from the last run.
- delta allowed_change
-
- decision: yes, warning: yes, flow: no
-
if the absolute change in the value of "name" is greater than the
allowed change, a warning is printed. This works only on numerical
values. The allowed change should be a positive number.
- pdelta allowed_percent_change[%]
-
- decision: yes, warning: yes, flow: no
-
the pdelta test action calculates the percentage change in the value
of "name" since the last time tkwatcher was run. The percent sign
after the number is optional. If the percentage change in the variable
is greater than the allowed change shown, a warning is printed. This
works only on numerical values. The allowed percent change should be a
positive number.
The delta and pdelta tests also support "_up" and "_down" forms.
These forms are used to create rising and falling style thresholds.
While delta_up and delta_down tests can be done using rdelta tests,
they are common enough that the _up and _down forms were created. For
example:
-
{disk_capacity delta_up 10}
{disk_capacity delta 10}
{disk_capacity delta_down 10}
will report if the disk capacity (amount of disk space in use) has
gone up by 10 since the previous run (a useful thing to know); has
changed up by 10 since the previous run regardless of whether it has
gone up or down; or has gone down by 10 since the previous run (an
unlikely but happy occurrence).
- change
-
- decision: yes, warning: yes, flow: no
-
if there was any change in the value of "name" from the previous run
it will print a warning. This works for both strings and numbers.
- nochange
-
- decision: yes, warning: yes, flow: no
-
if there was not a change in the value of "name" from the previous run
it will print a warning. This works for both strings and numbers.
- suppress count
-
- decision: yes, warning: no, flow: no
-
The suppress keyword suppresses the action line for count runs
following a warning being printed by the action line. This is useful
for reducing the amount of noise from certain monitoring operations.
- cycle hits period
-
- decision: no, warning: yes and suppresses, flow: no
-
The cycle keyword suppresses a warning message if there have not been
"hits" hits in "period" runs including the current run. An example may
help here. We have a warning that a filesystem is too full. This is
the first warning. The cycle statement is "cycle 2 4". The first time
through, this action line will suppress the warning. The next time the
action statement is run, the filesystem is back to normal. The third
time the action is run, the filesystem is too full. We now have two
positive hits in the previous 4 (or fewer) runs, and the cycle
statement permits the warning. The next time the action runs, the
filesystem is normal and no warning is produced. The fifth time the
action is run, the filesystem has filled up, and again we have at
least 2 warnings in the previous 4 runs, so a warning is produced.
OPTIONS
The options for tkwatcher are described below. Note that the operators
can not be grouped together. Each option must stand alone on the
command line. Thus a command line parameter of "-ASv" is invalid.
- -a allow_string
-
If the index of the watch array is matched by the allow string with an
optional period and number, or just a number appended, then the
corresponding watch element is run. If something is explicitly
allowed, it can not be denied. If multiple "-a allow_strings" are
specified, each allow string is compared, and if a match occurs, the
watch array element is run. This makes it easy to run special tests
e.g. "-a interface_speed" in addition to the default tests (which
have numeric indices). Multiple tests can be defined by using index
names of the form: "disk.1", "disk.2", "disk.3", ... "disk.30" which
will all be matched by an allow string of disk. Also, if the allow
string is matched with an ! mark in front of it, that item is not
run. This makes it easy to exclude a test for a given host in the case
of a known problem of indeterminate or very long duration. Note that
allowing a single integer E.G. "-a 1" will choose every line that
starts with a 1. Changing it to "-a '^1$'" will do what was intended.
- -A
-
Disables numeric watch array indices unless explicitly allowed. So
"-A -a interface" will run only tests whose indices match interface.
- -d deny_string
-
If the deny string is matched by the index of the watch array, then
the corresponding watch element is not run. If multiple "-d
deny_string" are specified, each deny_string is compared to the
index. If any deny string matches, the watch element is not run. An
element can not be denied if it is explicitly allowed with the -a
flag. As with the allow flag, an optional period and numbers can
follow the deny string and will be matched.
- -D [system:]level
-
Set the debug level and the systems to which it applies. The debug
level is an integer the higher the number the more verbose the
output. Anything above level 2 is not very useful for those who don't
actively debug tkwatcher. The systems are listed in the source code. A
sample would be: Cr:4. This will generate level 4 debugging output for
the "C" cycle and "r" reporting systems. The "system:" portion of the
debugging spec is optional, so just the number "4" is also a valid
argument for the "-D" flag and will produce debugging output for all
systems. Useful options for debugging control files are "-D t:1", or
"-D t:2". This will produce a list of commands, or commands and
command output lines.
- -f controlfile
-
Set the name of the control file. The default control file is
"./Watcherfile". If it doesn't exist, the file ~/.Watcherfile is
used.
- -h historyfile
-
Set the location of the history file in which tkwatcher stores the
values of the previous run. The name of the history file is the name
of the control file with ".hist" appended.
- -H hostname
-
hostname should be a fully qualified host name. This is just a
shortcut way of specifying the allow options for the hostname and its
nodename. If the hostname is foo.bar.edu,
-H foo.bar.edu
acts like "-a foo.bar.edu" and "-a foo".
- -M min[.sec]
-
Allows you to run tkwatcher as a daemon. The min parameter is the
number of minutes tkwatcher should sleep between runs over the watch
array. The minimum value is 1. Since the subprocesses run by tkwatcher
can take a lot of resources, the sec parameter is provided to allow
tkwatcher to sleep for sec seconds between invocation of the commands
in the watch array. It can be set to 0 to provide no delay. During
daemon startup no attempt is made to fork and close descriptors,
redirection of file descriptors and backgrounding of the job must be
done in the invoking shell.
The control file is only read once. A new tkwatcher must be started if
you wish to change the controlfile and have the new control file used.
- -m email_address
-
The default email address for notification is the user running
tkwatcher. An alternate address can be specified using this flag.
- -o [outputfile]
-
The default output mechanism is to send email. Using this flag email
is turned off, and the output can be redirected to a file. If the file
argument is missing then the output is sent to standard output.
- -R report_method_list
-
Enables alternate reporting mechanisms. Email, or file output is still
enabled even if one of these methods is used. The first three
reporting mechanisms are useful in conjunction with the -o flag in
that a real-time warning can be sent, and the output can be perused by
logging into the affected machine and reading the output file.
The message is passed on the command line to the external program. The
alternate reporting mechanisms are:
-
- syslog
-
uses an external program such as /usr/ucb/logger to report a one
line summary of the warning. Severity actions are translated
directly into syslog severity levels.
- pager
-
uses an external program to generate a page.
- x
-
passes output to an external program for real-time display. While
it is called X, and intended for use with a popup X windows based
messaging system, any real time messaging system: zephyr, inform, or
other program could be used as well. The program must accept the
message to be transmitted on the command line. A simple shell wrapper
can be used to do this manipulation if required.
When used with a real X popup program, this report mode can work well
with the daemon mode above since DISPLAY and authentication info is
readily available.
- machine
-
when used with -o, this modifies the output report to generate a line
of "^" delimited data for each error. These lines are more easily
parsed by other reporting programs. The format of the lines is a
number of fields separated by ^ characters. The field elements are:
severity, description, key information, type of test (range, enum
etc), the trigger, the name of the variable, the current value of the
variable, the previous value of the variable, the header line for the
command, the line from the command that was analyzed. If any of the
values are not available, two ^^ are placed adjacent in the output.
If the description, header or line fields contain "^" characters, they
are replaced by "-" characters. Other lines produced have only one
field and are either waring/error lines with three stars "***" in
them, or the lines are empty, or they are the end sentinel line.
- -S
-
causes empty mail messages to be suppressed. This stops the mail
message with a subject line of "no watcher problem" from being sent. I
would suggest that at least one run of tkwatcher per day not use
this flag. Not using "-S" tests the mail system on the hosts.
- -v
-
This is equivalent to "-D t:1".
SECURITY CONSIDERATIONS
Since the controlfile is tcl code, and is loaded into a running
tkwatcher, any operation may be done from it. You should never use a
controlfile that you do not trust.
See the bugs and security bugs section for a few more concerns.
CUSTOMIZATION and INSTALLATION
Change the #! line at the top of the script to point to your tcl based
shell. I like "expect -d" myself when debugging, but use plain tclsh
in production.
There are two arrays that can be customized. The command associative
array is used to set the commands for the pager, x, syslog and mail.
Check this and make substitutions for your programs and paths where
necessary.
The other array is the report array that can be used to customize the
reporting strings for the output mechanisms. Knowledge of the
internals of tkwatcher's reporting code is needed to do this
successfully.
EXAMPLES
Here is a rather interesting (and partly contrived) example (the
numbers in parentheses down the left hand side are reference numbers
and are not part of the array definition):
-
set watch(4,nfs,network) { "nfsstat -c" "nfs client stats"
{
{=3 {header 0-end %H}}
{=4 {calls_rpc_server 0 %d} {badcalls_rpc_server 1 %d}
{retransmits 2 %d} {badxid 3 %d} {timeout 4 %d}
{timers_expire 7 %d}
}
{=10 {getattr 2 %d} {readlink% 12 %d} }
}
{
(1) {calls_rpc_server global glob_c_r_s}
{badcalls_rpc_server delta_up 5}
{retrans_% calc 100.0*@retransmits@/@calls_rpc_server@
range 0 5 delta_up 0}
{retrans_badxid calc
100*(@retransmits@-@badxid@)/@retransmits@
range 50 100 delta_down 0 }
(2) {getattr% calc @getattr@*100.0/@glob_c_r_s@
global g_getattr%
range 0 5 }
(3) {@g_getattr%@ calc @g_getattr%@>@readlink%@ value 1}
(4) {@g_getattr%@ orgroup b delta_up 10 severity crit}
(5) {timeout_badxid calc @timeout@>@badxid@ iftrue
calc (@timeout@*1.0)/@badxid@ range 2 10000 }
(6) {timeout_badxid_1 calc @timeout@<@badxid@ iftrue
calc (@timeout@*1.0)/@badxid@ range 0 0.5
delta_up 0 }
}
}
Note the use of iftrue and iffalse in action lines 5 and 6. We don't
know if the number of timeouts or the number of badxid's will be
larger. However, to set up the ranges and the delta_up or delta_down
tests, we need to know this. So we do a small calc that determines
which value is larger and then use the iftrue tests to stop the
evaluation of the line if it is a calculation for the opposite condition.
Also we see a use of the global action in rules 1, and 2. In rule 2
the result of the getattr% (get attributes calls as a percentage of
total calls) calculation is saved. Then this saved value is used in
rules 3, and 4.
Rule 3, 4 will not be executed until after rule 2 has been executed
since the variable getattr% is created by the global statement in rule 2.
In rule 3 we use the g_getattr global variable in another (boolean)
calculation to make sure that the percentage of readlink calls is
always less than the percentage of getattr%. Note that rule 3 is only
executed when the readlink% variable is defined, so it will only be executed
when line 10 is read.
Action 4 adds one more test to the calculated global value
(g_getattr%) that reports a critical problem if it has increased by
more than 10. Command 4 is executed for every line after the 10th
input line because g_getattr% is defined. So for line 11,
12... command 4 is executed. If the command produces output on line
10, the orgroup will prevent output on lines 11...end. It will also
prevent execution of delta_up action as well if and only if output was
produced. If output was not produced, all of rule 4 is executed for
every additional line in the output of "nfsstat -c". While this is a
performance issue, it is inconsequential from a reporting point of
view.
FILES
- ./Watcherfile, ~/.Watcherfile
-
The default tkwatcher command file.
- ./Watcherfile.hist, ~/.Watcherfile.hist
-
The default tkwatcher history files. The name of the history file is
derived from the name of the control file that is used.
DIAGNOSTICS
The control file is only sourced, so any tcl syntax errors will result
in diagnostics from tcl.
The calculations are done using tcl as well, so tcl diagnostics
about improper expression syntax and errors are generated.
Warning about command failures are printed.
BUGS
The original cut of this was 8k and encompassed all of the functions
of the original Watcher. It took me 5 hours to produce. I am now
nearing the 100 hour mark, and it is nearing the 70Kb mark with
comments and debug statements. It is only 48K without them. However
it is still too large for my tastes. Then again performance isn't
shabby, it still takes more wall clock time for my processes to run
than it does for tkwatcher to analyze them, and the memory usage isn't
bad either.
A mistyped command line can erase your watcherfile. I.E. tkwatcher -o
Watcherfile will result in an empty watcherfile. The correct syntax
is: tkwatcher -o -f Watcherfile. Arguably this is a misfeature of the
command line syntax.
The delta_up/down and pdelta_up/down may not work right if the values
being checked are negative.
This program has been tested through use, not formal testing, so
combining actions in certain interesting ways may have unexpected
results.
There are enough ways of reducing chatter that non-judicious use may
result in loss of useful info. Then again who wants to wade through
100 mail messages about the same problem.
There is no formal parsing or syntax checking of the control file in
the program since it would slow it down too much. Even reading in
syntax checking code can be painful. There should be a separate syntax
checking/pretty printing program for this. Since the control file is
just a tcl script, "tclsh watcherfile" can be used as a syntax check
for the code. It doesn't perform any semantic checks however.
The sense of the testing actions is not always consistent, and it
helps to have a warped mind when attempting to create certain types of
tests.
It should be possible to customize the report_format array without
having to know the internals of tkwatcher.
Keywords should be able to contain an equals sign. The current
practice of changing it to a - is not satisfactory.
The command line parser should allow grouping of options. It should at
the least warn about invalid groupings.
reporting mechanism other than email or standard output (e.g. paging,
real time reporting using x or inform and syslog) haven't been tested
as well as I would like.
If the value of a variable contains the @ character, tcl and calc
action will be fooled into thinking that variable substitution
failed. As a result the calc and tcl actions won't be executed.
Ue sed or tr as part of the input command to substitute another
character for the @ character.
SECURITY BUGS
When calling an alternate report mode, the report should be passed to
the stdin of the report command rather than passing it on the command
line. This will prevent the unintended execution of code which is a
security risk. Exactly how a cracker could make use of this bug is
unclear, but I have been at this too long to ignore the risk. I am
sure that bugtraq will have an exploit for this if tkwatcher is in
wide use.
tkwatcher should really test the controlfile to make sure it is a real
file and not a link. Also it should refuse to write/read any
historyfile that has world write permissions. Tcl 7 doesn't support any
way of testing for these on an open file handle, and any other way of
testing produces a race condition. Better that it should not give
people a false sense of security.
TODO
A formal testing suite needs to be made up. This code is being tested
over a year of use, but it really needs formal testing.
Implement HUP handling to force reread of the control file when in
daemon mode.
AUTHOR
John Rouillard (rouiljNO@SPAMieee.org) with thanks to Mark
Lamourine <mlamouriNO@SPAMBBN.COM> released July 2000.
SEE ALSO
wish(1), tclsh(1), expect(1), scotty(1), Tcl(n), regexp(n) (tcl
regexp), expr(n) (tcl expression operator), scan(n) (tcl scan operator
for %s, %d, %f specifiers), sscanf(3), Keeping Watch over the Flocks
by Night (and day) by Kenneth Ingham, Summer 1987 Usenix proceedings.
Index
- NAME
-
- SYNOPSIS
-
- DESCRIPTION
-
- Control File
-
- parse table
-
- column specifier
-
- format specifier
-
- action table
-
- actions
-
- OPTIONS
-
- SECURITY CONSIDERATIONS
-
- CUSTOMIZATION and INSTALLATION
-
- EXAMPLES
-
- FILES
-
- DIAGNOSTICS
-
- BUGS
-
- SECURITY BUGS
-
- TODO
-
- AUTHOR
-
- SEE ALSO
-
This document was created by
man2html,
using the manual pages.
Time: GMT, August 24, 2003