To understand this document

You need to have a working knowledge of the basic configuration files that configure your instances, and some scripting skills. It helps a lot if you have some trouble shooting experience.

See hosrlint's HTML document, or the master source package HTML document for an over-view of these structures.

Netlint creates a report that reveals many facts about the local instance. The reports from many instance (hosts, or other types) are aggregated on a reporting host where they are compared for discrepancies. For example 10 hosts on a given network may report the netmask of the network as a /24, while 2 hosts report a /26. In that case some person should arbitrate the conflict to assure that all the network peers have the same (correct) mask.

On first glance it may seem that there would be few of these to compare and that it would be unlikely that many mistakes would be found. Experience as shown both of those opinions to be nieve.

Close the loop with netlint

Netlint works a lot like hostlint in that it collects a list of items from the running system. The difference is that netlint collects raw facts: it doesn't apply any local filter to the data. Here is a list from the base set:
the duplex and speed of our network interfaces
Hosts with configuration that are not common with their link partner cause no end of trouble.
bond network devices have common configuration
Obvious.
/etc/hosts has a mapping for our hostname
If the hosts file doesn't match DNS then boot-time services may start with the wrong IP.
who we accept mail for (via sendmail)
who we expect to accept mail for us (from MX)
Mail is used for more than people. It is great for reports (like this one).
check /etc/netmasks vs ifconfig
This is often overlooked, but some programs absolutely depend on this file.
report NFS resources expected or provided
report NTP servers expected or not reachable and current stratum
check OVO server configuration
check for dead name servers
All of these have obvious impacts.
report on sshd configuration
look for missing SSHFP DNS records
The combination of these to allow you do get rid of most of the pesky ssh host authentication prompts.
report on expected syslog peers
When syslogd is pointed at a host you don't have a report for it might be bad.
report OS information from uname (or the like)
Cross-check for master source push/pull configuration.
report the last boot-time of the instance
Used as part of triage.
report the default time zone
How does 1 host get a different time zone on the same subnet?

How to use this application

A mortal application login's crontab runs netlint at least once a week on every host. E-mailed output from that tasks is processed on a central reporting host to collate and prioritize the messages. The Admins review the feedback report every Monday to prevent minor errors from becoming bigger issues. (The jobs are staggered across a 4 hour window, so the reports do not all come in at the same time.)

Missing e-mail reports are taken very seriously.

When a new instance is created, after the process finishes the final reboot, it runs netlint to report the initial state of the host. This offers the admin a chance to check e-mail, read the (short) report to close-the-loop on any unexpected values.

Part of the triage list for a production issue is to run netlint, if there is some reason to believe that the network configuration or basic system configuration has been corrupted. This is a quick check that can be compared to the last e-mail report to see what may have changed.

Summary

Netlint is not rocket science: it is a good way to do statistical feed-back on a population of instances that all (should) share common features, or depend on peer services.

It should never be expanded into the opinion business, thats hostlint's job (see that page).


$Id: netlint.html,v 1.2 2012/07/11 17:30:09 ksb Exp $