|
|
To print: Select File and then Print from your browser's menu
-------------------------------------------------------------- This story was printed from ZDNet Australia. --------------------------------------------------------------
|
Network management and debugging September 16, 2001 URL: http://www.zdnet.com.au/reviews/software/productivity/soa/Network-management-and-debugging/0,139023447,120153891,00.htm
From Chapter 20 of UNIX Systems Administration Handbook, Third Edition Because networks increase the number of interdependencies among machines, they tend to magnify problems. As the saying goes, "Networking is when you can't get any work done because of the failure of a machine you have never even heard of." Network management is the art and science of keeping a network healthy. It generally includes the following tasks:
As your network grows, management procedures should become more automated. On a network consisting of several different subnets joined with switches or routers, you may want to start automating management tasks with shell scripts and simple programs. If you have a WAN or a complex local network, you should consider in-stalling a dedicated network management station with special software. In some cases, your organisation's reliability needs will dictate the sophistication of your network management system. A problem with the network can bring all work to a standstill. If your site cannot tolerate downtime, it may well be worthwhile to obtain and install a high-end enterprise network management system. Unfortunately, even the best network management system cannot prevent all failures. It is critical to have a well-documented network and a high-quality staff available to handle the inevitable collapses. Troubleshooting a networkSeveral good tools are available for debugging a network at the TCP/IP layer. Most give low-level information, so you must understand the main ideas of TCP/IP and routing in order to use the debugging tools. On the other hand, network issues can also stem from problems with higher-level protocols such as DNS, NFS, and HTTP. You might want to read through Chapter 13, TCP/IP Networking, and Chapter 14, Routing, before tackling this chapter. In this section, we start with some general troubleshooting strategy. We then cover several essential tools, including ping, traceroute, netstat, tcpdump, and snoop. We don't discuss the arp command in this chapter, though it, too, is a useful debugging toolâ€"-see page 286 for more information. When your network is broken, chances are that you'll be in quite a rush to repair it. Stop right there! It's important to take a moment and consider how to approach the problem before jumping into action. The biggest mistake you can make is to introduce poorly planned changes into an already failing network. Before you attack your network, consider these principles:
This last point deserves a bit more discussion. As described on page 265, the architecture of TCP/IP defines several layers of abstraction at which components of the network can function. For example, HTTP depends on TCP, TCP depends on IP, IP depends on the Ethernet protocol, and the Ethernet protocol depends on the integrity of the network cable. You can dramatically reduce the amount of time spent debugging a problem if you first figure out which layer is misbehaving. Ask yourself questions like these as you work up (or down) the stack:
Once you've identified where the problem lies, take a step back and consider the effect your subsequent tests and prospective fixes will have on other services and hosts. Ping: Check to see if a host is aliveThe ping command is embarrassingly simple, but in many situations it is all you need. It sends an ICMP ECHO_REQUEST packet to a target host and waits to see if the host answers back. Despite its simplicity, ping is one of the workhorses of network debugging. You can use ping to check the status of individual hosts and to test segments of the network. Routing tables, physical networks, and gateways are all involved in processing a ping, so the network must be more or less working for ping to succeed. If ping doesn't work, you can be pretty sure that nothing more sophisticated will work either. However, this rule does not apply to networks that block ICMP echo requests with a firewall. Make sure that a firewall isn't interfering with your debugging before you conclude that the target host is ignoring a ping. You might consider disabling a meddlesome firewall for a short period of time to facilitate debugging. Every vendor provides a ping. Most versions of ping run in an infinite loop unless a packet count argument is given. Under Solaris, ping-s provides the extended output that other versions use by default. Once you've had your fill of pinging, type the interrupt character (usually Here's an example:
% ping beast
The output for beast shows the host's IP address, the ICMP sequence number of each response packet, and the round trip travel time. The most obvious thing that the output above tells you is that the server beast is alive and connected to the network.
On a healthy network, ping can allow you to determine if a host is down. Conversely, when a remote host is known to be up and in good working order, ping can give you useful information about the health of the network. ping packets are routed by the usual IP mechanisms, and a successful round trip means that all networks and gateways lying between the source and destination are working correctly, at least to a first approximation.
The ICMP sequence number is a particularly valuable piece of information. Discontinuities in the sequence indicate dropped packets. Despite the fact that IP does not guarantee the delivery of packets, a healthy network should drop very few of them. Lost-packet problems are important to track down because they tend to be masked by higher-level protocols. The network may appear to function correctly, but it will be much slower than it ought to be, not only because of the retransmitted packets but also because of the protocol overhead needed to detect and manage them.
To track down the cause of disappearing packets, first run traceroute (see the next section) to discover the route that packets are taking to the target host. Then ping the intermediate gateways in sequence to discover which link is dropping packets. To pin down the problem, you need to send a statistically significant number of packets. The network fault will generally lie on the link between the last gateway that you can ping without significant loss of packets and the gateway beyond it.
The round trip time reported by ping gives you insight into the overall performance of a path through a network. Moderate variations in round trip time do not usually indicate problems. Packets may occasionally be delayed by tens or hundreds of milliseconds for no apparent reason; that's just the way IP and UNIX work. You should expect to see a fairly consistent round trip time for the majority of packets, with occasional lapses. Many of today's routers implement rate-limited responses to ICMP packets, which means that a router may delay responding to your ping if it is already dealing with a lot of ICMP traffic.
The ping program allows you to send echo request packets of any size. By using a packet larger than the MTU of the network (1,500 bytes for Ethernet), you can force fragmentation to take place. This practice will help you to identify media errors or other low-level issues such as problems with a congested ATM network.
Under Solaris and HP-UX, you simply add the desired packet size to the end of the ping command:
% ping cuinfo.cornell.edu 1500
Red Hat Linux and FreeBSD require you to specify the desired size in bytes with the -s flag. Because excessively large packets can cause network problems, FreeBSD restricts the use of this option to root. (Note: The 1998 Ping of Death attack that could crash both UNIX and Windows systems was executed simply by transmission of an overly large ping packet. When the fragmented packet was reassembled, it filled the default memory buffer and crashed the machine.)
# ping -s 1500 cuinfo.cornell.edu
Use the ping command with the following caveats in mind. First, it is hard to distinguish the failure of a network from the failure of a server with only the ping command. A failed ping just tells you that something is wrong.
Second, a ping does not guarantee much about the target machine's state. Echo request packets are handled within the IP protocol stack and do not require a server process to be running on the probed host. A response guarantees only that a machine is powered on and has not experienced a kernel panic. You'll need higher-level methods to verify the availability of individual services such as HTTP and DNS
traceroute, written by Van Jacobson, lets you discover the sequence of gateways that an IP packet travels through to reach its destination. Almost all modern operating systems come with some version of traceroute. The syntax is simply traceroute hostname There are a variety of options, most of which are not important in daily use. As usual, the hostname can be specified either symbolically or numerically. The output is simply a list of hosts, starting with the first gateway and ending at the destination. For example, a traceroute from the host jaguar to the host drevil produces the following output:
% traceroute drevil
From this output we can tell that jaguar is exactly three hops away from drevil, and we can see which gateways are involved in the connection. The round trip time for each gateway is also shown--three samples for each hop are measured and displayed. A typical traceroute between Internet hosts can include ten or twenty hops. traceroute works by setting the time-to-live (TTL, actually -hop count to live") field of an outbound packet to an artificially low number. As packets arrive at a gateway, their TTL is decreased. When a gateway decreases the TTL to 0, it discards the packet and sends an ICMP -time exceeded" message back to the originating host. The first few traceroute packets have their TTL set to 1. The first gateway to see such a packet (xor-gw2 in this case) determines that the TTL has been exceeded and notifies jaguar of the dropped packet by sending back an ICMP message. The sender's IP address in the header of the error packet identifies the gateway; traceroute looks up this address in DNS to find the gateway's hostname. To identify the second-hop gateway, a second round of packets with TTL fields set to 2 are sent out. The first gateway routes the packets and decreases their TTL by 1. At the second gateway, the packets are then dropped and ICMP error messages generated as before. This process continues until the TTL is equal to the number of hops to the destination host and the packets reach their destination successfully. Most routers send their ICMP messages from the interface -closest" to your host. If you run traceroute backwards from the destination host, you will probably see different IP addresses being used to identify the same set of routers. Since traceroute sends three packets for each value of the TTL field, you may sometimes observe an interesting artifact. If an intervening gateway multiplexes traffic across several routes, the packets might be returned by different hosts; in this case, traceroute simply prints them all. Let's look at a more interesting example from a host at colourado.edu to xor.com: rupertsberg% traceroute xor.com
This output shows that packets must traverse five of our internal gateways before leaving the colourado.edu network (cs-gw3-faculty to cuatm-gw). The next-hop gateway on the BRAN network (204.131.62.6) doesn't have a name in DNS. After two hops in coop.net, we arrive at xor.com. At hop 8, we see a star in place of one of the round trip times. This notation indicates that no response (error packet) was received in response to the probe. In this case, the cause is probably congestion, but that is not the only possibility. traceroute relies on low-priority ICMP packets, which many routers are smart enough to drop in preference to -real" traffic. A few stars shouldn't send you into a panic. If you see stars in all of the round trip time fields for a given gateway, no -time exceeded" messages are arriving from that machine. Perhaps the gateway is simply down. Sometimes, a gateway will be configured to silently discard packets with expired TTLs. In this case, you will still be able to see through the silent host to the gateways beyond. Another possibility is that the gateway's error packets are slow to return and that traceroute has stopped waiting for them by the time they arrive. Some firewalls block ICMP -time exceeded" messages entirely. If there's one of these firewalls along the path, you won't get information about any of the gateways beyond it. However, you can still determine the total number of hops to the destination because the probe packets will eventually get all the way there. Also, some firewalls may block the outbound UDP datagrams that traceroute sends to trigger the ICMP responses. This problem causes traceroute to report no useful information at all. A slow link does not necessarily indicate a malfunction. Some physical networks have a naturally high latency. Sluggishness can also be a sign of congestion on the receiving network, especially if the network uses a CSMA/CD technology that makes repeated attempts to transmit a packet (Ethernet is one example). Inconsistent round trip times would support such a hypothesis, since collisions increase the randomness of the network's behavior. Sometimes, you may see the notation !N instead of a star or round trip time. It indicates that the current gateway sent back a -network unreachable" error, meaning that it doesn't know how to route your packet. Other possibilities include !H for -host unreachable" and !P for -protocol unreachable." A gateway that gives you any of these error messages will usually be the last hop you can get to. That host usually has a routing problem (possibly caused by a broken link): either its static routes are wrong or dynamic protocols have failed to propagate a usable route to the destination. If traceroute doesn't seem to be working for you (or is working incredibly slowly), it may be timing out while trying to resolve the hostnames of gateways by using DNS. If DNS is broken on the host you are tracing from, use traceroute -n to request numeric output. This option prevents the use of DNS; it may be the only way to get traceroute to function on a crippled network.
netstat provides a wealth of information about the state of your computer's networking software, including interface statistics, routing information, and connection tables. There isn't really a unifying theme to the different sets of output, except for the fact that they all relate to the network. Every system provides netstat, but since the command is kind of a -kitchen sink," different systems understand somewhat different options. Here, we discuss the four most common uses of netstat:
Monitoring the status of network connections
% netstat -a
The preceding example was run on the host nimi. It shows several inbound SSH connections, an outbound telnet connection, and a bunch of ports listening for other connections. Also of note are the lines showing the protocol as tcp46, which are services running on IPv6. Addresses are shown as hostname.service, where the service is a port number. For well-known services, netstat shows the port symbolically, using the mapping defined in /etc/services. You can obtain numeric addresses with the -n option. Remember, if your DNS is broken, netstat will be painful to use without the -n flag. Send-Q and Recv-Q show the sizes of the send and receive queues for the connection on the local host; the queue sizes on the other end of a TCP connection might be different. They should tend toward 0 and at least not be consistently nonzero. Of course, if you are running netstat over a network terminal, the send queue for your connection will probably never be 0. The connection state has meaning only for TCP; UDP is a connectionless protocol. The most common states you'll see are ESTABLISHED for currently active connections, LISTEN for servers waiting for connections (not normally shown without -a), and TIME_WAIT for connections in the process of closing. This display is primarily useful for debugging higher-level problems once you have determined that basic networking facilities are working correctly. It lets you verify that servers are set up correctly and facilitates the diagnosis of certain types of miscommunication, particularly with TCP. For example, a connection that stays in state SYN_SENT identifies a process that is trying to contact a nonexistent or inaccessible network server. If netstat shows a lot of connections in the SYN_WAIT condition, your host is probably unable to handle the number of connections being requested. This inadequacy may be due to kernel tuning limitations or even to malicious flooding. Inspecting interface configuration information
This host has two network interfaces: one for regular traffic and a -backlan" connection called evolve-bl. Ipkts and Opkts report the number of packets that have been received and transmitted on each interface since the machine was booted. Ierrs and Oerrs show the number of input and output errors; many different types of errors are counted in these buckets, and it is normal for a few to show up. Errors should be less than 1 percent of the associated packets. If your error rate is high, compare the rates of several neighboring machines. A large number of errors on a single machine suggests a problem with that machine's interface or connection. An error rate that is high everywhere most likely indicates a media problem. Collisions indicate a loaded network; errors often indicate cabling problems. Although a collision is a type of error, it is counted separately by netstat. The Collis column gives the number of collisions that were experienced while packets were being sent. (Note: This field has meaning only on broadcast-based networks such as Ethernet.) Use this number to calculate the percentage of output packets (Opkts) that result in collisions. In the example above, the collision rate on interface hme0 is about 0.6percent and the collision rate on interface hme1 is 1.3percent. On a properly functioning network, collisions should be less than 5percent of output packets, and anything over 15percent indicates serious congestion problems. netstat can also monitor a specific interface in real time, although the flags to request this behavior are different on each version of UNIX. The following commands give interface statistics at one-second intervals. The output that is shown is adapted from a FreeBSD system.
solaris% netstat -i 1
In this example, the collision rate is running at 20percent-30 percent. The network is probably very slow and possibly even unusable. netstat's continuous mode is especially useful for tracking down the source of errors. netstat -i can alert you to the existence of problems, but it can't tell you whether the errors came from a continuous, low-level problem or from a brief but catastrophic event. Observing the network over time under a variety of load conditions will give you a much better impression of what's going on. Try running ping with a large ping packet size while you watch the output of netstat. Examining the routing table
Destinations and gateways can be displayed either as hostnames or as IP addresses; the -n flag requests numeric output. The Flags characterize the route: U means up (active), G is a gateway, and H is a host route. The D flag (not shown) indicates a route resulting from an ICMP redirect. G and H together indicate a host route that passes through an intermediate gateway. The remaining fields give statistics on the route: the current number of TCP connections using the route, the number of packets sent, and the interface used. Remember that this output varies slightly among operating systems. Use this form of netstat to check on the health of your machine's routing table. It's particularly important to verify that the system has a default route and that it is correct. On some systems, the default route is represented by an all-0 destination address (0.0.0.0); on others, the word -default" appears instead. Viewing operational statistics for various network protocols
ip:
The absence of checksum errors indicates a clean hardware connection. It is important to check that packets are not getting dropped because of lack of memory (bufs in this example, but often referred to as -mbufs"). (Note: To get more details about memory usage by network services on Solaris and FreeBSD, try using the -m flag with netstat.) icmp:
The number of echo requests, responses generated, and echo replies all match. Note that -destination unreachable" messages can still be generated even when all packets are apparently forwardable. Bad packets can eventually reach a gateway that rejects them, and error messages are then sent back along the gateway chain. tcp:
It's a good idea to develop a feel for the normal ranges of these statistics so that you can recognise pathological states.
Packet sniffers are useful both for solving problems you know about and for discovering entirely new problems. It's a good idea to take an occasional sniff of your net work to make sure the traffic is in order. Since packet sniffers need to be able to intercept traffic that the local machine would not normally receive (or at least, pay attention to), the underlying network hardware must allow access to every packet. Broadcast technologies such as Ethernet work fine, as do some types of token ring network on which the sender of a packet removes it from the ring after it has made a complete circuit. Since packet sniffers need to see as much of the raw network traffic as possible, they can be thwarted by network switches, which by design try to limit the propagation of -unnecessary" packets. However, it can still be informative to try out a sniffer on a switched network. You may discover problems related to broadcast or multicast packets. Depending on your switch vendor, you may be surprised at how much traffic you can see. In addition to having potential access to all network packets, the interface hardware must provide a way to actually transport those packets up to the software layer. Packet addresses are normally checked in hardware, and only broadcast/multicast packets and those addressed to the local host are relayed to the kernel. In -promiscuous mode," an interface lets the kernel read all packets on the network, even the ones intended for other hosts. Packet sniffers understand many of the packet formats used by standard UNIX daemons, and they can often print out packets in a human-readable form. This capability makes it easier to track the flow of a conversation between two programs. Some sniffers print the ASCII contents of a packet in addition to the packet header, which can be useful for investigating high-layer protocols. Since some of these protocols send information (and even passwords) across the network as cleartext, you must exercise caution to avoid invading the privacy of your users. Each of our example operating systems comes with a packet sniffer. The sniffer must read data from a raw network device, so it must run as root. Although the root limitation serves to decrease the chance that normal users will listen in on your network traffic, it is really not much of a barrier. Some sites choose to remove the sniffers from most hosts to reduce the chance of abuse. If nothing else, you should check your systems' interfaces to be sure they are not running in promiscuous mode without your knowledge or consent. snoop: Solaris's packet sniffer Solaris includes a packet sniffer called snoop. It takes arguments on the command line that specify how to behave and what packets to collect. snoop can filter packets based on host, protocol, packet type, and port number, among other things. With no arguments, snoop collects packets from the first interface it finds, which is usually also the first interface listed by netstat -i (excluding the loopback). To specify a particular interface, use the -d device flag, where device is the name of the interface as reported by netstat -i (often hme0 for the first Ethernet interface). Using the -V flag gives you a little more information, and the -v flag gives you several lines of detail on each packet. snoop's command-line language is quite sophisticated, and it is well documented in the snoop man page. Expressions can be created with primitives such as host, port, tcp, udp, and ip. Simple expressions can be combined with primitives such as and, or, and not. Let's look at a couple of examples. Below is the output of a snoop session that might be useful for debugging mail between the hosts evolve and xor.com. We overspecified the filters to snoop to give a better example:
# snoop host chimchim and host evolve and tcp port 25
You should read the command and arguments above like this: -Capture all packets between the hosts chimchim and evolve which involve TCP port 25." This example shows one line for each packet that was collected. The packet's source is written first, and the destination appears in the second column. The remainder of the line contains information from the highest layer of the packet, such as protocol, port, and the first few bytes of the packet's data (we cut out a few columns from this example to save space). If you telnet to a host and run snoop there, you must filter out the traffic from your telnet session. Otherwise, output to your terminal will get caught in loop as it is displayed on the virtual terminal, sent across the telnet session, and captured again. To ignore all traffic to or from the host evolve, you would use a command such as: # snoop not host evolve If we were investigating a failing DNS server named mrhat, we might use the following command line: # snoop host mrhat | grep DNS This command incorporates a grep to further limit the packets that are displayed. nettl: HP-UX's packet sniffer
nettl is part of HP-UX's Network Tracing and Logging package. By default, nettl logging is started at boot time. Unless you want to use nettl to collect data indefinitely, it is wise to disable it until you need it. Edit the /etc/rc.config.d/nettl file and set the NETTL variable to 0. nettl reads its configuration information from /etc/nettlgen.conf. tcpdump: king of sniffers
By default, tcpdump tunes in on the first network interface that it comes across. If it chooses the wrong interface, you can force an interface with the -i flag. If DNS is broken or you just don't want tcpdump doing name lookups, use the -n option. This option is important because slow DNS service can cause the filter to start dropping packets before they can be dealt with by tcpdump. The -v flag increases the information you see about packets, and -vv gives you even more data. Finally, tcpdump can store packets to a file with the -w flag and can read them back in with the -r flag. For example, the following output comes from the machine jaguar.xor.com. The filter specification host jaguar limits the display of packets to those that directly involve the machine jaguar, either as source or as destination. # tcpdump host jaguar
The first packet shows jaguar sending a DNS lookup request about cs.colourado.edu to xor.com. The response is the actual name of the machine for which that name is an alias, which is mroe.cs.colourado.edu. The third packet is a reverse lookup of mroe's IP address, and the fourth packet contains the expected response. The tcpdump man page contains several good examples of advanced filtering along with a complete listing of primitives.
Networks have grown rapidly in size and value over the last decade, and along with that growth has come the need for an efficient way to manage them. Commercial vendors and standards organisations have approached this challenge in many different ways. The most significant developments have been the introduction of several standard device management protocols and a glut of high-level products that exploit those protocols. Network management protocols provide a standard way of probing a device to discover its configuration, health, and network connections. In addition, they allow some of this information to be modified so that network management can be standardised across different kinds of machinery and performed from a central location. The most common management protocol used with TCP/IP is the Simple Network Management Protocol, SNMP. Despite its name, SNMP is actually quite complex. It defines a hierarchical namespace of management data and a way to read and write the data at each node. It also defines a way for managed entities ("agents") to send event notification messages ("traps") to management stations. The protocol itself is simple; most of SNMP's complexity lies above the protocol layer in the conventions for constructing the namespace and the conventions for formatting data items within a node. SNMP is widely supported. Several other standards are floating around out there. Many of them originate from the Distributed Management Task Force (DMTF), which is responsible for concepts such as WBEM (Web-Based Enterprise Management), DMI (Desktop Management Interface), and the CIM (Conceptual Interface Model). Some of these concepts, particularly DMI, have been embraced by several major vendors and may become a useful complement to (or even a replacement for) SNMP. For now, however, the vast majority of network management takes place over SNMP. Since SNMP is only an abstract protocol, you need both a server program ("agent") and a client ("manager") to make use of it. (Perhaps counterintuitively, the server side of SNMP represents the thing being managed, and the client side is the manager.) Clients range from simple command-line utilities to dedicated management stations that graphically display networks and faults in eye-popping colour. Dedicated network management stations are the primary reason for the existence of management protocols. Most products let you build a topographic model of the network as well as a logical model; the two are presented together on-screen, along with a continuous indication of the status of each component. Just as a chart can reveal the hidden meaning in a page of numbers, a network management station can summarize the state of a large network in a way that's easily accepted by a human brain. This kind of executive summary is almost impossible to get any other way. A major advantage of management-by-protocol is that it promotes all kinds of network hardware onto a level playing field. UNIX systems are all basically similar, but routers, switches, and other low-level components are not. With SNMP, they all speak a common language and can be probed, reset, and configured from a central location. It's nice to have one consistent interface to all the network's hardware.
When SNMP first became widely used in the early 1990s, it started a mini gold rush. Hundreds of companies have come out with SNMP management packages. Also, many hardware and software vendors ship an SNMP agent as part of their product. Before we dive into the gritty details of SNMP, we should note that the terminology associated with it is some of the most wretched technobabble to be found in the UNIX arena. The standard names for SNMP concepts and objects will actively lead you away from an understanding of what's going on. The people responsible for this state of affairs should have their keyboards smashed. SNMP organisation
Translated into English, this means that SNMP defines a hierarchical namespace of variables whose values are tied to -interesting" parameters of the system. The basic data types that an SNMP variable can contain are integer, string, and null. These can be combined into sequences of the basic types, and a sequence can be instantiated repeatedly to form a table. Most implementations support a variety of other data types as well. The SNMP hierarchy is very much like a filesystem. However, a dot is used as the separator character, and each node is given a number rather than a name. By convention, nodes are also given text names for ease of reference, but this naming is really just a high-level convenience and not a feature of the hierarchy (it is similar in principle to the mapping of hostnames to IP addresses). For example, the OID that refers to the uptime of the system is 1.3.6.1.2.1.1.3. This OID is also known by the human readable name iso.org.dod.internet.mgmt.mib-2.system.sysUpTime The top levels of the SNMP hierarchy are political artifacts and generally do not contain useful data. In fact, useful data can currently be found only beneath the OID iso.org.dod.internet.mgmt (numerically, 1.3.6.1.2). The basic SNMP MIB for TCP/IP (MIB-I) defines access to common management data: information about the system, its interfaces, address translation, and protocol operations (IP, ICMP, TCP, UDP, and others). A later and more complete reworking of this MIB (called MIB-II) is defined in RFC1213. Most vendors that provide an SNMP server support MIB-II. Table 20.1 presents a sampling of nodes from the MIB-II namespace.
Table 20.1 Selected OIDs from MIB-II
(Note: OID is relative to iso.org.dod.internet.mgmt.mib-2.) In addition to the basic MIB, there are MIBs for various kinds of hardware interfaces and protocols. There are MIBs for individual vendors and MIBs for particular hardware products. A MIB for you, a MIB for me, catch that MIB behind the tree. A MIB is only a convention about the naming of management data. It must be backed up with agent-side code that maps between the SNMP namespace and the device's actual state to be useful. Code for the basic MIB (now MIB-II) comes with most UNIX SNMP agents. Some agents are extensible to include supplemental MIBs, and some are not. SNMP protocol operations
Get and set are the basic operations for reading and writing data to a node identified by a specific OID. Get-next is used to step through a MIB hierarchy, as well as to read the contents of tables. A trap is an unsolicited, asynchronous notification from server (agent) to client (manager) that reports the occurrence of an interesting event or condition. Several standard traps are defined, including -I've just come up" notifications, traps that report the failure or recovery of a network link, and traps for various routing and authentication problems. Many other not-so-standard traps are in common use, including some that simply watch the values of other SNMP variables and fire off a message when a specified range is exceeded. The mechanism by which the destinations of trap messages are specified depends on the implementation of the agent. Since SNMP messages can potentially modify configuration information, some security mechanism is needed. The simplest version of SNMP security is based on the concept of an SNMP -community name," which is really just a horribly obfuscated way of saying -password." There's usually one community name for read-only access and another that allows writing. Version 3 of the SNMP standard introduced access control methods with higher security. Although support for these schemes is still somewhat limited in production network hardware, it is reasonable to expect this situation to change soon. RMON: Remote monitoring MIB
RMON is defined in RFC1757, which became a draft standard in 1995. The MIB is broken up into nine -RMON groups." Each group contains a different set of network statistics. If you have a large network with many WAN connections, you should consider buying probes to reduce the SNMP traffic across your WAN links. Once you have access to statistical summaries from the RMON probes, there's usually no need to gather raw data remotely. Many switches and routers support RMON and will store at least some network statistics.
Many OS and network hardware vendors ship their products with SNMP agents that can run right out of the box. The read-only community string is usually set to "public," and the write community string is occasionally set to "private" or "secret." We recently saw a list of dozens of vendors that follow this practice. Although it can be handy for system administrators, it is equally useful for hackers. If you decide to enable SNMP, be sure to configure your agents to use hard-to-guess community strings for both write and read access. Solaris and HP-UX are shipped with decent SNMP agents. FreeBSD includes UCD SNMP in the /usr/ports/net/ucd-snmp directory. Red Hat Linux has no SNMP support in its standard distribution. In the following sections we first describe the Solaris and HP-UX agents. We then talk a bit about the UCD SNMP package, which we recommend for systems that do not come with their own agent. SNMP on Solaris
The main SNMP agent is /usr/lib/snmp/snmpdx, which reads its configuration from the file /etc/snmp/conf/snmpd.conf. In this file, you can specify the values of many MIB variables and also set the agent's general configuration. For example, you can set the system description string (sysdescr), the trap host or hosts (trap), and the community strings (read-community, write-community). After you modify this file, kill and restart snmpdx to force your changes to take effect. snmpdx also reads security information from /etc/snmp/conf/snmpdx.acl. In this file, you can list the IP addresses of hosts that should be allowed access to the local SNMP agent. Each set of hosts can have its own read and write community names. These features can dramatically increase the security of SNMP; unfortunately, all restrictions are turned off by default. An off-the-shelf Solaris installation boots with two DMI-related processes. The first of these is /usr/lib/dmi/dmispd, which answers DMI queries directly. The second is /usr/lib/dmi/snmpXdmid, which translates SNMP requests into DMI requests and passes them on to dmispd. Once dmispd responds, snmpXdmid passes the responses back to the SNMP server, snmpdx. SNMP/DMI translations are defined by files in the /var/dmi/map directory. Only two variable translations are defined by default, so unless you are planning on adding more, you should really have no reason to run snmpXdmid. If you don't have DMI management software or don't plan on using it, you can prevent both DMI processes from starting at boot time by renaming /etc/rc3.d/S77dmi to /etc/rc3.d/s77dmi. If you just want to disable snmpXdmid, you should rename its configuration file from snmpXdmid.conf to snmpXdmid.conf.orig. SNMP on HP-UX
The master agent is /usr/sbin/snmpdm, but it should never be run directly. Use the shell script /usr/sbin/snmpd instead. In addition to starting snmpdm, the snmpd script starts the subagents that are responsible for gathering data. The agent reads its configuration from /etc/SnmpAgent.d/snmpd.conf. Configuration information can also be specified on the snmpd command line. Only five keywords can be used within snmpd.conf. They're illustrated in the following example:
# SNMP configuration for disaster.xor.com
The get-community-name and set-community-name keywords set the SNMP community strings (aka passwords) that a client must provide to read and write data values. There can be more than one instance of each. However, access control cannot be subdivided: any name listed in any set-community-name statement is valid for any supported operation. The trap-dest keyword specifies the name or IP address of an SNMP client that is to receive trap notifications. There can be several trap destinationsâ€" all traps are sent to all destinations. The location and contact keywords set the values of the MIB-II sysLocation and sysContact OIDs. You can control the amount of logging that snmpd generates with the -m flag: snmpd -m logmask The logmask should be a bitwise OR of your choice of the option flags in Table 20.2.
Table 20.2 Option flag values for HP-UX snmpd
Unfortunately, HP's SNMP agent does not use syslog. You can specify the location of its log file with -l log; the default is /var/adm/snmpd.log. The UCD SNMP agent
The UCD distribution is now the authoritative free SNMP implementation for UNIX. We recommend it highly for systems with no SNMP implementation of their own. It includes an SNMP agent, some command-line tools, and even a library for developing SNMP-aware applications. We discuss the agent in some detail here and take a look at the command-line tools later in the chapter. The latest version is available from the Web at ucd-snmp.ucdavis.edu. As in other implementations, the agent collects information about the local host and serves it to SNMP managers across the network. The default installation includes MIBs for network interface, memory, disk, process, and CPU statistics. The agent is easily extensible since it can execute an arbitrary UNIX command and return the command's output as an SNMP response. You can use this feature to monitor almost anything on your system with SNMP. By default, the agent is installed as /usr/sbin/snmpd. It is usually started at boot time and reads its configuration information from files in the /etc/snmp directory. The most important of these files is snmpd.conf, which contains most of the configuration information and comes shipped with a bunch of sample data collection methods enabled. Although the intention of the UCD authors seems to have been for users to edit only the snmpd.local.conf file, you must edit the snmpd.conf file at least once to disable any default data collection methods that you do not plan to use. The UCD SNMP configure script lets you specify a default log file and a couple of other local settings. You can use snmpd -l to specify an alternate log file or -s to direct log messages to syslog. Table 20.3 shows a list of snmpd's most important flags. We recommend that you always use the -a flag. For debugging, you should use the -V, -d, or -D flags, each of which gives progressively more information. Table 20.3 Useful flags for UCD's snmpd
It's worth mentioning that many useful SNMP-related Perl modules are available. Look on CPAN 7 for the latest information if you are interested in writing your own network management scripts. (Note: CPAN, the Comprehensive Perl Archive Network, is an amazing collection of useful Perl modules. Check it out at www.cpan.org.)
We begin this section by exploring the simplest SNMP management tools: the commands provided with the UCD SNMP package. These commands are useful for familiarizing yourself with SNMP, and they're also great for one-off checks of specific OIDs. Next, we look at MRTG, a program that generates historical graphs of SNMP values, and NOCOL, an event-based monitoring system. We conclude with some recommendations of what to look for when purchasing a commercial system. The UCD SNMP tools
in the UCD SNMP package
In addition to their value on the command line, these programs are tremendously handy in simple scripts. It is often helpful to have snmpget save interesting data values to a text file every few minutes. (Use cron to implement the scheduling; see Chapter 9, Periodic Processes.) snmpwalk is another useful tool. Starting at a specified OID (or at the beginning of the MIB, by default), this command repeatedly makes "get next" calls to an agent. This behavior results in a complete list of available OIDs and their associated values. Here's a sample snmpwalk of the host jaguar ("public" is the community string):
% snmpwalk jaguar public
In this example, we see some general information about the system, followed by statistics about the host's network interfaces, lo0 and eth0. Depending on the MIBs supported by the agent you are managing, a complete dump can run to hundreds of lines. MRTG: The Multi-Router Traffic Grapher
MRTG runs regularly from cron and can collect data from any SNMP source. Each time the program runs, new data is stored and new graph images are created. MRTG is free and offers several attractive features. First, it maintains a zero-maintenance, statically-sized database; the software stores only enough data to create the necessary graphs. For example, MRTG could store one sample every minute for a day, one sample every hour for a week, and one sample every week for a year. This consolidation scheme lets you maintain important historical information without having to store unimportant details or to consume your time with database administration. Second, MRTG can record and graph any SNMP variable. You're free to collect whatever data you want. When combined with the UCD SNMP agent, MRTG can provide a historical perspective on almost any system or network resource. The future of MRTG lies in a new package, RRDtool, by the same author. RRDtool is similar in concept to MRTG, but with improved data consolidation and graphing features. Unlike MRTG, RRDtool does not offer any data collection methods of its own. Instead, a separate piece of software must collect the data. Currently, Jeff Allen's Cricket tool is the best choice for this role. Cricket is not limited to collecting SNMP data; it can pull in data from almost any network source. Since it is written in Perl, it's easy to add new data sources. Tobi Oetiker's home page at ee-staff.ethz.ch/~oetiker provides links to the current versions of MRTG, RRDtool, and Cricket. NOCOL: Network Operation Center OnLine
The distribution includes monitor programs that supervise a variety of common points of failure. You can whip up new monitors in Perl, or even in C if you are feeling ambitious. For notification methods, the distribution can send email, generate Web reports, view status with a curses interface, and use a dial-up modem to page you. As with monitor programs, it's easy to roll your own. If you cannot afford a commercial network management tool, we suggest giving strong consideration to NOCOL. The software works very well for networks of less than 100 hosts and devices. You can read more at www.netplex-tech.com. Commercial management platforms
Data gathering flexibility: It's important for management tools to be able to collect data from sources other than SNMP. Many packages include the ability to gather data from almost any network service. For example, some packages can make SQL database queries, check DNS records, and connect to Web servers. User interface quality: Expensive systems often offer a custom GUI or a Web interface. The most well-marketed packages today all tout the ability to understand XML templates for data presentation. Although the UI often seems like just more marketing hype, it is important to have an interface that relays information clearly, simply, and comprehensibly. Value: Some management packages come at a stiff price. HP's OpenView is both one of the most expensive and one of the most widely adopted network management systems. For many corporations, there is a definite value in being able to say that your site is managed by a high-end commercial system. If that isn't so important to your organisation, you should look at the other end of the spectrum for free tools like MRTG and NOCOL. Automated discovery: Many systems offer the ability to "discover" your network. Through a combination of broadcast pings, SNMP requests, ARP table lookups, and DNS queries, they are able to identify all your local hosts and devices. All the discovery implementations we have seen work pretty well, but none are very accurate on a complex (or heavily firewalled) network. Reporting features: Many products can send alert email, activate pagers, and automatically generate tickets for popular trouble-tracking systems. Make sure that the platform you choose allows for flexible reporting; who knows what electronic devices you will be dealing with in a few years? Configuration management: Some vendors step far beyond monitoring and alerting. They offer the ability to manage actual host and device configurations. For example, CiscoWorks provides an interface that lets you change a router's configuration in addition to monitoring its state with SNMP. Because device configuration information allows for a deeper analysis of network problems, we predict that many packages will develop along these lines in the future.
Cisco Online. Internetworking Technology Overview: SNMP. www.cisco.com. Hunt, Craig, and Gigi Estabrook. TCP/IP Network Administration, Second Edition. Sebastopol: O'Reilly & Associates. 1998. Stallings, William. Snmp, Snmpv2, Snmpv3, and Rmon 1 and 2, Third Edition. Reading, MA: Addison-Wesley. 1999. You may find the following RFCs to be useful as well. Instead of citing the actual titles of the RFCs, we have described their contents. The actual titles are an unhelpful jumble of buzzwords and SNMP jargon.
Published by Prentice-Hall Professional and Technical Reference Copyright © 2000 All Rights Reserved. ISBN: 0-13020-601-6
Copyright © 2009 CBS Interactive, a CBS Company. All Rights Reserved. |