Pathfinder 0.1
--------------
(If you want to skip all the conceptual crap go to the end where there is a
concrete example of output from a currently running PathFinder monitoring 14000 hosts and 30000 routers).

I was fascinated by the idea of monitoring the connectivity on the Internet.
This involved keeping track of routes and the responsiveness of routers on
the internet. The internet has a large number of hosts so one of the
questions that we considered was if it would be possible to monitor large
quantities of routers (>10k) and what kind of hardware would enable us to do
that.

The result of the following tests is that an old 1U Telemetrybox at the Colo can
be used to monitor more than 100.000 routes (300.000 routers) without a major
impact on the system performance of the box.

Router / Host Ratio:

It was said that the average route to a host on the internet is 10 routers.
It turns out that many routes share the same routers. The actual ratio that
I observed was that a new host on the average added 3 new routers that needed to
be monitored. Our initial project that we would need to monitor 100k routers
for tracking paths to 10k hosts is wrong. We only need to monitor about 30k
routers for that.

Design:

I took the sources for the MTR tool which is frequently used to monitor
paths on the internet and used some pieces to build a new tool that I called
"Pathfinder" which has been designed to scale to huge quantities of hosts to
be monitored.

Characteristics:

- Small footprint (13K executable)

- Simplification by omitting DNS lookups (can be added later by programs
  showing the data). A script can be used to translate hostnames to ip
  addresses before feeding them to PathFinder.

- Simplification by using ascii file for configuration and logging.
  Database can load ascii data or it can be integrated later. The ascii
  format is suitable for direct import into MySQL.

- One process does it all and schedules pings as well as manage the data.
  This simplifies implementation. It also means that adding additional
  processeors will do no good. The process is essentially I/O bound anyways.
  Using netsaint to schedule operations at this scale is suicide. So the
  tool needs to schedule on its own.

- Prime Hash for all search operations to match IP addresses minimizes CPU
  resource use.

- Statistics are kept in RAM from raw data. Statistics can be dumped in
  configurable intervals as to keep the load on a database or on storage
  as low as needed. The more storage is available the higher the volume and
  the more detail of the information obtained by Pathfinder. Calculations
  show that 100k routes / 10k hosts would consume less than 6 MB of RAM which is
  significantly lower than a Webbrowser (24-40Mb RAM).

- Network bandwidth usage is settable by configuring the number of pings per
  second to perform

- The maximum hitrate per second of hosts to be monitored can be set so that
  routers are not flooded with traffic. This also redirects pings away from
  the routers on the common path which are hit any number of times.

- Traces path changes (Could be used for alarms if connectivity problems
  develop but not yet done)

Network Bandwidth Use:
----------------------

PathFinder uses 60 byte ping packets. The number of pings can be configured
with the --rate option. Default is 100 pings per seconds which results in a
bandwidth use of around 10kbyte/sec. Possible configurations

Rate	Bandwidth
1	Not measurable
100	10kb/s		(default)
1000    100kb/s		This setting would use more than half of a T1 line.
10000   1Mbyte/sec	Uses all Bandwidth that I am allowed to use at the Colo
                        (10Megabit/sec) and is the highest setting I have tried.

Tests were done on a 433 Celeron 128Ram at Level 3 which showed no
moticable impact on system performance at any level. At 10000 pings per
second the cpu use was at 70% but the system was still responsive.
At 1000 pings/second the cpu use was 35% but idletime was still > 80%.
Most of the CPU time was spend in System mode which indicates that the issue
here is the Network Interface Card and the speed at which data gets back and
forth to it. A better NIC might improve things.

Memory use:
-----------

With around 9000 hosts 23000 routers to monitor the process used 4.6
Megabyte of RAM.

Resource use
Per monitored path		140 byte
Per router			50 byte

100K routers would take 5 Megabyte  / 10K paths 1.4 Megabyte which is less
than 6 Megabyte.


Disk Space:
-----------
For each sample taken:

Per monitored path		180
Per router			40

A data dump for 100k routers and 10k hosts would take 2 Megabyte for the
path data and 4 megabyte for the router performance data.

The data dump to ascii files is configurable via the --dump-interval option.
The following capacities are needed to store historical data. A shorter
interval allows to analyze more detail for reports.

Interval	Per 5min	Per Hour	Per Day
5min		6M		100M		2.5G
1hr		-		6M		200M
24hr		-		-		6M

Adding compression gets these things down to 1/3rd size.

Data is stored in a format that allows easy compression without having to
get to another format. Summary sets can be stored in the same structure.

Data is stored in ascii records separated by newlines. Fields are separated
by tabs.

paths-* structure:

1. Target IP
2. Number of times this route was taken
3. Number of Hops
4. List of routers traveled through (up to 30)

routers-* structure:

1. Router IP
2. Number of pings sent to the IP
3. Number of pings returned
4. Best response time in usec
5. Average response time in usec
6. Worst response time in usec


Router impact:
--------------

Monitoring 100000 routers at 1000 pings per second means that a sample of
each host is taken on the average every 2 minutes. This is a minimal impact
on the system surveyed. The 5 minute statistics would not be very useful
wince they would only include 2-3 samples. One hour data dumps would result
in 30 samples which would be better.

If a 5 minute resolution is needed then the hits per second could be
increased to 10000 pings per second but then the 1U system needs to be
dedicated to that task. Every host would be visited on average around 10-20
seconds giving us 20-30 samples within 5 minutes. The network bandwidth use
is so high at 10000 pings per second though that I would suggest talking
with our connectivity providers first.

Sample Sessions on slave3.openrock.net
--------------------------------------

I had some problem getting enough IP addresses for this test. I started with
all the public debian mirrors around the world (187) but that did not get me very far.
I tried network solutions ftp site for the .com zone but the .com .net and
.org zone had been removed because of spammers abusing that stuff.
I found the zone file for the .edu zone though. I ran a variety of pattern on the
zonefile to extract all sorts of IPs from it. The DNS resolution took about 8 hours
but at the end I had 14052 IP addresses.

slave3:/home/christoph# pfd --rate=1000 --write-interval=5 --daemon --floodrate=1
--- Reading Hostlist from /etc/pathfinder.list
net_add_netpath: duplicate IP 157.201.100.203
net_add_netpath: duplicate IP 157.201.100.202
net_add_netpath: duplicate IP 216.125.103.10
--- Starting to monitor 14052 paths. Ping Interval=1.00ms Flood
Delay=1000.00ms Data Dump=5.00min
--- Dumping Routing Path Data to 14052 hosts into /var/log/pathfinder/paths-976410738
--- Dumping Router Data for 29838 routers into /var/log/pathfinder/routers-976410738                        
--- Dumping Routing Path Data to 14052 hosts into /var/log/pathfinder/paths-976411038
--- Dumping Router Data for 29969 routers into /var/log/pathfinder/routers-976411038    

For the 14052 routes we are monitoring we need to monitor around 30000 routers.

christoph@slave3:~$ ls -l /var/log/pathfinder
-rw-r--r--    1 root     root      2814949 Dec  9 17:12 paths-976410738
-rw-r--r--    1 root     root      2965116 Dec  9 17:17 paths-976411038
-rw-r--r--    1 root     root       982383 Dec  9 17:12 routers-976410738
-rw-r--r--    1 root     root       615031 Dec  9 17:17 routers-976411038

Around 3 Meg of pathdata and around 1Meg of router performance data each 5
minutes.

TOP output:

  5:09pm  up 29 days,  1:50,  2 users,  load average: 0.00, 0.00, 0.00
34 processes: 32 sleeping, 2 running, 0 zombie, 0 stopped
CPU states:  1.7% user,  4.3% system,  0.0% nice, 93.8% idle
Mem:  127992K av, 120864K used,   7128K free,  25332K shrd,  61232K buff
Swap: 128516K av,    368K used, 128148K free                 37400K cached
 
  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
27261 root      19   0  5100 5100   528 R       0  5.1  3.9   0:03 pfd
27269 christop   3   0  1160 1160   688 R       0  0.9  0.9   0:00 top
    1 root       0   0   380  376   320 S       0  0.0  0.2   0:05 init
    2 root       0   0     0    0     0 SW      0  0.0  0.0   0:00 kflushd                  

This shows that the process just uses 5% cpu and that the system is quite
idle with 1000 pings/second. The process uses 5.1Megabytes for the data. No
swapping or any other ilk happens.

VMSTAT output:

christoph@slave3:~$ vmstat 5
   procs                      memory    swap          io     system cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
 0  0  0    368   7040  61232  37400   0   0     0     0    2     6   0   0 14
 1  0  0    368   7016  61232  37400   0   0     0     0 1825  1470   2   5 93
 1  0  0    368   7000  61232  37400   0   0     0     1 1760  1332   2   4 94
 1  0  0    368   6976  61232  37400   0   0     0     0 1834  1475   1   7 92
 1  0  0    368   6968  61232  37400   0   0     0     0 1793  1382   1   4 95
 1  0  0    368   6952  61232  37400   0   0     0     0 1767  1338   1   5 94        

VMSTAT shows a cpu >90% idle. No swapping occurs. Its heavy on interupts.

Next attempt using 10000 pings per second:
------------------------------------------

slave3:/home/christoph# pfd --rate=10000 --write-interval=5 --daemon --floodrate=1
--- Reading Hostlist from /etc/pathfinder.list
net_add_netpath: duplicate IP 157.201.100.203
net_add_netpath: duplicate IP 157.201.100.202
net_add_netpath: duplicate IP 216.125.103.10
--- Starting to monitor 14052 paths. Ping Interval=0.10ms Flood Delay=1000.00ms Data Dump=5.00min
--- Dumping Routing Path Data to 14052 hosts into /var/log/pathfinder/paths-976411606
--- Dumping Router Data for 30023 routers into /var/log/pathfinder/routers-976411606                 

TOP output:

  5:22pm  up 29 days,  2:04,  2 users,  load average: 0.45, 0.16, 0.04
34 processes: 31 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 12.6% user, 65.1% system,  0.0% nice, 22.1% idle
Mem:  127992K av, 122004K used,   5988K free,  25348K shrd,  61232K buff
Swap: 128516K av,    368K used, 128148K free                 37676K cached
 
  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
27287 root      15   0  5828 5828   528 R       0 76.7  4.5   0:37 pfd
27288 christop   2   0  1164 1164   692 R       0  0.9  0.9   0:00 top
    1 root       0   0   380  376   320 S       0  0.0  0.2   0:05 init
    2 root       0   0     0    0     0 SW      0  0.0  0.0   0:00 kflushd

We are pushing it here. 76% is used by PathFinder and the system is only 22%
idle.

VMSTAT output:

christoph@slave3:/var/log/pathfinder$ vmstat 5
   procs                      memory    swap          io     system cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy id
 1  0  0    368   6132  61232  37676   0   0     0     0    3     6   0   0 15
 1  0  0    368   6108  61232  37676   0   0     0     0 15247  2251  14  64 22
 1  0  0    368   6084  61232  37676   0   0     0     1 15087  2404  10  67 23
 1  0  0    368   6064  61232  37676   0   0     0     0 14975  2480  11  66 22
 1  0  0    368   6036  61232  37676   0   0     0     0 14921  2413  12  64 24
 1  0  0    368   6012  61232  37676   0   0     0     0 14479  2487  13  62 25                     

This confirms the resource situation. Interrupt use is way up. 
Note that there is no swapping since the whole stuff fits neatly into memory (<6Meg!!!).

The limitation is the speed of the NIC. We can saturate a 10Mbit link with
pings without a problem. We would need a NIC with higher intelligence (and
less thirst for interrupts) to get more speed.

I think there would not be an issue with monitoring even 10 times as much as
done here. With 10000 pings per second we have a sample rate of 25-30 per
data dump for 30k routers. For 100k hosts we would have 2-4 samples per 5
minutes of around 300k routers with 10000 pings per second. The data
structures would grow ten fold up to 60 megabytes as would also the storage
requirements. That could be done with a dedicated 1U box or even better
with a modern PC.

One million routers would be possible by using a 1U box and

1. Going to an hourly sample cycle (30 times as much data!)

2. Running at 10k pings/second (could even work at 1k/sec if sample are not
to be too frequent!)

3. Having a machine with 500Meg of ram. PathFinder would need 180Megabytes of RAM.

I hope this proves that monitoring lots of host would work....

Christoph Lameter, December 9th 2000