Random ramblings

All | General | Motorbikes | Networking | Odds & Sods | Soapbox | Solaris
« One year to learn,... | Main | OS personality test »
20050523 Monday May 23, 2005

Peer pressure

This makes me happy. A dutch ISP peering with 150 odd other ASs at several IXes. Done with 4 Quagga boxes, one of which has about 190 BGP sessions.

Speaking of which, I'm testing a set of nice changes to make bgpd much more responsive. At present, dealing with peer-down events is roughly about Order(n*p) (I think, if I have that right. Well, I know I don't - it's far more complex than that..), for n number of prefixes and p number of BGP peers (where each peer is advertising/receiving essentially same set of prefixes), with the 'p' component having a large influence.

The problem was that all this work was done in one thread. As Quagga processes use a simple cooperative/user-space thread model, staying in one thread for a long time is not good. In the worst case it can cause us to miss protocol-related timers and hence drop sessions, particularly on slow machines or fast machines with lots of peers, and various points on a curve in between them. Which obviously is not not good at all.

A big factor in the long delay in this 'peer-down' thread was that communicating route removals/updates to the 'zebra' daemon (which manages the 'RIB' and talking to the kernel) was done synchronously. Andrew Schorr updated the 'zclient' code to use asynchronous IO, deferring IO to threads, which had made a big difference, but still not big enough. A user with lots of sessions (2 upstream sessions feeding on to 70 or 80 other downstream peers iirc) on a P-III class machine reported there were still long delays if either of the 2 upstream peer sessions were cleared. So I've been working on a set of changes to allow arbitrary work to be deferred and done asynchronously via work-queues. The results of this are looking pretty good.

Eg, an Athlon 800MHz dealing with two upstream sessions, both providing full 160k prefix feeds, being cleared in short succession, with two other 'downstream' routers to send the withdrawals on to (note that, like VMS, commands are 'best-matched', so you need only type as many characters of a word as are sufficient to uniquely identify it, eg 'show' can be typed as 'sh','show work-queues' as 'sh w' and 'show thread cpu' as 'sh th c'):

# sh ip bgp summary
BGP router identifier 212.17.55.49, local AS number 64512
53129 BGP AS-PATH entries
2194 BGP community entries

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
192.168.0.11    4 64513       4   19602        0    0    0 00:01:19        0
XXX.XXX.XXX.23  4 XX650   30750       4        0    0    0 00:01:30   159875
212.17.55.52    4 64514       0       0        0    0    0 never    Idle (Admin)
212.17.55.53    4 64512       3   40578        0    0    0 00:01:30        0
XXX.XXX.XX.17   4 XX639       0       0        0    0    0 never    Idle (Admin)
XXX.XXX.XX.18   4 XX639   30267       4        0    0    0 00:01:29   160316
XXX.XXX.XX.19   4 XX639       0       0        0    0    0 never    Idle (Admin)

Total number of neighbors 7
# sh thread cpu 
                      CPU (user+system): Real (wall-clock):
Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
   9991.458     63390      157     11999      167     24505 R      bgp_read
    845.869     61041       13    395940       14    411720    E   bgp_event
      0.000        31        0         0       18        31   T    bgp_routeadv_timer
      0.000         7        0         0       20        28   T    bgp_import
      7.998        40      199      1000      135       303  W     vty_flush
   4170.371      6459      645     23996      675     23921  W     bgp_write
      0.000         4        0         0      136       253   T    bgp_keepalive_timer
      3.999         1     3999      3999     4379      4379 R      vty_accept
      1.000         3      333      1000      231       275 R      bgp_accept
      0.000         5        0         0      333       415   T    bgp_start_timer
     24.997        39      640      4999      627      5213 R      vty_read
    669.899         1   669899    669899   726771    726771   T    bgp_scan_timer
      0.000         1        0         0      226       226    E   zlookup_connect
   7451.895       636    11716     34995    12846     35991      B work_queue_run
      0.000         1        0         0      692       692    E   zclient_connect
     27.995        96      291      1000      399       912  W     zclient_flush_data
      0.000        17        0         0       51       131 R      zclient_read
  23195.481    131772      176    669899      188    726771 RWTEXB TOTAL
# clear ip bgp XXX.XXX.XXX.23 
# clear ip bgp XXX.XXX.XXX.18  
# sh work-queues
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
     460     10   500       713     5951    569   1282 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
  305620      0   100       370       63     23     41 clear_route_node
# sh w
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
     906     10   500      1022     6741    771   1411 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
  113062      0   100      4496       81     29     46 clear_route_node
# sh w
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
     464     10   500      1256     6741    845   1402 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
   20146      0   100      7018       81     29     42 clear_route_node
# sh w
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
     222     10   500      1289     6741    845   1380 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
    5095      0   100      7472       81     29     42 clear_route_node
# sh w
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
  130370     10   500      1308     6741    845   1370 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
       0      0   100      7632       81     29     42 clear_route_node
# sh w
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
    9131     10   500      1427     6741    883   1367 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
       0      0   100      7632       81     29     42 clear_route_node
# sh w
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
    2932     10   500      1501     6741    883   1339 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
       0      0   100      7632       81     29     42 clear_route_node
# sh ip bg su
BGP router identifier 212.17.55.49, local AS number 64512
35110 BGP AS-PATH entries
1598 BGP community entries

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
192.168.0.11    4 64513       8   55318        0    0    0 00:05:39        0
XXX.XXX.XXX.23  4 XX650   50202      12        0    0    0 00:03:34   111862
212.17.55.52    4 64514       0       0        0    0    0 never    Idle (Admin)
212.17.55.53    4 64512       7   86094        0    0    0 00:05:50        0
XXX.XXX.XX.17   4 XX639       0       0        0    0    0 never    Idle (Admin)
XXX.XXX.XX.18   4 XX639   50743      11        0    0    0 00:01:32   128103
XXX.XXX.XX.19   4 XX639       0       0        0    0    0 never    Idle (Admin)

Total number of neighbors 7
# sh w       
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
  259887     10   500      1607     6741    883   1304 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
       0      0   100      7632       81     29     42 clear_route_node
# sh w
    List      (ms)      Q. Runs       Cycle Counts   
   Items  Delay  Hold     Total     Best  Gran.   Avg. Name
       0     10   500      1724     6741    995   1399 process_main_queue
       0     10   500         0        0      1      0 process_rsclient_queue
       0      0   100      7632       81     29     42 clear_route_node
# sh th cp
                      CPU (user+system): Real (wall-clock):
Runtime(ms)   Invoked Avg uSec Max uSecs Avg uSec Max uSecs  Type  Thread
  22680.529    124500      182     17997      230    109761 R      bgp_read
   1885.720    122093       15    395940       17    411720    E   bgp_event
      0.000       113        0         0       19       414   T    bgp_routeadv_timer
      1.000        26       38      1000       19        78   T    bgp_import
     30.997       203      152      1000      829     52001  W     vty_flush
  16739.445     17026      983    253962     1163    298907  W     bgp_write
     71.990        22     3272     42994     4033     43105   T    bgp_keepalive_timer
      3.999         1     3999      3999     4379      4379 R      vty_accept
      1.000         3      333      1000      231       275 R      bgp_accept
      1.000         1     1000      1000      527       527   T    bgp_connect_timer

      1.000         7      142      1000      386       546   T    bgp_start_timer
   1133.823       202     5612    316951     6472    331332 R      vty_read
   3428.480         6   571413    725890   662915    822071   T    bgp_scan_timer
      0.000         1        0         0      226       226    E   zlookup_connect
 154337.544      9356    16496    111982    22551    541611      B work_queue_run
      0.000         1        0         0      692       692    E   zclient_connect
     27.995        96      291      1000      399       912  W     zclient_flush_data
      0.000        17        0         0       51       131 R      zclient_read
 200344.522    273674      732    725890      976    822071 RWTEXB TOTAL

The above shows the 2 upstream sessions being cleared in short succession (clear ip bgp XXX.XXX.XXX.XX), and then shows the work-queues filling up and being processed. 'show thread cpu' shows statistics for all the threads which have run (the types are Read IO, Write IO, Timer, Event, directly eXecuted and Background), including the all important maximum amount of time a thread has ever taken. We can also see the queue filling up then draining again as the sessions come back up.

The maximum amount of time we spent in a single thread was 822ms wall-clock, in bgp_scan_timer, which has not yet been 'work-queued'. Previously we'd probably have, on this machine, spent about 20s to 40s running in bgp_event when the sessions were cleared. Runtime which is instead now deferred and done in approximately 10ms slices via work_queue_run. Now bgpd, in this test-case, 'blocks' for just 411ms, and that was not even incurred due to the clears, as it was present in the first output of the thread statistics.

Additionally, with the work-queue approach, we're now able to make an optimisation we couldn't previously. By delaying the scheduling of the work-queue we can 'gather-up' work and avoid needlessly processing the same prefix twice, if it is already scheduled to be processed.

I'm quite proud of this work. There's a wee bit more to do (take care of a race in particular [done]) but it means we should now be able to scale to a lot of peers (or alternatively, have far better behaviour on slow machines with moderate numbers of sessions. There apparently is a 1024 peer select() limit, at least on old Linux machines!

There's obviously a cost to this, in terms of RAM to store the work-queue entries (minimal: 2.3MB on x86 for 300k items [correction: about 5.8MB]) and some run-time overhead. But it's well worth it, RAM is cheap and 2MBish is small compared to 100MB needed for two full views; runtime overhead is worth it too, given it allows work to be sliced up and benefits slow machines. Even with these changes, we fit two full-views, with soft-reconfiguration for one of the views, into about 106MB of RAM (bgpd) and a further 45MB for zebra, on x86 (about 168MB for one view + 2 soft-reconfigs of full-view for bgpd and 88MB for zebra on x86-64).

Anyway, yay! Now I'm even happier :).

( May 23 2005, 08:27:57 AM IST ) Permalink Comments [1]

Trackback URL: http://blogs.sun.com/paulj/entry/peer_pressure
Comments:

HI,

a Cisco 7513 with 28 MB of DRAM, can it support default and Partial routes received from an ISP?

cisco RSP4 (R5000) processor with 262144K/2072K bytes of memory.
R5000 CPU at 200Mhz, Implementation 35, Rev 2.1, 512KB L2 Cache

Please advise.

Regards,

Masood

Posted by Masood Aboolai on May 11, 2008 at 08:09 PM IST #

Post a Comment:

Name:
E-Mail:
URL:

Your Comment:

HTML Syntax: NOT allowed

Calendar

RSS Feeds

Search

Links

Navigation

Referers