Monday May 23, 2005 This makes me happy. A dutch ISP peering with 150 odd other ASs at several IXes. Done with 4 Quagga boxes, one of which has about 190 BGP sessions.
Speaking of which, I'm testing a set of nice changes to make bgpd much more responsive. At present, dealing with peer-down events is roughly about Order(n*p) (I think, if I have that right. Well, I know I don't - it's far more complex than that..), for n number of prefixes and p number of BGP peers (where each peer is advertising/receiving essentially same set of prefixes), with the 'p' component having a large influence.
The problem was that all this work was done in one thread. As Quagga processes use a simple cooperative/user-space thread model, staying in one thread for a long time is not good. In the worst case it can cause us to miss protocol-related timers and hence drop sessions, particularly on slow machines or fast machines with lots of peers, and various points on a curve in between them. Which obviously is not not good at all.
A big factor in the long delay in this 'peer-down' thread was that communicating route removals/updates to the 'zebra' daemon (which manages the 'RIB' and talking to the kernel) was done synchronously. Andrew Schorr updated the 'zclient' code to use asynchronous IO, deferring IO to threads, which had made a big difference, but still not big enough. A user with lots of sessions (2 upstream sessions feeding on to 70 or 80 other downstream peers iirc) on a P-III class machine reported there were still long delays if either of the 2 upstream peer sessions were cleared. So I've been working on a set of changes to allow arbitrary work to be deferred and done asynchronously via work-queues. The results of this are looking pretty good.
Eg, an Athlon 800MHz dealing with two upstream sessions, both providing full 160k prefix feeds, being cleared in short succession, with two other 'downstream' routers to send the withdrawals on to (note that, like VMS, commands are 'best-matched', so you need only type as many characters of a word as are sufficient to uniquely identify it, eg 'show' can be typed as 'sh','show work-queues' as 'sh w' and 'show thread cpu' as 'sh th c'):
# sh ip bgp summary
BGP router identifier 212.17.55.49, local AS number 64512
53129 BGP AS-PATH entries
2194 BGP community entries
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
192.168.0.11 4 64513 4 19602 0 0 0 00:01:19 0
XXX.XXX.XXX.23 4 XX650 30750 4 0 0 0 00:01:30 159875
212.17.55.52 4 64514 0 0 0 0 0 never Idle (Admin)
212.17.55.53 4 64512 3 40578 0 0 0 00:01:30 0
XXX.XXX.XX.17 4 XX639 0 0 0 0 0 never Idle (Admin)
XXX.XXX.XX.18 4 XX639 30267 4 0 0 0 00:01:29 160316
XXX.XXX.XX.19 4 XX639 0 0 0 0 0 never Idle (Admin)
Total number of neighbors 7
# sh thread cpu
CPU (user+system): Real (wall-clock):
Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
9991.458 63390 157 11999 167 24505 R bgp_read
845.869 61041 13 395940 14 411720 E bgp_event
0.000 31 0 0 18 31 T bgp_routeadv_timer
0.000 7 0 0 20 28 T bgp_import
7.998 40 199 1000 135 303 W vty_flush
4170.371 6459 645 23996 675 23921 W bgp_write
0.000 4 0 0 136 253 T bgp_keepalive_timer
3.999 1 3999 3999 4379 4379 R vty_accept
1.000 3 333 1000 231 275 R bgp_accept
0.000 5 0 0 333 415 T bgp_start_timer
24.997 39 640 4999 627 5213 R vty_read
669.899 1 669899 669899 726771 726771 T bgp_scan_timer
0.000 1 0 0 226 226 E zlookup_connect
7451.895 636 11716 34995 12846 35991 B work_queue_run
0.000 1 0 0 692 692 E zclient_connect
27.995 96 291 1000 399 912 W zclient_flush_data
0.000 17 0 0 51 131 R zclient_read
23195.481 131772 176 669899 188 726771 RWTEXB TOTAL
# clear ip bgp XXX.XXX.XXX.23
# clear ip bgp XXX.XXX.XXX.18
# sh work-queues
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
460 10 500 713 5951 569 1282 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
305620 0 100 370 63 23 41 clear_route_node
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
906 10 500 1022 6741 771 1411 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
113062 0 100 4496 81 29 46 clear_route_node
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
464 10 500 1256 6741 845 1402 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
20146 0 100 7018 81 29 42 clear_route_node
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
222 10 500 1289 6741 845 1380 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
5095 0 100 7472 81 29 42 clear_route_node
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
130370 10 500 1308 6741 845 1370 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
0 0 100 7632 81 29 42 clear_route_node
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
9131 10 500 1427 6741 883 1367 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
0 0 100 7632 81 29 42 clear_route_node
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
2932 10 500 1501 6741 883 1339 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
0 0 100 7632 81 29 42 clear_route_node
# sh ip bg su
BGP router identifier 212.17.55.49, local AS number 64512
35110 BGP AS-PATH entries
1598 BGP community entries
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
192.168.0.11 4 64513 8 55318 0 0 0 00:05:39 0
XXX.XXX.XXX.23 4 XX650 50202 12 0 0 0 00:03:34 111862
212.17.55.52 4 64514 0 0 0 0 0 never Idle (Admin)
212.17.55.53 4 64512 7 86094 0 0 0 00:05:50 0
XXX.XXX.XX.17 4 XX639 0 0 0 0 0 never Idle (Admin)
XXX.XXX.XX.18 4 XX639 50743 11 0 0 0 00:01:32 128103
XXX.XXX.XX.19 4 XX639 0 0 0 0 0 never Idle (Admin)
Total number of neighbors 7
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
259887 10 500 1607 6741 883 1304 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
0 0 100 7632 81 29 42 clear_route_node
# sh w
List (ms) Q. Runs Cycle Counts
Items Delay Hold Total Best Gran. Avg. Name
0 10 500 1724 6741 995 1399 process_main_queue
0 10 500 0 0 1 0 process_rsclient_queue
0 0 100 7632 81 29 42 clear_route_node
# sh th cp
CPU (user+system): Real (wall-clock):
Runtime(ms) Invoked Avg uSec Max uSecs Avg uSec Max uSecs Type Thread
22680.529 124500 182 17997 230 109761 R bgp_read
1885.720 122093 15 395940 17 411720 E bgp_event
0.000 113 0 0 19 414 T bgp_routeadv_timer
1.000 26 38 1000 19 78 T bgp_import
30.997 203 152 1000 829 52001 W vty_flush
16739.445 17026 983 253962 1163 298907 W bgp_write
71.990 22 3272 42994 4033 43105 T bgp_keepalive_timer
3.999 1 3999 3999 4379 4379 R vty_accept
1.000 3 333 1000 231 275 R bgp_accept
1.000 1 1000 1000 527 527 T bgp_connect_timer
1.000 7 142 1000 386 546 T bgp_start_timer
1133.823 202 5612 316951 6472 331332 R vty_read
3428.480 6 571413 725890 662915 822071 T bgp_scan_timer
0.000 1 0 0 226 226 E zlookup_connect
154337.544 9356 16496 111982 22551 541611 B work_queue_run
0.000 1 0 0 692 692 E zclient_connect
27.995 96 291 1000 399 912 W zclient_flush_data
0.000 17 0 0 51 131 R zclient_read
200344.522 273674 732 725890 976 822071 RWTEXB TOTAL
The above shows the 2 upstream sessions being cleared in short succession (clear ip bgp XXX.XXX.XXX.XX), and then shows the work-queues filling up and being processed. 'show thread cpu' shows statistics for all the threads which have run (the types are Read IO, Write IO, Timer, Event, directly eXecuted and Background), including the all important maximum amount of time a thread has ever taken. We can also see the queue filling up then draining again as the sessions come back up.
The maximum amount of time we spent in a single thread was 822ms wall-clock, in bgp_scan_timer, which has not yet been 'work-queued'. Previously we'd probably have, on this machine, spent about 20s to 40s running in bgp_event when the sessions were cleared. Runtime which is instead now deferred and done in approximately 10ms slices via work_queue_run. Now bgpd, in this test-case, 'blocks' for just 411ms, and that was not even incurred due to the clears, as it was present in the first output of the thread statistics.
Additionally, with the work-queue approach, we're now able to make an optimisation we couldn't previously. By delaying the scheduling of the work-queue we can 'gather-up' work and avoid needlessly processing the same prefix twice, if it is already scheduled to be processed.
I'm quite proud of this work. There's a wee bit more to do (take care of a race in particular [done]) but it means we should now be able to scale to a lot of peers (or alternatively, have far better behaviour on slow machines with moderate numbers of sessions. There apparently is a 1024 peer select() limit, at least on old Linux machines!
There's obviously a cost to this, in terms of RAM to store the work-queue entries (minimal: 2.3MB on x86 for 300k items [correction: about 5.8MB]) and some run-time overhead. But it's well worth it, RAM is cheap and 2MBish is small compared to 100MB needed for two full views; runtime overhead is worth it too, given it allows work to be sliced up and benefits slow machines. Even with these changes, we fit two full-views, with soft-reconfiguration for one of the views, into about 106MB of RAM (bgpd) and a further 45MB for zebra, on x86 (about 168MB for one view + 2 soft-reconfigs of full-view for bgpd and 88MB for zebra on x86-64).
Anyway, yay! Now I'm even happier :).
( May 23 2005, 08:27:57 AM IST ) Permalink Comments [1]
HI,
a Cisco 7513 with 28 MB of DRAM, can it support default and Partial routes received from an ISP?
cisco RSP4 (R5000) processor with 262144K/2072K bytes of memory.
R5000 CPU at 200Mhz, Implementation 35, Rev 2.1, 512KB L2 Cache
Please advise.
Regards,
Masood
Posted by Masood Aboolai on May 11, 2008 at 08:09 PM IST #