Thursday May 28, 2009 One thing I very much hope, is that the Sun software group (Sun here on) and the general free-software world (which these days is equivalent to the general Unix eco-system) will continue to converge.
I hope Sun continues to engage the wider free-software using community and embrace it more fully, as it is doing with OpenSolaris (the distro) and the work to build up a wide body of packages. Specifically I think Sun should:
Simply put: It's a divisive measure, by design - especially when Sun is seen to be happy with the GPL for other projects. It sets OpenSolaris aside from other communities. It greatly hinders code re-use. The costs of this are definitely high, the benefits uncertain.
I hope the greater free-software using world also considers more carefully the importance of Sun to it. Sun has long contributed both code and technologies to the Unix eco-system and to free software. Further, the Unix eco-system has always been distinguished (e.g. from Wintel) by its healthy competition. Regardless of which Unix (inc Linux) you favour, the presence of strong, competing Unixes benefits you - even if don't like everything about them. It may be useful to keep that in mind.
I really believe that the best future for OpenSolaris demands much greater co-operation and inter-mingling with other free software projects. I hope Sun does too, and I hope the greater community will encourage and welcome them.
(See you at my new blog).
( May 28 2009, 07:25:06 PM IST ) Permalink Comments [3]I'm leaving Sun, as I have to take up summer work elsewhere. I have a new blog which I'll try keep updated.
I'm very grateful for my time at Sun: for the people I've met, the experiences gained, and the support given to me. Hopefully I'll run into people again...
( May 28 2009, 05:35:07 PM IST ) Permalink Comments [0]The UK is projected to borrow £175bn over the next 2 years. The UK is projected to spend approximately £16bn in total, through to 2010 on the Afghanistan and Iraqi wars. Precise figures seem hard to find, but some £7bn approx had been spent through to 2007, so a figure of around £6bn seems a reasonable, rough projection for 2009-2010 spending (winding down in Iraq has been matched by escalation in Afghanistan). These figures are, it seems, by and large supplementary to the main UK DoD budget of circa £36bn/annum.
So something like 3% of that massive 175Bn of borrowing is simply going toward that remaining, failed project of Blair in Afghanistan. That's a huge burden. Will UK tax-payers accept cuts in public-services, while paying for continuing, futile, death and destruction in Afghanistan?
( Apr 26 2009, 03:53:06 PM IST ) Permalink Comments [0]Sniff.. my beautiful Aprilia RS250 disappears tomorrow, sold to a collector.
The money's going on maintenance on my other collectable, my '00 Mini (the real mini in that picture, not that other shameless BMW rip-off of the name) which my parents are using at the moment.
( Apr 24 2009, 01:05:51 AM IST ) Permalink Comments [0]As per an earlier blog, I'm busy outside of Sun (and Quagga) for the time being, except, however, for the summer when I need to work. Ideally I'd work at Sun, however that seems uncertain, if not unlikely. I'd really like to keep working on Quagga. So if anyone knows of any opportunities, do get in touch.
( Apr 23 2009, 09:58:04 AM IST ) Permalink Comments [0]The Insecure, Great, British Pædosieve: The use of web-cache software
In my previous post on the Great-British Pædowall (or sieve more accurately, given its usefulness) I mentioned potential insecurities generally. Some web-URL-blocking systems make use of the Squid Cache to implement the URL-matching-and-blocking functionality. This, I would argue, is a qualitative security risk, in at least two dimensions.
It is, I think, widely accepted that software detects are unavoidable and proportional to the complexity of a body of software. The rate of defects in software can be minimised, e.g. through certain, more costly, development practices and by restricting the complexity of the software (e.g. the amount of code, features, etc).
Squid is a long-standing software project to develop web-caching (not blocking) software. As it was designed to cache web content, it has to have a fairly sophisticated understanding of HTTP - not entirely trivial and certainly significantly more complex than applying string filters. Further, Squid can act in a distributed manner, co-operating with other caches by exchanging information with them - to this end it supports various inter-cache/proxy protocols (ICP, WCCP, etc.). Each of which, obviously, requires its own body of network exposed code. Additionally, it contains support for management protocols, both in-band (HTTP cache_object requests) and out-of-band (e.g. SNMP) which can be used to retrieve information - again requiring specific, network-exposed lumps of code.
Squid was developed via normal, open source development practices. The project originates from the 1990s, an era when the Internet was still a more trusting, less hostile place. Its coding practices reflect this to some extent: raw handling of network-supplied input abounds - this is (sadly) still very common in C/C++ network software generally (Squid is C++ now, but much of the code is imported C from the original Squid; there is little, if any, use of input-sanitising abstractions). Further, deployment in "friendly" environments is reasonably typical for Squid (e.g. corporate, where nearly all users are reasonably responsible adults who are contractually bound to be "friendly", on pain of substantial punishment) .
So a) Squid is a reasonably featureful and complex piece of software, with several relatively independent, parallel bodies of code that can be invoked by a remote, untrusted user; b) There is no reason to think any special defect-minimisation processes were used to develop Squid. In combination, this means we should expect Squid to have a fairly large number of defects that can be triggered by an external attacker - some proportion of which will have disruptive consequences, possibly even leading to the execution of network-supplied code. A glance at the various vulnerability databases on the internet would seem to confirm this suspicion: Squid is subject to fairly regular reports, against pretty much all the protocol bits mentioned above. It is reasonable to think that many defects remain and that some would be easily found with fuzz-testing.
Simply put this means a Squid cache will expose information and/or behave in ways which its operator did not anticipate. These ways may be by design - through features the operator was not fully aware of - or by defect. Behaviours that are caused solely by having put this large piece of software in the path, and could be fixed simply by removing it, without any loss of functionality (according to my argument in the earlier post).
The risks here are in two dimensions, a) Risks to the operator; b) Risks to the users. This may seem obvious, but it's an important distinction because the operator is likely to care more about the former risks than the latter. I.e. an ISP might take pre-emptive action to mitigate a risk to itself, but is less likely to expend similar effort on mitigating b. This is somewhat speculative, but reasonably safe in the face of various lost-customer-data incidents seen in other industries.
Risks to the operator:
Risks to the user:
Finally, a typical Squid transparent proxy has no way to ensure that the original IP destination address matches the requested host. Squid will only see the HTTP-layer request, and can only route that. Therefore such an ISP is, obviously, operating an open-proxy as far as its customers are concerned. Whether there are any qualitative security problems with this, I don't know, but you can gain some amount of plausible deniability by setting a known filtered host as your HTTP proxy (e.g. for browsing, or for BitTorrent) and sending your own X-Forwarded-For header with various other customer IPs. Squid will add your IP to the end of the X-Forwarded-For, but the recipient may not be able to tell which was the real one.
Recommendations:
Much of the investigation into this was done by an unnamed collaberator. Also, Richard Clayton's paper on BT Cleanfeed contains very useful information (e.g. techniques to find filtered hosts to test with). tcptraceroute and socat are very useful tools.
Updates: murb informs me that there are transparent-proxy patches for Squid and Linux. It's not clear to me whether these currently would allow Squid to match on the original destination IP though.
It appears the X-Forwarded-For should consistently point to the source behind the ISP pædo-sieve. That said, this header might not be logged by default by many systems; it's also conceivable that the code responsible for updating this header may contain exploitable defects.
( Feb 12 2009, 04:20:30 PM GMT ) Permalink Comments [0]To avoid some confusion: I have been on leave of absence from Sun for a good few months now. Other than, possibly, July-September of this year, my time is committed outside of Sun until mid-2010.
Further, I have few quality cycles to spare for Quagga. I do not regularly read the Quagga lists at this time. I'm not reviewing patches, never mind integrating them. Other people are going to have to take up the slack, I'm afraid.
( Feb 12 2009, 07:14:52 AM GMT ) Permalink Comments [0]Note to phone system operators: Relative format caller-ID is evil
Why, oh why, do phone operators configure their systems to send relative phone numbers as the caller-ID? Which muppet made that decision?
Ever since countries, under the ITU, agreed to a common international, direct-dial numbering standards, phone numbers share a global, hierarchical number-space. This number-space is rooted (notionally) at "+", and leads to numbers like "+44 141 bcd efgh". Such a number is "fully-qualified" and uniquely identifies some port to the phone system, globally. E.g. this example is a number in the UK, nominally in Glasgow (the 141 code). One little nit is that the "+" can translate to different codes in different systems. Thankfully many countries use "00"; further, GSM phones can just dial "+", it seems (exactly how this works I don't know, but it seems to work with multiple phones on multiple networks in various countries - presumably it's part of the standard).
It also possible to have "relative" phone numbers, which assume part of the prefix (aka Subscriber Trunk Dialling) - alternatively, you can think of this as "bottom up" or "local" dialling. E.g. if one is in the UK, there is no need to dial +44, just dial "0141 bcd efgh"; if one is in Glasgow (and certain other localities) there is no need to dial the 141 part, just dial "bcd efgh"; if one is dialing from a phone sharing the same bcd part as its primary number then that possibly can be dropped too (an exchange-local call, though the "bcd" part need not uniquely identify an exchange), just dial "efgh".
So basically a fully-qualified number can be dialled from any public phone system in the world, and it should just work, while a relative number only works in certain places. The problem is that many operators today send Caller-ID in relative form. E.g. if someone rings my UK mobile, I will see something like "078a bcd efgh" or "0141 bcd efgh" on my phone's display. This lead to various annoyances, not least:
All these problems would be avoided if phone networks just sent fully-qualified, E.164-form numbers as their Caller-ID. Relative numbers == inconvenient == fail == evil.
Even more helpful are BT: BTs landline network disallows E.164 dialling! You can't dial "+44 141 bcd efgh", (you can't even dial "0141 bcd efgh" iirc if in Glasgow). This is a big, stinking pile of FAIL if you've got a home VoIP/POTS router and you want it to dynamically select whether to dial-out via BT or via VoIP - you need to apply a bunch of horrid number-rewriting rules to stand a chance of this working, and lower-end VoIP/POTS routers might not even be capable of doing so. I can't understand why BT would disallow fully-qualified dialling to UK numbers.. BT just suck. (I think this works fine with Eircom in Ireland).
To summarise:
Craig Murray's new book, "The Catholic Orangemen of Togo", is now available (also available on Amazon, and other retailers). He's had to self-publish, as his original publisher withdrew due to legal threats from Tim Spicer. Craig has also made the book available to download for free via the internet:
The book is a prequal to "Murder in Samarkand", covering Craig's time in Africa from 1997 to 2001. It's the expected mix of intelligent observation and comments on far-flung countries, geo-politics particularly, as seen by a senior British diplomat; along with more personal tales of people, bravado and women (not always endearing). I pretty much couldn't put it down (electronically speaking) till I'd finished it! Both a very interesting and entertaining read.
( Jan 12 2009, 07:03:30 AM GMT ) Permalink Comments [0]I'm curious if there's any well-reasoned argument, preferably with data, that shows that constructing a giant conspiracy to get kids to believe that a fat man delivers presents to kids across the world, in one night, in a flying, reindeer-driven sled, is a good idea, does anyone know?
Maybe it's just me, but it confused me greatly that all these adults enjoyed lying to me to make me think men in obviously fake beards knew whether I was good or not.
These same adults who would tell me lying was bad and chastise me if I did so. That I was asked to believe in two different variants of the beardy-with-presents story1 didn't aid its credibility Also, it distressed my poor little sister, who was completely enthralled by the whole thing, terribly when she found out at age 4 or 5 or so that there was no Santa (it might have been me who told her - she hasn't forgiven me to this day).
Perhap's there's some parental joy to be had in deceiving your child so fully, that I havn't yet had the chance to experience.. Does this Santa conspiracy really serve children, or more the adults?
1. The other one, Sinter Klaas, was a bit less fantastical and more plausible though. He was a bishop and dressed like those funny old men you'd see in church sometimes; he arrived by boat from Spain, with a bunch of Moorish helpers; he'd arrive a week or more in advance; he travelled the country on a white horse traditionally, but would avail of modern transport; he left the presents on your front-door step on the evening of the 6th of December. A sceptical child could still imagine he distributed the presents in advance, with the parents colluding in the final delivery (my uncle always disappeared before the knock on the door, I noticed). Still, I couldn't quite understand why adults so enjoyed us believing in this strange beardy - but it didn't matter as long as I got presents.
( Dec 16 2008, 04:03:57 PM GMT ) Permalink Comments [5]Choosing better BGP MRAI values
The IETF IDR working group is hopefully in the process of deprecating the default MRAI values given in [RFC4271]. The below text is from a defunct draft I wrote on what is known about the MRAI, with minor modifications to reflect some of the discussion on IDR. It suggests that 5s would be a better value for the general Internet, BGP MRAI (1s or lower for iBGP). Posted here for archival purposes.
The proper functioning of the [BGP] routing protocol is of great importance to the Internet. Issues regarding matters of its stability and convergence have been documented widely, such as in [BGP-STAB], [bgp-converge] and [Potaroo0607].
One such issue is the effect of 'Minimum Route Advertisement Interval' (MRAI).
The Minimum Route Advertisement Interval (MRAI) timer is specified in [RFC4271]. This timer acts to rate-limit updates, on a per-destination basis. [BGP] suggests values of 30s and 5s for this interval for eBGP and iBGP respectively. The MRAI must also be applied to withdrawals according to [RFC4271], a change from the earlier RFC1771.
Some implementations apply this rate-limiting on a per-peer basis, presumably an adequate approximation. Some implementations apply it to withdrawal methods (often called "WRATE" in the literature). Some implementations do not apply MRAI at all.
The MRAI timer serves to suppress messages which BGP would otherwise send out to describe transitory states, and so allow BGP to converge with significantly fewer messages sent. This beneficial effect of the MRAI timer, in terms of # of messages, increases as the timer is increased until an optimum value is reached, after which the beneficial effect stabilises. [bgp-converge] [mrai-final]
In terms of convergence time, a similar beneficial effect is seen as the MRAI increases to near the same optimum value. However as the timer value is increased past this point, the convergence time increases again linearly. The scale of this increase is significantly worse with WRATE, i.e. applying the MRAI to withdrawals has an adverse effect on BGP convergence time. [bgp-converge] [mrai-final]
The optimum MRAI timer value is dependent on several factors, most particularly the topology in its layout and propagation times. The optimum value will differ between different subsets of the Internet. [mrai-final]
It is believed to be infeasible to try directly calculate this value. However a useful approximation can be made from the diameter of the topology if it is known, along with some assumptions about the the topology, such as the latency between nodes. [mrai-internet]
The interaction between extensions to BGP designed to improve convergence, such as those that allow propagation of additional and/or backup paths, and the MRAI timer is as yet unknown. However, it seems reasonable to speculate these extensions might have the effect of leading to a lower optimum MRAI than would be indicated by an approximation based on the diameter of the BGP topology. Further work on these questions would be useful.
As the MRAI helps eliminate some updates, it interacts with flap-damping. The lower the MRAI timer, the greater the risk of crossing below the threshold of the optimum value. If that threshold is crossed, there will be an increased number of updates somewhere within the BGP system, and hence an increased risk of paths being dampened which otherwise would not.
So, in presence of significant flap-damping deployment and given the uncertainty of what the optimum is, it is reasonable to err towards selecting a value of the MRAI timer significantly higher than the optimum.
However, given that flap-damping increasingly is discouraged [RIPE-378] in Internet routing, this particular need to be conservative in the choice of MRAI timer value may be less important.
The current recommended value of 30s may be far higher than is optimal for the Internet, based on observations of certain parameters related to its topology. In [mrai-final] it is suggested that the optimal value may be between 5s ('semi-safe') to 15s ('safe'). The estimation of the 'safe' value here is of no relevance if WRATE is universally deployed, as in such a case the 'semi-safe' value will be the same as the 'safe' value. Further empirical work by the same authors [mrai-internet] suggests that the optimal, Internet MRAI may be below 5s.
Further, [BGP-STAB] and [Potaroo0607] argue that operational conditions (e.g. different routers using different MRAI values) mean the MRAI is having an adverse effect even on the number of messages sent, and so further exacerbating convergence problems in the global BGP system, such as path hunting. The [BGP-STAB] document goes further still and argues that MRAI be deprecated in favour of some better way of damping BGP UPDATES, however there are no clear proposals before the IDR as of this writing for such changes to BGP.
Though there is an optimum value for the MRAI, it's unlikely that it can be determined empirically or otherwise for the general Internet. It may even not be possible, as the optimum MRAI will differ for different subsets of the Internet. Some degree of guesstimation at a reasonable value for the MRAI is required, which is an exercise in risk; whether to err towards fast convergence at the risk of a disproportionate increase in BGP messaging, or to err to the side of an optimal number of messages at the expense of convergence.
Arguably, economising on bandwidth and control-plane processing power is of less importance than the convergence time of BGP, compared to times past. Presuming this, any new recommendations for the MRAI should seek to err slightly to the side of convergence, rather than erring towards minimising BGP traffic.
Further, if we assume most implementations apply the MRAI to withdrawals, then the Internet BGP topology effectively is WRATE-enabled, and [mrai-final] suggests there is even less benefit to erring toward a higher MRAI.
There may be risks in mismatched MRAI values between speakers in an AS as revised MRAI values are deployed. The MRAI values in [RFC4271] were deliberately specified to be lower for iBGP than for eBGP, so as to allow interior routing to converge while minimising the effect on eBGP state. So with mismatched values there is an increased risk that the external stability of an AS's routes would be affected by transient, internal states.
This last risk suggests that the existing iBGP/VPN values should be the lower-bound for any conservative revision of the eBGP MRAI value.
The most definite risk of lowering the MRAI is the increased risk of flap-damping, if the value is set too much below the optimum. Therefore, taking into account estimations of that optimum is required. That said, at least one BGP implementation by default does not apply any MRAI at all.
The suggested default values for the MinRouteAdvertisementIntervalTimer given in [RFC4271] are almost certainly too high. The author agrees with [mrai-internet] and suggests that values 5s or less for eBGP connections, and 1s or less for iBGP connections, would be more suited to Internet topologies. Opinions here differ, some think these values err too low for safety and others think the MRAI timer should be removed altogether.
These values may not be suitable for topologies which differ from the Internet, be that in scale, arrangement or otherwise. Such non-Internet, BGP topologies likely would have lower optimum values, assuming they are always significantly smaller in scale than the Internet BGP topology.
The author would like to thank Manav Bhatia for his helpful review and comments; as well as Robert Raszuk, Samita Chakrabarti, Danny McPherson and Jeffrey Haas for their useful comments; dissenting or otherwise.
The authors of the cited documents are thanked for their contributions to the understanding of BGP, of which this document is a simple summary.
BGP: A Border Gateway Protocol 4 (BGP-4) Y. Rekhter, T. Li and S. Hares, January 2006.
BGP-STAB: BGP Stability Improvements, T. Li and G. Huston, June 2007.
BGP-DAMP: BGP Route Flap Damping, C. Villamizar, R. Chandra and R. Govindan, Nov. 1998.
Potaroo0607: Damping BGP G. Huston, June 2007.
RIPE-378: RIPE RRG: Recommendations on Route-flap Damping P. Smith, C. Panigl, May 2006.
bgp-converge: An Experimental Analysis of BGP Convergence Time T. Griffin and B. Premore, Nov. 2001, Proc. of ICNP pages 53-61.
mrai-final: An Experimental Study of the BGP Rate-limiting Timer J. Qui, R. Hao and X. Li, June 2003, Bell Labs Technical Memo ITD-03-44604H.
mrai-internet: The Optimal Rate-Limiting Timer of BGP for Routing Convergence J. Qui, R. Hao and X. Li, IEICE TRANS. Comm. Vol.E88–B, No. 4.
( Dec 16 2008, 07:21:28 AM GMT ) Permalink Comments [0]Why the Great-British-Pædowall is a dumb idea
It seems all my HTTP traffic to various sites, including wikipedia, blogger and blogspot, is being transparently rerouted through a Squid proxy by my ISP. This proxy checks the URIs against a block-list provided by the Internet Watch Foundation, and returns 403s for any URIs deemed to contain potentially illegal material - particularly any material of pædohilic interest.
Now, don't get me wrong, I'll stand in line with everyone else to condemn those depraved and sick people who would participate in the abuse of children. However it seems, in all the calls of "Won't someone think of the children?", that we've managed to throw the important principle of proportionality out the window by filtering everyone's internet. This is just a fantastically dumb idea, given the low efficacy of this system relative to the impact and risks imposed on law-abiding society generally.
It's still trivial to share illicit images via HTTP, as there's just no way the IWF can stay ahead of all the new images posted to all the various image and file sharing sites across the internet. Even if they could make a dent on HTTP file-sharing, there are various other protocols - some even designed specifically for encrypted sharing of files.
The impact of this filtering system on generally, law-abiding users:
The filtering proxies must maintain state on all active HTTP requests, which becomes easily available (at a minimum) to employees of the ISP. Further, poor security practices and/or configuration mistakes can allow this information to be viewed by others (e.g. all customers of the ISP, as was the case for quite a while with at least one UK ISP). Obviously, historical logging of requests is trivial to enable.
Systems deployed today to filter out child-abuse images may tomorrow be appropriated for less universally welcomed purposes. E.g. why stop at child-abuse, why not track copyright infringers (another horseman of the internet apocalypse)? This infrastructure could very quickly be appropriated for more sinister purposes.
Basically:
1. Even funnier/tragic is that the IWF blacklisted an article URI, rather than the URI of the offensive image - so the latter is still viewable (e.g. use google cache to view the article). Basically, the UK have made people who seem ignorant of how the internet works the gatekeepers to it.
Update: There's a really interesting thread at UKCrypto about this - posts by Clive Feather and Peter Sommer are particularly interesting (thanks to murb).
There's another aspect to all of this: As the IWF, though governmentally-recognised, are a private organisation they are not covered by the Freedom of Information Act. So this system is completely out of the reach of the powers of oversight available to the general public, despite it having been put in the place at the behest of the government, by the threat of regulation.
Also, its worth noting that one possible argument for the efficacy of this system is that it protects ordinary people from accidentally being exposed to this material. However this argument appears to be struck down by, apparently learned, commentators in the above discussion who point out that is extremely rare (to the point of being almost unheard of) to accidently stumble on child abuse images.
Certainly, in my experience of using the internet for 14 odd years, I don't recall ever seeing anything approaching such. At least, not until the IWF managed to publicise a certain image on Wikipedia...
( Dec 09 2008, 03:16:24 PM GMT ) Permalink Comments [1]Path-hunting is the tendency of BGP, when a path is withdrawn by a remote AS, to "hunt" from one path to another, before finally converging. The problem has been described elsewhere, particularly by Geoff Huston (e.g. in this ISOC column on BGP dampening). Manav Bhatia also has a nice, graphic explanation of BGP path hunting. Tony Li and Geoff Huston additionally have authored a very interesting draft on the problem, draft-li-bgp-stability - which analogises BGP convergence with wave-front propogation, where the "front" of convergence bends with differences in path-propogation times, and expands out where ASes connect with multiple others.
This problem is very similar to[5], if not a special-case of, the well-known "count to infinity" problem inherent to the Bellman-Ford, distance-vector protocols. With infinity being constrained, by BGPs built-in split-horizon behaviour, to the number of simple paths in the topology.
Path Hunting overview
In short: given some converged subset of the BGP topology, with A and B connected by a set of one or more distinct paths {1, ..., n}, with the time taken for a message to propogate down path x of t[x] so that the set of paths has a directly corresponding set of propogation times T = {t[0], ..., t[n]}. Then for any announcement which passes through A to B, BGP will reach a "useful convergence"[1] by or within a time of:
The latter convergence time being due to the 'hunting' behaviour of BGP, and indeed consistent with the well-known "Good news travels quickly, bad news travels slowly"[3] behaviour of distance-vector protocols.
E.g. imagine A announces some prefix to C, and imagine (for simplicity) that B does not re-announce paths to E or H, with simple paths between A and B of {ABC,ACDEB,ACDFGHB,ACDFGEB,ACDEGHB}. I.e.:
-A--C----------B
\ / \
D------E H
\ \ /
F------G
So after a time of minimum(T), B will have received at least one UPDATE corresponding to a valid path and would be able to forward packets toward A for the prefix. Though the system is not converged, any further UPDATEs B receives can only cause it to select a better path, so the system is at least "usefully-converged" - it can route packets.
After a time of maximum(T), the state at each speakers will be fully converged, and B may then have the following paths in its RIB (arbitrarily making assumptions about preferences speakers may have), for whatever prefix it is, from among which B may pick its preferred path:
Note that B does not know about the ACDFGEB or ACDEGHB paths, however if the DE edge were withdrawn, E potentially could switch to G (and issue an update to B when it does so) and similarly with the FG edge and latter path. B can pick anyone of the above as its preferred path. I.e. though these paths are hidden from B, they can still come into play.
Imagine A now withdraws the prefix from C - so the connectivity to the prefix is now lost as far as the rest of the graph goes. After a time of minimum(T) B will receive at least one message (that might be a withdrawal from C, but it could just as well be an UPDATE from E as E hunts its preferred path over to G). Imagine it happens to come from the peer which it had selected as best, and results in B picking sa new best path via another peer (e.g. because its a withdrawal, or an UPDATE with less favourable path-attributes). Then the next update arrives from the peer newly preferred, and so B picks again, and so on. So B's best-path may well change between the following best-paths:
I.e. before B can converge on "there is no path", it potentially will first "hunt" through various paths that are already dead[2]. Yet, B *did* receive a message which was triggered by As withdrawal - several in fact! If only B could use the first message to see that, as it was A which withdrew the prefix, that therefore all paths via A must be dead, then it could short-circuit the whole hunting process!
Possible Solutions
Huston and Li propose various possible changes to BGP to mitigate hunting, such as removing the MRAI timer, band-stop filtering of updates, path-exploration damping, etc, all of which are implementable without change to the BGP protocol itself. Other proposals include Brian Dickson's second-best-backup proposal which seeks to speed up convergence by providing visibility of additional paths besides the best-path (such as the ACDFGEB path which is initially hidden to B in the example above), and also to provide a fast-path for withdrawals (this proposal requires changes to BGP). None of these proposals actually solve path-hunting, they instead try to either filter it out and/or speed it up.
Hunting can eliminated in BGP if it could be extended with mechanisms to allow BGP to recognise nodal changes in state, across paths. Other DV protocols (DSDV, AODV) solve similar problems by numbering messages/routes with a sequence-number. This sequenced-DV approach is problematic with BGP as it tends to require either that routers in an AS somehow maintain a shared counter, or that BGP must describe a more detailed router-router topology (rather than todays abstract AS-AS topology).
Some possible mechanisms that eliminate hunting:
A possible means to allow for convergence in O(minimum(T)) for all kinds of state-changes is described by Pei, Azuma, Massey and Zhang in "BGP-RCN: improving BGP convergence through root cause notification". Speakers include an additional attribute of information with updates/withdrawals, called the "Root Cause Notification" consisting of a list of (ASN,sequence number) tuples. This allows a receiving speaker to compare announcements received via different paths and determine relative recency, so allowing it to always act on the most recent information.
The sequence number dependency is a significant barrier to deployment.
This is essentially a simplified version of BGP-RCN, that does away with the problematic sequence number. When withdrawing a prefix, a speaker can include information that "poisons" nodes (or even edges). E.g. if A sends some kind of "Poison: A" attribute with its withdrawal, and if that attribute will propogate along with the withdrawal, then B can conclude from the first withdrawal it receives that all paths that go via A are defunct. BGP can now converge on a prefix-withdrawn state in minimum(T). The downside being that convergence to the prefix-announced state instead occurs in maximum(T) - though it is possible that BGP can be "usefully converged" prior to that, which would be a more useful state than path-hunting.
So we can "fix" path-hunting, and maximise withdrawal convergence without introducing any co-dependent state with only mild changes to BGP, though at some expense to announcement convergence.
A further observation to be made is that IGP routing has switched from distance-vector to link-state (or "map-distribution") because of the same convergence limitations. Further, there are proposals for BGP security extensions which overlay a link-stateish topology onto BGP, and the BGP-RCN approach is not far from a link-state protocol either. So it is perhaps not completely unreasonable to imagine whether some kind of topological-state mechanism could be bolted onto BGP, that could be used to augment the path-selection process.
So...
Solutions to the hunting problem either are imperfect local-hueristics (which may even make convergence worse in cases where they fail), or protocol extensions with fairly significant trade-offs (either in complexity or adverse side-effects). It seems unlikely there's any magic bullet, given the length of experience we have with Bellman-Ford/DV protocols.
--
Corrections and comments would be appreciated.
Update: Added footnote 5.
1. "Useful convergence" being where speakers all have converged on some particular state, OR are in a certain semi-converged state such that any further messages still propogating can only cause speakers to transition their routes from an existing, valid path to another valid path (this can not be the case if there are any withdrawal messages still to propogate). I.e. forwarding remains loop and blackhole free.
2. This is a problem as it means that if, outside of the subset we're consider, B has other, less-preferred connections, then B will have to wait max(T) before it can decide there are no valid paths in this subset and start to consider paths outside of this subset.
3. Does anyone know the original source of this? (Douglas Comer, "Internetworking with TCP/IP", Prentice-Hall, 1991, seems to quote it).
4. Described only in private correspondence, to date.
5. In the sense that the cause of the count-to-infinity problem was due to the source event of a routing update being indeterminable. Split-horizon cut-out the count-infinity problem for direct loops between neighbours, by preventing updates from propogating endlessly. The addition of a record-of-path (e.g. AS_PATH in BGP) solved count-infinity for multi-hop loops, by preventing updates from propogating along any but simple paths. The remaining problem is of updates travelling along multiple, parallel paths with different propogation times, preventing us from recognising which message received is the most recent - fixed with sequence numbers in some DV protocols. Each modification has pared down DV protocols slightly, restricting where valid information can flow and how it is recognised.
( Aug 25 2008, 09:14:38 PM IST ) Permalink Comments [2]Tomorrow's big story: The piano player and the kid mimed their performance too!
From the coming weekend's Sunday broadsheets: In-depth expose interviews historian who claims to have evidence that the archer-lights-cauldron scene at the Los Angeles olympics was faked too!!!!!
News just in from 2012: All theatrics deemed fake by IOC. Opening ceremony in London to consist of a reading of selected DNA sequences by Richard Dawkins, and the sotto voce proving of theorems from "Principia Mathematica" by the Froncysyllte Male Voice Choir, followed by some sport.
( Aug 13 2008, 09:24:33 AM IST ) Permalink Comments [1]Wholesale DSL reform: Possibilities...
Following on from my, somewhat rambling, post on wholesale DSL, here I'll speculate on possible reforms that could be made.
Alluded to as one of the "difficult to do" solutions. It's not at all viable today, but its interesting to consider what improvements would need to be made to internet protocols for this option to at least be possible. E.g. BGP is extremely management intensive. Even simple, non-protocol changes like better defaults could make a difference here. In terms of protocol, things like in-band transfer of policy information and auto-configuration of neighbours could simplify management greatly.
In other words, allow the telco-ISP run a single IP data-plane at the head-end for its own DSL customers, so allowing local switching of traffic within an exchange, if the telco-ISP so desires. At the moment, regulations would tend to prevent the telco discriminating between its wholesale-ISP customers and its own ISP in such a way. If it's the case that there are enough trunk-link bandwidth savings to be made to outweigh the initial costs of re-engineering and further additional managements costs, then it'll happen (at which point there'll be the tricky problem of what to do about the wholesale-ISPs not enjoying this advantage). If not, we'll know it doesn't matter - no harm done. I.e. loosen the regulations so that we can at least allow one player to obtain data on whether we need to re-think whole-sale DSL..
This probably wouldn't work in the UK, where BT appears pretty well divided up. Might still work in Ireland, where Eircom appears still to be a single entity.
The benefits of the wholesale-DSL model in terms of the economic-efficiency of the network are actually negative[2]; both in terms of the congruence of IP/geography and in terms of the significant additional complexity and points-of-failure[1], which incur direct costs (engineering and sustaining) and indirect costs (more downtime for customers). Further, managing an IP network is well-understood, and the reliability of the whole-sale ISPs is reliant on the operational competence of the telco.
The benefits of the competition introduced by the wholesale-DSL model therefore must lie primarily in customer-facing services. Therefore we could reform things by allowing the telco to manage the entire network stack (terminating PPP where it wished, assigning IPs, etc..), while leaving customer-services and further value-added services (domains, email accounts, etc..) to the "ISPs".
1. E.g. in Ireland, the few (4 odd?) "BRAS" routers operated by Eircom, which exist to dispatch PPPoE sessions to the various ISPs, via L2TP/IP. The wholesale-ISP must operate their own access-routers in addition to the "BRAS" - with obvious implications for failure rates. In a simpler model, such dispatching wouldn't be needed and PPP sessions could be terminated on cheaper (and so more numerous and perhaps more widely distributed) access-routers.
2. As per previous post: determined by comparison with single-operator networks.
( Aug 10 2008, 05:41:30 PM IST ) Permalink Comments [0]