Some months back I spoke about setting my friend and partner in crime up on a pfSense router. Part of the reason for doing this was that he had dual WAN links and we could put them into a redundant configuration. I stress now, this wasn’t the primary reason, more of an added bonus. It was therefore with some surprise that I got the following text message on Tuesday:
And my internet is down
My first thoughts were that there had either been a power cut or some muppet had put a digger through the BT wiring. So, away from my computer, with some trepidation I called back ready to hear the worst. As it turns out, everything was normal – the modems had DSL sync, the router was up and running normally. Only problem was that there was no internet. I then went into fault finding mode; down the phone; working from memory; talking with somebody not familiar with pfSense; all the while trying to interpret system logs out of context. In the end I determined that:
- Both WAN interfaces were down and disabled
- PPP was failing to connect
- the apinger (Alarm Pinger) service was shut down because there were no valid targets.
None of this was good. Bearing in mind we was working on the basis that there were redundant WAN links, how could both of them be offline at the same time? Our minds then jumped to the question of what if there’s a problem with pfSense – how can you support a business with a router that throws a wobbly sometimes? At this point we moved to the next step and I dispatched my friend to go and find a spare BT router in a box, while I made my way back to a computer. As I got back to my computer, something caught my eye in my news feed: An article from The Register about BT Broadband issues. *Click.* A quick read later, I message my friend back saying not to worry, it looks like a BT issue. Sure enough, later that night, the Internet sprang back to life and all was well with the world. BT for their part put the outage down to an issue with one of their core routers.
From this incident I took a couple of key experience points away. None of this is earth shatteringly new, but sometimes a gentle reminder of them is needed:
Perfect redundancy cannot be achieved…
Achieving resilience through redundancy is about eliminating single points of failure. Adding redundant components at some levels is easy, at others, it’s surprisingly difficult. Things under your control: Computers and the local network are easy enough. Computers can be given redundant power supplies and network cards. Switches and routers can be configured for high availability. As soon as you leave the area under your control, everything changes.
Use the example of my friends setup. Yes, he has dual WAN links, but they share two pairs of the same cable. They go to the same DSLAM, and on to the same exchange. Therefore any physical damage to the cabling, street boxes, or exchange will take down both connections. The solution would be to have dual connections to separate exchanges via separate street boxes over separate cables in separate ducts. Even then, in the case of this outage, the Internet would have still gone down since the problem was with the ISP. The solution here is to use a separate ISP, although that won’t stop all problems. One option would be to use a different type of connection such as satellite or 3/4G,however this also presents problems.
… and if it could, you wouldn’t pay for it!
In the IT world, we use the following formula ALE = SLE x ARO to calculate the cost of a failure risk. The ALE (Annualised Loss Expectancy) is a measure of how much a systems failure will cost per year. It is calculated by multiplying the SLE (Single Loss Event) by the ARO (Annual Rate of Occurrence.) The SLE is the cost of a single failure, while the ARO is the number of times such a failure can be expected in a year.
Take as an example, an outage will cost you £2000 every time it happens, and there’s an outage once every two years. In this case the SLE is 2000 and the ARO is 0.5, or half an occurrence per year. In this case the ALE is £1000. This means that it would only be worth spending £1000 per year to mitigate the risk using a redundant link. In reality, spending 100% of the ALE would be foolish – the risk might not crystallise. The question then becomes how much is worth spending? The European Network and Information Security Agency published a document – Introduction to Return on Security Investment (ROSI), which sets out a formula for calculating the return on investment as a percentage. This formula is ROSI = (ALE Reduction – Mitigation Costs) / Mitigation Costs.
Assume then the mitigation to this is to use a satellite connection, which costs £50 per month and will prevent the network from going down. The ALE reduction is £1000 because you won’t have any further outages, while the mitigation cost is £600. The ROSI is therefore £400 / £600 or 66%. This means that in this case, you’re spending £600 per year to save £400, which is why you wouldn’t pay for it. If we work the formula in reverse, to find the break even point, then the mitigation costs can be no more than half the ALE reduction, so in this case, we wouldn’t want to spend more than £500 per year on a redundant solution.