[RPKI-Discuss] [Community-Discuss] 06 April 2019 RPKI incident - Postmortem report

Thu Apr 11 10:44:40 UTC 2019

Hi Owen,

On 2019-04-10 21:39:47+02:00 Owen DeLong wrote:

If I understand it correctly, however, ALL of the RIRs mirror all of the other RIRs data, so you should be able to get a complete set of data from any RIR. The TAL is very long term cacheable (at least 30 days).

The TAL is good for as long at the validity period of the cert that it points at, which in this case is until 2027. It also contains only the public key, not the certificate fingerprint, so if that cert is rolled with the same keypair, the TAL is still good.
The RIR's mirror each other's IRR data, but not each others publication mirrors (as far as I know), but that's not relevant either...

What happened in this instance wasn’t an issue of the TAL becoming unavailable, but the TAL (and related certificates) going past their expiration date without being renewed.

Correct. I didn't say unavailable. I said unverifiable.

Further, absent inter-RIR IPv6 transfers (which to the best of my knowledge are only permitted from RIPE to RIPE), that circumstance cannot occur.

Assuming no human error in RIR operations - but see for example this thread. If we're doing cryptographic guarantees, then let's do it!

This is most certainly a corner case, but is at least theoretically possible given that all RIRs claim 0/0 in their root certs.
Only if Inter-RIR IPv6 transfers are permitted. So, this is, in fact, a valid argument against a current ARIN proposal to allow such transfers, but in the current instance, it’s not actually a concern.

That's a stretch, but I suppose so. I'd say it's an argument against claiming default for the sake of convenience.

A more likely scenario is that an existing mis-origination that is being dropped as Invalid suddenly becomes Not Found, and wins path selection, thereby misdirecting traffic.
I did mention that as a possible consequence below.

Yes, I see you did.

I'm not aware of any such cases on our network from this last outage, but it's possible that they went undetected. The likelihood of this case increases substantially as more operators begin to drop Invalids.
My point earlier was that it actually shouldn’t. If there is growth in the number of rejected routes from RPKI, that indicates that
operators aren’t doing their due diligence when that happens. A route should only be dropped by RPKI for a very short period
of time while the originating ASN and/or upstreams of the originating ASN are contacted to cease and desist the announcement
of this route.

I don't think that's the likely scenario at all. I think we'll see an increasing number of persistent accidental mis-origins, which don't get cleaned up because they're being discarded by people doing OV, and therefore not causing an outage anywhere - until the covering VRP disappears.

Becoming bestpath in a densely-interconnected network using a forged-origin hijack in the face of a ROA that has all it's covered prefixes in the DFZ is actually not trivial, and often not possible, because you loose on path-length.
If you’re forging a major provider’s origin AS, sure. If you’re forging a remote single-homed customer of a small rural ISP that buys from a reseller of a reseller of a transit backbone, it gets fairly trivial if you can find a cooperative (or inattentive) network close to a major IX (not so hard currently).

Too true. The solution, of course, being for networks close to that stub AS performing OV.

This is even more true in networks that filter peers and customers by prefix, as you have less chance of exploiting a higher local-pref to overcome path-length.
As a result, the protection that OV offers is not meaningless.
Yes, but you only need to tie the path length, and you don’t necessarily need to hijack 100% of traffic to achieve a useful result, depending on the goal of your particular hijack.

OV is also a prerequisite for validating the entire path against stated policy, various mechanisms for which are currently WIP in SIDROPS and GROW.
Yes, but none of those has much promise of producing a useful result in the foreseeable future.  Once they do, then I might consider this serious vs. merely operational.

Note, I'm not talking about BGPsec here - i'm yet to be convinced that'll be a thing in my ops-lifetime. But things like as-cones and aspa could be operationalised pretty fast in the scheme of things.

Nonetheless, even if one wants to take RPKI seriously, a quick review of the RFCs and IETF guidance on the matter shows that the worst case scenario for an RIR outage on ROA publication should be that routing reverts to its pre-RPKI unauthenticated state. It should not cause any sort of outage (except to the extent you might start accepting routes you previously rejected).
If you’re rejecting routes for RPKI validation failure, you should be tracking down the advertisers and getting those situations corrected. If you’re doing that, then any such outages should be somewhere between minimal and non-existent.
If issuing party tools are unavailable to resource holders, they will be unable to effect the correction.
True. But it’s going to take longer to get to the point of being ready to make the correction than my understanding of the duration of this outage.

I don't get this, explain pls?

Did any packets go the wrong way due to the AfriNIC outage? Was there any actual operational impact?
I suspect not. I suspect that this is a lot of handwaving about a non-issue.

I suspect some, though certainly few. The same incident at a time when more networks are performing OV could look very different, and I'm simply suggesting that we look closely at the options before that happens.
As I understand it, AfriNIC has already modified their human procedures to address exactly this. What am I missing?
I suspect no actual packets were harmed in the making of this email thread.

Possibly nothing. All I'm saying is this is a complex area that warrants discussion between operators.

Don’t get me wrong, I’m all for making AfriNIC’s systems more resilient and more available, but, I think we also need to consider the actual impact of failures and not over-react to failures without impact.
Based on the information in the post mortem, it does not look like a systems failure, but purely human error. Taking the humans out of the loop on that monthly maintenance would involve compromising the integrity of the private key and thus reduce the validity of the RPKI data. As such, I’m not convinced that there is a problem here to solve beyond the procedural changes that AfriNIC says they have already implemented.

I'm not arguing for any specific change. I would point out that when there are humans in the system, then a human failure *is* a system failure, and we should be clear on what failure modes can and cannot be tolerated.
Yep… And the design specs of RPKI call for this kind of failure to be tolerable. If that’s not acceptable, then go address the RFCs and redesign RPKI to be more resilient.

If making RPKI stable in operations requires a protocol change, that's fine by me. I don't think anyone (myself included) quite understands what that needs to look like yet. Hence this discussion.

Cheers,

Ben

Owen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.afrinic.net/pipermail/rpki-discuss/attachments/20190411/9039d33a/attachment.html>