[RPKI-Discuss] [Community-Discuss] 06 April 2019 RPKI incident - Postmortem report

Wed Apr 10 18:29:05 UTC 2019

Hi Ben, all,

I’m glad that we are finally seeing some discussions on this list. The community has been rather quiet on RPKI until the incident. 

As it was previously mentioned, AFRINIC RPKI system has two CAs, one Offline CA and one Online CA. CRLs and MFTs of the Online CA is refreshed automatically on a daily basis. However, the Offline CA, which is also the Trust Anchor (TA) CA, is an offline CA that needs manual intervention to refresh CRLs and MFTs and this is done on a monthly basis. The nature of the Offline CA is such that it cannot (SHOULD NOT) be remotely accessible for automated services.

This will change if one day...there is a single trust anchor, then each RIR wouldn't need to maintain an offline CA.

The only way to fix human errors is through dedicated processes and proper safeguards.

> On 10 Apr 2019, at 18:00, Ben Maddison via RPKI-Discuss <rpki-discuss at afrinic.net> wrote:
> 
> That's not entirely true. A partial outage (where for example a single TAL becomes unverifiable, as in this case) may lead to a missing ROA for a prefix that remains covered by other ROAs issued under other TALs.
> 
> Consider ROAs:
> {prefix: 2001:db8::/32, maxLength: 48, asn: 65000, tal: AFRINIC}
> {prefix: 2001:db8:f00::/48, maxLength: 48, asn: 65001, tal: RIPE}
> 
> With the above, a route 2001:db8:f00::/48 via 65000_65001 will have a status Valid.
> If the RIPE TAL fails verification, it will become Invalid.
> 
> This is most certainly a corner case, but is at least theoretically possible given that all RIRs claim 0/0 in their root certs.

That’s assuming the /48 got transfered from AFRINIC to RIPE. I believe that’s why somebody recommended to use minimal ROAs :), in that case if RIPE TAL fails, route 2001:db8:f00::/48 originated by 65001 would simply be tagged as NOT-FOUND.

> I'm not arguing for any specific change. I would point out that when there are humans in the system, then a human failure *is* a system failure, and we should be clear on what failure modes can and cannot be tolerated.

Agree. As long as the system isn’t made 100% available, secure and resilient, such issues are bound to happen.

Just to put some context, AFRINIC RPKI has been operational since 2011 and currently we are at around 2.85% of IPv4 coverage, 1.22% IPv6 coverage with 114 activated RPKI engines. We are the slowest region in terms of adoption. I believe this is changing [1] and I hope the demand in more robust RPKI services will cascade in increased dedicated and additional resources to this critical service.

Cheers,
Amreesh

[1] https://afnog.org/pipermail/afnog/2019-April/003681.html