We’ve posted before about what happens when things go wrong on GOV.UK, and how we classify and prioritise incidents on GOV.UK.

Every incident teaches us something new about our technology or the way we communicate with each other. They also give us the opportunity to improve our incident management process so we can minimise disruption to users in the future.

In May 2016 we committed to blogging about every severity 1 or severity 2 incident, as well as severity 3 incidents if they’re particularly interesting.

This post is a roundup of 3 incidents that GOV.UK encountered between May and June 2016.

31 May 2016 – problem uploading attachments using the Whitehall publishing application

What users saw

Between 2:30pm and 6pm, attachments uploaded by government publishers and editors appeared to be stuck in virus scanning. Users of the GOV.UK website were not directly affected. This was a severity 3 incident.

Cause of the problem

Supplier issues meant that the virtual private network (VPN) between our production environment and our disaster recovery setup stopped working. This prevented attachments from being copied across, leaving uploads in a pending state.

Steps taken to prevent this from happening again

We updated the relevant script so it would log an error, but continue with subsequent attachment uploads.

We no longer use a VPN in the uploading assets process.

In future, we’d like to add monitoring to alert us if the number of unprocessed attachment uploads crosses a reasonable threshold.

2 June 2016 – draft editions incorrectly available on live site

What users saw

For approximately 24 hours from 11am on 2 June, the driving and transport browse page linked to draft editions of 3 new pages that had not yet been published. Because they were still in draft, they should not have been live on the site, nor linked from the browse page. This was a severity 2 incident.

Cause of the problem

There was a glitch in a change to the publishing API as part of dependency resolution work. We already had tests in place to check that draft content wasn’t being sent to the live content store, but they needed to be updated to reflect the dependency resolution work.

Steps taken to prevent this from happening again

We recognised that we didn’t have sufficient documentation to detail the really critical parts of the publishing API and what happens if they go wrong (for example, the potential consequences of draft content making it into the live system). To address this, we created fresh documentation on our internal wiki, available to the entire GOV.UK team.

We also expanded the tests we run to check that content is sent to the right place (the draft content store or the live content store).

19 June 2016 – users unable to submit anonymous feedback

What users saw

For just over an hour on 19 June, a number of users were unable to submit feedback using the anonymous contact form on GOV.UK. In addition, publishing of one document was affected until the problem was resolved. This was a severity 3 incident.

Cause of the problem

The publishing API’s experiments code (which tests if proposed changes to our code will optimise it) produced a large amount of data, causing the machine hosting Redis (a database used for fast data retrieval) to run out of memory. This meant the publishing API could not serve requests quickly enough and feedback couldn’t be processed.

Steps taken to prevent this from happening again

We put steps in place to make sure the Whitehall application continues to render pages if the publishing API is down. We also removed the experiments code from the publishing API, since it is no longer required.

Paul Heron is a delivery manager on GOV.UK. You can follow him on Twitter.

Original source – Inside GOV.UK

Comments closed

Bitnami