SIDEBAR
»
S
I
D
E
B
A
R
«
ibis died (again)
Sep 27th, 2013 by Tim Watts

Restarted apache – this affected the following sites:

  • http://ems.kcl.ac.uk/
  • http://epidoc.cch.kcl.ac.uk/
  • http://insaph.kcl.ac.uk/
  • http://iwpe.org.uk/http://socrates.kcl.ac.uk/
  • http://wiki.digitalclassicist.org/
  • http://worldpoetryinenglish.org.uk/
  • http://www.cch.kcl.ac.uk/
  • http://www.digitalclassicist.org/
  • http://www.ems.kcl.ac.uk/
  • http://www.epidoc.cch.kcl.ac.uk/
  • http://www.insaph.kcl.ac.uk/
  • http://www.iwpe.org.uk/
  • http://www.socrates.kcl.ac.uk/
  • http://www.worldpoetryinenglish.org.uk/

All sites are working again.

Thursday’s systems failure
Sep 27th, 2013 by Tim Watts

MCB labelled 'B' tripped

Sometime between 1am and 2am on Thursday morning (26th Sept 2013) the VMware cluster failed.

After some initial diagnostics from offsite, I concluded the problem was local to our rack and that half the power feeds had failed in cabinet M8 at ULCC. ULCC confirmed the problem was not a wider one.

I discovered that the circuit breaker next to the label B in the photo had tripped. This feeds one half of the cabinet. There is no good reason the breaker should have tripped – it is a Type C characteristic 16A which are pretty resistant to false trips.

In theory our VMware cluster should tolerate half the power going off. However, there are two problems:

i) There is a bug with the Dell PowerConnect network switches that route disk data (iSCSI) between the SAN (disk array) and the VMWare host servers. If the “wrong” switch (unit 1 in this case) fails, then unit 2’s ports go offline.

ii) The SAN has 3 power supplies. However, these are in the arrangement of “2+1” which means only one can fail before the unit shuts down and two must remain working for the unit to function. This is tricky to arrange when you only have 2 main distribution circuits. A “1+2” arrangement would be better, or even a “1+1”

I reset the breaker and waited for VMWare to restart essential services (the vcenter VM mainly).

All VMs needed a reset, as whilst they were still technically running, they had lost their disks for so long as to upset the OS.

Thanks to Nagios┬áit was fairly easy to see which VMs still had problems – there are quite a few order related dependencies between VMs (eg the database servers and NFS servers need to be up before other servers).

The systems were mostly restored by 2pm and fully restored by 3pm.

Mitigating further problems

Network

i) The network switch problem is tricky. Ideally we would replace the switches under warranty.

However, there are a lot of cables that have to be put back the correct ports and all systems would be offline for most of a day to do that.

What I did do was swap the power feeds for switch 1 and switch 2 in case the same MCB trips again. This should allow the remain switch to remain operational.

ii) SAN PSUs: I have moved the feed for PSU3 into the next cabinet. This mitigates the risk of the MCB tripping, but is still at risk of a failure of one of the main ULCC circuits.

Back from holiday
Sep 9th, 2013 by Tim Watts

Systems all noted working with the exception of:

  1. SAWS – memory problems with tomcat
  2. Backups: poms.cch.kcl.ac.uk failed again. Something weird with that VM instance.
  3. Stormont Papers – web service failure.
  4. Several password change requests from external people.
SIDEBAR
»
S
I
D
E
B
A
R
«
»  Substance:WordPress   »  Style:Ahren Ahimsa