Sydney rail outage caused by 2 capacitors (and bad system design)

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Full report at
http://www.railcorp.info/__data/assets/pdf_file/0008/9719/110429-Signal_System_Report.pdf

On the morning of the April 12th, 2011 the ATRICS system located at
Sydenham Signal Box suffered a failure which resulted in a loss of
control for points, routes and controlled signals for all areas under
the control of the Signal Box. All rail traffic was requested by radio
broadcast to stop until advised otherwise.
This failure caused significant disruption across approximately 40% of
the metropolitan rail network. As a consequence of the failure there
were 847 trains delayed and 240 trains cancelled. The On Time Reporting
(OTR) figures for morning and afternoon peak periods were 75.5% and
67.6% respectively.
An investigation was commenced on the morning of the 12th April to
examine the cause of the failure and to provide recommendations on
improvements to both hardware and software to improve the resilience of
the control system to these types of failures.
The Advanced Train Running Information Control System (ATRICS) is a
fully integrated train management system that provides a simple means
for a signaller to control a rail network.
The Sydenham Signal Box uses ATRICS to control a very substantial part
of the Metropolitan Rail Network covering the following areas:
Hurstville, Liverpool, Sefton, Sutherland, Sydenham, Sydenham Bankstown,
Wardell Road, Wolli Creek, Sutherland and Airport Line
The ATRICS system is a computerised system that utilises a Local Area
Network (LAN) and a Wide Area Network (WAN) to communicate between the
various devices that make up the system. The network is protected from
other networks in order to prevent external cyber attacks such as
hacking, virus etc. The Sydenham LAN is made up from a combination of
switches, routers and firewalls in order to achieve the required
availability and integrity. The LAN comprises four switches which are
connected in a ring topology to provide device and network fault
tolerance and to be adaptive to network changes.
At 7:36:52 on the 12th April 2011, one of the network switches that
forms part of the ATRICS LAN at Sydenham (Sydhm_sw1) detected that the
adjacent switch was no longer communicating. At 7:37:10 the same network
switch detected that the new communicating switch had resumed
communicating. This pattern repeated regularly for sw1 and for other
switches connected to the network. This pattern indicates that there was
not a complete failure but that one of the network switches was cycling
from a failed to an operational state. As a result the Sydenham LAN
became caught in a cycle where it was continually trying to reconfigure
itself to address the changing state of the network.
All external data communication stopped while the Sydenham LAN was
trying to reconfigure itself. At 07:36:55 due to the disruption across
the LAN network, applications on all 22 RCS machines started losing data
connections and as a consequence controlling panels started losing their
functionality at 07:38:02.
The faulty network switch was identified by local technical staff. The
disaster recovery process was initiated at 07:48:03. This involved the
staged restart of all signal control servers and workstations. The first
workstation area became functional at 08:10:15 and full control on all
workstations was restored at 08:52 with the faulty switch powered off at
08.46.
The Passenger Information system became functional at 09:15, but quite a
few platform indicators had been blanked to avoid confusing passengers.
Full functionality of all platform indicators was restored at 16:15.
An investigation into the incident commenced later that same day. The
faulty switch and all logs and evidence were quarantined and analysed.
During the investigation it was found there were two related failures.
The first one was the failure of the Sydenham ATRICS network switch and
the second one was a failure of the Revesby Microlok signalling
interlocking.

According to available evidence and tests performed it can be concluded
that, the trigger for the event was the failure of two electrolytic
capacitors in the Sydenham LAN switch sydhm_sw2. Due to the nature of
the capacitor failure, the switch partially failed which placed the
whole LAN into a partially failed state.
The root cause of the incident was the failure of the Sydenham LAN
however; the major contributors to the duration of the outage were the
inability of the ATRICS software to manage the scenario created by the
failure of the LAN, and the slowness of the Sydenham LAN, which are
still under investigation.
The root cause of the Revesby failure has been identified as a weakness
in the configuration of the signalling interlocking at the site. In the
configuration that applies at Revesby, there is a possibility that under
failure conditions, point controls issued by ATRICS may not be
exexecuted by the interlocking. (The interlocking fails to a safe state
preventing movement of trains).
The investigation identified seven recommendations to improve management
and robustness of the control system in response to the incident.
The vulnerability exposed by the incident is an inherent part of the
design of the Sydenham LAN and ATRICS software. This vulnerability
remains in the Sydenham system and priority should be given to its
mitigation or elimination.

Site Timeline