Sydney rail outage caused by 2 capacitors (and bad system design)

Full report at

formatting link

On the morning of the April 12th, 2011 the ATRICS system located at Sydenham Signal Box suffered a failure which resulted in a loss of control for points, routes and controlled signals for all areas under the control of the Signal Box. All rail traffic was requested by radio broadcast to stop until advised otherwise. This failure caused significant disruption across approximately 40% of the metropolitan rail network. As a consequence of the failure there were 847 trains delayed and 240 trains cancelled. The On Time Reporting (OTR) figures for morning and afternoon peak periods were 75.5% and

67.6% respectively. An investigation was commenced on the morning of the 12th April to examine the cause of the failure and to provide recommendations on improvements to both hardware and software to improve the resilience of the control system to these types of failures. The Advanced Train Running Information Control System (ATRICS) is a fully integrated train management system that provides a simple means for a signaller to control a rail network. The Sydenham Signal Box uses ATRICS to control a very substantial part of the Metropolitan Rail Network covering the following areas: Hurstville, Liverpool, Sefton, Sutherland, Sydenham, Sydenham Bankstown, Wardell Road, Wolli Creek, Sutherland and Airport Line The ATRICS system is a computerised system that utilises a Local Area Network (LAN) and a Wide Area Network (WAN) to communicate between the various devices that make up the system. The network is protected from other networks in order to prevent external cyber attacks such as hacking, virus etc. The Sydenham LAN is made up from a combination of switches, routers and firewalls in order to achieve the required availability and integrity. The LAN comprises four switches which are connected in a ring topology to provide device and network fault tolerance and to be adaptive to network changes. At 7:36:52 on the 12th April 2011, one of the network switches that forms part of the ATRICS LAN at Sydenham (Sydhm_sw1) detected that the adjacent switch was no longer communicating. At 7:37:10 the same network switch detected that the new communicating switch had resumed communicating. This pattern repeated regularly for sw1 and for other switches connected to the network. This pattern indicates that there was not a complete failure but that one of the network switches was cycling from a failed to an operational state. As a result the Sydenham LAN became caught in a cycle where it was continually trying to reconfigure itself to address the changing state of the network. All external data communication stopped while the Sydenham LAN was trying to reconfigure itself. At 07:36:55 due to the disruption across the LAN network, applications on all 22 RCS machines started losing data connections and as a consequence controlling panels started losing their functionality at 07:38:02. The faulty network switch was identified by local technical staff. The disaster recovery process was initiated at 07:48:03. This involved the staged restart of all signal control servers and workstations. The first workstation area became functional at 08:10:15 and full control on all workstations was restored at 08:52 with the faulty switch powered off at 08.46. The Passenger Information system became functional at 09:15, but quite a few platform indicators had been blanked to avoid confusing passengers. Full functionality of all platform indicators was restored at 16:15. An investigation into the incident commenced later that same day. The faulty switch and all logs and evidence were quarantined and analysed. During the investigation it was found there were two related failures. The first one was the failure of the Sydenham ATRICS network switch and the second one was a failure of the Revesby Microlok signalling interlocking.

According to available evidence and tests performed it can be concluded that, the trigger for the event was the failure of two electrolytic capacitors in the Sydenham LAN switch sydhm_sw2. Due to the nature of the capacitor failure, the switch partially failed which placed the whole LAN into a partially failed state. The root cause of the incident was the failure of the Sydenham LAN however; the major contributors to the duration of the outage were the inability of the ATRICS software to manage the scenario created by the failure of the LAN, and the slowness of the Sydenham LAN, which are still under investigation. The root cause of the Revesby failure has been identified as a weakness in the configuration of the signalling interlocking at the site. In the configuration that applies at Revesby, there is a possibility that under failure conditions, point controls issued by ATRICS may not be exexecuted by the interlocking. (The interlocking fails to a safe state preventing movement of trains). The investigation identified seven recommendations to improve management and robustness of the control system in response to the incident. The vulnerability exposed by the incident is an inherent part of the design of the Sydenham LAN and ATRICS software. This vulnerability remains in the Sydenham system and priority should be given to its mitigation or elimination.

Reply to
keithr
Loading thread data ...

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.