blob: 4c16009c45db83dcbd5c352ce9c8f56f0b106a15 [file] [log] [blame]
Paul Beesleyfc9ee362019-03-07 15:47:15 +00001Reliability, Availability, and Serviceability (RAS) Extensions
2==============================================================
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01003
4.. contents::
5 :depth: 2
6
7.. |EHF| replace:: Exception Handling Framework
8.. |TF-A| replace:: Trusted Firmware-A
9
10This document describes |TF-A| support for Arm Reliability, Availability, and
11Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
12later CPUs, and also an optional extension to the base Armv8.0 architecture.
13
14In conjunction with the |EHF|, support for RAS extension enables firmware-first
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +000015paradigm for handling platform errors: exceptions resulting from errors are
16routed to and handled in EL3. Said errors are Synchronous External Abort (SEA),
17Asynchronous External Abort (signalled as SErrors), Fault Handling and Error
18Recovery interrupts. The |EHF| document mentions various `error handling
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010019use-cases`__.
20
21.. __: exception-handling.rst#delegation-use-cases
22
23For the description of Arm RAS extensions, Standard Error Records, and the
24precise definition of RAS terminology, please refer to the Arm Architecture
25Reference Manual. The rest of this document assumes familiarity with
26architecture and terminology.
27
28Overview
29--------
30
31As mentioned above, the RAS support in |TF-A| enables routing to and handling of
32exceptions resulting from platform errors in EL3. It allows the platform to
33define an External Abort handler, and to register RAS nodes and interrupts. RAS
34framework also provides `helpers`__ for accessing Standard Error Records as
35introduced by the RAS extensions.
36
37.. __: `Standard Error Record helpers`_
38
39The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run
40time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also
41be set ``1``.
42
43.. _ras-figure:
44
Paul Beesleyea225122019-02-11 17:54:45 +000045.. image:: ../draw.io/ras.svg
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010046
47See more on `Engaging the RAS framework`_.
48
49Platform APIs
50-------------
51
52The RAS framework allows the platform to define handlers for External Abort,
53Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
54refer to the porting guide for the `RAS platform API descriptions`__.
55
Paul Beesleyea225122019-02-11 17:54:45 +000056.. __: ../getting_started/porting-guide.rst#external-abort-handling-and-ras-support
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010057
58Registering RAS error records
59-----------------------------
60
61RAS nodes are components in the system capable of signalling errors to PEs
62through one one of the notification mechanismsSEAs, SErrors, or interrupts. RAS
63nodes contain one or more error records, which are registers through which the
64nodes advertise various properties of the signalled error. Arm recommends that
65error records are implemented in the Standard Error Record format. The RAS
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +000066architecture allows for error records to be accessible via system or
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010067memory-mapped registers.
68
69The platform should enumerate the error records providing for each of them:
70
71- A handler to probe error records for errors;
72- When the probing identifies an error, a handler to handle it;
73- For memory-mapped error record, its base address and size in KB; for a system
74 register-accessed record, the start index of the record and number of
75 continuous records from that index;
76- Any node-specific auxiliary data.
77
78With this information supplied, when the run time firmware receives one of the
79notification mechanisms, the RAS framework can iterate through and probe error
80records for error, and invoke the appropriate handler to handle it.
81
82The RAS framework provides the macros to populate error record information. The
83macros are versioned, and the latest version as of this writing is 1. These
84macros create a structure of type ``struct err_record_info`` from its arguments,
85which are later passed to probe and error handlers.
86
87For memory-mapped error records:
88
89.. code:: c
90
91 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
92
93And, for system register ones:
94
95.. code:: c
96
97 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
98
99The probe handler must have the following prototype:
100
101.. code:: c
102
103 typedef int (*err_record_probe_t)(const struct err_record_info *info,
104 int *probe_data);
105
106The probe handler must return a non-zero value if an error was detected, or 0
107otherwise. The ``probe_data`` output parameter can be used to pass any useful
108information resulting from probe to the error handler (see `below`__). For
109example, it could return the index of the record.
110
111.. __: `Standard Error Record helpers`_
112
113The error handler must have the following prototype:
114
115.. code:: c
116
117 typedef int (*err_record_handler_t)(const struct err_record_info *info,
118 int probe_data, const struct err_handler_data *const data);
119
120The ``data`` constant parameter describes the various properties of the error,
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +0000121including the reason for the error, exception syndrome, and also ``flags``,
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100122``cookie``, and ``handle`` parameters from the `top-level exception handler`__.
123
124.. __: interrupt-framework-design.rst#el3-interrupts
125
126The platform is expected populate an array using the macros above, and register
127the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
128passing it the name of the array describing the records. Note that the macro
129must be used in the same file where the array is defined.
130
131Standard Error Record helpers
132~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
133
134The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
135both memory-mapped and System Register accesses:
136
137.. code:: c
138
139 int ras_err_ser_probe_memmap(const struct err_record_info *info,
140 int *probe_data);
141
142 int ras_err_ser_probe_sysreg(const struct err_record_info *info,
143 int *probe_data);
144
145When the platform enumerates error records, for those records in the Standard
146Error Record format, these helpers maybe used instead of rolling out their own.
147Both helpers above:
148
149- Return non-zero value when an error is detected in a Standard Error Record;
150- Set ``probe_data`` to the index of the error record upon detecting an error.
151
152Registering RAS interrupts
153--------------------------
154
155RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
156Recovery interrupts. For the firmware-first handling paradigm for interrupts to
157work, the platform must setup and register with |EHF|. See `Interaction with
158Exception Handling Framework`_.
159
160For each RAS interrupt, the platform has to provide structure of type ``struct
161ras_interrupt``:
162
163- Interrupt number;
164- The associated error record information (pointer to the corresponding
165 ``struct err_record_info``);
166- Optionally, a cookie.
167
168The platform is expected to define an array of ``struct ras_interrupt``, and
169register it with the RAS framework using the macro
170``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
171macro must be used in the same file where the array is defined.
172
173The array of ``struct ras_interrupt`` must be sorted in the increasing order of
174interrupt number. This allows for fast look of handlers in order to service RAS
175interrupts.
176
177Double-fault handling
178---------------------
179
180A Double Fault condition arises when an error is signalled to the PE while
181handling of a previously signalled error is still underway. When a Double Fault
182condition arises, the Arm RAS extensions only require for handler to perform
183orderly shutdown of the system, as recovery may be impossible.
184
185The RAS extensions part of Armv8.4 introduced new architectural features to deal
186with Double Fault conditions, specifically, the introduction of ``NMEA`` and
187``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
188software which runs part of its entry/exit routines with exceptions momentarily
189maskedmeaning, in such systems, External Aborts/SErrors are not immediately
190handled when they occur, but only after the exceptions are unmasked again.
191
192|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
193This means that all exceptions routed to EL3 are handled immediately. |TF-A|
194thus is able to detect a Double Fault conditions in software, without needing
195the intended advantages of Armv8.4 Double Fault architecture extensions.
196
197Double faults are fatal, and terminate at the platform double fault handler, and
198doesn't return.
199
200Engaging the RAS framework
201--------------------------
202
Paul Beesley1fbc97b2019-01-11 18:26:51 +0000203Enabling RAS support is a platform choice constructed from three distinct, but
204related, build options:
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100205
206- ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware;
207
208- ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See
209 `Interaction with Exception Handling Framework`_;
210
211- ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to
212 EL3.
213
214The RAS support in |TF-A| introduces a default implementation of
215``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION``
216is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
217top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
218to through platform-supplied error records, probe them, and when an error is
219identified, look up and invoke the corresponding error handler.
220
221Note that, if the platform chooses to override the ``plat_ea_handler`` function
222and intend to use the RAS framework, it must explicitly call
223``ras_ea_handler()`` from within.
224
225Similarly, for RAS interrupts, the framework defines
226``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
227when a RAS interrupt taken at EL3. The function bisects the platform-supplied
228sorted array of interrupts to look up the error record information associated
229with the interrupt number. That error handler for that record is then invoked to
230handle the error.
231
232Interaction with Exception Handling Framework
233---------------------------------------------
234
235As mentioned in earlier sections, RAS framework interacts with the |EHF| to
236arbitrate handling of RAS exceptions with others that are routed to EL3. This
237means that the platform must partition a `priority level`__ for handling RAS
238exceptions. The platform must then define the macro ``PLAT_RAS_PRI`` to the
239priority level used for RAS exceptions. Platforms would typically want to
240allocate the highest secure priority for RAS handling.
241
242.. __: exception-handling.rst#partitioning-priority-levels
243
Paul Beesley1fbc97b2019-01-11 18:26:51 +0000244Handling of both `interrupt`__ and `non-interrupt`__ exceptions follow the
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100245sequences outlined in the |EHF| documentation. I.e., for interrupts, the
246priority management is implicit; but for non-interrupt exceptions, they're
247explicit using `EHF APIs`__.
248
249.. __: exception-handling.rst#interrupt-flow
250.. __: exception-handling.rst#non-interrupt-flow
251.. __: exception-handling.rst#activating-and-deactivating-priorities
252
253----
254
255*Copyright (c) 2018, Arm Limited and Contributors. All rights reserved.*