blob: 747367a4c8eae47eb7a2b21978cb63185968a715 [file] [log] [blame]
Paul Beesleyfc9ee362019-03-07 15:47:15 +00001Reliability, Availability, and Serviceability (RAS) Extensions
Manish Pandeyd419e222023-02-13 12:39:17 +00002**************************************************************
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01003
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01004This document describes |TF-A| support for Arm Reliability, Availability, and
5Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
6later CPUs, and also an optional extension to the base Armv8.0 architecture.
7
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01008For the description of Arm RAS extensions, Standard Error Records, and the
9precise definition of RAS terminology, please refer to the Arm Architecture
Manish Pandeyd419e222023-02-13 12:39:17 +000010Reference Manual and `RAS Supplement`_. The rest of this document assumes
11familiarity with architecture and terminology.
12
Manish Pandeyb32197e2023-07-13 10:08:41 +010013**IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present
14then FEAT_IESB is also implmented.
15
Manish Pandeyd419e222023-02-13 12:39:17 +000016There are two philosophies for handling RAS errors from Non-secure world point
17of view.
18
19- :ref:`Firmware First Handling (FFH)`
20- :ref:`Kernel First Handling (KFH)`
21
22.. _Firmware First Handling (FFH):
23
24Firmware First Handling (FFH)
25=============================
26
27Introduction
28------------
29
30EA’s and Error interrupts corresponding to NS nodes are handled first in firmware
31
32- Errors signaled back to NS world via suitable mechanism
33- Kernel is prohibited from accessing the RAS error records directly
34- Firmware creates CPER records for kernel to navigate and process
35- Firmware signals error back to Kernel via SDEI
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010036
37Overview
38--------
39
Manish Pandeyd419e222023-02-13 12:39:17 +000040FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from
41errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous
42External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling
43and Error Recovery interrupts.
44RAS Framework in TF-A allows the platform to define an external abort handler and to
45register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard
46Error Records as introduced by the RAS extensions
47
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010048
49.. __: `Standard Error Record helpers`_
50
Manish Pandeyd419e222023-02-13 12:39:17 +000051.. _Kernel First Handling (KFH):
52
53Kernel First Handling (KFH)
54===========================
55
56Introduction
57------------
58
59EA's originating/attributed to NS world are handled first in NS and Kernel navigates
60the std error records directly.
61
Manish Pandeyb32197e2023-07-13 10:08:41 +010062- KFH is the default handling mode if platform does not explicitly enable FFH mode.
63- KFH mode does not need any EL3 involvement except for the reflection of errors back
64 to lower EL. This happens when there is an error (EA) in the system which is not yet
65 signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are
66 synchronized causing async EA to pend at EL3.
67
68Error Syncronization at EL3 entry
69=================================
70
71During entry to EL3 from lower EL, if there is any pending async EAs they are either
72reflected back to lower EL (KFH) or handled in EL3 itself (FFH).
73
74|Image 1|
Manish Pandeyd419e222023-02-13 12:39:17 +000075
76TF-A build options
77==================
78
Manish Pandeyf90a73c2023-10-10 15:42:19 +010079- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3.
80- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH
Manish Pandeyd419e222023-02-13 12:39:17 +000081- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers.
Manish Pandeyf90a73c2023-10-10 15:42:19 +010082- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and
83 HANDLE_EA_EL3_FIRST_NS put together.
84
85RAS internal macros
86
87- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled.
Manish Pandeyd419e222023-02-13 12:39:17 +000088
89RAS feature has dependency on some other TF-A build flags
90
91- **EL3_EXCEPTION_HANDLING**: Required for FFH
Manish Pandeyd419e222023-02-13 12:39:17 +000092- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform
93
Manish Pandeyb32197e2023-07-13 10:08:41 +010094TF-A Tests
95==========
96
97RAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple
98configurations for testing lower EL External aborts.
99
100All the tests are written in TF-A tests which runs as NS-EL2 payload.
101
102- **FFH without RAS extension**
103
104 *fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug*
105
106 Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3.
107 Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully
108 handles these errors and returns back to TF-A Tests
109
110 Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH**
111
112- **FFH with RAS extension**
113
114 Three Tests :
115
116 - *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug*
117
118 Inject an unrecoverable RAS error, which gets handled in EL3.
119
120 - *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug*
121
122 Inject uncontainable RAS errors which causes platform to panic.
123
124 - *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug*
125
126 Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL
127 which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending
128 async EA it will handle the async EA first (nested exception) before handling the original SMC call.
129
130- **KFH with RAS extension**
131
132 Couple of tests in the group :
133
134 - *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug*
135
136 Inject and handle RAS errors in TF-A tests (no El3 involvement)
137
138 - *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug*
139
140 Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting
141 in IRQ and SMC path.
142
Manish Pandeyd419e222023-02-13 12:39:17 +0000143RAS Framework
144=============
145
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100146
147.. _ras-figure:
148
Paul Beesley814f8c02019-03-13 15:49:27 +0000149.. image:: ../resources/diagrams/draw.io/ras.svg
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100150
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100151Platform APIs
152-------------
153
154The RAS framework allows the platform to define handlers for External Abort,
155Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
Manish Pandey9c9f38a2020-06-30 00:46:08 +0100156refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100157
158Registering RAS error records
159-----------------------------
160
161RAS nodes are components in the system capable of signalling errors to PEs
162through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
163nodes contain one or more error records, which are registers through which the
164nodes advertise various properties of the signalled error. Arm recommends that
165error records are implemented in the Standard Error Record format. The RAS
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +0000166architecture allows for error records to be accessible via system or
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100167memory-mapped registers.
168
169The platform should enumerate the error records providing for each of them:
170
171- A handler to probe error records for errors;
172- When the probing identifies an error, a handler to handle it;
173- For memory-mapped error record, its base address and size in KB; for a system
174 register-accessed record, the start index of the record and number of
175 continuous records from that index;
176- Any node-specific auxiliary data.
177
178With this information supplied, when the run time firmware receives one of the
179notification mechanisms, the RAS framework can iterate through and probe error
180records for error, and invoke the appropriate handler to handle it.
181
182The RAS framework provides the macros to populate error record information. The
183macros are versioned, and the latest version as of this writing is 1. These
184macros create a structure of type ``struct err_record_info`` from its arguments,
185which are later passed to probe and error handlers.
186
187For memory-mapped error records:
188
189.. code:: c
190
191 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
192
193And, for system register ones:
194
195.. code:: c
196
197 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
198
199The probe handler must have the following prototype:
200
201.. code:: c
202
203 typedef int (*err_record_probe_t)(const struct err_record_info *info,
204 int *probe_data);
205
206The probe handler must return a non-zero value if an error was detected, or 0
207otherwise. The ``probe_data`` output parameter can be used to pass any useful
208information resulting from probe to the error handler (see `below`__). For
209example, it could return the index of the record.
210
211.. __: `Standard Error Record helpers`_
212
213The error handler must have the following prototype:
214
215.. code:: c
216
217 typedef int (*err_record_handler_t)(const struct err_record_info *info,
218 int probe_data, const struct err_handler_data *const data);
219
220The ``data`` constant parameter describes the various properties of the error,
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +0000221including the reason for the error, exception syndrome, and also ``flags``,
Manish Pandey9c9f38a2020-06-30 00:46:08 +0100222``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
223<EL3 interrupts>`.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100224
225The platform is expected populate an array using the macros above, and register
226the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
227passing it the name of the array describing the records. Note that the macro
228must be used in the same file where the array is defined.
229
230Standard Error Record helpers
231~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
232
233The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
234both memory-mapped and System Register accesses:
235
236.. code:: c
237
238 int ras_err_ser_probe_memmap(const struct err_record_info *info,
239 int *probe_data);
240
241 int ras_err_ser_probe_sysreg(const struct err_record_info *info,
242 int *probe_data);
243
244When the platform enumerates error records, for those records in the Standard
245Error Record format, these helpers maybe used instead of rolling out their own.
246Both helpers above:
247
248- Return non-zero value when an error is detected in a Standard Error Record;
249- Set ``probe_data`` to the index of the error record upon detecting an error.
250
251Registering RAS interrupts
252--------------------------
253
254RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
255Recovery interrupts. For the firmware-first handling paradigm for interrupts to
256work, the platform must setup and register with |EHF|. See `Interaction with
257Exception Handling Framework`_.
258
259For each RAS interrupt, the platform has to provide structure of type ``struct
260ras_interrupt``:
261
262- Interrupt number;
263- The associated error record information (pointer to the corresponding
264 ``struct err_record_info``);
265- Optionally, a cookie.
266
267The platform is expected to define an array of ``struct ras_interrupt``, and
268register it with the RAS framework using the macro
269``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
270macro must be used in the same file where the array is defined.
271
272The array of ``struct ras_interrupt`` must be sorted in the increasing order of
273interrupt number. This allows for fast look of handlers in order to service RAS
274interrupts.
275
276Double-fault handling
277---------------------
278
279A Double Fault condition arises when an error is signalled to the PE while
280handling of a previously signalled error is still underway. When a Double Fault
281condition arises, the Arm RAS extensions only require for handler to perform
282orderly shutdown of the system, as recovery may be impossible.
283
284The RAS extensions part of Armv8.4 introduced new architectural features to deal
285with Double Fault conditions, specifically, the introduction of ``NMEA`` and
286``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
287software which runs part of its entry/exit routines with exceptions momentarily
288masked—meaning, in such systems, External Aborts/SErrors are not immediately
289handled when they occur, but only after the exceptions are unmasked again.
290
291|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
292This means that all exceptions routed to EL3 are handled immediately. |TF-A|
293thus is able to detect a Double Fault conditions in software, without needing
294the intended advantages of Armv8.4 Double Fault architecture extensions.
295
296Double faults are fatal, and terminate at the platform double fault handler, and
297doesn't return.
298
299Engaging the RAS framework
300--------------------------
301
Manish Pandeyd419e222023-02-13 12:39:17 +0000302Enabling RAS support is a platform choice
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100303
304The RAS support in |TF-A| introduces a default implementation of
Manish Pandeyf90a73c2023-10-10 15:42:19 +0100305``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS``
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100306is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
307top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
308to through platform-supplied error records, probe them, and when an error is
309identified, look up and invoke the corresponding error handler.
310
311Note that, if the platform chooses to override the ``plat_ea_handler`` function
312and intend to use the RAS framework, it must explicitly call
313``ras_ea_handler()`` from within.
314
315Similarly, for RAS interrupts, the framework defines
316``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
317when a RAS interrupt taken at EL3. The function bisects the platform-supplied
318sorted array of interrupts to look up the error record information associated
319with the interrupt number. That error handler for that record is then invoked to
320handle the error.
321
322Interaction with Exception Handling Framework
323---------------------------------------------
324
325As mentioned in earlier sections, RAS framework interacts with the |EHF| to
326arbitrate handling of RAS exceptions with others that are routed to EL3. This
Manish Pandey9c9f38a2020-06-30 00:46:08 +0100327means that the platform must partition a :ref:`priority level <Partitioning
328priority levels>` for handling RAS exceptions. The platform must then define
329the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
330Platforms would typically want to allocate the highest secure priority for
331RAS handling.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100332
Manish Pandey9c9f38a2020-06-30 00:46:08 +0100333Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
334<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
335documentation. I.e., for interrupts, the priority management is implicit; but
336for non-interrupt exceptions, they're explicit using :ref:`EHF APIs
337<Activating and Deactivating priorities>`.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100338
Paul Beesleyf8640672019-04-12 14:19:42 +0100339--------------
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100340
Manish Pandeyd419e222023-02-13 12:39:17 +0000341*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.*
342
343.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest
Manish Pandeyb32197e2023-07-13 10:08:41 +0100344.. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master
345
346.. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png