blob: e237ebb922e824277b5095855017cc97061bb753 [file] [log] [blame]
Paul Beesleyfc9ee362019-03-07 15:47:15 +00001Reliability, Availability, and Serviceability (RAS) Extensions
Manish Pandeyd419e222023-02-13 12:39:17 +00002**************************************************************
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01003
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01004This document describes |TF-A| support for Arm Reliability, Availability, and
5Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
6later CPUs, and also an optional extension to the base Armv8.0 architecture.
7
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01008For the description of Arm RAS extensions, Standard Error Records, and the
9precise definition of RAS terminology, please refer to the Arm Architecture
Manish Pandeyd419e222023-02-13 12:39:17 +000010Reference Manual and `RAS Supplement`_. The rest of this document assumes
11familiarity with architecture and terminology.
12
13There are two philosophies for handling RAS errors from Non-secure world point
14of view.
15
16- :ref:`Firmware First Handling (FFH)`
17- :ref:`Kernel First Handling (KFH)`
18
19.. _Firmware First Handling (FFH):
20
21Firmware First Handling (FFH)
22=============================
23
24Introduction
25------------
26
27EA’s and Error interrupts corresponding to NS nodes are handled first in firmware
28
29- Errors signaled back to NS world via suitable mechanism
30- Kernel is prohibited from accessing the RAS error records directly
31- Firmware creates CPER records for kernel to navigate and process
32- Firmware signals error back to Kernel via SDEI
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010033
34Overview
35--------
36
Manish Pandeyd419e222023-02-13 12:39:17 +000037FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from
38errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous
39External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling
40and Error Recovery interrupts.
41RAS Framework in TF-A allows the platform to define an external abort handler and to
42register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard
43Error Records as introduced by the RAS extensions
44
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010045
46.. __: `Standard Error Record helpers`_
47
Manish Pandeyd419e222023-02-13 12:39:17 +000048.. _Kernel First Handling (KFH):
49
50Kernel First Handling (KFH)
51===========================
52
53Introduction
54------------
55
56EA's originating/attributed to NS world are handled first in NS and Kernel navigates
57the std error records directly.
58
59**KFH can be supported in a platform without TF-A being aware of it but there are few
60corner cases where TF-A needs to have special handling, which is currently missing and
61will be added in future**
62
63TF-A build options
64==================
65
Manish Pandeyf90a73c2023-10-10 15:42:19 +010066- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3.
67- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH
Manish Pandeyd419e222023-02-13 12:39:17 +000068- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers.
Manish Pandeyf90a73c2023-10-10 15:42:19 +010069- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and
70 HANDLE_EA_EL3_FIRST_NS put together.
71
72RAS internal macros
73
74- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled.
Manish Pandeyd419e222023-02-13 12:39:17 +000075
76RAS feature has dependency on some other TF-A build flags
77
78- **EL3_EXCEPTION_HANDLING**: Required for FFH
Manish Pandeyd419e222023-02-13 12:39:17 +000079- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform
80
81RAS Framework
82=============
83
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010084
85.. _ras-figure:
86
Paul Beesley814f8c02019-03-13 15:49:27 +000087.. image:: ../resources/diagrams/draw.io/ras.svg
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010088
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010089Platform APIs
90-------------
91
92The RAS framework allows the platform to define handlers for External Abort,
93Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
Manish Pandey9c9f38a2020-06-30 00:46:08 +010094refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010095
96Registering RAS error records
97-----------------------------
98
99RAS nodes are components in the system capable of signalling errors to PEs
100through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
101nodes contain one or more error records, which are registers through which the
102nodes advertise various properties of the signalled error. Arm recommends that
103error records are implemented in the Standard Error Record format. The RAS
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +0000104architecture allows for error records to be accessible via system or
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100105memory-mapped registers.
106
107The platform should enumerate the error records providing for each of them:
108
109- A handler to probe error records for errors;
110- When the probing identifies an error, a handler to handle it;
111- For memory-mapped error record, its base address and size in KB; for a system
112 register-accessed record, the start index of the record and number of
113 continuous records from that index;
114- Any node-specific auxiliary data.
115
116With this information supplied, when the run time firmware receives one of the
117notification mechanisms, the RAS framework can iterate through and probe error
118records for error, and invoke the appropriate handler to handle it.
119
120The RAS framework provides the macros to populate error record information. The
121macros are versioned, and the latest version as of this writing is 1. These
122macros create a structure of type ``struct err_record_info`` from its arguments,
123which are later passed to probe and error handlers.
124
125For memory-mapped error records:
126
127.. code:: c
128
129 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
130
131And, for system register ones:
132
133.. code:: c
134
135 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
136
137The probe handler must have the following prototype:
138
139.. code:: c
140
141 typedef int (*err_record_probe_t)(const struct err_record_info *info,
142 int *probe_data);
143
144The probe handler must return a non-zero value if an error was detected, or 0
145otherwise. The ``probe_data`` output parameter can be used to pass any useful
146information resulting from probe to the error handler (see `below`__). For
147example, it could return the index of the record.
148
149.. __: `Standard Error Record helpers`_
150
151The error handler must have the following prototype:
152
153.. code:: c
154
155 typedef int (*err_record_handler_t)(const struct err_record_info *info,
156 int probe_data, const struct err_handler_data *const data);
157
158The ``data`` constant parameter describes the various properties of the error,
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +0000159including the reason for the error, exception syndrome, and also ``flags``,
Manish Pandey9c9f38a2020-06-30 00:46:08 +0100160``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
161<EL3 interrupts>`.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100162
163The platform is expected populate an array using the macros above, and register
164the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
165passing it the name of the array describing the records. Note that the macro
166must be used in the same file where the array is defined.
167
168Standard Error Record helpers
169~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
170
171The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
172both memory-mapped and System Register accesses:
173
174.. code:: c
175
176 int ras_err_ser_probe_memmap(const struct err_record_info *info,
177 int *probe_data);
178
179 int ras_err_ser_probe_sysreg(const struct err_record_info *info,
180 int *probe_data);
181
182When the platform enumerates error records, for those records in the Standard
183Error Record format, these helpers maybe used instead of rolling out their own.
184Both helpers above:
185
186- Return non-zero value when an error is detected in a Standard Error Record;
187- Set ``probe_data`` to the index of the error record upon detecting an error.
188
189Registering RAS interrupts
190--------------------------
191
192RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
193Recovery interrupts. For the firmware-first handling paradigm for interrupts to
194work, the platform must setup and register with |EHF|. See `Interaction with
195Exception Handling Framework`_.
196
197For each RAS interrupt, the platform has to provide structure of type ``struct
198ras_interrupt``:
199
200- Interrupt number;
201- The associated error record information (pointer to the corresponding
202 ``struct err_record_info``);
203- Optionally, a cookie.
204
205The platform is expected to define an array of ``struct ras_interrupt``, and
206register it with the RAS framework using the macro
207``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
208macro must be used in the same file where the array is defined.
209
210The array of ``struct ras_interrupt`` must be sorted in the increasing order of
211interrupt number. This allows for fast look of handlers in order to service RAS
212interrupts.
213
214Double-fault handling
215---------------------
216
217A Double Fault condition arises when an error is signalled to the PE while
218handling of a previously signalled error is still underway. When a Double Fault
219condition arises, the Arm RAS extensions only require for handler to perform
220orderly shutdown of the system, as recovery may be impossible.
221
222The RAS extensions part of Armv8.4 introduced new architectural features to deal
223with Double Fault conditions, specifically, the introduction of ``NMEA`` and
224``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
225software which runs part of its entry/exit routines with exceptions momentarily
226masked—meaning, in such systems, External Aborts/SErrors are not immediately
227handled when they occur, but only after the exceptions are unmasked again.
228
229|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
230This means that all exceptions routed to EL3 are handled immediately. |TF-A|
231thus is able to detect a Double Fault conditions in software, without needing
232the intended advantages of Armv8.4 Double Fault architecture extensions.
233
234Double faults are fatal, and terminate at the platform double fault handler, and
235doesn't return.
236
237Engaging the RAS framework
238--------------------------
239
Manish Pandeyd419e222023-02-13 12:39:17 +0000240Enabling RAS support is a platform choice
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100241
242The RAS support in |TF-A| introduces a default implementation of
Manish Pandeyf90a73c2023-10-10 15:42:19 +0100243``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS``
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100244is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
245top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
246to through platform-supplied error records, probe them, and when an error is
247identified, look up and invoke the corresponding error handler.
248
249Note that, if the platform chooses to override the ``plat_ea_handler`` function
250and intend to use the RAS framework, it must explicitly call
251``ras_ea_handler()`` from within.
252
253Similarly, for RAS interrupts, the framework defines
254``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
255when a RAS interrupt taken at EL3. The function bisects the platform-supplied
256sorted array of interrupts to look up the error record information associated
257with the interrupt number. That error handler for that record is then invoked to
258handle the error.
259
260Interaction with Exception Handling Framework
261---------------------------------------------
262
263As mentioned in earlier sections, RAS framework interacts with the |EHF| to
264arbitrate handling of RAS exceptions with others that are routed to EL3. This
Manish Pandey9c9f38a2020-06-30 00:46:08 +0100265means that the platform must partition a :ref:`priority level <Partitioning
266priority levels>` for handling RAS exceptions. The platform must then define
267the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
268Platforms would typically want to allocate the highest secure priority for
269RAS handling.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100270
Manish Pandey9c9f38a2020-06-30 00:46:08 +0100271Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
272<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
273documentation. I.e., for interrupts, the priority management is implicit; but
274for non-interrupt exceptions, they're explicit using :ref:`EHF APIs
275<Activating and Deactivating priorities>`.
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100276
Paul Beesleyf8640672019-04-12 14:19:42 +0100277--------------
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100278
Manish Pandeyd419e222023-02-13 12:39:17 +0000279*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.*
280
281.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest