blob: 137c0c3012399a5258ba72dd4f1bae180601d5d8 [file] [log] [blame]
Paul Beesleyfc9ee362019-03-07 15:47:15 +00001Reliability, Availability, and Serviceability (RAS) Extensions
2==============================================================
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01003
Jeenu Viswambharane34bf582018-10-12 08:48:36 +01004.. |EHF| replace:: Exception Handling Framework
5.. |TF-A| replace:: Trusted Firmware-A
6
7This document describes |TF-A| support for Arm Reliability, Availability, and
8Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
9later CPUs, and also an optional extension to the base Armv8.0 architecture.
10
11In conjunction with the |EHF|, support for RAS extension enables firmware-first
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +000012paradigm for handling platform errors: exceptions resulting from errors are
13routed to and handled in EL3. Said errors are Synchronous External Abort (SEA),
14Asynchronous External Abort (signalled as SErrors), Fault Handling and Error
15Recovery interrupts. The |EHF| document mentions various `error handling
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010016use-cases`__.
17
18.. __: exception-handling.rst#delegation-use-cases
19
20For the description of Arm RAS extensions, Standard Error Records, and the
21precise definition of RAS terminology, please refer to the Arm Architecture
22Reference Manual. The rest of this document assumes familiarity with
23architecture and terminology.
24
25Overview
26--------
27
28As mentioned above, the RAS support in |TF-A| enables routing to and handling of
29exceptions resulting from platform errors in EL3. It allows the platform to
30define an External Abort handler, and to register RAS nodes and interrupts. RAS
31framework also provides `helpers`__ for accessing Standard Error Records as
32introduced by the RAS extensions.
33
34.. __: `Standard Error Record helpers`_
35
36The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run
37time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also
38be set ``1``.
39
40.. _ras-figure:
41
Paul Beesley814f8c02019-03-13 15:49:27 +000042.. image:: ../resources/diagrams/draw.io/ras.svg
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010043
44See more on `Engaging the RAS framework`_.
45
46Platform APIs
47-------------
48
49The RAS framework allows the platform to define handlers for External Abort,
50Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
51refer to the porting guide for the `RAS platform API descriptions`__.
52
Paul Beesleyea225122019-02-11 17:54:45 +000053.. __: ../getting_started/porting-guide.rst#external-abort-handling-and-ras-support
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010054
55Registering RAS error records
56-----------------------------
57
58RAS nodes are components in the system capable of signalling errors to PEs
59through one one of the notification mechanismsSEAs, SErrors, or interrupts. RAS
60nodes contain one or more error records, which are registers through which the
61nodes advertise various properties of the signalled error. Arm recommends that
62error records are implemented in the Standard Error Record format. The RAS
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +000063architecture allows for error records to be accessible via system or
Jeenu Viswambharane34bf582018-10-12 08:48:36 +010064memory-mapped registers.
65
66The platform should enumerate the error records providing for each of them:
67
68- A handler to probe error records for errors;
69- When the probing identifies an error, a handler to handle it;
70- For memory-mapped error record, its base address and size in KB; for a system
71 register-accessed record, the start index of the record and number of
72 continuous records from that index;
73- Any node-specific auxiliary data.
74
75With this information supplied, when the run time firmware receives one of the
76notification mechanisms, the RAS framework can iterate through and probe error
77records for error, and invoke the appropriate handler to handle it.
78
79The RAS framework provides the macros to populate error record information. The
80macros are versioned, and the latest version as of this writing is 1. These
81macros create a structure of type ``struct err_record_info`` from its arguments,
82which are later passed to probe and error handlers.
83
84For memory-mapped error records:
85
86.. code:: c
87
88 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
89
90And, for system register ones:
91
92.. code:: c
93
94 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
95
96The probe handler must have the following prototype:
97
98.. code:: c
99
100 typedef int (*err_record_probe_t)(const struct err_record_info *info,
101 int *probe_data);
102
103The probe handler must return a non-zero value if an error was detected, or 0
104otherwise. The ``probe_data`` output parameter can be used to pass any useful
105information resulting from probe to the error handler (see `below`__). For
106example, it could return the index of the record.
107
108.. __: `Standard Error Record helpers`_
109
110The error handler must have the following prototype:
111
112.. code:: c
113
114 typedef int (*err_record_handler_t)(const struct err_record_info *info,
115 int probe_data, const struct err_handler_data *const data);
116
117The ``data`` constant parameter describes the various properties of the error,
Antonio Nino Diaz56b68ad2019-02-28 13:35:21 +0000118including the reason for the error, exception syndrome, and also ``flags``,
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100119``cookie``, and ``handle`` parameters from the `top-level exception handler`__.
120
121.. __: interrupt-framework-design.rst#el3-interrupts
122
123The platform is expected populate an array using the macros above, and register
124the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
125passing it the name of the array describing the records. Note that the macro
126must be used in the same file where the array is defined.
127
128Standard Error Record helpers
129~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
130
131The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
132both memory-mapped and System Register accesses:
133
134.. code:: c
135
136 int ras_err_ser_probe_memmap(const struct err_record_info *info,
137 int *probe_data);
138
139 int ras_err_ser_probe_sysreg(const struct err_record_info *info,
140 int *probe_data);
141
142When the platform enumerates error records, for those records in the Standard
143Error Record format, these helpers maybe used instead of rolling out their own.
144Both helpers above:
145
146- Return non-zero value when an error is detected in a Standard Error Record;
147- Set ``probe_data`` to the index of the error record upon detecting an error.
148
149Registering RAS interrupts
150--------------------------
151
152RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
153Recovery interrupts. For the firmware-first handling paradigm for interrupts to
154work, the platform must setup and register with |EHF|. See `Interaction with
155Exception Handling Framework`_.
156
157For each RAS interrupt, the platform has to provide structure of type ``struct
158ras_interrupt``:
159
160- Interrupt number;
161- The associated error record information (pointer to the corresponding
162 ``struct err_record_info``);
163- Optionally, a cookie.
164
165The platform is expected to define an array of ``struct ras_interrupt``, and
166register it with the RAS framework using the macro
167``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
168macro must be used in the same file where the array is defined.
169
170The array of ``struct ras_interrupt`` must be sorted in the increasing order of
171interrupt number. This allows for fast look of handlers in order to service RAS
172interrupts.
173
174Double-fault handling
175---------------------
176
177A Double Fault condition arises when an error is signalled to the PE while
178handling of a previously signalled error is still underway. When a Double Fault
179condition arises, the Arm RAS extensions only require for handler to perform
180orderly shutdown of the system, as recovery may be impossible.
181
182The RAS extensions part of Armv8.4 introduced new architectural features to deal
183with Double Fault conditions, specifically, the introduction of ``NMEA`` and
184``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
185software which runs part of its entry/exit routines with exceptions momentarily
186maskedmeaning, in such systems, External Aborts/SErrors are not immediately
187handled when they occur, but only after the exceptions are unmasked again.
188
189|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
190This means that all exceptions routed to EL3 are handled immediately. |TF-A|
191thus is able to detect a Double Fault conditions in software, without needing
192the intended advantages of Armv8.4 Double Fault architecture extensions.
193
194Double faults are fatal, and terminate at the platform double fault handler, and
195doesn't return.
196
197Engaging the RAS framework
198--------------------------
199
Paul Beesley1fbc97b2019-01-11 18:26:51 +0000200Enabling RAS support is a platform choice constructed from three distinct, but
201related, build options:
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100202
203- ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware;
204
205- ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See
206 `Interaction with Exception Handling Framework`_;
207
208- ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to
209 EL3.
210
211The RAS support in |TF-A| introduces a default implementation of
212``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION``
213is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
214top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
215to through platform-supplied error records, probe them, and when an error is
216identified, look up and invoke the corresponding error handler.
217
218Note that, if the platform chooses to override the ``plat_ea_handler`` function
219and intend to use the RAS framework, it must explicitly call
220``ras_ea_handler()`` from within.
221
222Similarly, for RAS interrupts, the framework defines
223``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
224when a RAS interrupt taken at EL3. The function bisects the platform-supplied
225sorted array of interrupts to look up the error record information associated
226with the interrupt number. That error handler for that record is then invoked to
227handle the error.
228
229Interaction with Exception Handling Framework
230---------------------------------------------
231
232As mentioned in earlier sections, RAS framework interacts with the |EHF| to
233arbitrate handling of RAS exceptions with others that are routed to EL3. This
234means that the platform must partition a `priority level`__ for handling RAS
235exceptions. The platform must then define the macro ``PLAT_RAS_PRI`` to the
236priority level used for RAS exceptions. Platforms would typically want to
237allocate the highest secure priority for RAS handling.
238
239.. __: exception-handling.rst#partitioning-priority-levels
240
Paul Beesley1fbc97b2019-01-11 18:26:51 +0000241Handling of both `interrupt`__ and `non-interrupt`__ exceptions follow the
Jeenu Viswambharane34bf582018-10-12 08:48:36 +0100242sequences outlined in the |EHF| documentation. I.e., for interrupts, the
243priority management is implicit; but for non-interrupt exceptions, they're
244explicit using `EHF APIs`__.
245
246.. __: exception-handling.rst#interrupt-flow
247.. __: exception-handling.rst#non-interrupt-flow
248.. __: exception-handling.rst#activating-and-deactivating-priorities
249
250----
251
252*Copyright (c) 2018, Arm Limited and Contributors. All rights reserved.*