Paul Beesley | fc9ee36 | 2019-03-07 15:47:15 +0000 | [diff] [blame] | 1 | Reliability, Availability, and Serviceability (RAS) Extensions |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 2 | ************************************************************** |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 3 | |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 4 | This document describes |TF-A| support for Arm Reliability, Availability, and |
| 5 | Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and |
| 6 | later CPUs, and also an optional extension to the base Armv8.0 architecture. |
| 7 | |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 8 | For the description of Arm RAS extensions, Standard Error Records, and the |
| 9 | precise definition of RAS terminology, please refer to the Arm Architecture |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 10 | Reference Manual and `RAS Supplement`_. The rest of this document assumes |
| 11 | familiarity with architecture and terminology. |
| 12 | |
Manish Pandey | b32197e | 2023-07-13 10:08:41 +0100 | [diff] [blame] | 13 | **IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present |
| 14 | then FEAT_IESB is also implmented. |
| 15 | |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 16 | There are two philosophies for handling RAS errors from Non-secure world point |
| 17 | of view. |
| 18 | |
| 19 | - :ref:`Firmware First Handling (FFH)` |
| 20 | - :ref:`Kernel First Handling (KFH)` |
| 21 | |
| 22 | .. _Firmware First Handling (FFH): |
| 23 | |
| 24 | Firmware First Handling (FFH) |
| 25 | ============================= |
| 26 | |
| 27 | Introduction |
| 28 | ------------ |
| 29 | |
| 30 | EA’s and Error interrupts corresponding to NS nodes are handled first in firmware |
| 31 | |
| 32 | - Errors signaled back to NS world via suitable mechanism |
| 33 | - Kernel is prohibited from accessing the RAS error records directly |
| 34 | - Firmware creates CPER records for kernel to navigate and process |
| 35 | - Firmware signals error back to Kernel via SDEI |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 36 | |
| 37 | Overview |
| 38 | -------- |
| 39 | |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 40 | FFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from |
| 41 | errors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous |
| 42 | External Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling |
| 43 | and Error Recovery interrupts. |
| 44 | RAS Framework in TF-A allows the platform to define an external abort handler and to |
| 45 | register RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard |
| 46 | Error Records as introduced by the RAS extensions |
| 47 | |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 48 | |
| 49 | .. __: `Standard Error Record helpers`_ |
| 50 | |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 51 | .. _Kernel First Handling (KFH): |
| 52 | |
| 53 | Kernel First Handling (KFH) |
| 54 | =========================== |
| 55 | |
| 56 | Introduction |
| 57 | ------------ |
| 58 | |
| 59 | EA's originating/attributed to NS world are handled first in NS and Kernel navigates |
| 60 | the std error records directly. |
| 61 | |
Manish Pandey | b32197e | 2023-07-13 10:08:41 +0100 | [diff] [blame] | 62 | - KFH is the default handling mode if platform does not explicitly enable FFH mode. |
| 63 | - KFH mode does not need any EL3 involvement except for the reflection of errors back |
| 64 | to lower EL. This happens when there is an error (EA) in the system which is not yet |
| 65 | signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are |
| 66 | synchronized causing async EA to pend at EL3. |
| 67 | |
| 68 | Error Syncronization at EL3 entry |
| 69 | ================================= |
| 70 | |
| 71 | During entry to EL3 from lower EL, if there is any pending async EAs they are either |
| 72 | reflected back to lower EL (KFH) or handled in EL3 itself (FFH). |
| 73 | |
| 74 | |Image 1| |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 75 | |
| 76 | TF-A build options |
| 77 | ================== |
| 78 | |
Manish Pandey | f90a73c | 2023-10-10 15:42:19 +0100 | [diff] [blame] | 79 | - **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3. |
| 80 | - **HANDLE_EA_EL3_FIRST_NS**: Required for FFH |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 81 | - **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers. |
Manish Pandey | f90a73c | 2023-10-10 15:42:19 +0100 | [diff] [blame] | 82 | - **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and |
| 83 | HANDLE_EA_EL3_FIRST_NS put together. |
| 84 | |
| 85 | RAS internal macros |
| 86 | |
| 87 | - **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled. |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 88 | |
| 89 | RAS feature has dependency on some other TF-A build flags |
| 90 | |
| 91 | - **EL3_EXCEPTION_HANDLING**: Required for FFH |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 92 | - **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform |
| 93 | |
Manish Pandey | b32197e | 2023-07-13 10:08:41 +0100 | [diff] [blame] | 94 | TF-A Tests |
| 95 | ========== |
| 96 | |
| 97 | RAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple |
| 98 | configurations for testing lower EL External aborts. |
| 99 | |
| 100 | All the tests are written in TF-A tests which runs as NS-EL2 payload. |
| 101 | |
| 102 | - **FFH without RAS extension** |
| 103 | |
| 104 | *fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug* |
| 105 | |
| 106 | Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3. |
| 107 | Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully |
| 108 | handles these errors and returns back to TF-A Tests |
| 109 | |
| 110 | Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH** |
| 111 | |
| 112 | - **FFH with RAS extension** |
| 113 | |
| 114 | Three Tests : |
| 115 | |
| 116 | - *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug* |
| 117 | |
| 118 | Inject an unrecoverable RAS error, which gets handled in EL3. |
| 119 | |
| 120 | - *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug* |
| 121 | |
| 122 | Inject uncontainable RAS errors which causes platform to panic. |
| 123 | |
| 124 | - *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug* |
| 125 | |
| 126 | Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL |
| 127 | which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending |
| 128 | async EA it will handle the async EA first (nested exception) before handling the original SMC call. |
| 129 | |
| 130 | - **KFH with RAS extension** |
| 131 | |
| 132 | Couple of tests in the group : |
| 133 | |
| 134 | - *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug* |
| 135 | |
| 136 | Inject and handle RAS errors in TF-A tests (no El3 involvement) |
| 137 | |
| 138 | - *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug* |
| 139 | |
| 140 | Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting |
| 141 | in IRQ and SMC path. |
| 142 | |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 143 | RAS Framework |
| 144 | ============= |
| 145 | |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 146 | |
| 147 | .. _ras-figure: |
| 148 | |
Paul Beesley | 814f8c0 | 2019-03-13 15:49:27 +0000 | [diff] [blame] | 149 | .. image:: ../resources/diagrams/draw.io/ras.svg |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 150 | |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 151 | Platform APIs |
| 152 | ------------- |
| 153 | |
| 154 | The RAS framework allows the platform to define handlers for External Abort, |
| 155 | Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please |
Manish Pandey | 9c9f38a | 2020-06-30 00:46:08 +0100 | [diff] [blame] | 156 | refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 157 | |
| 158 | Registering RAS error records |
| 159 | ----------------------------- |
| 160 | |
| 161 | RAS nodes are components in the system capable of signalling errors to PEs |
| 162 | through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS |
| 163 | nodes contain one or more error records, which are registers through which the |
| 164 | nodes advertise various properties of the signalled error. Arm recommends that |
| 165 | error records are implemented in the Standard Error Record format. The RAS |
Antonio Nino Diaz | 56b68ad | 2019-02-28 13:35:21 +0000 | [diff] [blame] | 166 | architecture allows for error records to be accessible via system or |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 167 | memory-mapped registers. |
| 168 | |
| 169 | The platform should enumerate the error records providing for each of them: |
| 170 | |
| 171 | - A handler to probe error records for errors; |
| 172 | - When the probing identifies an error, a handler to handle it; |
| 173 | - For memory-mapped error record, its base address and size in KB; for a system |
| 174 | register-accessed record, the start index of the record and number of |
| 175 | continuous records from that index; |
| 176 | - Any node-specific auxiliary data. |
| 177 | |
| 178 | With this information supplied, when the run time firmware receives one of the |
| 179 | notification mechanisms, the RAS framework can iterate through and probe error |
| 180 | records for error, and invoke the appropriate handler to handle it. |
| 181 | |
| 182 | The RAS framework provides the macros to populate error record information. The |
| 183 | macros are versioned, and the latest version as of this writing is 1. These |
| 184 | macros create a structure of type ``struct err_record_info`` from its arguments, |
| 185 | which are later passed to probe and error handlers. |
| 186 | |
| 187 | For memory-mapped error records: |
| 188 | |
| 189 | .. code:: c |
| 190 | |
| 191 | ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) |
| 192 | |
| 193 | And, for system register ones: |
| 194 | |
| 195 | .. code:: c |
| 196 | |
| 197 | ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) |
| 198 | |
| 199 | The probe handler must have the following prototype: |
| 200 | |
| 201 | .. code:: c |
| 202 | |
| 203 | typedef int (*err_record_probe_t)(const struct err_record_info *info, |
| 204 | int *probe_data); |
| 205 | |
| 206 | The probe handler must return a non-zero value if an error was detected, or 0 |
| 207 | otherwise. The ``probe_data`` output parameter can be used to pass any useful |
| 208 | information resulting from probe to the error handler (see `below`__). For |
| 209 | example, it could return the index of the record. |
| 210 | |
| 211 | .. __: `Standard Error Record helpers`_ |
| 212 | |
| 213 | The error handler must have the following prototype: |
| 214 | |
| 215 | .. code:: c |
| 216 | |
| 217 | typedef int (*err_record_handler_t)(const struct err_record_info *info, |
| 218 | int probe_data, const struct err_handler_data *const data); |
| 219 | |
| 220 | The ``data`` constant parameter describes the various properties of the error, |
Antonio Nino Diaz | 56b68ad | 2019-02-28 13:35:21 +0000 | [diff] [blame] | 221 | including the reason for the error, exception syndrome, and also ``flags``, |
Manish Pandey | 9c9f38a | 2020-06-30 00:46:08 +0100 | [diff] [blame] | 222 | ``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler |
| 223 | <EL3 interrupts>`. |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 224 | |
| 225 | The platform is expected populate an array using the macros above, and register |
| 226 | the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, |
| 227 | passing it the name of the array describing the records. Note that the macro |
| 228 | must be used in the same file where the array is defined. |
| 229 | |
| 230 | Standard Error Record helpers |
| 231 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 232 | |
| 233 | The |TF-A| RAS framework provides probe handlers for Standard Error Records, for |
| 234 | both memory-mapped and System Register accesses: |
| 235 | |
| 236 | .. code:: c |
| 237 | |
| 238 | int ras_err_ser_probe_memmap(const struct err_record_info *info, |
| 239 | int *probe_data); |
| 240 | |
| 241 | int ras_err_ser_probe_sysreg(const struct err_record_info *info, |
| 242 | int *probe_data); |
| 243 | |
| 244 | When the platform enumerates error records, for those records in the Standard |
| 245 | Error Record format, these helpers maybe used instead of rolling out their own. |
| 246 | Both helpers above: |
| 247 | |
| 248 | - Return non-zero value when an error is detected in a Standard Error Record; |
| 249 | - Set ``probe_data`` to the index of the error record upon detecting an error. |
| 250 | |
| 251 | Registering RAS interrupts |
| 252 | -------------------------- |
| 253 | |
| 254 | RAS nodes can signal errors to the PE by raising Fault Handling and/or Error |
| 255 | Recovery interrupts. For the firmware-first handling paradigm for interrupts to |
| 256 | work, the platform must setup and register with |EHF|. See `Interaction with |
| 257 | Exception Handling Framework`_. |
| 258 | |
| 259 | For each RAS interrupt, the platform has to provide structure of type ``struct |
| 260 | ras_interrupt``: |
| 261 | |
| 262 | - Interrupt number; |
| 263 | - The associated error record information (pointer to the corresponding |
| 264 | ``struct err_record_info``); |
| 265 | - Optionally, a cookie. |
| 266 | |
| 267 | The platform is expected to define an array of ``struct ras_interrupt``, and |
| 268 | register it with the RAS framework using the macro |
| 269 | ``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the |
| 270 | macro must be used in the same file where the array is defined. |
| 271 | |
| 272 | The array of ``struct ras_interrupt`` must be sorted in the increasing order of |
| 273 | interrupt number. This allows for fast look of handlers in order to service RAS |
| 274 | interrupts. |
| 275 | |
| 276 | Double-fault handling |
| 277 | --------------------- |
| 278 | |
| 279 | A Double Fault condition arises when an error is signalled to the PE while |
| 280 | handling of a previously signalled error is still underway. When a Double Fault |
| 281 | condition arises, the Arm RAS extensions only require for handler to perform |
| 282 | orderly shutdown of the system, as recovery may be impossible. |
| 283 | |
| 284 | The RAS extensions part of Armv8.4 introduced new architectural features to deal |
| 285 | with Double Fault conditions, specifically, the introduction of ``NMEA`` and |
| 286 | ``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 |
| 287 | software which runs part of its entry/exit routines with exceptions momentarily |
| 288 | masked—meaning, in such systems, External Aborts/SErrors are not immediately |
| 289 | handled when they occur, but only after the exceptions are unmasked again. |
| 290 | |
| 291 | |TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. |
| 292 | This means that all exceptions routed to EL3 are handled immediately. |TF-A| |
| 293 | thus is able to detect a Double Fault conditions in software, without needing |
| 294 | the intended advantages of Armv8.4 Double Fault architecture extensions. |
| 295 | |
| 296 | Double faults are fatal, and terminate at the platform double fault handler, and |
| 297 | doesn't return. |
| 298 | |
| 299 | Engaging the RAS framework |
| 300 | -------------------------- |
| 301 | |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 302 | Enabling RAS support is a platform choice |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 303 | |
| 304 | The RAS support in |TF-A| introduces a default implementation of |
Manish Pandey | f90a73c | 2023-10-10 15:42:19 +0100 | [diff] [blame] | 305 | ``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS`` |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 306 | is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the |
| 307 | top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating |
| 308 | to through platform-supplied error records, probe them, and when an error is |
| 309 | identified, look up and invoke the corresponding error handler. |
| 310 | |
| 311 | Note that, if the platform chooses to override the ``plat_ea_handler`` function |
| 312 | and intend to use the RAS framework, it must explicitly call |
| 313 | ``ras_ea_handler()`` from within. |
| 314 | |
| 315 | Similarly, for RAS interrupts, the framework defines |
| 316 | ``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked |
| 317 | when a RAS interrupt taken at EL3. The function bisects the platform-supplied |
| 318 | sorted array of interrupts to look up the error record information associated |
| 319 | with the interrupt number. That error handler for that record is then invoked to |
| 320 | handle the error. |
| 321 | |
| 322 | Interaction with Exception Handling Framework |
| 323 | --------------------------------------------- |
| 324 | |
| 325 | As mentioned in earlier sections, RAS framework interacts with the |EHF| to |
| 326 | arbitrate handling of RAS exceptions with others that are routed to EL3. This |
Manish Pandey | 9c9f38a | 2020-06-30 00:46:08 +0100 | [diff] [blame] | 327 | means that the platform must partition a :ref:`priority level <Partitioning |
| 328 | priority levels>` for handling RAS exceptions. The platform must then define |
| 329 | the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. |
| 330 | Platforms would typically want to allocate the highest secure priority for |
| 331 | RAS handling. |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 332 | |
Manish Pandey | 9c9f38a | 2020-06-30 00:46:08 +0100 | [diff] [blame] | 333 | Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt |
| 334 | <non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| |
| 335 | documentation. I.e., for interrupts, the priority management is implicit; but |
| 336 | for non-interrupt exceptions, they're explicit using :ref:`EHF APIs |
| 337 | <Activating and Deactivating priorities>`. |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 338 | |
Paul Beesley | f864067 | 2019-04-12 14:19:42 +0100 | [diff] [blame] | 339 | -------------- |
Jeenu Viswambharan | e34bf58 | 2018-10-12 08:48:36 +0100 | [diff] [blame] | 340 | |
Manish Pandey | d419e22 | 2023-02-13 12:39:17 +0000 | [diff] [blame] | 341 | *Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.* |
| 342 | |
| 343 | .. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest |
Manish Pandey | b32197e | 2023-07-13 10:08:41 +0100 | [diff] [blame] | 344 | .. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master |
| 345 | |
| 346 | .. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png |