docs: Add RAS framework documentation

Change-Id: Ibf2b21b12ebc0af5815fc6643532a3be9100bf02
Signed-off-by: Jeenu Viswambharan <jeenu.viswambharan@arm.com>
diff --git a/docs/ras.rst b/docs/ras.rst
new file mode 100644
index 0000000..4c82022
--- /dev/null
+++ b/docs/ras.rst
@@ -0,0 +1,258 @@
+RAS support in Trusted Firmware-A
+=================================
+
+.. section-numbering::
+    :suffix: .
+
+.. contents::
+    :depth: 2
+
+.. |EHF| replace:: Exception Handling Framework
+.. |TF-A| replace:: Trusted Firmware-A
+
+This document describes |TF-A| support for Arm Reliability, Availability, and
+Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
+later CPUs, and also an optional extension to the base Armv8.0 architecture.
+
+In conjunction with the |EHF|, support for RAS extension enables firmware-first
+paradigm for handling platform errors, in which exceptions resulting from
+errors—viz. Synchronous External Abort (SEA), Asynchronous External Abort
+(signalled as SErrors), Fault Handling and Error Recovery interrupts are routed
+to and handled in EL3. The |EHF| document mentions various `error handling
+use-cases`__.
+
+.. __: exception-handling.rst#delegation-use-cases
+
+For the description of Arm RAS extensions, Standard Error Records, and the
+precise definition of RAS terminology, please refer to the Arm Architecture
+Reference Manual. The rest of this document assumes familiarity with
+architecture and terminology.
+
+Overview
+--------
+
+As mentioned above, the RAS support in |TF-A| enables routing to and handling of
+exceptions resulting from platform errors in EL3. It allows the platform to
+define an External Abort handler, and to register RAS nodes and interrupts. RAS
+framework also provides `helpers`__ for accessing Standard Error Records as
+introduced by the RAS extensions.
+
+.. __: `Standard Error Record helpers`_
+
+The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run
+time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also
+be set ``1``.
+
+.. _ras-figure:
+
+.. image:: draw.io/ras.svg
+
+See more on `Engaging the RAS framework`_.
+
+Platform APIs
+-------------
+
+The RAS framework allows the platform to define handlers for External Abort,
+Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
+refer to the porting guide for the `RAS platform API descriptions`__.
+
+.. __: porting-guide.rst#external-abort-handling-and-ras-support
+
+Registering RAS error records
+-----------------------------
+
+RAS nodes are components in the system capable of signalling errors to PEs
+through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
+nodes contain one or more error records, which are registers through which the
+nodes advertise various properties of the signalled error. Arm recommends that
+error records are implemented in the Standard Error Record format. The RAS
+architecture allows for error records to be accessible via. system or
+memory-mapped registers.
+
+The platform should enumerate the error records providing for each of them:
+
+-  A handler to probe error records for errors;
+-  When the probing identifies an error, a handler to handle it;
+-  For memory-mapped error record, its base address and size in KB; for a system
+   register-accessed record, the start index of the record and number of
+   continuous records from that index;
+-  Any node-specific auxiliary data.
+
+With this information supplied, when the run time firmware receives one of the
+notification mechanisms, the RAS framework can iterate through and probe error
+records for error, and invoke the appropriate handler to handle it.
+
+The RAS framework provides the macros to populate error record information. The
+macros are versioned, and the latest version as of this writing is 1. These
+macros create a structure of type ``struct err_record_info`` from its arguments,
+which are later passed to probe and error handlers.
+
+For memory-mapped error records:
+
+.. code:: c
+
+    ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
+
+And, for system register ones:
+
+.. code:: c
+
+    ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
+
+The probe handler must have the following prototype:
+
+.. code:: c
+
+    typedef int (*err_record_probe_t)(const struct err_record_info *info,
+                    int *probe_data);
+
+The probe handler must return a non-zero value if an error was detected, or 0
+otherwise. The ``probe_data`` output parameter can be used to pass any useful
+information resulting from probe to the error handler (see `below`__). For
+example, it could return the index of the record.
+
+.. __: `Standard Error Record helpers`_
+
+The error handler must have the following prototype:
+
+.. code:: c
+
+    typedef int (*err_record_handler_t)(const struct err_record_info *info,
+               int probe_data, const struct err_handler_data *const data);
+
+The ``data`` constant parameter describes the various properties of the error,
+viz. the reason for the error, exception syndrome, and also ``flags``,
+``cookie``, and ``handle`` parameters from the `top-level exception handler`__.
+
+.. __: interrupt-framework-design.rst#el3-interrupts
+
+The platform is expected populate an array using the macros above, and register
+the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
+passing it the name of the array describing the records. Note that the macro
+must be used in the same file where the array is defined.
+
+Standard Error Record helpers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The |TF-A| RAS framework provides probe handlers for Standard Error Records, for
+both memory-mapped and System Register accesses:
+
+.. code:: c
+
+    int ras_err_ser_probe_memmap(const struct err_record_info *info,
+                int *probe_data);
+
+    int ras_err_ser_probe_sysreg(const struct err_record_info *info,
+                int *probe_data);
+
+When the platform enumerates error records, for those records in the Standard
+Error Record format, these helpers maybe used instead of rolling out their own.
+Both helpers above:
+
+-  Return non-zero value when an error is detected in a Standard Error Record;
+-  Set ``probe_data`` to the index of the error record upon detecting an error.
+
+Registering RAS interrupts
+--------------------------
+
+RAS nodes can signal errors to the PE by raising Fault Handling and/or Error
+Recovery interrupts. For the firmware-first handling paradigm for interrupts to
+work, the platform must setup and register with |EHF|. See `Interaction with
+Exception Handling Framework`_.
+
+For each RAS interrupt, the platform has to provide structure of type ``struct
+ras_interrupt``:
+
+-  Interrupt number;
+-  The associated error record information (pointer to the corresponding
+   ``struct err_record_info``);
+-  Optionally, a cookie.
+
+The platform is expected to define an array of ``struct ras_interrupt``, and
+register it with the RAS framework using the macro
+``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
+macro must be used in the same file where the array is defined.
+
+The array of ``struct ras_interrupt`` must be sorted in the increasing order of
+interrupt number. This allows for fast look of handlers in order to service RAS
+interrupts.
+
+Double-fault handling
+---------------------
+
+A Double Fault condition arises when an error is signalled to the PE while
+handling of a previously signalled error is still underway. When a Double Fault
+condition arises, the Arm RAS extensions only require for handler to perform
+orderly shutdown of the system, as recovery may be impossible.
+
+The RAS extensions part of Armv8.4 introduced new architectural features to deal
+with Double Fault conditions, specifically, the introduction of ``NMEA`` and
+``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
+software which runs part of its entry/exit routines with exceptions momentarily
+masked—meaning, in such systems, External Aborts/SErrors are not immediately
+handled when they occur, but only after the exceptions are unmasked again.
+
+|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
+This means that all exceptions routed to EL3 are handled immediately. |TF-A|
+thus is able to detect a Double Fault conditions in software, without needing
+the intended advantages of Armv8.4 Double Fault architecture extensions.
+
+Double faults are fatal, and terminate at the platform double fault handler, and
+doesn't return.
+
+Engaging the RAS framework
+--------------------------
+
+Enabling RAS support is a platform choice conjunctional of three distinct but
+related build options:
+
+-  ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware;
+
+-  ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See
+   `Interaction with Exception Handling Framework`_;
+
+-  ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to
+   EL3.
+
+The RAS support in |TF-A| introduces a default implementation of
+``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION``
+is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
+top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
+to through platform-supplied error records, probe them, and when an error is
+identified, look up and invoke the corresponding error handler.
+
+Note that, if the platform chooses to override the ``plat_ea_handler`` function
+and intend to use the RAS framework, it must explicitly call
+``ras_ea_handler()`` from within.
+
+Similarly, for RAS interrupts, the framework defines
+``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
+when  a RAS interrupt taken at EL3. The function bisects the platform-supplied
+sorted array of interrupts to look up the error record information associated
+with the interrupt number. That error handler for that record is then invoked to
+handle the error.
+
+Interaction with Exception Handling Framework
+---------------------------------------------
+
+As mentioned in earlier sections, RAS framework interacts with the |EHF| to
+arbitrate handling of RAS exceptions with others that are routed to EL3. This
+means that the platform must partition a `priority level`__ for handling RAS
+exceptions. The platform must then define the macro ``PLAT_RAS_PRI`` to the
+priority level used for RAS exceptions. Platforms would typically want to
+allocate the highest secure priority for RAS handling.
+
+.. __: exception-handling.rst#partitioning-priority-levels
+
+Handling of both `interrrupt`__ and `non-interrupt`__ exceptions follow the
+sequences outlined in the |EHF| documentation. I.e., for interrupts, the
+priority management is implicit; but for non-interrupt exceptions, they're
+explicit using `EHF APIs`__.
+
+.. __: exception-handling.rst#interrupt-flow
+.. __: exception-handling.rst#non-interrupt-flow
+.. __: exception-handling.rst#activating-and-deactivating-priorities
+
+----
+
+*Copyright (c) 2018, Arm Limited and Contributors. All rights reserved.*