Paul Beesley | 236d246 | 2019-03-05 17:19:37 +0000 | [diff] [blame] | 1 | PSCI Performance Measurements on Arm Juno Development Platform |
| 2 | ============================================================== |
| 3 | |
Joel Hutton | 9e60563 | 2019-02-25 15:18:56 +0000 | [diff] [blame] | 4 | This document summarises the findings of performance measurements of key |
John Tsichritzis | 63801cd | 2019-07-05 14:22:12 +0100 | [diff] [blame] | 5 | operations in the Trusted Firmware-A Power State Coordination Interface (PSCI) |
| 6 | implementation, using the in-built Performance Measurement Framework (PMF) and |
| 7 | runtime instrumentation timestamps. |
Joel Hutton | 9e60563 | 2019-02-25 15:18:56 +0000 | [diff] [blame] | 8 | |
| 9 | Method |
| 10 | ------ |
| 11 | |
| 12 | We used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2 |
| 13 | x Cortex-A57 clusters running at the following frequencies: |
| 14 | |
| 15 | +-----------------+--------------------+ |
| 16 | | Domain | Frequency (MHz) | |
| 17 | +=================+====================+ |
| 18 | | Cortex-A57 | 900 (nominal) | |
| 19 | +-----------------+--------------------+ |
| 20 | | Cortex-A53 | 650 (underdrive) | |
| 21 | +-----------------+--------------------+ |
| 22 | | AXI subsystem | 533 | |
| 23 | +-----------------+--------------------+ |
| 24 | |
| 25 | Juno supports CPU, cluster and system power down states, corresponding to power |
| 26 | levels 0, 1 and 2 respectively. It does not support any retention states. |
| 27 | |
| 28 | We used the upstream `TF master as of 31/01/2017`_, building the platform using |
| 29 | the ``ENABLE_RUNTIME_INSTRUMENTATION`` option: |
| 30 | |
Paul Beesley | 493e349 | 2019-03-13 15:11:04 +0000 | [diff] [blame] | 31 | .. code:: shell |
Joel Hutton | 9e60563 | 2019-02-25 15:18:56 +0000 | [diff] [blame] | 32 | |
| 33 | make PLAT=juno ENABLE_RUNTIME_INSTRUMENTATION=1 \ |
| 34 | SCP_BL2=<path/to/scp-fw.bin> \ |
| 35 | BL33=<path/to/test-fw.bin> \ |
| 36 | all fip |
| 37 | |
| 38 | When using the debug build of TF, there was no noticeable difference in the |
| 39 | results. |
| 40 | |
| 41 | The tests are based on an ARM-internal test framework. The release build of this |
| 42 | framework was used because the results in the debug build became skewed; the |
| 43 | console output prevented some of the tests from executing in parallel. |
| 44 | |
| 45 | The tests consist of both parallel and sequential tests, which are broadly |
| 46 | described as follows: |
| 47 | |
| 48 | - **Parallel Tests** This type of test powers on all the non-lead CPUs and |
| 49 | brings them and the lead CPU to a common synchronization point. The lead CPU |
| 50 | then initiates the test on all CPUs in parallel. |
| 51 | |
| 52 | - **Sequential Tests** This type of test powers on each non-lead CPU in |
| 53 | sequence. The lead CPU initiates the test on a non-lead CPU then waits for the |
| 54 | test to complete before proceeding to the next non-lead CPU. The lead CPU then |
| 55 | executes the test on itself. |
| 56 | |
| 57 | In the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and |
| 58 | CPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead |
| 59 | CPU. |
| 60 | |
| 61 | ``PSCI_ENTRY`` refers to the time taken from entering the TF PSCI implementation |
| 62 | to the point the hardware enters the low power state (WFI). Referring to the TF |
| 63 | runtime instrumentation points, this corresponds to: |
| 64 | ``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``. |
| 65 | |
| 66 | ``PSCI_EXIT`` refers to the time taken from the point the hardware exits the low |
| 67 | power state to exiting the TF PSCI implementation. This corresponds to: |
| 68 | ``(RT_INSTR_EXIT_PSCI - RT_INSTR_EXIT_HW_LOW_PWR)``. |
| 69 | |
| 70 | ``CFLUSH_OVERHEAD`` refers to the part of ``PSCI_ENTRY`` taken to flush the |
| 71 | caches. This corresponds to: ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``. |
| 72 | |
| 73 | Note there is very little variance observed in the values given (~1us), although |
| 74 | the values for each CPU are sometimes interchanged, depending on the order in |
| 75 | which locks are acquired. Also, there is very little variance observed between |
| 76 | executing the tests sequentially in a single boot or rebooting between tests. |
| 77 | |
| 78 | Given that runtime instrumentation using PMF is invasive, there is a small |
| 79 | (unquantified) overhead on the results. PMF uses the generic counter for |
| 80 | timestamps, which runs at 50MHz on Juno. |
| 81 | |
| 82 | Results and Commentary |
| 83 | ---------------------- |
| 84 | |
| 85 | ``CPU_SUSPEND`` to deepest power level on all CPUs in parallel |
| 86 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 87 | |
| 88 | +-------+---------------------+--------------------+--------------------------+ |
| 89 | | CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | |
| 90 | +=======+=====================+====================+==========================+ |
| 91 | | 0 | 27 | 20 | 5 | |
| 92 | +-------+---------------------+--------------------+--------------------------+ |
| 93 | | 1 | 114 | 86 | 5 | |
| 94 | +-------+---------------------+--------------------+--------------------------+ |
| 95 | | 2 | 202 | 58 | 5 | |
| 96 | +-------+---------------------+--------------------+--------------------------+ |
| 97 | | 3 | 375 | 29 | 94 | |
| 98 | +-------+---------------------+--------------------+--------------------------+ |
| 99 | | 4 | 20 | 22 | 6 | |
| 100 | +-------+---------------------+--------------------+--------------------------+ |
| 101 | | 5 | 290 | 18 | 206 | |
| 102 | +-------+---------------------+--------------------+--------------------------+ |
| 103 | |
| 104 | A large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is |
| 105 | observed due to TF PSCI lock contention. In the worst case, CPU 3 has to wait |
| 106 | for the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release |
| 107 | the lock before proceeding. |
| 108 | |
| 109 | The ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the |
| 110 | last CPUs in their respective clusters to power down, therefore both the L1 and |
| 111 | L2 caches are flushed. |
| 112 | |
| 113 | The ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3 |
| 114 | because the L2 cache size for the big cluster is lot larger (2MB) compared to |
| 115 | the little cluster (1MB). |
| 116 | |
| 117 | ``CPU_SUSPEND`` to power level 0 on all CPUs in parallel |
| 118 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 119 | |
| 120 | +-------+---------------------+--------------------+--------------------------+ |
| 121 | | CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | |
| 122 | +=======+=====================+====================+==========================+ |
| 123 | | 0 | 116 | 14 | 8 | |
| 124 | +-------+---------------------+--------------------+--------------------------+ |
| 125 | | 1 | 204 | 14 | 8 | |
| 126 | +-------+---------------------+--------------------+--------------------------+ |
| 127 | | 2 | 287 | 13 | 8 | |
| 128 | +-------+---------------------+--------------------+--------------------------+ |
| 129 | | 3 | 376 | 13 | 9 | |
| 130 | +-------+---------------------+--------------------+--------------------------+ |
| 131 | | 4 | 29 | 15 | 7 | |
| 132 | +-------+---------------------+--------------------+--------------------------+ |
| 133 | | 5 | 21 | 15 | 8 | |
| 134 | +-------+---------------------+--------------------+--------------------------+ |
| 135 | |
| 136 | There is no lock contention in TF generic code at power level 0 but the large |
| 137 | variance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno |
| 138 | platform code. The platform lock is used to mediate access to a single SCP |
| 139 | communication channel. This is compounded by the SCP firmware waiting for each |
| 140 | AP CPU to enter WFI before making the channel available to other CPUs, which |
| 141 | effectively serializes the SCP power down commands from all CPUs. |
| 142 | |
| 143 | On platforms with a more efficient CPU power down mechanism, it should be |
| 144 | possible to make the ``PSCI_ENTRY`` times smaller and consistent. |
| 145 | |
| 146 | The ``PSCI_EXIT`` times are consistent across all CPUs because TF does not |
| 147 | require locks at power level 0. |
| 148 | |
| 149 | The ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only |
| 150 | the cache associated with power level 0 is flushed (L1). |
| 151 | |
| 152 | ``CPU_SUSPEND`` to deepest power level on all CPUs in sequence |
| 153 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 154 | |
| 155 | +-------+---------------------+--------------------+--------------------------+ |
| 156 | | CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | |
| 157 | +=======+=====================+====================+==========================+ |
| 158 | | 0 | 114 | 20 | 94 | |
| 159 | +-------+---------------------+--------------------+--------------------------+ |
| 160 | | 1 | 114 | 20 | 94 | |
| 161 | +-------+---------------------+--------------------+--------------------------+ |
| 162 | | 2 | 114 | 20 | 94 | |
| 163 | +-------+---------------------+--------------------+--------------------------+ |
| 164 | | 3 | 114 | 20 | 94 | |
| 165 | +-------+---------------------+--------------------+--------------------------+ |
| 166 | | 4 | 195 | 22 | 180 | |
| 167 | +-------+---------------------+--------------------+--------------------------+ |
| 168 | | 5 | 21 | 17 | 6 | |
| 169 | +-------+---------------------+--------------------+--------------------------+ |
| 170 | |
Paul Beesley | f2ec714 | 2019-10-04 16:17:46 +0000 | [diff] [blame] | 171 | The ``CFLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster |
Joel Hutton | 9e60563 | 2019-02-25 15:18:56 +0000 | [diff] [blame] | 172 | are large because all other CPUs in the cluster are powered down during the |
| 173 | test. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a |
| 174 | flush of both L1 and L2 caches. |
| 175 | |
| 176 | The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little |
| 177 | CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared |
| 178 | to the little cluster (1MB). |
| 179 | |
| 180 | The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead |
| 181 | CPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to |
| 182 | level 0, which only requires L1 cache flush. |
| 183 | |
| 184 | ``CPU_SUSPEND`` to power level 0 on all CPUs in sequence |
| 185 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 186 | |
| 187 | +-------+---------------------+--------------------+--------------------------+ |
| 188 | | CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | |
| 189 | +=======+=====================+====================+==========================+ |
| 190 | | 0 | 22 | 14 | 5 | |
| 191 | +-------+---------------------+--------------------+--------------------------+ |
| 192 | | 1 | 22 | 14 | 5 | |
| 193 | +-------+---------------------+--------------------+--------------------------+ |
| 194 | | 2 | 21 | 14 | 5 | |
| 195 | +-------+---------------------+--------------------+--------------------------+ |
| 196 | | 3 | 22 | 14 | 5 | |
| 197 | +-------+---------------------+--------------------+--------------------------+ |
| 198 | | 4 | 17 | 14 | 6 | |
| 199 | +-------+---------------------+--------------------+--------------------------+ |
| 200 | | 5 | 18 | 15 | 6 | |
| 201 | +-------+---------------------+--------------------+--------------------------+ |
| 202 | |
| 203 | Here the times are small and consistent since there is no contention and it is |
| 204 | only necessary to flush the cache to power level 0 (L1). This is the best case |
| 205 | scenario. |
| 206 | |
| 207 | The ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than |
| 208 | for the CPUs in little cluster due to greater CPU performance. |
| 209 | |
| 210 | The ``PSCI_EXIT`` times are generally lower than in the last test because the |
| 211 | cluster remains powered on throughout the test and there is less code to execute |
| 212 | on power on (for example, no need to enter CCI coherency) |
| 213 | |
| 214 | ``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level |
| 215 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 216 | |
| 217 | The test sequence here is as follows: |
| 218 | |
| 219 | 1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence. |
| 220 | |
| 221 | 2. Program wake up timer and suspend the lead CPU to the deepest power level. |
| 222 | |
| 223 | 3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU. |
| 224 | |
| 225 | +-------+---------------------+--------------------+--------------------------+ |
| 226 | | CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) | |
| 227 | +=======+=====================+====================+==========================+ |
| 228 | | 0 | 110 | 28 | 93 | |
| 229 | +-------+---------------------+--------------------+--------------------------+ |
| 230 | | 1 | 110 | 28 | 93 | |
| 231 | +-------+---------------------+--------------------+--------------------------+ |
| 232 | | 2 | 110 | 28 | 93 | |
| 233 | +-------+---------------------+--------------------+--------------------------+ |
| 234 | | 3 | 111 | 28 | 93 | |
| 235 | +-------+---------------------+--------------------+--------------------------+ |
| 236 | | 4 | 195 | 22 | 181 | |
| 237 | +-------+---------------------+--------------------+--------------------------+ |
| 238 | | 5 | 20 | 23 | 6 | |
| 239 | +-------+---------------------+--------------------+--------------------------+ |
| 240 | |
| 241 | The ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other |
| 242 | CPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call |
| 243 | powers down to the cluster level, requiring a flush of both L1 and L2 caches. |
| 244 | |
| 245 | The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because |
| 246 | lead CPU 4 is running and CPU 5 only powers down to level 0, which only requires |
| 247 | an L1 cache flush. |
| 248 | |
| 249 | The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little |
| 250 | CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared |
| 251 | to the little cluster (1MB). |
| 252 | |
| 253 | The ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than |
| 254 | for CPUs in the little cluster due to greater CPU performance. These times |
| 255 | generally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests |
| 256 | because there is more code to execute in the "on finisher" compared to the |
| 257 | "suspend finisher" (for example, GIC redistributor register programming). |
| 258 | |
| 259 | ``PSCI_VERSION`` on all CPUs in parallel |
| 260 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 261 | |
| 262 | Since very little code is associated with ``PSCI_VERSION``, this test |
| 263 | approximates the round trip latency for handling a fast SMC at EL3 in TF. |
| 264 | |
| 265 | +-------+-------------------+ |
| 266 | | CPU | TOTAL TIME (ns) | |
| 267 | +=======+===================+ |
| 268 | | 0 | 3020 | |
| 269 | +-------+-------------------+ |
| 270 | | 1 | 2940 | |
| 271 | +-------+-------------------+ |
| 272 | | 2 | 2980 | |
| 273 | +-------+-------------------+ |
| 274 | | 3 | 3060 | |
| 275 | +-------+-------------------+ |
| 276 | | 4 | 520 | |
| 277 | +-------+-------------------+ |
| 278 | | 5 | 720 | |
| 279 | +-------+-------------------+ |
| 280 | |
| 281 | The times for the big CPUs are less than the little CPUs due to greater CPU |
| 282 | performance. |
| 283 | |
| 284 | We suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache |
| 285 | effects, given that these measurements are at the nano-second level. |
| 286 | |
John Tsichritzis | 63801cd | 2019-07-05 14:22:12 +0100 | [diff] [blame] | 287 | -------------- |
| 288 | |
Sandrine Bailleux | 4e82472 | 2020-07-01 13:53:07 +0200 | [diff] [blame] | 289 | *Copyright (c) 2019-2020, Arm Limited and Contributors. All rights reserved.* |
John Tsichritzis | 63801cd | 2019-07-05 14:22:12 +0100 | [diff] [blame] | 290 | |
Sandrine Bailleux | 4e82472 | 2020-07-01 13:53:07 +0200 | [diff] [blame] | 291 | .. _Juno R1 platform: https://static.docs.arm.com/100122/0100/arm_versatile_express_juno_r1_development_platform_(v2m_juno_r1)_technical_reference_manual_100122_0100_05_en.pdf |
Joel Hutton | 0f79fb1 | 2019-02-26 16:23:54 +0000 | [diff] [blame] | 292 | .. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d |