developer | 02e6591 | 2023-08-17 16:33:10 +0800 | [diff] [blame] | 1 | +=============================================================================+ |
| 2 | | Copyright (c) 2010-2022 Rambus, Inc. and/or its subsidiaries. | |
| 3 | | | |
| 4 | | Subject : PEC API Implementation Notes | |
| 5 | | Product : SLAD API | |
| 6 | | Date : 02 December, 2022 | |
| 7 | | | |
| 8 | +=============================================================================+ |
| 9 | |
| 10 | The SLAD API is a set of the API's one of which is the Packet Engine |
| 11 | Control (PEC) API. The driver implementation specifics of these APIs are |
| 12 | described in short documents that serve as an addendum to the API |
| 13 | specifications. This document describes the PEC API. |
| 14 | |
| 15 | This document uses the phrase "configurable" to indicate that a parameter or |
| 16 | option is build-time configurable in the driver. Please refer to the User Guide |
| 17 | of the driver for details. |
| 18 | |
| 19 | |
| 20 | PEC API |
| 21 | ------- |
| 22 | |
| 23 | One PEC implementation is available: ARM (Autonomous Ring Mode). |
| 24 | |
| 25 | Packet engine device |
| 26 | The PEC API implementation can support multiple packet processing |
| 27 | devices. These devices are identified by the interface ID parameter |
| 28 | in the PEC API functions and are sometimes referred to as rings. |
| 29 | |
| 30 | ARM mode |
| 31 | ARM mode uses DMA, can queue many jobs, handles commands and |
| 32 | results asynchronously (but always in-order) and also supports |
| 33 | fragmented data buffers (scatter / gather). The implementation |
| 34 | supports a single multi-threaded application per ring, allowing |
| 35 | concurrent and independent use of PEC_SA_Register, |
| 36 | PEC_SA_UnRegister, PEC_Packet_Put and PEC_Packet_Get. Both |
| 37 | PEC_Packet_Put and PEC_Packet_Get are not re-entrant, but |
| 38 | individual threads can use these functions concurrently. |
| 39 | |
| 40 | Ensure that the Token Data, Packet Data and Context Data DMA |
| 41 | buffers are not re-used or freed by the execution context using |
| 42 | the PEC and DMABuf API's functions or by another execution |
| 43 | context until the processed result descriptor(s) referring to the |
| 44 | packet associated with these buffers is(are) fully processed by |
| 45 | the engine. This is required not only when in-place packet |
| 46 | transform is done using the same Packet Data DMA buffer as input |
| 47 | and output buffer but also when different DMA buffers are used |
| 48 | for the packet processing input and output data. |
| 49 | |
| 50 | Byte ordering (endianess) |
| 51 | The SA and Token buffers are considered arrays of 32bit integers, |
| 52 | in host-native byte ordering. The driver will change the byte |
| 53 | order if required if this is required (configurable). The data |
| 54 | buffers are considered byte arrays and the driver will not touch |
| 55 | these. |
| 56 | |
| 57 | Bounce Buffers |
| 58 | Bounce Buffer support can be removed (configurable) from the driver to |
| 59 | reduce footprint. |
| 60 | For ARM mode, the implementation can bounce the SA, Token and data buffers |
| 61 | if these are not DMA-safe. This requires that the buffer was registered |
| 62 | using DMABuf_Register with an unsupported AllocatorRef. |
| 63 | When bouncing an SA buffer, it will be copied to a bounce buffer in |
| 64 | PEC_SA_Register and copied back by PEC_SA_UnRegister. |
| 65 | When bouncing a token buffer, a bounce buffer is created by |
| 66 | PEC_Packet_Put and released by PEC_Packet_Get. |
| 67 | When bouncing a data buffer (because either the source or destination |
| 68 | requires bouncing), a single bounce buffer is created by PEC_Packet_Put |
| 69 | based on the largest of the source and destination buffers. The engine |
| 70 | then performs an in-place operation in the bounce buffer. PEC_Packet_Get |
| 71 | copies the result to the destination buffer and releases the bounce |
| 72 | buffer. |
| 73 | |
| 74 | Descriptor grouping |
| 75 | PEC_Packet_Get and PEC_Packet_Put can process up to a |
| 76 | configurable number of descriptors in one call. |
| 77 | |
| 78 | Queuing |
| 79 | The ring can queue a configurable number of jobs. This can be set to |
| 80 | hundreds or even thousands, at the cost of some memory footprint. This |
| 81 | can avoid queuing in software and can also give better performance |
| 82 | (avoids idle engine due to empty ring). |
| 83 | |
| 84 | Scatter / Gather support |
| 85 | The ARM mode supports the scatter/gather extension (configurable) of the |
| 86 | PEC API. |
| 87 | If the SrcPkt_Handle in the command descriptor is a PEC SG_List, a |
| 88 | packet is assumed to used gather. |
| 89 | If the DstPkt_Handle in the command descriptor is a PEC SG_List, a |
| 90 | packet is assumed to used scatter. |
| 91 | |
| 92 | The application is responsible for setting up both the gather list |
| 93 | in SrcPkt_Handle and the scatter list in DstPkt_Handle for each packet. |
| 94 | The application is responsible for allocating and releasing the individual |
| 95 | gather and scatter buffers. |
| 96 | |
| 97 | The buffers used for the Scatter and/or Gather data are not bounced and |
| 98 | must be allocated and provided to the driver as DMA-safe buffers. This |
| 99 | can be achieved by using the driver's DMABuf API. |
| 100 | |
| 101 | Continuous Scatter mode |
| 102 | The ARM mode supports continuous scatter mode, which can be enabled per |
| 103 | ring. When continuous scatter mode is enabled for a ring, the result |
| 104 | packets are written to a sequence of destination buffers supplied by the |
| 105 | function PEC_Scatter_Preload(). Each result packet will occupy one |
| 106 | or more destination buffers (it will be scattered), depending on the |
| 107 | packet length and the sizes of the destination buffers. |
| 108 | |
| 109 | It is not determined in advance which destination buffers will be |
| 110 | used for the result packet of a certain PEC_Packet_Put(). In a |
| 111 | typical use case, the application will pre-allocate a number of |
| 112 | buffers, each of the same size, and submits them to the |
| 113 | PEC_Scatter_Preload() function. It will regularly call |
| 114 | PEC_Scatter_Preload() to refill the supply of destination buffers, after |
| 115 | buffers are used by result packets. |
| 116 | |
| 117 | Continuous scatter mode can be used even if the driver is configured |
| 118 | without scatter-gather support, but in that case each packet is required |
| 119 | to fit into a single destination buffer. |
| 120 | |
| 121 | Continuous scatter mode is not supported with LAC flows. |
| 122 | |
| 123 | Redirection |
| 124 | Some configurations of the hardware support redirection. A packet, |
| 125 | originally submitted with PEC_Packet_Put() can have its result appear |
| 126 | on a different ring or on the inline interface. A packet, originally |
| 127 | received on the inline interface, can have its result appear on |
| 128 | a ring. |
| 129 | |
| 130 | Any ring from which packets can be redirected, must be configured with |
| 131 | continuous scatter mode. Any ring towards which packets can be redirected, |
| 132 | must be configured with continuous scatter mode. When redirection is |
| 133 | possible, the sequence of packets submitted with PEC_Packet_Put() and |
| 134 | the sequence of result packets retrieved with PEC_Packet_Get() |
| 135 | on the same ring are no longer related to one another. |
| 136 | |
| 137 | When a packet submitted with PEC_Packet_Put() is redirected to the |
| 138 | inline interface, no result descriptor will be received with |
| 139 | PEC_Packet_Get(). When a packet is received on the inline interface and |
| 140 | it is redirected to a ring, a result descriptor will appear with no |
| 141 | corresponding command descriptor in PEC_Packet_Put(). |
| 142 | |
| 143 | Command Descriptor fields |
| 144 | User_p is fully supported on rings that do not use continuous scatter mode |
| 145 | and allows the user to match results to commands. User_p is not supported |
| 146 | on rings with continuous scatter mode. |
| 147 | |
| 148 | The Control1 field is not used by this implementation. The PEC |
| 149 | API function PEC_CD_Control_Write is not implemented. Instead use |
| 150 | the IOToken API to pass an array of 32-bit words via the |
| 151 | InputToken_p field. The application is responsible for allocating |
| 152 | this array. The HW_Services field in the data structure passed to |
| 153 | the IOToken API specifies the exact packet flow or alternatively |
| 154 | it can specify a record invalidation command instead. The values |
| 155 | to be filled in are provided by the firmware API. |
| 156 | |
| 157 | Control2 can be used to specify the engine on which a packet must be |
| 158 | processed. This can be useful for protocols like TLS, where subsequent |
| 159 | packets of the same data stream must be processed on the same engine to |
| 160 | ensure in-order processing and in-order assignment of the sequence numbers. |
| 161 | Bit 5 in Control2 can be set if the engine is specified. The engine ID |
| 162 | is put in bits 4..0. Otherwise, the Control2 field should be all zero. |
| 163 | |
| 164 | LAC packet flow: |
| 165 | - A valid Token_Handle must always be provided for each packet and |
| 166 | Token_WordCount must be set to the exact size in words of the token. |
| 167 | - SA_Handle1 must point to the main SA, which must be registered by |
| 168 | PEC_SA_Register. |
| 169 | - SA_Handle2 must be a null handle |
| 170 | - SA_WordCount is not used. |
| 171 | - DstPkt_Handle must always be provided, also for input-only operations. |
| 172 | When no destination buffer is required, it can be set to SrcPkt_Handle. |
| 173 | - The TokenHeaderWord passed to the IOToken API must be filled in. |
| 174 | - The Offset_ByteCount passed to the IOToken API is not used. |
| 175 | |
| 176 | Other packet flows: |
| 177 | - Token_Handle is the null handle and Token_WordCount is zero. |
| 178 | - SA_Handle1 is the null handle if classification is used, else it is the |
| 179 | DMABuf handle representing a transform record. |
| 180 | - SA_Handle2 must be a null handle |
| 181 | - SA_WordCount is not used. |
| 182 | - DstPkt_Handle must always be provided (except for continuous |
| 183 | scatter mode), also for input-only operations. |
| 184 | When no destination buffer is required, it can be set to SrcPkt_Handle. |
| 185 | When continuous scatter mode is enabled, no destination handle must be |
| 186 | provided. |
| 187 | - The Offset_ByteCount field passed to the IOTOken API specifies the |
| 188 | number of bytes at the start of each packet that will be passed |
| 189 | unchanged and are not part of the packet to be processed. |
| 190 | - The NextHeader field passed to the IOToken API specifies the Next |
| 191 | Header field for IPsec packet flows that do not use network header |
| 192 | processing. |
| 193 | |
| 194 | Record invalidation commands: |
| 195 | - Token_Handle is the null handle and Token_WordCount is zero. |
| 196 | - SA_Handle must point to the SA, transform or flow record to be |
| 197 | invalidated. |
| 198 | - SA_Handle2 must be a null handle |
| 199 | - SA_WordCount is not used. |
| 200 | - SrcPkt_Handle and DstPkt_Handle must both be null handles. |
| 201 | - SrcPkt_ByteCount is zero. |
| 202 | |
| 203 | Note: A record invalidation command may be submitted via the PEC API |
| 204 | to the engine only when the engine has no packets being processed |
| 205 | for this record. |
| 206 | |
| 207 | Result Descriptor fields |
| 208 | User_p is fully supported on rings that do not use continuous scatter mode |
| 209 | and allows the user to match results to commands. User_p is not supported |
| 210 | on rings with continuous scatter mode. |
| 211 | |
| 212 | SrcPkt_Handle and DstPkt_Handle are the same as provided in the command |
| 213 | descriptor. DstPkt_p is the host address for DstPkt_Handle. |
| 214 | On ring with continuous scatter mode, these fields are the NULL handle |
| 215 | and NULL pointer. On rings with continuous scatter mode, the NumParticles |
| 216 | field will be the number of scatter buffers used by the result packet. |
| 217 | At least one scatter buffer will be used, even if the result packet |
| 218 | has zero length. In some cases, the number of scatter buffer is |
| 219 | higher that would be required by the result packet. |
| 220 | |
| 221 | DstPkt_ByteCount and Bypass_WordCount have been extracted from the engine |
| 222 | result descriptor as described in the engine datasheet under PE_LENGTH. |
| 223 | Bypass_WordCount should be the same as was provided in the command |
| 224 | descriptor. |
| 225 | |
| 226 | For operations that do not require output buffers such as hash operations |
| 227 | the SrcPkt_Handle and DstPkt_Handle parameters in the |
| 228 | PEC_CommandDescriptor_t descriptor must be set equal by applications. |
| 229 | The advantage of this solution is that the driver still checks that |
| 230 | the output buffer handle is not NULL for all operations and can detect |
| 231 | errors in applications that do not do provide correct output buffer handle. |
| 232 | The disadvantage is that for hash operations this will degrade performance |
| 233 | because the driver will have to perform bounce buffer copy back to |
| 234 | the original buffer which is not needed and the driver will also |
| 235 | perform the PostDMA operation which is also not needed. |
| 236 | |
| 237 | Status1 and Status2 reflect up to two words from the result token that |
| 238 | contain relevant status information. Use the function |
| 239 | PEC_RD_Status_Read to extract this information in an |
| 240 | engine-independent form. |
| 241 | |
| 242 | More status information is passed in an array of 32-bit words via |
| 243 | the OutputToken_p field. The IOToken API can be used to extract |
| 244 | information from this array. The application is responsible for allocating |
| 245 | this array before the call to PEC_Packet_Get. |
| 246 | |
| 247 | Notify Requests (callbacks) |
| 248 | In ARM mode, two notify requests (commands and results) are |
| 249 | supported. The result notification callback is only invoked in |
| 250 | interrupt mode. The command notification callback is invoked |
| 251 | from within PEC_Packet_Get. |
| 252 | |
| 253 | SA Invalidation |
| 254 | In order to remove an SA from the system, it is required to carry out |
| 255 | the following operations in the specified order. |
| 256 | - Submit a special command descriptor with a transform record invalidation |
| 257 | command via PEC_Packet_Put(). This command will remove the record from |
| 258 | the record cache of the packet engine. |
| 259 | - Wait until the corresponding result descriptor is received via |
| 260 | PEC_Packet_Get(). |
| 261 | - Call PEC_SA_UnRegister(). This command will take care of CPU cache |
| 262 | coherency, endianness conversion and bounce buffers, whichever applies. |
| 263 | - At this time the DMA buffer of the SA can be reused for a different |
| 264 | purpose or it can be freed. |
| 265 | |
| 266 | Banks for DMA resources |
| 267 | DMA-safe buffers for each data type must be allocated with the |
| 268 | correct Bank parameter (in DAMBuf_Properties_t). |
| 269 | Buffers for SA records must be allocated with Bank=1, all other |
| 270 | buffers must be allocated with Bank=0. On 64-bit hosts, SA buffers |
| 271 | must be allocated in a 4GB memory range, which is taken care of by |
| 272 | using Bank=1. |
| 273 | |
| 274 | SA resources |
| 275 | The functions PEC_SA_Register and PEC_SA_UnRegister take three |
| 276 | DMABuf handles as parameters. The first of these is always the |
| 277 | DMABuf handle representing the SA, the second is always a null |
| 278 | handle and the third is the null handle if the SA does not have |
| 279 | an ARC4 state record. |
| 280 | |
| 281 | If the SA does have an ARC4 state record, the SA_Handle3 |
| 282 | parameter represents the ARC4 state record. However this DMABuf |
| 283 | handle is supposed to represent the ARC4 state part within the SA |
| 284 | buffer. The application is supposed to use DMABuf_Register |
| 285 | (AllocatorRef=='R') to register a subset of the SA buffer as a |
| 286 | DMA Handle. |
| 287 | |
| 288 | Multiple Applications |
| 289 | The current implementation supports multiple rings (tested with |
| 290 | two rings) and each ring can be used by a separate application, |
| 291 | independently of other rings. The applications can run |
| 292 | concurrently, as long as each application uses a different |
| 293 | ring. The use of PEC_Packet_Put and PEC_Packet_Get by different |
| 294 | concurrent applications requires no locking. The implementation |
| 295 | of PEC_SA_UnRegister contains the required locking to support |
| 296 | multiple applications. |
| 297 | |
| 298 | Concurrent Context Synchronization (CCS) |
| 299 | The PEC API implementation supports a single multi-threaded application |
| 300 | per interface ID, allowing concurrent and independent use of |
| 301 | PEC_SA_Register, PEC_SA_UnRegister, PEC_Packet_Put and PEC_Packet_Get. |
| 302 | Multiple applications using the PEC API are also supported but they must |
| 303 | use different interface ID each. |
| 304 | |
| 305 | Note: although the PEC_SA_Register and PEC_SA_UnRegister functions take |
| 306 | InterfaceId as an input parameter it is ignored by these functions |
| 307 | since this functions do nothing what is specific to an packet I/O |
| 308 | (ring) interface. |
| 309 | |
| 310 | The PEC API implementation provides synchronization mechanisms for |
| 311 | concurrent contexts invoking the API functions. The API can be used |
| 312 | by multi-threaded user-space and kernel-space applications. The latter |
| 313 | can invoke the API functions from the user process as well as from softirq |
| 314 | contexts. Mixing user process execution with softirq contexts is also |
| 315 | supported. Both Linux Uni-Processor (UP) and Symmetric Multi-Processor |
| 316 | (SMP) kernel configurations are supported. |
| 317 | |
| 318 | The PEC API allows for non-blocking synchronization concurrent context |
| 319 | invoking the API functions for different interface ID's. The only |
| 320 | exception are the PEC_Init and PEC_UnInit functions which both allow |
| 321 | for just one execution context at a time even for different interface ID's. |
| 322 | Also there should be no contexts executing the PEC_Packet_Put |
| 323 | or PEC_Packet_Get function code in order for the PEC_UnInit function |
| 324 | to succeed for the same interface ID. |
| 325 | |
| 326 | For optimal utilization of the packet engine the PEC API user should allow |
| 327 | for concurrent contexts for the PEC_Packet_Put and PEC_Packet_Get |
| 328 | functions for the same interface ID. Note that having multiple concurrent |
| 329 | contexts invoking the PEC_Packet_Put function for the same interface ID |
| 330 | will not improve performance because this function does not allow more |
| 331 | than one execution context at a time for one interface ID. The same |
| 332 | applies for the PEC_Packet_Get function. |
| 333 | |
| 334 | When a function from the PEC API detects that it competes for a resource |
| 335 | already used at the time by another context executing the PEC code it will |
| 336 | not block and return PEC_STATUS_BUSY return code. The caller should try |
| 337 | calling this function again short after. |
| 338 | |
| 339 | Debugging |
| 340 | The PEC_Put_Dump() and PEC_Get_Dump() functions can be used to print |
| 341 | the command ring and result ring administration and cached data as well |
| 342 | as the content of the ring buffers respectively. A slot corresponds to |
| 343 | a descriptor in the ring. These functions can be used to debug the packet |
| 344 | I/O functionality. |
| 345 | |
| 346 | |
| 347 | <end of document> |