trace: Update documentation

Revamp the documentation for the new features, including a description of
the new features and documentation for the trace command.

Signed-off-by: Simon Glass <sjg@chromium.org>
diff --git a/.gitattributes b/.gitattributes
index 1879a2d..d5931f0 100644
--- a/.gitattributes
+++ b/.gitattributes
@@ -4,3 +4,4 @@
 *.bmp binary
 *.ttf binary
 *.gz binary
+*.png binary
diff --git a/doc/develop/pics/flamegraph.png b/doc/develop/pics/flamegraph.png
new file mode 100644
index 0000000..dbdd27e
--- /dev/null
+++ b/doc/develop/pics/flamegraph.png
Binary files differ
diff --git a/doc/develop/pics/flamegraph_timing.png b/doc/develop/pics/flamegraph_timing.png
new file mode 100644
index 0000000..b388b8b
--- /dev/null
+++ b/doc/develop/pics/flamegraph_timing.png
Binary files differ
diff --git a/doc/develop/pics/flamegraph_zoom.png b/doc/develop/pics/flamegraph_zoom.png
new file mode 100644
index 0000000..38b6844
--- /dev/null
+++ b/doc/develop/pics/flamegraph_zoom.png
Binary files differ
diff --git a/doc/develop/pics/kernelshark.png b/doc/develop/pics/kernelshark.png
new file mode 100644
index 0000000..3450c6b
--- /dev/null
+++ b/doc/develop/pics/kernelshark.png
Binary files differ
diff --git a/doc/develop/pics/kernelshark_fg.png b/doc/develop/pics/kernelshark_fg.png
new file mode 100644
index 0000000..a54efd1
--- /dev/null
+++ b/doc/develop/pics/kernelshark_fg.png
Binary files differ
diff --git a/doc/develop/trace.rst b/doc/develop/trace.rst
index 5c7802d..8425d84 100644
--- a/doc/develop/trace.rst
+++ b/doc/develop/trace.rst
@@ -15,13 +15,17 @@
 The trace feature uses GCC's instrument-functions feature to trace all
 function entry/exit points. These are then recorded in a memory buffer.
 The memory buffer can be saved to the host over a network link using
-tftpput or by writing to an attached memory device such as MMC.
+tftpput or by writing to an attached storage device such as MMC.
 
 On the host, the file is first converted with a tool called 'proftool',
 which extracts useful information from it. The resulting trace output
 resembles that emitted by Linux's ftrace feature, so can be visually
-displayed by pytimechart.
+displayed by kernelshark (see kernelshark_) and used with
+'trace-cmd report' (see trace_cmd_).
 
+It is also possible to produce a flame graph for use with flamegraph.pl
+(see flamegraph_pl_).
+
 
 Quick-start using Sandbox
 -------------------------
@@ -30,7 +34,7 @@
 way of trying out tracing before you use it on your actual board. To do
 this, follow these steps:
 
-Add the following to config/sandbox_defconfig
+Add the following to `config/sandbox_defconfig`:
 
 .. code-block:: c
 
@@ -75,7 +79,7 @@
              16 maximum observed call depth
              15 call depth limit
           1,275,767 calls not traced due to depth
-    =>trace calls 0 e00000
+    =>trace calls 1000000 e00000
     Call list dumped to 00000000, size 0xae0a40
     =>print
     baudrate=115200
@@ -87,7 +91,7 @@
     stdout=serial
 
     Environment size: 117/8188 bytes
-    =>host save host 0 trace 0 ${profoffset}
+    =>host save hostfs - 1000000 trace ${profoffset}
     11405888 bytes written in 10 ms (1.1 GiB/s)
     =>reset
 
@@ -96,18 +100,107 @@
 
 .. code-block:: console
 
-    $ ./sandbox/tools/proftool -m sandbox/System.map -p trace dump-ftrace >trace.txt
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace dump-ftrace >trace.dat
 
-Finally run pytimechart to display it
+Finally run kernelshark to display it (note it only works with `.dat` files!):
 
 .. code-block:: console
 
-    $ pytimechart trace.txt
+    $ kernelshark trace.dat
 
-Using this tool you can zoom and pan across the trace, with the function
-calls on the left and little marks representing the start and end of each
+Using this tool you can view the trace records and see the timestamp for each
 function.
 
+.. image:: pics/kernelshark.png
+  :width: 800
+  :alt: Kernelshark showing function-trace records
+
+
+To see the records on the console, use trace-cmd:
+
+.. code-block:: console
+
+    $ trace-cmd report trace.dat | less
+    cpus=1
+          u-boot-1     [000]     3.116364: function:             initf_malloc
+          u-boot-1     [000]     3.116375: function:             initf_malloc
+          u-boot-1     [000]     3.116386: function:             initf_bootstage
+          u-boot-1     [000]     3.116396: function:                bootstage_init
+          u-boot-1     [000]     3.116408: function:                   malloc
+          u-boot-1     [000]     3.116418: function:                      malloc_simple
+          u-boot-1     [000]     3.116429: function:                         alloc_simple
+          u-boot-1     [000]     3.116441: function:                         alloc_simple
+          u-boot-1     [000]     3.116449: function:                      malloc_simple
+          u-boot-1     [000]     3.116457: function:                   malloc
+
+Note that `pytimechart` is obsolete so cannot be used anymore.
+
+There is a -f option available to select a function graph:
+
+.. code-block:: console
+
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace -f funcgraph dump-ftrace >trace.dat
+
+Again, you can use kernelshark or trace-cmd to look at the output. In this case
+you will see the time taken by each function shown against its exit record.
+
+.. image:: pics/kernelshark_fg.png
+  :width: 800
+  :alt: Kernelshark showing function-graph records
+
+.. code-block:: console
+
+    $ trace-cmd report trace.dat | less
+    cpus=1
+              u-boot-1     [000]     3.116364: funcgraph_entry:        0.011 us   |    initf_malloc();
+              u-boot-1     [000]     3.116386: funcgraph_entry:                   |    initf_bootstage() {
+              u-boot-1     [000]     3.116396: funcgraph_entry:                   |      bootstage_init() {
+              u-boot-1     [000]     3.116408: funcgraph_entry:                   |        malloc() {
+              u-boot-1     [000]     3.116418: funcgraph_entry:                   |          malloc_simple() {
+              u-boot-1     [000]     3.116429: funcgraph_entry:        0.012 us   |            alloc_simple();
+              u-boot-1     [000]     3.116449: funcgraph_exit:         0.031 us   |            }
+              u-boot-1     [000]     3.116457: funcgraph_exit:         0.049 us   |          }
+              u-boot-1     [000]     3.116466: funcgraph_entry:        0.063 us   |        memset();
+              u-boot-1     [000]     3.116539: funcgraph_exit:         0.143 us   |        }
+
+Flame graph
+-----------
+
+Some simple flame graph options are available as well, using the dump-flamegraph
+command:
+
+.. code-block:: console
+
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace dump-flamegraph >trace.fg
+    $ flamegraph.pl trace.fg >trace.svg
+
+You can load the .svg file into a viewer. If you use Chrome (and some other
+programs) you can click around and zoom in and out.
+
+.. image:: pics/flamegraph.png
+  :width: 800
+  :alt: Chrome showing the flamegraph.pl output
+
+.. image:: pics/flamegraph_zoom.png
+  :width: 800
+  :alt: Chrome showing zooming into the flamegraph.pl output
+
+
+A timing variant is also available, which gives an idea of how much time is
+spend in each call stack:
+
+.. code-block:: console
+
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace dump-flamegraph -f timing >trace.fg
+    $ flamegraph.pl trace.fg >trace.svg
+
+Note that trace collection does slow down execution so the timings will be
+inflated. They should be used to guide optimisation. For accurate boot timings,
+use bootstage.
+
+.. image:: pics/flamegraph_timing.png
+  :width: 800
+  :alt: Chrome showing flamegraph.pl output with timing
 
 CONFIG Options
 --------------
@@ -138,6 +231,11 @@
 CONFIG_TRACE_EARLY_ADDR
     Address of early trace buffer
 
+CONFIG_TRACE_CALL_DEPTH_LIMIT
+    Sets the limit on trace call-depth. For a broad view, 10 is typically
+    sufficient. Setting this too large creates enormous traces and distorts
+    the overall timing considerable.
+
 
 Building U-Boot with Tracing Enabled
 ------------------------------------
@@ -148,6 +246,26 @@
 config files.
 
 
+Board requirements
+------------------
+
+Trace data collection relies on a microsecond timer, accessed through
+`timer_get_us()`. So the first thing you should do is make sure that
+this produces sensible results for your board. Suitable sources for
+this timer include high resolution timers, PWMs or profile timers if
+available. Most modern SOCs have a suitable timer for this.
+
+See `add_ftrace()` for where `timer_get_us()` is called. The `notrace`
+attribute must be used on each function called by `timer_get_us()` since
+recursive calls to `add_ftrace()` will cause a fault::
+
+   trace: recursion detected, disabling
+
+You cannot use driver model to obtain the microsecond timer, since tracing
+may be enabled before driver model is set up. Instead, provide a low-level
+function which accesses the timer, setting it up if needed.
+
+
 Collecting Trace Data
 ---------------------
 
@@ -155,21 +273,22 @@
 limit of the trace buffer size you have specified. Once that is exhausted
 no more data will be collected.
 
-Collecting trace data has an affect on execution time/performance. You
+Collecting trace data affects execution time and performance. You
 will notice this particularly with trivial functions - the overhead of
 recording their execution may even exceed their normal execution time.
 In practice this doesn't matter much so long as you are aware of the
 effect. Once you have done your optimizations, turn off tracing before
-doing end-to-end timing.
+doing end-to-end timing using bootstage.
 
 The best time to start tracing is right at the beginning of U-Boot. The
 best time to stop tracing is right at the end. In practice it is hard
 to achieve these ideals.
 
-This implementation enables tracing early in board_init_f(). This means
+This implementation enables tracing early in `board_init_r()`, or
+`board_init_f()` when `TRACE_EARLY` is enabled. This means
 that it captures most of the board init process, missing only the
 early architecture-specific init. However, it also misses the entire
-SPL stage if there is one.
+SPL stage if there is one. At present tracing is not supported in SPL.
 
 U-Boot typically ends with a 'bootm' command which loads and runs an
 OS. There is useful trace data in the execution of that bootm
@@ -179,39 +298,11 @@
 variable at this point. This variable should have a short script which
 collects the trace data and writes it somewhere.
 
-Trace data collection relies on a microsecond timer, accessed through
-timer_get_us(). So the first think you should do is make sure that
-this produces sensible results for your board. Suitable sources for
-this timer include high resolution timers, PWMs or profile timers if
-available. Most modern SOCs have a suitable timer for this. Make sure
-that you mark this timer (and anything it calls) with
-notrace so that the trace library can
-use it without causing an infinite loop.
-
-
-Commands
---------
-
-The trace command has variable sub-commands:
-
-stats
-    Display tracing statistics
-
-pause
-    Pause tracing
-
-resume
-    Resume tracing
-
-funclist [<addr> <size>]
-    Dump a list of functions into the buffer
-
-calls  [<addr> <size>]
-    Dump function call trace into buffer
+Controlling the trace
+---------------------
 
-If the address and size are not given, these are obtained from environment
-variables (see below). In any case the environment variables are updated
-after the command runs.
+U-Boot provides a command-line interface to the trace system for controlling
+tracing and accessing the trace data. See :doc:`../usage/cmd/trace`.
 
 
 Environment Variables
@@ -264,39 +355,94 @@
 TFTP. After this, U-Boot will boot the OS normally, albeit a little
 later.
 
+For a filesystem you may do something like::
+
+    trace calls 10000000 1000000;
+    save mmc 1:1 10000000 /trace ${profoffset}
+
+The trace buffer format is internal to the trace system. It consists of a
+header, a call count for each function site, followed by a list of trace
+records, once for each function call.
+
 
-Converting Trace Output Data
-----------------------------
+Converting Trace Output Data (proftool)
+---------------------------------------
 
 The trace output data is kept in a binary format which is not documented
-here. To convert it into something useful, you can use proftool.
+here. See the `trace.h` header file if you are interested. To convert it into
+something useful, you can use proftool.
 
 This tool must be given the U-Boot map file and the trace data received
-from running that U-Boot. It produces a text output file.
+from running that U-Boot. It produces a binary output file.
 
-Options
+It is also possible to provide a configuration file to indicate which functions
+should be included or dropped during conversion. This file consists of lines
+like::
+
+   include-func <regex>
+   exclude-func <regex>
+
+where <regex> is a regular expression matched against function names. It
+allows some functions to be dropped from the trace when producing ftrace
+records.
+
+Options:
+
+-c <config_file>
+    Specify the optional configuration file, to control which functions are
+    included in the output.
+
+-f <format>
+    Specifies the format to use (see below)
 
 -m <map_file>
-    Specify U-Boot map file
+    Specify U-Boot map file (`System.map`)
+
+-o <output file>
+    Specify the output filename
 
--p <trace_file>
-    Specify profile/trace file
+-t <trace_file>
+    Specify trace file, the data saved from U-Boot
+
+-v <0-4>
+    Specify the verbosity, where 0 is the minimum and 4 is for debugging.
 
 Commands:
 
+dump-ftrace:
+    Write a binary dump of the file in Linux ftrace format. Two options are
+    available:
+
+    function
+        write function-call records (caller/callee)
+
-dump-ftrace
-    Write a text dump of the file in Linux ftrace format to stdout
+    funcgraph
+        write function entry/exit records (graph)
 
+    This format can be used with kernelshark_ and trace_cmd_.
+
+dump-flamegraph
+    Write a list of stack records useful for producing a flame graph. Two
+    options are available:
+
+    calls
+        create a flamegraph of stack frames
+
+    timing
+        create a flamegraph of microseconds for each stack frame
+
+    This format can be used with flamegraph_pl_.
 
 Viewing the Trace Data
 ----------------------
 
-You can use pytimechart for this (sudo apt-get pytimechart might work on
-your Debian-style machine, and use your favourite search engine to obtain
-documentation). It expects the file to have a .txt extension. The program
-has terse user interface but is very convenient for viewing U-Boot
-profile information.
+You can use kernelshark_ for a GUI, but note that version 2.0.x was broken. If
+you have that version you could try building it from source.
 
+The file must have a .dat extension or it is ignored. The program has terse
+user interface but is very convenient for viewing U-Boot profile information.
+
+Also available is trace_cmd_ which provides a command-line interface.
 
 Workflow Suggestions
 --------------------
@@ -329,7 +475,9 @@
 There is a function call depth limit (set to 15 by default). When the
 stack depth goes above this then no tracing information is recorded.
 The maximum depth reached is recorded and displayed by the 'trace stats'
-command.
+command. While it might be tempting to set the depth limit quite high, this
+can dramatically increase the size of the trace output as well as the execution
+time.
 
 
 Future Work
@@ -346,5 +494,10 @@
 - Compression of trace information
 
 
-Simon Glass <sjg@chromium.org>
-April 2013
+.. sectionauthor:: Simon Glass <sjg@chromium.org>
+.. April 2013
+.. Updated January 2023
+
+.. _kernelshark: https://kernelshark.org/
+.. _trace_cmd: https://www.trace-cmd.org/
+.. _flamegraph_pl: https://github.com/brendangregg/FlameGraph/blob/master/flamegraph.pl
diff --git a/doc/usage/cmd/trace.rst b/doc/usage/cmd/trace.rst
new file mode 100644
index 0000000..3bdf4f0
--- /dev/null
+++ b/doc/usage/cmd/trace.rst
@@ -0,0 +1,163 @@
+.. SPDX-License-Identifier: GPL-2.0+:
+
+trace command
+=============
+
+Synopis
+-------
+
+::
+
+    trace stats
+    trace pause
+    trace resume
+    trace funclist [<addr> <size>]
+    trace calls [<addr> <size>]
+
+Description
+-----------
+
+The *trace* command is used to control the U-Boot tracing system. It allows
+tracing to be paused and resumed, shows statistics for traces and provides a
+way to dump out the trace information.
+
+
+trace stats
+~~~~~~~~~~~
+
+This display tracing statistics, as follows:
+
+function sites
+    Functions are binned as a way of reducing the amount of space needed to
+    hold all the function information. This is controlled by FUNC_SITE_SIZE in
+    the trace.h header file. The usual value is 4, which provides the finest
+    granularity (assuming a minimum instruction size of 4 bytes) which means
+    that every function can be resolved individually.
+
+function calls
+    Total number of function calls, including those which were not traced due
+    to buffer space. This count does not include functions which exceeded the
+    depth limit.
+
+untracked function calls
+    Total number of function calls which did not appear in the U-Boot image.
+    This can happen if a function is called outside the normal code area.
+
+traced function calls
+    Total number of function calls which were actually traced, i.e. are included
+    in the recorded trace data.
+
+dropped due to overflow
+    If the trace buffer was exhausted then this shows the number of records that
+    were dropped. Try reducing the depth limit or expanding the buffer size.
+
+maximum observed call depth
+    Maximum observed call depth while tracing.
+
+calls not traced due to depth
+    Counts the number of function calls that were not recorded because they
+    exceeded the maximum call depth.
+
+max function calls
+    Maximum number of function calls which can be recorded in the trace buffer,
+    given its size. Once `function calls` hits this value, recording stops.
+
+trace buffer
+    Address of trace buffer
+
+call records
+    Address of first trace record. This is near the start of the trace buffer,
+    after the function-call counts.
+
+
+trace pause
+~~~~~~~~~~~
+
+Pauses tracing, so that no more data is added to the trace buffer.
+
+
+trace resume
+~~~~~~~~~~~~
+
+Resumes tracing, so that new function calls are added to the trace buffer if
+there is sufficient space.
+
+
+trace funclist [<addr> <size>]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Dumps a list of functions into the provided buffer. The file uses a format
+specific to U-Boot: a header, following by the function offset and call count.
+
+If the address and size are not given, these are obtained from
+:ref:`develop/trace:environment variables`. In any case the environment
+variables are updated after the command runs.
+
+The resulting data should be written out to the host, e.g. using Ethernet or
+a filesystem. There are no tools provided to read this sdata.
+
+
+trace calls [<addr> <size>]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Dumps a list of function calls into the provided buffer. The file uses a format
+specific to U-Boot: a header, following by the list of calls. The proftool
+tool can be used to convert this information ready for further analysis.
+
+
+Example
+-------
+
+::
+
+    => trace stats
+            269,252 function sites
+         38,025,059 function calls
+                  3 untracked function calls
+          7,382,690 traced function calls
+                 17 maximum observed call depth
+                 15 call depth limit
+         68,667,432 calls not traced due to depth
+         22,190,112 max function calls
+
+    trace buffer 6c000000 call records 6c20de78
+    => trace resume
+    => trace pause
+
+This shows that resuming the trace causes the buffer to overflow::
+
+    => trace stats
+            269,252 function sites
+         49,573,694 function calls
+                  3 untracked function calls
+         22,190,112 traced function calls (8289848 dropped due to overflow)
+                 17 maximum observed call depth
+                 15 call depth limit
+         68,667,432 calls not traced due to depth
+         22,190,112 max function calls
+
+    trace buffer 6c000000 call records 6c20de78
+    => trace funcs 30000000 0x100000
+    Function trace dumped to 30000000, size 0x1e70
+
+This shows collecting and writing out the result trace data:
+
+::
+    => trace calls 20000000 0x10000000
+    Call list dumped to 20000000, size 0xfdf21a0
+    => save mmc 1:1 20000000 /trace ${profoffset}
+    File System is consistent
+    file found, deleting
+    update journal finished
+    File System is consistent
+    update journal finished
+    266281376 bytes written in 18584 ms (13.7 MiB/s)
+
+From here you can use proftool to convert it:
+
+.. code-block:: bash
+
+    tools/proftool -m System.map -t trace -o asc.fg dump-ftrace
+
+
+.. _`ACPI specification`: https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
diff --git a/doc/usage/index.rst b/doc/usage/index.rst
index 13e6939..cde7dcb 100644
--- a/doc/usage/index.rst
+++ b/doc/usage/index.rst
@@ -84,6 +84,7 @@
    cmd/sm
    cmd/temperature
    cmd/tftpput
+   cmd/trace
    cmd/true
    cmd/ums
    cmd/ut