trace: Update documentation

Revamp the documentation for the new features, including a description of
the new features and documentation for the trace command.

Signed-off-by: Simon Glass <sjg@chromium.org>
diff --git a/doc/develop/trace.rst b/doc/develop/trace.rst
index 5c7802d..8425d84 100644
--- a/doc/develop/trace.rst
+++ b/doc/develop/trace.rst
@@ -15,13 +15,17 @@
 The trace feature uses GCC's instrument-functions feature to trace all
 function entry/exit points. These are then recorded in a memory buffer.
 The memory buffer can be saved to the host over a network link using
-tftpput or by writing to an attached memory device such as MMC.
+tftpput or by writing to an attached storage device such as MMC.
 
 On the host, the file is first converted with a tool called 'proftool',
 which extracts useful information from it. The resulting trace output
 resembles that emitted by Linux's ftrace feature, so can be visually
-displayed by pytimechart.
+displayed by kernelshark (see kernelshark_) and used with
+'trace-cmd report' (see trace_cmd_).
 
+It is also possible to produce a flame graph for use with flamegraph.pl
+(see flamegraph_pl_).
+
 
 Quick-start using Sandbox
 -------------------------
@@ -30,7 +34,7 @@
 way of trying out tracing before you use it on your actual board. To do
 this, follow these steps:
 
-Add the following to config/sandbox_defconfig
+Add the following to `config/sandbox_defconfig`:
 
 .. code-block:: c
 
@@ -75,7 +79,7 @@
              16 maximum observed call depth
              15 call depth limit
           1,275,767 calls not traced due to depth
-    =>trace calls 0 e00000
+    =>trace calls 1000000 e00000
     Call list dumped to 00000000, size 0xae0a40
     =>print
     baudrate=115200
@@ -87,7 +91,7 @@
     stdout=serial
 
     Environment size: 117/8188 bytes
-    =>host save host 0 trace 0 ${profoffset}
+    =>host save hostfs - 1000000 trace ${profoffset}
     11405888 bytes written in 10 ms (1.1 GiB/s)
     =>reset
 
@@ -96,18 +100,107 @@
 
 .. code-block:: console
 
-    $ ./sandbox/tools/proftool -m sandbox/System.map -p trace dump-ftrace >trace.txt
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace dump-ftrace >trace.dat
 
-Finally run pytimechart to display it
+Finally run kernelshark to display it (note it only works with `.dat` files!):
 
 .. code-block:: console
 
-    $ pytimechart trace.txt
+    $ kernelshark trace.dat
 
-Using this tool you can zoom and pan across the trace, with the function
-calls on the left and little marks representing the start and end of each
+Using this tool you can view the trace records and see the timestamp for each
 function.
 
+.. image:: pics/kernelshark.png
+  :width: 800
+  :alt: Kernelshark showing function-trace records
+
+
+To see the records on the console, use trace-cmd:
+
+.. code-block:: console
+
+    $ trace-cmd report trace.dat | less
+    cpus=1
+          u-boot-1     [000]     3.116364: function:             initf_malloc
+          u-boot-1     [000]     3.116375: function:             initf_malloc
+          u-boot-1     [000]     3.116386: function:             initf_bootstage
+          u-boot-1     [000]     3.116396: function:                bootstage_init
+          u-boot-1     [000]     3.116408: function:                   malloc
+          u-boot-1     [000]     3.116418: function:                      malloc_simple
+          u-boot-1     [000]     3.116429: function:                         alloc_simple
+          u-boot-1     [000]     3.116441: function:                         alloc_simple
+          u-boot-1     [000]     3.116449: function:                      malloc_simple
+          u-boot-1     [000]     3.116457: function:                   malloc
+
+Note that `pytimechart` is obsolete so cannot be used anymore.
+
+There is a -f option available to select a function graph:
+
+.. code-block:: console
+
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace -f funcgraph dump-ftrace >trace.dat
+
+Again, you can use kernelshark or trace-cmd to look at the output. In this case
+you will see the time taken by each function shown against its exit record.
+
+.. image:: pics/kernelshark_fg.png
+  :width: 800
+  :alt: Kernelshark showing function-graph records
+
+.. code-block:: console
+
+    $ trace-cmd report trace.dat | less
+    cpus=1
+              u-boot-1     [000]     3.116364: funcgraph_entry:        0.011 us   |    initf_malloc();
+              u-boot-1     [000]     3.116386: funcgraph_entry:                   |    initf_bootstage() {
+              u-boot-1     [000]     3.116396: funcgraph_entry:                   |      bootstage_init() {
+              u-boot-1     [000]     3.116408: funcgraph_entry:                   |        malloc() {
+              u-boot-1     [000]     3.116418: funcgraph_entry:                   |          malloc_simple() {
+              u-boot-1     [000]     3.116429: funcgraph_entry:        0.012 us   |            alloc_simple();
+              u-boot-1     [000]     3.116449: funcgraph_exit:         0.031 us   |            }
+              u-boot-1     [000]     3.116457: funcgraph_exit:         0.049 us   |          }
+              u-boot-1     [000]     3.116466: funcgraph_entry:        0.063 us   |        memset();
+              u-boot-1     [000]     3.116539: funcgraph_exit:         0.143 us   |        }
+
+Flame graph
+-----------
+
+Some simple flame graph options are available as well, using the dump-flamegraph
+command:
+
+.. code-block:: console
+
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace dump-flamegraph >trace.fg
+    $ flamegraph.pl trace.fg >trace.svg
+
+You can load the .svg file into a viewer. If you use Chrome (and some other
+programs) you can click around and zoom in and out.
+
+.. image:: pics/flamegraph.png
+  :width: 800
+  :alt: Chrome showing the flamegraph.pl output
+
+.. image:: pics/flamegraph_zoom.png
+  :width: 800
+  :alt: Chrome showing zooming into the flamegraph.pl output
+
+
+A timing variant is also available, which gives an idea of how much time is
+spend in each call stack:
+
+.. code-block:: console
+
+    $ ./sandbox/tools/proftool -m sandbox/System.map -t trace dump-flamegraph -f timing >trace.fg
+    $ flamegraph.pl trace.fg >trace.svg
+
+Note that trace collection does slow down execution so the timings will be
+inflated. They should be used to guide optimisation. For accurate boot timings,
+use bootstage.
+
+.. image:: pics/flamegraph_timing.png
+  :width: 800
+  :alt: Chrome showing flamegraph.pl output with timing
 
 CONFIG Options
 --------------
@@ -138,6 +231,11 @@
 CONFIG_TRACE_EARLY_ADDR
     Address of early trace buffer
 
+CONFIG_TRACE_CALL_DEPTH_LIMIT
+    Sets the limit on trace call-depth. For a broad view, 10 is typically
+    sufficient. Setting this too large creates enormous traces and distorts
+    the overall timing considerable.
+
 
 Building U-Boot with Tracing Enabled
 ------------------------------------
@@ -148,6 +246,26 @@
 config files.
 
 
+Board requirements
+------------------
+
+Trace data collection relies on a microsecond timer, accessed through
+`timer_get_us()`. So the first thing you should do is make sure that
+this produces sensible results for your board. Suitable sources for
+this timer include high resolution timers, PWMs or profile timers if
+available. Most modern SOCs have a suitable timer for this.
+
+See `add_ftrace()` for where `timer_get_us()` is called. The `notrace`
+attribute must be used on each function called by `timer_get_us()` since
+recursive calls to `add_ftrace()` will cause a fault::
+
+   trace: recursion detected, disabling
+
+You cannot use driver model to obtain the microsecond timer, since tracing
+may be enabled before driver model is set up. Instead, provide a low-level
+function which accesses the timer, setting it up if needed.
+
+
 Collecting Trace Data
 ---------------------
 
@@ -155,21 +273,22 @@
 limit of the trace buffer size you have specified. Once that is exhausted
 no more data will be collected.
 
-Collecting trace data has an affect on execution time/performance. You
+Collecting trace data affects execution time and performance. You
 will notice this particularly with trivial functions - the overhead of
 recording their execution may even exceed their normal execution time.
 In practice this doesn't matter much so long as you are aware of the
 effect. Once you have done your optimizations, turn off tracing before
-doing end-to-end timing.
+doing end-to-end timing using bootstage.
 
 The best time to start tracing is right at the beginning of U-Boot. The
 best time to stop tracing is right at the end. In practice it is hard
 to achieve these ideals.
 
-This implementation enables tracing early in board_init_f(). This means
+This implementation enables tracing early in `board_init_r()`, or
+`board_init_f()` when `TRACE_EARLY` is enabled. This means
 that it captures most of the board init process, missing only the
 early architecture-specific init. However, it also misses the entire
-SPL stage if there is one.
+SPL stage if there is one. At present tracing is not supported in SPL.
 
 U-Boot typically ends with a 'bootm' command which loads and runs an
 OS. There is useful trace data in the execution of that bootm
@@ -179,39 +298,11 @@
 variable at this point. This variable should have a short script which
 collects the trace data and writes it somewhere.
 
-Trace data collection relies on a microsecond timer, accessed through
-timer_get_us(). So the first think you should do is make sure that
-this produces sensible results for your board. Suitable sources for
-this timer include high resolution timers, PWMs or profile timers if
-available. Most modern SOCs have a suitable timer for this. Make sure
-that you mark this timer (and anything it calls) with
-notrace so that the trace library can
-use it without causing an infinite loop.
-
-
-Commands
---------
-
-The trace command has variable sub-commands:
-
-stats
-    Display tracing statistics
-
-pause
-    Pause tracing
-
-resume
-    Resume tracing
-
-funclist [<addr> <size>]
-    Dump a list of functions into the buffer
-
-calls  [<addr> <size>]
-    Dump function call trace into buffer
+Controlling the trace
+---------------------
 
-If the address and size are not given, these are obtained from environment
-variables (see below). In any case the environment variables are updated
-after the command runs.
+U-Boot provides a command-line interface to the trace system for controlling
+tracing and accessing the trace data. See :doc:`../usage/cmd/trace`.
 
 
 Environment Variables
@@ -264,39 +355,94 @@
 TFTP. After this, U-Boot will boot the OS normally, albeit a little
 later.
 
+For a filesystem you may do something like::
+
+    trace calls 10000000 1000000;
+    save mmc 1:1 10000000 /trace ${profoffset}
+
+The trace buffer format is internal to the trace system. It consists of a
+header, a call count for each function site, followed by a list of trace
+records, once for each function call.
+
 
-Converting Trace Output Data
-----------------------------
+Converting Trace Output Data (proftool)
+---------------------------------------
 
 The trace output data is kept in a binary format which is not documented
-here. To convert it into something useful, you can use proftool.
+here. See the `trace.h` header file if you are interested. To convert it into
+something useful, you can use proftool.
 
 This tool must be given the U-Boot map file and the trace data received
-from running that U-Boot. It produces a text output file.
+from running that U-Boot. It produces a binary output file.
 
-Options
+It is also possible to provide a configuration file to indicate which functions
+should be included or dropped during conversion. This file consists of lines
+like::
+
+   include-func <regex>
+   exclude-func <regex>
+
+where <regex> is a regular expression matched against function names. It
+allows some functions to be dropped from the trace when producing ftrace
+records.
+
+Options:
+
+-c <config_file>
+    Specify the optional configuration file, to control which functions are
+    included in the output.
+
+-f <format>
+    Specifies the format to use (see below)
 
 -m <map_file>
-    Specify U-Boot map file
+    Specify U-Boot map file (`System.map`)
+
+-o <output file>
+    Specify the output filename
 
--p <trace_file>
-    Specify profile/trace file
+-t <trace_file>
+    Specify trace file, the data saved from U-Boot
+
+-v <0-4>
+    Specify the verbosity, where 0 is the minimum and 4 is for debugging.
 
 Commands:
 
+dump-ftrace:
+    Write a binary dump of the file in Linux ftrace format. Two options are
+    available:
+
+    function
+        write function-call records (caller/callee)
+
-dump-ftrace
-    Write a text dump of the file in Linux ftrace format to stdout
+    funcgraph
+        write function entry/exit records (graph)
 
+    This format can be used with kernelshark_ and trace_cmd_.
+
+dump-flamegraph
+    Write a list of stack records useful for producing a flame graph. Two
+    options are available:
+
+    calls
+        create a flamegraph of stack frames
+
+    timing
+        create a flamegraph of microseconds for each stack frame
+
+    This format can be used with flamegraph_pl_.
 
 Viewing the Trace Data
 ----------------------
 
-You can use pytimechart for this (sudo apt-get pytimechart might work on
-your Debian-style machine, and use your favourite search engine to obtain
-documentation). It expects the file to have a .txt extension. The program
-has terse user interface but is very convenient for viewing U-Boot
-profile information.
+You can use kernelshark_ for a GUI, but note that version 2.0.x was broken. If
+you have that version you could try building it from source.
 
+The file must have a .dat extension or it is ignored. The program has terse
+user interface but is very convenient for viewing U-Boot profile information.
+
+Also available is trace_cmd_ which provides a command-line interface.
 
 Workflow Suggestions
 --------------------
@@ -329,7 +475,9 @@
 There is a function call depth limit (set to 15 by default). When the
 stack depth goes above this then no tracing information is recorded.
 The maximum depth reached is recorded and displayed by the 'trace stats'
-command.
+command. While it might be tempting to set the depth limit quite high, this
+can dramatically increase the size of the trace output as well as the execution
+time.
 
 
 Future Work
@@ -346,5 +494,10 @@
 - Compression of trace information
 
 
-Simon Glass <sjg@chromium.org>
-April 2013
+.. sectionauthor:: Simon Glass <sjg@chromium.org>
+.. April 2013
+.. Updated January 2023
+
+.. _kernelshark: https://kernelshark.org/
+.. _trace_cmd: https://www.trace-cmd.org/
+.. _flamegraph_pl: https://github.com/brendangregg/FlameGraph/blob/master/flamegraph.pl