Source code

Revision control

Copy as Markdown

Other Tools

Profiler Code Overview
######################
This is an overview of the code that implements the Profiler inside Firefox
with some details around tricky subjects, or pointers to more detailed
documentation and/or source code.
It assumes familiarity with Firefox development, including Mercurial (hg), mach,
moz.build files, Try, Phabricator, etc.
It also assumes knowledge of the user-visible part of the Firefox Profiler, that
is: How to use the Firefox Profiler, and what profiles contain that is shown
when capturing a profile. See the main website https://profiler.firefox.com, and
its `documentation <https://profiler.firefox.com/docs/>`_.
For just an "overview", it may look like a huge amount of information, but the
Profiler code is indeed quite expansive, so it takes a lot of words to explain
even just a high-level view of it! For on-the-spot needs, it should be possible
to search for some terms here and follow the clues. But for long-term
maintainers, it would be worth skimming this whole document to get a grasp of
the domain, and return to get some more detailed information before diving into
the code.
WIP note: This document should be correct at the time it is written, but the
profiler code constantly evolves to respond to bugs or to provide new exciting
features, so this document could become obsolete in parts! It should still be
useful as an overview, but its correctness should be verified by looking at the
actual code. If you notice any significant discrepancy or broken links, please
help by
*****
Terms
*****
This is the common usage for some frequently-used terms, as understood by the
Dev Tools team. But incorrect usage can sometimes happen, context is key!
* **profiler** (a): Generic name for software that enables the profiling of
* **Profiler** (the): All parts of the profiler code inside Firefox.
* **Base Profiler** (the): Parts of the Profiler that live in
mozglue/baseprofiler, and can be used from anywhere, but has limited
functionality.
* **Gecko Profiler** (the): Parts of the Profiler that live in tools/profiler,
and can only be used from other code in the XUL library.
* **Profilers** (the): Both the Base Profiler and the Gecko Profiler.
* **profiling session**: This is the time during which the profiler is running
and collecting data.
* **profile** (a): The output from a profiling session, either as a file, or a
shared viewable profile on https://profiler.firefox.com
* **Profiler back-end** (the): Other name for the Profiler code inside Firefox,
to distinguish it from...
* **Profiler front-end** (the): The website https://profiler.firefox.com that
displays profiles captured by the back-end.
* **Firefox Profiler** (the): The whole suite comprised of the back-end and front-end.
******************
Guiding Principles
******************
When working on the profiler, here are some guiding principles to keep in mind:
* Low profiling overhead in cpu and memory. For the Profiler to provide the best
value, it should stay out of the way and consume as few resources (in time and
memory) as possible, so as not to skew the actual Firefox code too much.
* Common data structures and code should be in the Base Profiler when possible.
WIP note: Deduplication is slowly happening, see
This document focuses on the Profiler back-end, and mainly the Gecko Profiler
(because this is where most of the code lives, the Base Profiler is mostly a
subset, originally just a cut-down version of the Gecko Profiler); so unless
specified, descriptions below are about the Gecko Profiler, but know that
there may be some equivalent code in the Base Profiler as well.
* Use appropriate programming-language features where possible to reduce coding
errors in both our code, and our users' usage of it. In C++, this can be done
by using a specific class/struct types for a given usage, to avoid misuse
(e.g., an generic integer representing a **process** could be incorrectly
given to a function expecting a **thread**; we have specific types for these
instead, more below.)
* Follow the
* Whenever possible, write tests (if not present already) for code you add or
modify -- but this may be too difficult in some case, use good judgement and
at least test manually instead.
******************
Profiler Lifecycle
******************
Here is a high-level view of the Base **or** Gecko Profiler lifecycle, as part
of a Firefox run. The following sections will go into much more details.
* Profiler initialization, preparing some common data.
* Threads de/register themselves as they start and stop.
* During each User/test-controlled profiling session:
* Profiler start, preparing data structures that will store the profiling data.
* Periodic sampling from a separate thread, happening at a user-selected
frequency (usually once every 1-2 ms), and recording snapshots of what
Firefox is doing:
* CPU sampling, measuring how much time each thread has spent actually
running on the CPU.
* Stack sampling, capturing a stack of functions calls from whichever leaf
function the program is in at this point in time, up to the top-most
caller (i.e., at least the ``main()`` function, or its callers if any).
Note that unlike most external profilers, the Firefox Profiler back-end
is capable or getting more useful information than just native functions
calls (compiled from C++ or Rust):
* Labels added by Firefox developers along the stack, usually to identify
regions of code that perform "interesting" operations (like layout, file
I/Os, etc.).
* JavaScript function calls, including the level of optimization applied.
* Java function calls.
* At any time, Markers may record more specific details of what is happening,
e.g.: User operations, page rendering steps, garbage collection, etc.
* Optional profiler pause, which stops most recording, usually near the end of
a session so that no data gets recorded past this point.
* Profile JSON output, generated from all the recorded profiling data.
* Profiler stop, tearing down profiling session objects.
* Profiler shutdown.
Note that the Base Profiler can start earlier, and then the data collected so
far, as well as the responsibility for periodic sampling, is handed over to the
Gecko Profiler:
#. (Firefox starts)
#. Base Profiler init
#. Base Profiler start
#. (Firefox loads the libxul library and initializes XPCOM)
#. Gecko Profiler init
#. Gecko Profiler start
#. Handover from Base to Gecko
#. Base Profiler stop
#. (Bulk of the profiling session)
#. JSON generation
#. Gecko Profiler stop
#. Gecko Profiler shutdown
#. (Firefox ends XPCOM)
#. Base Profiler shutdown
#. (Firefox exits)
Base Profiler functions that add data (mostly markers and labels) may be called
from anywhere, and will be recorded by either Profiler. The corresponding
functions in Gecko Profiler can only be called from other libxul code, and can
only be recorded by the Gecko Profiler.
Whenever possible, Gecko Profiler functions should be preferred if accessible,
as they may provide extended functionality (e.g., better stacks with JS in
markers). Otherwise fallback on Base Profiler functions.
***********
Directories
***********
* Non-Profiler supporting code
replacements for C++ std library facilities.
Mutex base classes.
Stack-walking functions.
Timestamps and time durations.
Data structures like arrays, strings.
Threading functions.
* Profiler back-end
Base Profiler code, usable from anywhere in Firefox. Because it lives in
mozglue, it's loaded right at the beginning, so it's possible to start the
profiler very early, even before Firefox loads its big&heavy "xul" library.
Public headers, may be #included from anywhere.
Main implementation code.
Special stack-walking code for Linux.
Unit tests.
Gecko Profiler code, only usable from the xul library. That library is
loaded a short time after Firefox starts, so the Gecko Profiler is not able
to profile the early phase of the application, Base Profiler handles that,
and can pass its collected data to the Gecko Profiler when the latter
starts.
Public headers, may be #included from most libxul code.
Main implementation code.
Control from JS, and multi-process/IPC code.
Special stack-walking code for Linux.
Middleware code for about:profiling and devtools panel functionality.
* js, starting with
JavaScript engine support, mostly to capture JS stacks.
* `toolkit/components/extensions/schemas/geckoProfiler.json <https://searchfox.org/mozilla-central/source/toolkit/components/extensions/schemas/geckoProfiler.json>`_ -
File that needs to be updated when Profiler features change.
* Profiler front-end
* Out of scope for this document, but its code and bug repository can be found at:
https://github.com/firefox-devtools/profiler . Sometimes work needs to be
done on both the back-end of the front-end, especially when modifying the
back-end's JSON output format.
*******
Headers
*******
The most central public header is
from which almost everything else can be found, it can be a good starting point
for exploration.
It includes other headers, which together contain important top-level macros and
functions.
WIP note: GeckoProfiler.h used to be the header that contained everything!
To better separate areas of functionality, and to hopefully reduce compilation
times, parts of it have been split into smaller headers, and this work will
MOZ_GECKO_PROFILER and Macros
=============================
Mozilla officially supports the Profiler on `tier-1 platforms
Windows, macos, Linux and Android.
There is also some code running on tier 2-3 platforms (e.g., for FreeBSD), but
the team at Mozilla is not obligated to maintain it; we do try to keep it
running, and some external contributors are keeping an eye on it and provide
patches when things do break.
To reduce the burden on unsupported platforms, a lot of the Profilers code is
only compiled when ``MOZ_GECKO_PROFILER`` is #defined. This means that some
public functions may not always be declared or implemented, and should be
surrounded by guards like ``#ifdef MOZ_GECKO_PROFILER``.
Some commonly-used functions offer an empty definition in the
non-``MOZ_GECKO_PROFILER`` case, so these functions may be called from anywhere
without guard.
Other functions have associated macros that can always be used, and resolve to
nothing on unsupported platforms. E.g.,
``PROFILER_REGISTER_THREAD`` calls ``profiler_register_thread`` where supported,
otherwise does nothing.
WIP note: There is an effort to eventually get rid of ``MOZ_GECKO_PROFILER`` and
its associated macros, see
RAII "Auto" macros and classes
==============================
A number of functions are intended to be called in pairs, usually to start and
then end some operation. To ease their use, and ensure that both functions are
always called together, they usually have an associated class and/or macro that
may be called only once. This pattern of using an object's destructor to ensure
that some action always eventually happens, is called
common prefix "auto".
E.g.: In ``MOZ_GECKO_PROFILER`` builds,
instantiates an
object, which calls ``profiler_init`` when constructed, and
``profiler_shutdown`` when destroyed.
*********************
Platform Abstractions
*********************
This section describes some platform abstractions that are used throughout the
Profilers. (Other platform abstractions will be described where they are used.)
Process and Thread IDs
======================
The Profiler back-end often uses process and thread IDs (aka "pid" and "tid"),
which are commonly just a number.
For better code correctness, and to hide specific platform details, they are
encapsulated in opaque types
and
These types should be used wherever possible.
When interfacing with other code, they may be converted using the member
functions ``FromNumber`` and ``ToNumber``.
To find the current process or thread ID, use
or
The main thread ID is available through
(assuming
was called when the application started -- especially important in stand-alone
test programs.)
And
is a quick way to find out if the current thread is the main thread.
Locking
=======
The locking primitives in PlatformMutex.h are not supposed to be used as-is, but
through a user-accessible implementation. For the Profilers, this is in
In addition to the usual ``Lock``, ``TryLock``, and ``Unlock`` functions,
objects have a name (which may be helpful when debugging),
they record the thread on which they are locked (making it possible to know if
the mutex is locked on the current thread), and in ``DEBUG`` builds there are
assertions verifying that the mutex is not incorrectly used recursively, to
verify the correct ordering of different Profiler mutexes, and that it is
unlocked before destruction.
Mutexes should preferably be locked within C++ block scopes, or as class
members, by using
Some classes give the option to use a mutex or not (so that single-threaded code
can more efficiently bypass locking operations), for these we have
and
There is also a special type of shared lock (aka RWLock, see
which may be locked in multiple threads (through ``LockShared`` or preferably
or locked exclusively, preventing any other locking (through ``LockExclusive`` or preferably
*********************
Main Profiler Classes
*********************
Diagram showing the most important Profiler classes, see details in the
following sections:
(As noted, the "RegisteredThread" classes are now obsolete in the Gecko
Profiler, see the "Thread Registration" section below for an updated diagram and
description.)
.. image:: profilerclasses-20220913.png
***********************
Profiler Initialization
***********************
and
must be called from the main thread, and are used to prepare important aspects
of the profiler, including:
* Making sure the main thread ID is recorded.
* Handling ``MOZ_PROFILER_HELP=1 ./mach run`` to display the command-line help.
* Creating the ``CorePS`` instance -- more details below.
* Registering the main thread.
* Initializing some platform-specific code.
* Handling other environment variables that are used to immediately start the
profiler, with optional settings provided in other env-vars.
CorePS
======
has a single instance that should live for the duration of the Firefox
application, and contains important information that could be needed even when
the Profiler is not running.
It includes:
* A static pointer to its single instance.
* The process start time.
* JavaScript-specific data structures.
* A list of registered
used to keep track of the tabs that this process handles.
* A list of
used to record things like the process memory usage.
* The process name, and optionally the "eTLD+1" (roughly sub-domain) that this
process handles.
* In the Base Profiler only, a list of
WIP note: This storage has been reworked in the Gecko Profiler (more below),
and in practice the Base Profiler only registers the main thread. This should
eventually disappear as part of the de-duplication work
*******************
Thread Registration
*******************
Threads need to register themselves in order to get fully profiled.
This section describes the main data structures that record the list of
registered threads and their data.
WIP note: There is some work happening to add limited profiling of unregistered
threads, with the hope that more and more functionality could be added to
eventually use the same registration data structures.
Diagram showing the relevant classes, see details in the following sub-sections:
.. image:: profilerthreadregistration-20220913.png
ProfilerThreadRegistry
======================
The
contains a list of ``OffThreadRef`` objects.
Each ``OffThreadRef`` points to a ``ProfilerThreadRegistration``, and restricts
access to a safe subset of the thread data, and forces a mutex lock if necessary
(more information under ProfilerThreadRegistrationData below).
ProfilerThreadRegistration
==========================
A
contains a lot of information relevant to its thread, to help with profiling it.
This data is accessible from the thread itself through an ``OnThreadRef``
object, which points to the ``ThreadRegistration``, and restricts access to a
safe subset of thread data, and forces a mutex lock if necessary (more
information under ProfilerThreadRegistrationData below).
ThreadRegistrationData and accessors
====================================
contains a hierarchy of classes that encapsulate all the thread-related data.
``ThreadRegistrationData`` contains all the actual data members, including:
* Some long-lived
containing the thread name, its registration time, the thread ID, and whether
it's the main thread.
* A ``ProfilingStack`` that gathers developer-provided pseudo-frames, and JS
frames.
* Some platform-specific ``PlatformData`` (usually required to actually record
profiling measurements for that thread).
* A pointer to the top of the stack.
* A shared pointer to the thread's ``nsIThread``.
* A pointer to the ``JSContext``.
* An optional pre-allocated ``JsFrame`` buffer used during stack-sampling.
* Some JS flags.
* Sleep-related data (to avoid costly sampling while the thread is known to not
be doing anything).
* The current ``ThreadProfilingFeatures``, to know what kind of data to record.
* When profiling, a pointer to a ``ProfiledThreadData``, which contains some
more data needed during and just after profiling.
As described in their respective code comments, each data member is supposed to
be accessed in certain ways, e.g., the ``JSContext`` should only be "written
from thread, read from thread and suspended thread". To enforce these rules,
data members can only be accessed through certain classes, which themselves can
only be instantiated in the correct conditions.
The accessor classes are, from base to most-derived:
* ``ThreadRegistrationData``, not an accessor itself, but it's the base class
with all the ``protected`` data.
* ``ThreadRegistrationUnlockedConstReader``, giving unlocked ``const`` access to
the ``ThreadRegistrationInfo``, ``PlatformData``, and stack top.
* ``ThreadRegistrationUnlockedConstReaderAndAtomicRW``, giving unlocked
access to the atomic data members: ``ProfilingStack``, sleep-related data,
``ThreadProfilingFeatures``.
* ``ThreadRegistrationUnlockedRWForLockedProfiler``, giving access that's
protected by the Profiler's main lock, but doesn't require a
``ThreadRegistration`` lock, to the ``ProfiledThreadData``
* ``ThreadRegistrationUnlockedReaderAndAtomicRWOnThread``, giving unlocked
mutable access, but only on the thread itself, to the ``JSContext``.
* ``ThreadRegistrationLockedRWFromAnyThread``, giving locked access from any
thread to mutex-protected data: ``ThreadProfilingFeatures``, ``JsFrame``,
``nsIThread``, and the JS flags.
* ``ThreadRegistrationLockedRWOnThread``, giving locked access, but only from
the thread itself, to the ``JSContext`` and a JS flag-related operation.
* ``ThreadRegistration::EmbeddedData``, containing all of the above, and stored
as a data member in each ``ThreadRegistration``.
To recapitulate, if some code needs some data on the thread, it can use
``ThreadRegistration`` functions to request access (with the required rights,
like a mutex lock).
To access data about another thread, use similar functions from
``ThreadRegistry`` instead.
You may find some examples in the implementations of the functions in
ProfilerThreadState.h (see the following section).
ProfilerThreadState.h functions
===============================
The
header provides a few helpful functions related to threads, including:
* ``profiler_is_active_and_thread_is_registered``
* ``profiler_thread_is_being_profiled`` (for the current thread or another
thread, and for a given set of features)
* ``profiler_thread_is_sleeping``
**************
Profiler Start
**************
There are multiple ways to start the profiler, through command line env-vars,
and programmatically in C++ and JS.
The main public C++ function is
It takes all the features specifications, and returns a promise that gets
resolved when the Profiler has fully started in all processes (multi-process
profiling is described later in this document, for now the focus will be on each
process running its instance of the Profiler). It first calls ``profiler_init``
if needed, and also ``profiler_stop`` if the profiler was already running.
The main implementation, which can be called from multiple sources, is
It performs a number of operations to start the profiling session, including:
* Record the session start time.
* Pre-allocate some work buffer to capture stacks for markers on the main thread.
* In the Gecko Profiler only: If the Base Profiler was running, take ownership
of the data collected so far, and stop the Base Profiler (we don't want both
trying to collect the same data at the same time!)
* Create the ActivePS, which keeps track of most of the profiling session
information, more about it below.
* For each registered thread found in the ``ThreadRegistry``, check if it's one
of the threads to profile, and if yes set the appropriate data into the
corresponding ``ThreadRegistrationData`` (including informing the JS engine to
start recording profiling data).
* On Android, start the Java sampler.
* If native allocations are to be profiled, setup the appropriate hooks.
* Start the audio callback tracing if requested.
* Set the public shared "active" state, used by many functions to quickly assess
whether to actually record profiling data.
ActivePS
========
has a single instance at a time, that should live for the length of the
profiling session.
It includes:
* The session start time.
* A way to track "generations" (in case an old ActivePS still lives when the
next one starts, so that in-flight data goes to the correct place.)
* Requested features: Buffer capacity, periodic sampling interval, feature set,
list of threads to profile, optional: specific tab to profile.
* The profile data storage buffer and its chunk manager (see "Storage" section
below for details.)
* More data about live and dead profiled threads.
* Optional counters for per-process CPU usage, and power usage.
* A pointer to the ``SamplerThread`` object (see "Periodic Sampling" section
below for details.)
*******
Storage
*******
During a session, the profiling data is serialized into a buffer, which is made
of "chunks", each of which contains "blocks", which have a size and the "entry"
data.
During a profiling session, there is one main profile buffer, which may be
started by the Base Profiler, and then handed over to the Gecko Profiler when
the latter starts.
The buffer is divided in chunks of equal size, which are allocated before they
are needed. When the data reaches a user-set limit, the oldest chunk is
recycled. This means that for long-enough profiling sessions, only the most
recent data (that could fit under the limit) is kept.
Each chunk stores a sequence of blocks of variable length. The chunk itself
only knows where the first full block starts, and where the last block ends,
which is where the next block will be reserved.
To add an entry to the buffer, a block is reserved, the size is written first
(so that readers can find the start of the next block), and then the entry bytes
are written.
The following sessions give more technical details.
leb128iterator.h
================
contains some functions to read and write unsigned "LEB128" numbers
(`LEB128 on wikipedia <https://en.wikipedia.org/wiki/LEB128>`_).
They are an efficient way to serialize numbers that are usually small, e.g.,
numbers up to 127 only take one byte, two bytes up to 16,383, etc.
ProfileBufferBlockIndex
=======================
encapsulates a block index that is known to be the valid start of a block. It is
created when a block is reserved, or when trusted code computes the start of a
block in a chunk.
The more generic
type is used when working inside blocks.
ProfileBufferChunk
==================
is a variable-sized object. It contains:
* A public copyable header, itself containing:
* The local offset to the first full block (a chunk may start with the end of
a block that was started at the end of the previous chunk). That offset in
the very first chunk is the natural start to read all the data in the
buffer.
* The local offset past the last reserved block. This is where the next block
should be reserved, unless it points past the end of this chunk size.
* The timestamp when the chunk was first used.
* The timestamp when the chunk became full.
* The number of bytes that may be stored in this chunk.
* The number of reserved blocks.
* The global index where this chunk starts.
* The process ID writing into this chunk.
* An owning unique pointer to the next chunk. It may be null for the last chunk
in a chain.
* In ``DEBUG`` builds, a state variable, which is used to ensure that the chunk
goes through a known sequence of states (e.g., Created, then InUse, then
Done, etc.) See the sequence diagram
* The actual buffer data.
Because a ProfileBufferChunk is variable-size, it must be created through its
static ``Create`` function, which takes care of allocating the correct amount
of bytes, at the correct alignment.
Chunk Managers
==============
ProfilerBufferChunkManager
--------------------------
defines the interface of classes that manage chunks.
Concrete implementations are responsible for:
* Creating chunks for their user, with a mechanism to pre-allocate chunks before they are actually needed.
* Taking back and owning chunks when they are "released" (usually when full).
* Automatically destroying or recycling the oldest released chunks.
* Giving temporary access to extant released chunks.
ProfileBufferChunkManagerSingle
-------------------------------
manages a single chunk.
That chunk is always the same, it is never destroyed. The user may use it and
optionally release it. The manager can then be reset, and that one chunk will
be available again for use.
A request for a second chunk would always fail.
This manager is short-lived and not thread-safe. It is useful when there is some
limited data that needs to be captured without blocking the global profiling
buffer, usually one stack sample. This data may then be extracted and quickly
added to the global buffer.
ProfileBufferChunkManagerWithLocalLimit
---------------------------------------
implements the ``ProfileBufferChunkManager`` interface fully, managing a number
of chunks, and making sure their total combined size stays under a given limit.
This is the main chunk manager user during a profiling session.
Note: It also implements the ``ProfileBufferControlledChunkManager`` interface,
this is explained in the later section "Multi-Process Profiling".
It is thread-safe, and one instance is shared by both Profilers.
ProfileChunkedBuffer
====================
uses a ``ProfilerBufferChunkManager`` to store data, and handles the different
C++ types of data that the Profilers want to read/write as entries in buffer
chunks.
Its main function is ``ReserveAndPut``:
* It takes an invocable object (like a lambda) that should return the size of
the entry to store, this is to potentially avoid costly operations just to
compute a size, when the profiler may not be running.
* It attempts to reserve the space in its chunks, requesting a new chunk if
necessary.
* It then calls a provided invocable object with a
which offers a range of functions to help serialize C++ objects. The
de/serialization functions are found in specializations of
and
More "put" functions use ``ReserveAndPut`` to more easily serialize blocks of
memory, or C++ objects.
``ProfileChunkedBuffer`` is optionally thread-safe, using a
``BaseProfilerMaybeMutex``.
WIP note: Using a mutex makes this storage too noisy for profiling some
real-time (like audio processing).
at switching to using atomic variables instead.
An alternative would be to use a totally separate non-thread-safe buffers for
each real-time thread that requires it (see
ProfileBuffer
=============
uses a ``ProfileChunkedBuffer`` to store data, and handles the different kinds
of entries that the Profilers want to read/write.
Each entry starts with a tag identifying a kind. These kinds can be found in
There are "legacy" kinds, which are small fixed-length entries, such as:
Categories, labels, frame information, counters, etc. These can be stored in
And there are "modern" kinds, which have variable sizes, such as: Markers, CPU
running times, full stacks, etc. These are more directly handled by code that
can access the underlying ``ProfileChunkedBuffer``.
The other major responsibility of a ``ProfileChunkedBuffer`` is to read back all
this data, sometimes during profiling (e.g., to duplicate a stack), but mainly
at the end of a session when generating the output JSON profile.
*****************
Periodic Sampling
*****************
Probably the most important job of the Profiler is to sample stacks of a number
of running threads, to help developers know which functions get used a lot when
performing some operation on Firefox.
This is accomplished from a special thread, which regularly springs into action
and captures all this data.
SamplerThread
=============
manages the information needed during sampling. It is created when the profiler
starts, and is stored inside the ``ActivePS``, see above for details.
It includes:
* A ``Sampler`` object that contains platform-specific details, which are
implemented in separate files like platform-win32.cpp, etc.
* The same generation index as its owning ``ActivePS``.
* The requested interval between samples.
* A handle to the thread where the sampling happens, its main function is
* A list of callbacks to invoke after the next sampling. These may be used by
tests to wait for sampling to actually happen.
* The unregistered-thread-spy data, and an optional handle on another thread
that takes care of "spying" on unregistered thread (on platforms where that
operation is too expensive to run directly on the sampling thread).
The ``Run()`` function takes care of performing the periodic sampling work:
(more details in the following sections)
* Retrieve the sampling parameters.
* Instantiate a ``ProfileBuffer`` on the stack, to capture samples from other threads.
* Loop until a ``break``:
* Lock the main profiler mutex, and do:
* Check if sampling should stop, and break from the loop.
* Clean-up exit profiles (these are profiles sent from dying sub-processes,
and are kept for as long as they overlap with this process' own buffer range).
* Record the CPU utilization of the whole process.
* Record the power consumption.
* Sample each registered counter, including the memory counter.
* For each registered thread to be profiled:
* Record the CPU utilization.
* If the thread is marked as "still sleeping", record a "same as before"
sample, otherwise suspend the thread and take a full stack sample.
* On some threads, record the event delay to compute the
(un)responsiveness. WIP note: This implementation may change.
* Record profiling overhead durations.
* Unlock the main profiler mutex.
* Invoke registered post-sampling callbacks.
* Spy on unregistered threads.
* Based on the requested sampling interval, and how much time this loop took,
compute when the next sampling loop should start, and make the thread sleep
for the appropriate amount of time. The goal is to be as regular as
possible, but if some/all loops take too much time, don't try too hard to
catch up, because the system is probably under stress already.
* Go back to the top of the loop.
* If we're here, we hit a loop ``break`` above.
* Invoke registered post-sampling callbacks, to let them know that sampling
stopped.
CPU Utilization
===============
CPU Utilization is stored as a number of milliseconds that a thread or process
has spent running on the CPU since the previous sampling.
Implementations are platform-dependent, and can be found in
and
Power Consumption
=================
Energy probes added in 2022.
Stacks
======
Stacks are the sequence of calls going from the entry point in the program
(generally ``main()`` and some OS-specific functions above), down to the
function where code is currently being executed.
Native Frames
-------------
Compiled code, from C++ and Rust source.
Label Frames
------------
Pseudo-frames with arbitrary text, added from any language, mostly C++.
JS, Wasm Frames
---------------
Frames corresponding to JavaScript functions.
Java Frames
-----------
Recorded by the JavaSampler.
Stack Merging
-------------
The above types of frames are all captured in different ways, and when finally
taking an actual stack sample (apart from Java), they get merged into one stack.
All frames have an associated address in the call stack, and can therefore be
merged mostly by ordering them by this stack address. See
for the implementation details.
Counters
========
Counters are a special kind of probe, which can be continuously updated during
profiling, and the ``SamplerThread`` will sample their value at every loop.
Memory Counter
--------------
This is the main counter. During a profiling session, hooks into the memory
manager keep track of each de/allocation, so at each sampling we know how many
operations were performed, and what is the current memory usage compared to the
previous sampling.
Profiling Overhead
==================
The ``SamplerThread`` records timestamps between parts of its sampling loop, and
records this as the sampling overhead. This may be useful to determine if the
profiler itself may have used too much of the computer resources, which could
skew the profile and give wrong impressions.
Unregistered Thread Profiling
=============================
At some intervals (not necessarily every sampling loop, depending on the OS),
the profiler may attempt to find unregistered threads, and record some
information about them.
WIP note: This feature is experimental, and data is captured in markers on the
main thread. More work is needed to put this data in tracks like regular
registered threads, and capture more data like stack samples and markers.
*******
Markers
*******
Markers are events with a precise timestamp or time range, they have a name, a
category, options (out of a few choices), and optional marker-type-specific
payload data.
Before describing the implementation, it is useful to be familiar with how
markers are natively added from C++, because this drives how the implementation
takes all this information and eventually outputs it in the final JSON profile.
Adding Markers from C++
=======================
Implementation
==============
The main function that records markers is
It's a variadic templated function that takes the different the expected
arguments, first checks if the marker should actually be recorded (the profiler
should be running, and the target thread should be profiled), and then calls
into the deeper implementation function ``AddMarkerToBuffer`` with a reference
to the main profiler buffer.
takes the marker type as an object, removes it from the function parameter list,
and calls the next function with the marker type as an explicit template
parameter, and also a pointer to the function that can capture the stack
(because it is different between Base and Gecko Profilers, in particular the
latter one knows about JS).
From here, we enter the land of
which employs some heavy template techniques, in order to most efficiently
serialize the given marker payload arguments, in order to make them
deserializable when outputting the final JSON. In previous implementations, for
each new marker type, a new C++ class derived from a payload abstract class was
required, that had to implement all the constructors and virtual functions to:
* Create the payload object.
* Serialize the payload into the profile buffer.
* Deserialize from the profile buffer to a new payload object.
* Convert the payload into the final output JSON.
Now, the templated functions automatically take care of serializing all given
function call arguments directly (instead of storing them somewhere first), and
preparing a deserialization function that will recreate them on the stack and
directly call the user-provided JSONification function with these arguments.
Continuing from the public ``AddMarkerToBuffer``,
sets some defaults if not specified by the caller: Target the current thread,
use the current time.
Then if a stack capture was requested, attempt to do it in
the most efficient way, using a pre-allocated buffer if possible.
WIP note: This potential allocation should be avoided in time-critical thread.
There is already a buffer for the main thread (because it's the busiest thread),
but there could be more pre-allocated threads, for specific real-time thread
that need it, or picked from a pool of pre-allocated buffers. See
From there, `AddMarkerWithOptionalStackToBuffer <https://searchfox.org/mozilla-central/search?q=AddMarkerWithOptionalStackToBuffer>`_
handles ``NoPayload`` markers (usually added with ``PROFILER_MARKER_UNTYPED``)
in a special way, mostly to avoid the extra work associated with handling
payloads. Otherwise it continues with the following function.
`MarkerTypeSerialization<MarkerType>::Serialize <symbol:_ZN7mozilla28base_profiler_markers_detail23MarkerTypeSerialization9SerializeERNS_20ProfileChunkedBufferERKNS_18ProfilerStringViewIcEERKNS_14MarkerCategoryEONS_13MarkerOptionsEDpRKTL0__>`_
retrieves the deserialization tag associated with the marker type. If it's the
first time this marker type is used,
`Streaming::TagForMarkerTypeFunctions <symbol:_ZN7mozilla28base_profiler_markers_detail9Streaming25TagForMarkerTypeFunctionsEPFvRNS_24ProfileBufferEntryReaderERNS_12baseprofiler20SpliceableJSONWriterEEPFNS_4SpanIKcLy18446744073709551615EEEvEPFNS_12MarkerSchemaEvE,_ZN7mozilla28base_profiler_markers_detail9Streaming25TagForMarkerTypeFunctionsEPFvRNS_24ProfileBufferEntryReaderERNS_12baseprofiler20SpliceableJSONWriterEEPFNS_4SpanIKcLm18446744073709551615EEEvEPFNS_12MarkerSchemaEvE,_ZN7mozilla28base_profiler_markers_detail9Streaming25TagForMarkerTypeFunctionsEPFvRNS_24ProfileBufferEntryReaderERNS_12baseprofiler20SpliceableJSONWriterEEPFNS_4SpanIKcLj4294967295EEEvEPFNS_12MarkerSchemaEvE>`_
adds it to the global list (which stores some function pointers used during
deserialization).
Then the main serialization happens in
`StreamFunctionTypeHelper<decltype(MarkerType::StreamJSONMarkerData)>::Serialize <symbol:_ZN7mozilla28base_profiler_markers_detail24StreamFunctionTypeHelperIFT_RNS_12baseprofiler20SpliceableJSONWriterEDpT0_EE9SerializeERNS_20ProfileChunkedBufferERKNS_18ProfilerStringViewIcEERKNS_14MarkerCategoryEONS_13MarkerOptionsEhDpRKS6_>`_.
Deconstructing this mouthful of an template:
* ``MarkerType::StreamJSONMarkerData`` is the user-provided function that will
eventually produce the final JSON, but here it's only used to know the
parameter types that it expects.
* ``StreamFunctionTypeHelper`` takes that function prototype, and can extract
its argument by specializing on ```R(SpliceableJSONWriter&, As...)``, now
``As...`` is a parameter pack matching the function parameters.
* Note that ``Serialize`` also takes a parameter pack, which contains all the
referenced arguments given to the top ``AddBufferToMarker`` call. These two
packs are supposed to match, at least the given arguments should be
convertible to the target pack parameter types.
* That specialization's ``Serialize`` function calls the buffer's ``PutObjects``
variadic function to write all the marker data, that is:
* The entry kind that must be at the beginning of every buffer entry, in this
* The common marker data (options first, name, category, deserialization tag).
* Then all the marker-type-specific arguments. Note that the C++ types
are those extracted from the deserialization function, so we know that
whatever is serialized here can be later deserialized using those same
types.
The deserialization side is described in the later section "JSON output of
Markers".
Adding Markers from Rust
========================
Adding Markers from JS
======================
Adding Markers from Java
========================
*************
Profiling Log
*************
During a profiling session, some profiler-related events may be recorded using
The resulting JSON object is added near the end of the process' JSON generation,
in a top-level property named "profilingLog". This object is free-form, and is
not intended to be displayed, or even read by most people. But it may include
interesting information for advanced users, or could be an early temporary
prototyping ground for new features.
See "profileGatheringLog" for another log related to late events.
WIP note: This was introduced shortly before this documentation, so at this time
it doesn't do much at all.
***************
Profile Capture
***************
Usually at the end of a profiling session, a profile is "captured", and either
saved to disk, or sent to the front-end https://profiler.firefox.com for
analysis. This section describes how the captured data is converted to the
Gecko Profiler JSON format.
FailureLatch
============
is used during the JSON generation, in order to catch any unrecoverable error
(such as running Out Of Memory), to exit the process early, and to forward the
error to callers.
There are two main implementations, suffixed "source" as they are the one source
of failure-handling, which is passed as ``FailureLatch&`` throughout the code:
is an "infallible" latch, meaning that it doesn't expect any failure. So if
a failure actually happened, the program would immediately terminate! (This
was the default behavior prior to introducing these latches.)
is a "fallible" latch, it will record the first failure that happens, and
"latch" into the failure state. The code should regularly examine this state,
and return early when possible. Eventually this failure state may be exposed
to end users.
ProgressLogger, ProportionValue
===============================
is used to track the progress of a long operation, in this case the JSON
generation process.
To match how the JSON generation code works (as a tree of C++ functions calls),
each ``ProgressLogger`` in a function usually records progress from 0 to 100%
locally inside that function. If that function calls a sub-function, it gives it
a sub-logger, which in the caller function is set to represent a local sub-range
(like 20% to 40%), but to the called function it will look like its own local
``ProgressLogger`` that goes from 0 to 100%. The very top ``ProgressLogger``
converts the deepest local progress value to the corresponding global progress.
Progress values are recorded in
which effectively record fractional value with no loss of precision.
This progress is most useful when the parent process is waiting for child
processes to do their work, to make sure progress does happen, otherwise to stop
waiting for frozen processes. More about that in the "Multi-Process Profiling"
section below.
JSONWriter
==========
offers a simple way to create a JSON stream (start/end collections, add
elements, etc.), and calls back into a provided
to output characters.
While these classes live outside of the Profiler directories, it may sometimes be
worth maintaining and/or modifying them to better serve the Profiler's needs.
But there are other users, so be careful not to break other things!
SpliceableJSONWriter and SpliceableChunkedJSONWriter
====================================================
Because the Profiler deals with large amounts of data (big profiles can take
tens to hundreds of megabytes!), some specialized wrappers add better handling
of these large JSON streams.
is a subclass of ``JSONWriter``, and allows the "splicing" of JSON strings,
i.e., being able to take a whole well-formed JSON string, and directly inserting
it as a JSON object in the target JSON being streamed.
It also offers some functions that are often useful for the Profiler, such as:
* Converting a timestamp into a JSON object in the stream, taking care of keeping a nanosecond precision, without unwanted zeroes or nines at the end.
* Adding a number of null elements.
* Adding a unique string index, and add that string to a provided unique-string list if necessary. (More about UniqueStrings below.)
is a subclass of ``SpliceableJSONWriter``. Its main attribute is that it provides its own writer
which stores the stream as a sequence of "chunks" (heap-allocated buffers).
It starts with a chunk of a default size, and writes incoming data into it,
later allocating more chunks as needed. This avoids having massive buffers being
resized all the time.
It also offers the same splicing abilities as its parent class, but in case an
incoming JSON string comes from another ``SpliceableChunkedJSONWriter``, it's
able to just steal the chunks and add them to its list, thereby avoiding
expensive allocations and copies and destructions.
UniqueStrings
=============
Because a lot of strings would be repeated in profiles (e.g., frequent marker
names), such strings are stored in a separate JSON array of strings, and an
index into this list is used instead of that full string object.
Note that these unique-string indices are currently only located in specific
spots in the JSON tree, they cannot be used just anywhere strings are accepted.
stores this list of unique strings in a ``SpliceableChunkedJSONWriter``.
Given a string, it takes care of storing it if encountered for the first time,
and inserts the index into a target ``SpliceableJSONWriter``.
JSON Generation
===============
The "Gecko Profile Format" can be found at
The implementation in the back-end is
It outputs each JSON top-level JSON object, mostly in sequence. See the code for
how each object is output. Note that there is special handling for samples and
markers, as explained in the following section.
ProcessStreamingContext and ThreadStreamingContext
--------------------------------------------------
In JSON profiles, samples and markers are separated by thread and by
samples/markers. Because there are potentially tens to a hundred threads, it
would be very costly to read the full profile buffer once for each of these
groups. So instead the buffer is read once, and all samples and markers are
handled as they are read, and their JSON output is sent to separate JSON
writers.
contains all the information to facilitate this output, including a list of
which each contain one ``SpliceableChunkedJSONWriter`` for the samples, and one
for the markers in this thread.
When reading entries from the profile buffer, samples and markers are found by
their ``ProfileBufferEntryKind``, and as part of deserializing either kind (more
about each below), the thread ID is read, and determines which
``ThreadStreamingContext`` will receive the JSON output.
At the end of this process, all ``SpliceableChunkedJSONWriters`` are efficiently
spliced (mainly a pointer move) into the final JSON output.
JSON output of Samples
----------------------
This work is done in
`ProfileBuffer::DoStreamSamplesAndMarkersToJSON <https://searchfox.org/mozilla-central/search?q=DoStreamSamplesAndMarkersToJSON>`_.
From the main ``ProfileChunkedBuffer``, each entry is visited, its
``ProfileBufferEntryKind`` is read first, and for samples all frames from
captured stack are converted to the appropriate JSON.
is used to de-duplicate frames and even sub-stacks:
* Each unique frame string is written into a JSON array inside a
``SpliceableChunkedJSONWriter``, and its index is the frame identifier.
* Each stack level is also de-duplicated, and identifies the associated frame
string, and points at the calling stack level (i.e., closer to the root).
* Finally, the identifier for the top of the stack is stored, along with a
timestamp (and potentially some more information) as the sample.
For example, if we have collected the following samples:
#. A -> B -> C
#. A -> B
#. A -> B -> D
The frame table would contain each frame name, something like:
``["A", "B", "C", "D"]``. So the frame containing "A" has index 0, "B" is at 1,
etc.
The stack table would contain each stack level, something like:
``[[0, null], [1, 0], [2, 1], [3, 1]]``. ``[0, null]`` means the frame is 0
("A"), and it has no caller, it's the root frame. ``[1, 0]`` means the frame is
1 ("B"), and its caller is stack 0, which is just the previous one in this
example.
And the three samples stored in the thread data would be therefore be: 2, 1, 3
(E.g.: "2" points in the stack table at the frame [2,1] with "C", and from them
down to "B", then "A").
All this contains all the information needed to reconstruct all full stack
samples.
JSON output of Markers
----------------------
This also happens
`inside ProfileBuffer::DoStreamSamplesAndMarkersToJSON <https://searchfox.org/mozilla-central/search?q=DoStreamSamplesAndMarkersToJSON>`_.
When a ``ProfileBufferEntryKind::Marker`` is encountered,
reads the ``MarkerOptions`` (stored as explained above), which include the
thread ID, identifying which ``ThreadStreamingContext``'s
``SpliceableChunkedJSONWriter`` to use.
After that, the common marker data (timing, category, etc.) is output.
Then the ``Streaming::DeserializerTag`` identifies which type of marker this is.
The special case of ``0`` (no payload) means nothing more is output.
Otherwise some more common data is output as part of the payload if present, in
particular the "inner window id" (used to match markers with specific html
frames), and stack.
WIP note: Some of these may move around in the future, see
others.
In case of a C++-written payload, the ``DeserializerTag`` identifies the
``MarkerDataDeserializer`` function to use. This is part of the heavy templated
code in BaseProfilerMarkersDetail.h, the function is defined as
which outputs the marker type name, and then each marker payload argument. The
latter is done by using the user-defined ``MarkerType::StreamJSONMarkerData``
parameter list, and recursively deserializing each parameter from the profile
buffer into an on-stack variable of a corresponding type, at the end of which
``MarkerType::StreamJSONMarkerData`` can be called with all of these arguments
at it expects, and that function does the actual JSON streaming as the user
programmed.
*************
Profiler Stop
*************
See "Profiler Start" and do the reverse!
There is some special handling of the ``SampleThread`` object, just to ensure
that it gets deleted outside of the main profiler mutex being locked, otherwise
this could result in a deadlock (because it needs to take the lock before being
able to check the state variable indicating that the sampling loop and thread
should end).
*****************
Profiler Shutdown
*****************
See "Profiler Initialization" and do the reverse!
One additional action is handling the optional ``MOZ_PROFILER_SHUTDOWN``
environment variable, to output a profile if the profiler was running.
***********************
Multi-Process Profiling
***********************
All of the above explanations focused on what the profiler is doing is each
process: Starting, running and collecting samples, markers, and more data,
outputting JSON profiles, and stopping.
But Firefox is a multi-process program, since
`Electrolysis aka e10s <https://wiki.mozilla.org/Electrolysis>`_ introduce child
processes to handle web content and extensions, and especially since
`Fission <https://wiki.mozilla.org/Project_Fission>`_ forced even parts of the
same webpage to run in separate processes, mainly for added security. Since then
Firefox can spawn many processes, sometimes 10 to 20 when visiting busy sites.
The following sections explains how profiling Firefox as a whole works.
IPC (Inter-Process Communication)
=================================
As a quick summary, some message-passing function-like declarations live in
and corresponding ``SendX`` and ``RecvX`` C++ functions are respectively
generated in
and virtually declared (for user implementation) in
During Profiling
================
Exit profiles
-------------
One IPC message that is not in PProfiler.ipdl, is
in
It's called from
just before a child process ends, and if the profiler was running, to ensure
that the profile data is collected and sent to the parent, for storage in its
``ActivePS``.
See
for details. Note that the current "buffer position at gathering time" (which is
effectively the largest ``ProfileBufferBlockIndex`` that is present in the
global profile buffer) is recorded. Later,
looks at the **smallest** ``ProfileBufferBlockIndex`` still present in the
buffer (because early chunks may have been discarded to limit memory usage), and
discards exit profiles that were recorded before, because their data is now
older than anything stored in the parent.
Profile Buffer Global Memory Control
------------------------------------
Each process runs its own profiler, with each its own profile chunked buffer. To
keep the overall memory usage of all these buffers under the user-picked limit,
processes work together, with the parent process overseeing things.
Diagram showing the relevant classes, see details in the following sub-sections:
.. image:: fissionprofiler-20200424.png
ProfileBufferControlledChunkManager
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
allows a controller to get notified about all chunk updates, and to force the
destruction/recycling of old chunks.
implements it.
contains all information related to chunk changes: How much memory is currently
used by the local chunk manager, how much has been "released" (and therefore
could be destroyed/recycled), and a list of all chunks that were released since
the previous update; it also has a special state meaning that the child is
shutting down so there won't be updates anymore. An ``Update`` may be "folded"
into a previous one, to create a combined update equivalent to the two separate
ones one after the other.
Update Handling in the ProfilerChild
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When the profiler starts in a child process, the ``ProfilerChild``
These updates are stored and folded into previous ones (if any). At some point,
will be received, and any update can be forwarded to the parent. The local
update is cleared, ready to store future updates.
Update Handling in the ProfilerParent
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When the profiler starts AND when there are child processes, the
`ProfilerParent's ProfilerParentTracker <https://searchfox.org/mozilla-central/search?q=ProfilerParentTracker>`_
creates
which starts to listen for updates from the local chunk manager.
The ``ProfilerParentTracker`` is also responsible for keeping track of child
processes, and to regularly
`send them AwaitNextChunkManagerUpdate messages <https://searchfox.org/mozilla-central/search?q=SendAwaitNextChunkManagerUpdate>`_,
that the child's ``ProfilerChild`` answers to with updates. The update may
indicate that the child is shutting down, in which case the tracker will stop
tracking it.
All these updates (from the local chunk manager, and from child processes' own
chunk managers) are processed in
`ProfileBufferGlobalController::HandleChunkManagerNonFinalUpdate <https://searchfox.org/mozilla-central/search?q=HandleChunkManagerNonFinalUpdate>`_.
Based on this stream of updates, it is possible to calculate the total memory
used by all profile buffers in all processes, and to keep track of all chunks
that have been "released" (i.e., are full, and can be destroyed). When the total
memory usage reaches the user-selected limit, the controller can lookup the
oldest chunk, and get it destroyed (either a local call for parent chunks, or by
sending
to the owning child).
Historical note: Prior to Fission, the Profiler used to keep one fixed-size
circular buffer in each process, but as Fission made the possible number of
processes unlimited, the memory consumption grew too fast, and required the
implementation of the above system. But there may still be mentions of
"circular buffers" in the code or documents; these have effectively been
replaced by chunked buffers, with centralized chunk control.
Gathering Child Profiles
========================
When it's time to capture a full profile, the parent process performs its own
JSON generation (as described above), and sends
to all child processes, which will make them generate their JSON profile and
send it back to the parent.
All child profiles, including the exit profiles collected during profiling, are
stored as elements of a top-level array with property name "processes".
During the gathering phase, while the parent is waiting for child responses, it
regularly sends
to all child processes that have not sent their profile yet, and the parent
expects responses within a short timeframe. The response carries a progress
value. If at some point two messages went with no progress was made anywhere
(either there was no response, or the progress value didn't change), the parent
assumes that remaining child processes may be frozen indefinitely, stops the
gathering and considers the JSON generation complete.
During all of the above work, events are logged (especially issues with child
processes), and are added at the end of the JSON profile, in a top-level object
with property name "profileGatheringLog". This object is free-form, and is not
intended to be displayed, or even read by most people. But it may include
interesting information for advanced users regarding the profile-gathering
phase.