Revision control
Copy as Markdown
Other Tools
---
name: perf-investigation
description: >
Structured performance opportunity investigation for SpiderMonkey (the Firefox JavaScript engine).
Use this skill when the user wants to investigate JS engine performance, profile SpiderMonkey,
find optimization opportunities, write performance patches, or evaluate benchmark regressions.
Trigger on mentions of: profiling JS, SpiderMonkey performance, JIT optimization, benchmark
regression analysis, shell benchmarking, or any request to make JS workloads faster.
The methodolgy is described mostly for the JS shell but can be adapted to browser investigation.
allowed-tools: Bash(searchfox-cli *) Bash(profiler-cli *) Bash(samply *) Bash(mach *) Python(*.py) Markdown(*.md)
---
# SpiderMonkey Performance Investigation
This skill guides a structured, evidence-driven performance investigation for the SpiderMonkey
JavaScript engine. The methodology has four phases: **hypothesis generation**, **evidence
gathering**, **patch writing**, and **evaluation**. Each phase builds on the last: resist the
urge to skip ahead to writing patches before you have empirical evidence that a change will help.
When asked to create multiple patches, iterate through the phases each time to ensure each patch
is independently validated and measured. **Always create commits before moving onto a new patch
if you are creating multiple patches**. This will make it easier to review and to measure
contribution.
The end result of this skill will be a summary of the investigation, and one or more patches
that measurably improve the performance of the targeted workload, with each patch describing
supporting evidence and measured impact.
## Prerequisites
The user should provide:
- A workload to investigate (a JS file, benchmark suite, or instructions to reproduce)
- A build configuration or existing shell to use
You have access to:
- `samply` — sampling profiler that produces Firefox Profiler-compatible output
- `profiler-cli` — for analyzing profiles. This can also be used to investigate Gecko
profiler profiles if the investigation is being done in the browser.
- `searchfox-cli` — source code search for the Firefox codebase
For more details on how to use these tools load the "profiler-analysis" skill, which
will also hint on how to get the tools installed if needed.
An `artifacts/` directory can be created and this is excluded from version control.
## Phase 1: Hypothesis Generation
The goal is to identify where time is being spent and form testable hypotheses about what
could be improved.
### 1.1 Prepare the build
Use an **opt-nodebug** (optimized, no debug checks) build. Debug builds
distort profiles with assertion overhead.
The user should provide or confirm the mozconfig to use. The key settings for an opt-nodebug
build are:
```
ac_add_options --enable-optimize
ac_add_options --disable-debug
```
If the user hasn't specified a mozconfig, ask them — build configurations vary across
machines and the user will know which obj-dir and config is appropriate for their setup.
**Always run the shell with `--strict-benchmark-mode` when investigating performance.**
This flag validates the runtimeconfiguration and will error if something would produce
unreliable numbers (e.g. JIT is disabled unexpectedly). Generating profiles without this
flag risks producing misleading data.
### 1.2 Establish the workload
Examine the workload to understand what it does. If the workload has an iteration count or
loop parameter, determine an appropriate count so that **the workload runs for at least 30
seconds under profiling**. Statistical profilers need sufficient samples to produce
meaningful data — short runs produce noisy profiles where real hotspots are hard to
distinguish from sampling noise.
For targeted micro-optimizations (e.g. improving a single opcode or a specific stub), longer
runs (60s+) may be necessary to accumulate enough samples in the specific code path of
interest.
If the workload driver supports iteration configuration, prefer that.
Otherwise, wrap it:
```js
for (let i = 0; i < ITERATIONS; i++) {
load("workload.js"); // or call the main function
}
```
### 1.3 Profile
Record a profile with samply. Always set `IONPERF=func` and `PERF_SPEW_DIR` so that
JIT-compiled functions appear with readable names in the profile instead of raw addresses.
The overhead is negligible:
```bash
mkdir -p artifacts/perf-spew
PERF_SPEW_DIR=artifacts/perf-spew IONPERF=func \
samply record --save-only -o artifacts/profile.json.gz -- \
./obj-opt-nodebug/dist/bin/js --strict-benchmark-mode workload.js
```
Using `--save-only` avoids opening the browser and gives you a local file you can analyze
with `profiler-cli`. Save profiles to the `artifacts/` directory; you may need to gzip
the profile for profiler-cli to read it.
For deeper JIT investigation (e.g. understanding what IR the JIT emitted for a hot
function), use `IONPERF=ir` instead — see `references/advanced-tools.md`.
### 1.4 Analyze the profile
Start broad and narrow down: Looking at the profile, answer some of the following questionsfile:
1. What are the top CPU consumers?
2. What does the call tree look like top-down?
3. Who calls a hot function?
4. What does a specific function's time look like with callees collapsed?
For Speedometer profiles, always use `--focus-marker="-async,-sync"` to exclude async idle
time between benchmark iterations.
### 1.5 Form hypotheses
Based on the profile data, form specific, testable hypotheses. Good hypotheses look like:
- "Function X is called Y times from path Z — reducing call frequency by caching result W should save ~N% of its self time"
- "The JIT is spending M% of time in IC stubs for property access pattern P — a specialized stub for this pattern could reduce that"
- "Allocation pressure in function F is causing N% GC time — pretenuring could help"
Bad hypotheses (avoid these):
- "Let's tune the inlining threshold" — tuning existing knobs tends to overfit to the current benchmark state rather than making general engine progress
- "This function seems slow, let's rewrite it" — without understanding *why* it's slow
## Phase 2: Evidence Gathering
Before writing a patch, gather enough evidence to be confident the hypothesis is sound.
### 2.1 Source investigation
Use `searchfox-cli` to understand the relevant code and understand the current behavior.
Use searchfox-cli for blame on relevant code, as well as git history on relevant files.
This might provide context on why things are the way they are.
### 2.2 Instrumentation
Profiling shows *where* time is spent but not always *why*. When your hypothesis depends on
runtime state (data distributions, cache hit rates, list lengths, frequency of code paths),
add temporary instrumentation to measure it directly.
Use MOZ_LOG or JS_LOG for instrumentation.
```cpp
JS_LOG(debug /* you can also add your own channel, but debug should be unused */, Debug, "list length: %zu, sorted: %s",
list.length(), isSorted ? "yes" : "no");
```
**Throttle instrumentation output** when it would fire on every iteration — use a counter
to log every Nth occurrence, or accumulate statistics and log a summary. Unthrottled logging
in a hot path will drown the output and slow the workload enough to distort measurements.
```cpp
static uint32_t callCount = 0;
if (++callCount % 10000 == 0) {
JS_LOG_FMT(debug, Debug, "after %u calls: avg length = %zu",
callCount, totalLength / callCount);
}
```
Re run with `MOZ_LOG=debug:5` to see the output.
In a browser build you can add profiler markers instead of logging which can be read through
gecko-profiling and the profiler-cli.
### 2.3 Re-run with instrumentation
Run the instrumented build and collect the data. This confirms whether your hypothesis
about runtime behavior is correct before you invest in writing a real patch.
## Phase 3: Patch Writing
Now that you have evidence, write the patch.
### 3.1 Design for measurability
Where possible, gate the optimization behind a **JS::Prefs preference** so you can do
apples-to-apples comparison on the same binary. This eliminates build-to-build variation
as a confounding factor and makes it trivial to re-measure later.
To add the pref, add an entry to `StaticPrefList.yaml`:
```yaml
- name: javascript.options.experimental.my_optimization
type: bool
value: true
mirror: always
set_spidermonkey_pref: always
```
Then guard the code path:
```cpp
if (JS::Prefs::experimental_my_optimization()) {
// new path (default: on)
} else {
// old path
}
```
Use `set_spidermonkey_pref: always` (not `startup`) so the pref can be toggled via
`--setpref` without requiring a restart:
```bash
# Measure with optimization (default):
./js --strict-benchmark-mode workload.js
# Measure without:
./js --strict-benchmark-mode --setpref experimental.my_optimization=false workload.js
```
Note that pref-gating is not always feasible. For changes on extremely hot paths (tight
JIT loops, inline caches), the branch on the pref check itself can be costly enough to
distort measurements. In those cases, fall back to saving the obj-dir from a build without
the patch and comparing against a build with the patch applied.
**Note: You can't save -just- a `js` binary, as there are dynamically linked libraries.
Always save the obj-dir, or create a different mozconfig**.
### 3.2 Add development logging
During patch development, add `JS_LOG` logging to the debug channel to verify the new
code path is being taken where expected. Throttle by a counter to avoid flooding output.
Do a run with the instrumentation logging to ensure the logging fires when/where/as-much
as expected. Remove or reduce this logging before the patch is finalized.
### 3.3 Microbenchmark
For a given optimization is is often compelling to also generate a microbenchmark which
demonstrates in the _absolute most ideal circumstances for the optimization_ what kind
of result is achievable. This is not a replacement for measuring the real workload,
but can be a useful sanity check that the optimization is working as intended and has
the potential to produce the expected impact, and can help in choosing to keep
patches which are effective in the microbenchmark but don't show good impact under the
real workload.
### 3.4 Multiple patches
When investigating multiple optimization opportunities:
- Develop each patch independently so its contribution can be measured in isolation
- Commit each patch separately with a clear message describing the change and the hypothesis
aims to address, evidence in favour and testing results.
- At the end of optimziation, present:
1. **Total improvement** from baseline (no patches) to all patches applied
2. **Individual contribution** of each patch measured independently
3. Any **interactions** between patches (does applying A make B more or less effective?)
## Phase 4: Evaluation
### 4.1 Performance measurement
Run the workload with and without the patch (using the pref toggle or separate builds).
If `hyperfine` is available, you can use that if. If not, start with 5 runs of each configuration, collecting timing results into arrays.
```bash
# With pref-gated optimization — collect results into a file:
for i in $(seq 1 5); do
./js --strict-benchmark-mode --setpref experimental.my_optimization=true workload.js \
2>&1 | tee -a artifacts/results_with.txt
done
for i in $(seq 1 5); do
./js --strict-benchmark-mode --setpref experimental.my_optimization=false workload.js \
2>&1 | tee -a artifacts/results_without.txt
done
```
After collecting initial results, use a Python script to assess whether the sample size
is sufficient. Use the **Mann-Whitney U test** (non-parametric, robust to non-normal
distributions common in benchmark data) to test for significance:
```python
# /// script
# dependencies = [
# "numpy",
# "scipy",
# ]
# ///
# use `uv run script.py` and deps should be automaticaly installed
import numpy as np
from scipy import stats
baseline = np.array([...]) # times without patch
patched = np.array([...]) # times with patch
stat, p_value = stats.mannwhitneyu(baseline, patched, alternative='two-sided')
effect_size = (np.mean(baseline) - np.mean(patched)) / np.mean(baseline) * 100
print(f"Baseline: {np.mean(baseline):.2f} +/- {np.std(baseline):.2f}")
print(f"Patched: {np.mean(patched):.2f} +/- {np.std(patched):.2f}")
print(f"Effect: {effect_size:.2f}%")
print(f"p-value: {p_value:.4f}")
if p_value > 0.05:
print("Result not statistically significant at p<0.05 — consider more runs")
```
If the p-value is borderline (0.01 < p < 0.10) or the effect size is small relative to
the observed variance, collect additional runs and retest. But **do not exceed 20 runs per
configuration** — if 20 runs on each side still can't produce a significant result, the
effect is too close to the noise floor to be meaningfully measured this way. That's a signal
to step back and reconsider: either the optimization isn't having the expected impact, or
the workload needs to be restructured to isolate the effect better (e.g. more iterations
of the hot path, a more targeted microbenchmark).
### 4.2 Profile the patched build
Don't just measure — profile again to confirm the patch is having the expected effect.
The profile should show reduced time in the targeted code path. If it doesn't, investigate
why.
### 4.3 Safety evaluation
After each patch is written, but before it's commited, **run the correctness test suites.**
Both of these must pass. Test with opt-nodebug first (because you have the build) but
also test with an opt-debug build as well, as there are many debug-only assertions
that catch errors that are needed to be evaluated.
```bash
./mach jit-test
./mach jstests
```
If the patch touches **GC-related code**, run both suites with `--jitflags=all` for more
thorough coverage:
```bash
./mach jit-test --jitflags=all
./mach jstests --jitflags=all
```
Beyond the test suites, consider adding test cases to address
- Edge cases the optimization might mishandle
- Whether the patch changes general-purpose code paths that could regress other workloads
## Investigation document
Produce a summary document (outside the source tree, e.g. in `artifacts/`) that records:
1. **Objective**: What workload was being investigated and why
2. **Methodology**: Build configuration, profiling setup, iteration counts
3. **Hypotheses investigated**: For each hypothesis:
- What the profile data suggested
- What evidence was gathered (instrumentation results, source analysis)
- Whether a patch was written and what it does
- Measured performance impact (with numbers and variance)
4. **Hypotheses rejected**: Hypotheses that were investigated but didn't pan out, and why —
this is valuable for future investigators
5. **Results**: Summary of total improvement achieved, per-patch breakdown
6. **Remaining opportunities**: Observations from profiling that weren't pursued but could
be investigated in future work
## Anti-patterns to avoid
- **Patching without evidence**: Never write an optimization patch based on intuition alone.
Profile first, instrument if needed, then patch.
- **Knob tuning**: Adjusting existing heuristic thresholds (inlining limits, IC stub counts,
GC triggers) tends to overfit to the specific benchmark. Prefer structural improvements
that make the engine generally better over threshold adjustments that win one benchmark.
- **Measuring too few iterations**: A single run or a 2-second profile is not reliable.
Ensure sufficient samples for statistical confidence.
- **Forgetting `--strict-benchmark-mode`**: Without this flag, the shell may be in a
configuration that produces misleading numbers. Always use it.
- **Comparing across builds without controlling for noise**: Use pref-gated patches or
carefully controlled build pairs. Random rebuild-to-rebuild variation can mask or
exaggerate real differences.
- **Mixing together independnet changes in a single patch**.
- Advocating for changes that can't even be measured on a targeted microbenchmark.
If the optimization can't show a clear improvement in an idealized scenario, it's
unlikely to produce meaningful improvement in the real workload.