SKILL.md - mozsearch

---

name: perf-investigation

description: >

  Structured performance opportunity investigation for SpiderMonkey (the Firefox JavaScript engine).

  Use this skill when the user wants to investigate JS engine performance, profile SpiderMonkey,

  find optimization opportunities, write performance patches, or evaluate benchmark regressions.

  Trigger on mentions of: profiling JS, SpiderMonkey performance, JIT optimization, benchmark

  regression analysis, shell benchmarking, or any request to make JS workloads faster.

  The methodolgy is described mostly for the JS shell but can be adapted to browser investigation.

allowed-tools: Bash(searchfox-cli *) Bash(profiler-cli *) Bash(samply *) Bash(mach *) Python(*.py) Markdown(*.md)

---

# SpiderMonkey Performance Investigation

This skill guides a structured, evidence-driven performance investigation for the SpiderMonkey

JavaScript engine. The methodology has four phases: **hypothesis generation**, **evidence

gathering**, **patch writing**, and **evaluation**. Each phase builds on the last: resist the

urge to skip ahead to writing patches before you have empirical evidence that a change will help.

When asked to create multiple patches, iterate through the phases each time to ensure each patch

is independently validated and measured. **Always create commits before moving onto a new patch

if you are creating multiple patches**. This will make it easier to review and to measure

contribution.

The end result of this skill will be a summary of the investigation, and one or more patches

that measurably improve the performance of the targeted workload, with each patch describing

supporting evidence and measured impact.

## Prerequisites

The user should provide:

- A workload to investigate (a JS file, benchmark suite, or instructions to reproduce)

- A build configuration or existing shell to use

You have access to:

- `samply` — sampling profiler that produces Firefox Profiler-compatible output

- `profiler-cli` — for analyzing profiles. This can also be used to investigate Gecko

  profiler profiles if the investigation is being done in the browser.

- `searchfox-cli` — source code search for the Firefox codebase

For more details on how to use these tools load the "profiler-analysis" skill, which

will also hint on how to get the tools installed if needed.

An `artifacts/` directory can be created and this is excluded from version control.

## Phase 1: Hypothesis Generation

The goal is to identify where time is being spent and form testable hypotheses about what

could be improved.

### 1.1 Prepare the build

Use an **opt-nodebug** (optimized, no debug checks) build. Debug builds

distort profiles with assertion overhead.

The user should provide or confirm the mozconfig to use. The key settings for an opt-nodebug

build are:

```

ac_add_options --enable-optimize

ac_add_options --disable-debug

```

If the user hasn't specified a mozconfig, ask them — build configurations vary across

machines and the user will know which obj-dir and config is appropriate for their setup.

**Always run the shell with `--strict-benchmark-mode` when investigating performance.**

This flag validates the runtimeconfiguration and will error if something would produce

unreliable numbers (e.g. JIT is disabled unexpectedly). Generating profiles without this

flag risks producing misleading data.

### 1.2 Establish the workload

Examine the workload to understand what it does. If the workload has an iteration count or

loop parameter, determine an appropriate count so that **the workload runs for at least 30

seconds under profiling**. Statistical profilers need sufficient samples to produce

meaningful data — short runs produce noisy profiles where real hotspots are hard to

distinguish from sampling noise.

For targeted micro-optimizations (e.g. improving a single opcode or a specific stub), longer

runs (60s+) may be necessary to accumulate enough samples in the specific code path of

interest.

If the workload driver supports iteration configuration, prefer that.

Otherwise, wrap it:

```js

for (let i = 0; i < ITERATIONS; i++) {

    load("workload.js");  // or call the main function

```

### 1.3 Profile

Record a profile with samply. Always set `IONPERF=func` and `PERF_SPEW_DIR` so that

JIT-compiled functions appear with readable names in the profile instead of raw addresses.

The overhead is negligible:

```bash

mkdir -p artifacts/perf-spew

PERF_SPEW_DIR=artifacts/perf-spew IONPERF=func \

    samply record --save-only -o artifacts/profile.json.gz -- \

    ./obj-opt-nodebug/dist/bin/js --strict-benchmark-mode workload.js

```

Using `--save-only` avoids opening the browser and gives you a local file you can analyze

with `profiler-cli`. Save profiles to the `artifacts/` directory; you may need to gzip

the profile for profiler-cli to read it.

For deeper JIT investigation (e.g. understanding what IR the JIT emitted for a hot

function), use `IONPERF=ir` instead — see `references/advanced-tools.md`.

### 1.4 Analyze the profile

Start broad and narrow down: Looking at the profile, answer some of the following questionsfile:

1. What are the top CPU consumers?

2. What does the call tree look like top-down?

3. Who calls a hot function?

4. What does a specific function's time look like with callees collapsed?

For Speedometer profiles, always use `--focus-marker="-async,-sync"` to exclude async idle

time between benchmark iterations.

### 1.5 Form hypotheses

Based on the profile data, form specific, testable hypotheses. Good hypotheses look like:

- "Function X is called Y times from path Z — reducing call frequency by caching result W should save ~N% of its self time"

- "The JIT is spending M% of time in IC stubs for property access pattern P — a specialized stub for this pattern could reduce that"

- "Allocation pressure in function F is causing N% GC time — pretenuring could help"

Bad hypotheses (avoid these):

- "Let's tune the inlining threshold" — tuning existing knobs tends to overfit to the current benchmark state rather than making general engine progress

- "This function seems slow, let's rewrite it" — without understanding *why* it's slow

## Phase 2: Evidence Gathering

Before writing a patch, gather enough evidence to be confident the hypothesis is sound.

### 2.1 Source investigation

Use `searchfox-cli` to understand the relevant code and understand the current behavior.

Use searchfox-cli for blame on relevant code, as well as git history on relevant files.

This might provide context on why things are the way they are.

### 2.2 Instrumentation

Profiling shows *where* time is spent but not always *why*. When your hypothesis depends on

runtime state (data distributions, cache hit rates, list lengths, frequency of code paths),

add temporary instrumentation to measure it directly.

Use MOZ_LOG or JS_LOG for instrumentation.

```cpp

JS_LOG(debug /* you can also add your own channel, but debug should be unused */, Debug, "list length: %zu, sorted: %s",

           list.length(), isSorted ? "yes" : "no");

```

**Throttle instrumentation output** when it would fire on every iteration — use a counter

to log every Nth occurrence, or accumulate statistics and log a summary. Unthrottled logging

in a hot path will drown the output and slow the workload enough to distort measurements.

```cpp

static uint32_t callCount = 0;

if (++callCount % 10000 == 0) {

    JS_LOG_FMT(debug, Debug, "after %u calls: avg length = %zu",

               callCount, totalLength / callCount);

```

Re run with `MOZ_LOG=debug:5` to see the output.

In a browser build you can add profiler markers instead of logging which can be read through

gecko-profiling and the profiler-cli.

### 2.3 Re-run with instrumentation

Run the instrumented build and collect the data. This confirms whether your hypothesis

about runtime behavior is correct before you invest in writing a real patch.

## Phase 3: Patch Writing

Now that you have evidence, write the patch.

### 3.1 Design for measurability

Where possible, gate the optimization behind a **JS::Prefs preference** so you can do

apples-to-apples comparison on the same binary. This eliminates build-to-build variation

as a confounding factor and makes it trivial to re-measure later.

To add the pref, add an entry to `StaticPrefList.yaml`:

```yaml

- name: javascript.options.experimental.my_optimization

  type: bool

  value: true

  mirror: always

  set_spidermonkey_pref: always

```

Then guard the code path:

```cpp

if (JS::Prefs::experimental_my_optimization()) {

    // new path (default: on)

} else {

    // old path

```

Use `set_spidermonkey_pref: always` (not `startup`) so the pref can be toggled via

`--setpref` without requiring a restart:

```bash

# Measure with optimization (default):

./js --strict-benchmark-mode workload.js

# Measure without:

./js --strict-benchmark-mode --setpref experimental.my_optimization=false workload.js

```

Note that pref-gating is not always feasible. For changes on extremely hot paths (tight

JIT loops, inline caches), the branch on the pref check itself can be costly enough to

distort measurements. In those cases, fall back to saving the obj-dir from a build without

the patch and comparing against a build with the patch applied.

**Note: You can't save -just- a `js` binary, as there are dynamically linked libraries.

Always save the obj-dir, or create a different mozconfig**.

### 3.2 Add development logging

During patch development, add `JS_LOG` logging to the debug channel to verify the new

code path is being taken where expected. Throttle by a counter to avoid flooding output.

Do a run with the instrumentation logging to ensure the logging fires when/where/as-much

as expected. Remove or reduce this logging before the patch is finalized.

### 3.3 Microbenchmark

For a given optimization is is often compelling to also generate a microbenchmark which

demonstrates in the _absolute most ideal circumstances for the optimization_ what kind

of result is achievable. This is not a replacement for measuring the real workload,

but can be a useful sanity check that the optimization is working as intended and has

the potential to produce the expected impact, and can help in choosing to keep

patches which are effective in the microbenchmark but don't show good impact under the

real workload.

### 3.4 Multiple patches

When investigating multiple optimization opportunities:

- Develop each patch independently so its contribution can be measured in isolation

- Commit each patch separately with a clear message describing the change and the hypothesis

  aims to address, evidence in favour and testing results.

- At the end of optimziation, present:

  1. **Total improvement** from baseline (no patches) to all patches applied

  2. **Individual contribution** of each patch measured independently

  3. Any **interactions** between patches (does applying A make B more or less effective?)

## Phase 4: Evaluation

### 4.1 Performance measurement

Run the workload with and without the patch (using the pref toggle or separate builds).

If `hyperfine` is available, you can use that if. If not, start with 5 runs of each configuration, collecting timing results into arrays.

```bash

# With pref-gated optimization — collect results into a file:

for i in $(seq 1 5); do

    ./js --strict-benchmark-mode --setpref experimental.my_optimization=true workload.js \

        2>&1 | tee -a artifacts/results_with.txt

done

for i in $(seq 1 5); do

    ./js --strict-benchmark-mode --setpref experimental.my_optimization=false workload.js \

        2>&1 | tee -a artifacts/results_without.txt

done

```

After collecting initial results, use a Python script to assess whether the sample size

is sufficient. Use the **Mann-Whitney U test** (non-parametric, robust to non-normal

distributions common in benchmark data) to test for significance:

```python

# /// script

# dependencies = [

#   "numpy",

#   "scipy",

# ]

# ///

# use `uv run script.py` and deps should be automaticaly installed

import numpy as np

from scipy import stats

baseline = np.array([...])  # times without patch

patched = np.array([...])   # times with patch

stat, p_value = stats.mannwhitneyu(baseline, patched, alternative='two-sided')

effect_size = (np.mean(baseline) - np.mean(patched)) / np.mean(baseline) * 100

print(f"Baseline: {np.mean(baseline):.2f} +/- {np.std(baseline):.2f}")

print(f"Patched:  {np.mean(patched):.2f} +/- {np.std(patched):.2f}")

print(f"Effect:   {effect_size:.2f}%")

print(f"p-value:  {p_value:.4f}")

if p_value > 0.05:

    print("Result not statistically significant at p<0.05 — consider more runs")

```

If the p-value is borderline (0.01 < p < 0.10) or the effect size is small relative to

the observed variance, collect additional runs and retest. But **do not exceed 20 runs per

configuration** — if 20 runs on each side still can't produce a significant result, the

effect is too close to the noise floor to be meaningfully measured this way. That's a signal

to step back and reconsider: either the optimization isn't having the expected impact, or

the workload needs to be restructured to isolate the effect better (e.g. more iterations

of the hot path, a more targeted microbenchmark).

### 4.2 Profile the patched build

Don't just measure — profile again to confirm the patch is having the expected effect.

The profile should show reduced time in the targeted code path. If it doesn't, investigate

why.

### 4.3 Safety evaluation

After each patch is written, but before it's commited, **run the correctness test suites.**

Both of these must pass. Test with opt-nodebug first (because you have the build) but

also test with an opt-debug build as well, as there are many debug-only assertions

that catch errors that are needed to be evaluated.

```bash

./mach jit-test

./mach jstests

```

If the patch touches **GC-related code**, run both suites with `--jitflags=all` for more

thorough coverage:

```bash

./mach jit-test --jitflags=all

./mach jstests --jitflags=all

```

Beyond the test suites, consider adding test cases to address

- Edge cases the optimization might mishandle

- Whether the patch changes general-purpose code paths that could regress other workloads

## Investigation document

Produce a summary document (outside the source tree, e.g. in `artifacts/`) that records:

1. **Objective**: What workload was being investigated and why

2. **Methodology**: Build configuration, profiling setup, iteration counts

3. **Hypotheses investigated**: For each hypothesis:

   - What the profile data suggested

   - What evidence was gathered (instrumentation results, source analysis)

   - Whether a patch was written and what it does

   - Measured performance impact (with numbers and variance)

4. **Hypotheses rejected**: Hypotheses that were investigated but didn't pan out, and why —

   this is valuable for future investigators

5. **Results**: Summary of total improvement achieved, per-patch breakdown

6. **Remaining opportunities**: Observations from profiling that weren't pursued but could

   be investigated in future work

## Anti-patterns to avoid

- **Patching without evidence**: Never write an optimization patch based on intuition alone.

  Profile first, instrument if needed, then patch.

- **Knob tuning**: Adjusting existing heuristic thresholds (inlining limits, IC stub counts,

  GC triggers) tends to overfit to the specific benchmark. Prefer structural improvements

  that make the engine generally better over threshold adjustments that win one benchmark.

- **Measuring too few iterations**: A single run or a 2-second profile is not reliable.

  Ensure sufficient samples for statistical confidence.

- **Forgetting `--strict-benchmark-mode`**: Without this flag, the shell may be in a

  configuration that produces misleading numbers. Always use it.

- **Comparing across builds without controlling for noise**: Use pref-gated patches or

  carefully controlled build pairs. Random rebuild-to-rebuild variation can mask or

  exaggerate real differences.

- **Mixing together independnet changes in a single patch**.

- Advocating for changes that can't even be measured on a targeted microbenchmark.

  If the optimization can't show a clear improvement in an idealized scenario, it's

  unlikely to produce meaningful improvement in the real workload.

Revision control

Copy as Markdown

Other Tools