JSON_FORMAT.md - mozsearch

# Worker Pool Activity Data Format Documentation

This document describes the JSON file formats created by `fetch-worker-data.js`.

## Overview

The script generates daily JSON files containing Taskcluster worker pool activity

data, fetched from STMO (BigQuery). The format uses string tables, parallel arrays,

and differential time compression to minimize file size.

Three file types are produced per day:

1. **Index**: `index.json` - List of available dates, used by the dashboard

   at https://tests.firefox.dev/workers.html to know which dates can be loaded.

2. **Summary**: `workers-YYYY-MM-DD.json` - Small file (~5 MB) with timestamps,

   resolution, task queue, and project. Loaded immediately by the dashboard for

   stats computation, time series rendering, and project filtering.

3. **Tasks**: `workers-YYYY-MM-DD-tasks.json` - Full task data (~20 MB) with all

   fields. Loaded on demand when the user clicks a Profiler button.

---

## Index File (`index.json`)

```json

  "dates": ["2026-03-04", "2026-03-03", "2026-03-02", ...]

```

Dates are sorted in descending order (newest first). Up to 21 days of history

are maintained.

---

## Summary File (`workers-YYYY-MM-DD.json`)

Small file for fast initial dashboard load. Contains only the fields needed

to compute per-pool stats, draw time-series activity tracks, and populate

project filter buttons.

### Top-Level Structure

```json

  "metadata": { ... },

  "tables": { ... },

  "tasks": { ... }

```

### metadata

```json

  "date": "2026-03-04",

  "generatedAt": "2026-03-05T03:12:45.123Z",

  "taskCount": 243684

```

### tables

```json

  "taskQueueIds": ["gecko-t/t-linux-docker-noscratch-amd", ...],

  "resolutions": ["completed", "failed", "exception - canceled", ...],

  "projects": ["mozilla-central", "try", "autoland", ...]

```

### tasks

Parallel arrays of length `taskCount`:

```json

  "scheduled": [1709510400000, 150, 0, 23, ...],

  "started": [5432, null, 1234, ...],

  "resolved": [65432, 3600000, 12345, ...],

  "resolutionIds": [0, 0, 2, ...],

  "taskQueueIdIds": [0, 1, 0, ...],

  "projectIds": [0, null, 1, ...]

```

See [Time Compression](#time-compression) and [Resolutions](#resolutions) below

for encoding details.

---

## Tasks File (`workers-YYYY-MM-DD-tasks.json`)

Full task data loaded on demand for Firefox Profiler integration.

### Top-Level Structure

```json

  "metadata": { ... },

  "tables": { ... },

  "tasks": { ... },

  "workerInfo": { ... },

  "taskGroupInfo": { ... }

```

### metadata

Same structure as the summary file.

### tables

All strings are deduplicated and stored once. Tables are sorted by frequency

of use (most referenced entries first) to reduce JSON size (frequently used strings get smaller numeric indices).

```json

  "labels": ["test-linux2404-64/opt-mochitest-plain-5", ...],

  "projects": ["mozilla-central", "try", "autoland", ...],

  "taskQueueIds": ["releng-hardware/gecko-t-linux-2404-wayland", ...],

  "resolutions": ["completed", "failed", "exception - canceled", ...],

  "workerGroups": ["us-central1-b", ...],

  "workerIds": ["7732042218797547089", ...],

  "priorities": ["low", "lowest", "medium", ...],

  "users": ["cron@mozilla-central", "user@example.com", ...],

  "taskGroupIds": ["YJJe4a0CRIqbAmcCo8n63w", ...]

```

### tasks

Parallel arrays of length `taskCount`:

```json

  "scheduled": [1709510400000, 150, 0, 23, ...],

  "started": [5432, null, 1234, ...],

  "resolved": [65432, 3600000, 12345, ...],

  "resolutionIds": [0, 0, 2, ...],

  "taskIds": ["YJJe4a0CRIqbAmcCo8n63w", "XPPf5b1DRJrcBndDp9o74x.1", ...],

  "labelIds": [0, 1, 2, ...],

  "priorityIds": [0, 0, 1, ...],

  "taskGroupIdIds": [0, 0, 1, ...],

  "userIds": [0, 1, 0, ...],

  "taskQueueIdIds": [0, 1, 0, ...],

  "workerIdIds": [0, null, 1, ...],

  "runCosts": [0.012244, 0, 0.003456, ...]

```

#### Task IDs

Task IDs include a `.runId` suffix only when the run ID is greater than 0:

- Run 0: `"YJJe4a0CRIqbAmcCo8n63w"`

- Run 1: `"YJJe4a0CRIqbAmcCo8n63w.1"`

#### Index References

- `resolutionIds[i]`: Index into `tables.resolutions`

- `labelIds[i]`: Index into `tables.labels`

- `priorityIds[i]`: Index into `tables.priorities`

- `taskGroupIdIds[i]`: Index into `tables.taskGroupIds`

- `userIds[i]`: Index into `tables.users`

- `taskQueueIdIds[i]`: Index into `tables.taskQueueIds`

- `workerIdIds[i]`: Index into `tables.workerIds` (`null` if never assigned)

#### Run Costs

`runCosts[i]` is the cost in USD, rounded to 6 decimal places. `0` when

not available.

### workerInfo

Per-worker arrays indexed by worker ID (index into `tables.workerIds`).

```json

  "workerGroupIds": [0, 1, 0, ...]

```

- `workerGroupIds[w]`: Index into `tables.workerGroups` for worker `w`

Each worker belongs to exactly one worker group.

### taskGroupInfo

Per-task-group arrays indexed by task group ID (index into

`tables.taskGroupIds`).

```json

  "projectIds": [0, 1, null, ...]

```

- `projectIds[g]`: Index into `tables.projects` for task group `g`

  (`null` for task groups with no project, typically decision tasks)

Each task group belongs to at most one project.

---

## Shared Encoding Details

### Time Compression

Both files use the same time encoding. Tasks are sorted by scheduled time.

- **`scheduled`**: Differential compression. `scheduled[0]` is an absolute

  millisecond timestamp. `scheduled[i]` (for i > 0) is the delta from

  `scheduled[i-1]`. Since tasks are sorted, deltas are small non-negative

  integers.

- **`started`**: Offset from the task's own scheduled time (i.e., queue wait

  time in ms). `null` if the task never started.

- **`resolved`**: Offset from the task's own scheduled time (i.e., total

  task lifetime in ms).

To reconstruct absolute timestamps:

```javascript

const abs = new Array(n);

abs[0] = data.tasks.scheduled[0];

for (let i = 1; i < n; i++) {

  abs[i] = abs[i - 1] + data.tasks.scheduled[i];

// abs[i] is the absolute scheduled time for task i

const startedMs = data.tasks.started[i] !== null

  ? abs[i] + data.tasks.started[i] : null;

const resolvedMs = abs[i] + data.tasks.resolved[i];

```

### Resolutions

The `resolutions` table merges the task `state` and `reason_resolved` fields:

- When both are the same: just the value (e.g., `"completed"`, `"failed"`)

- When different: `"state - reason"` (e.g., `"exception - canceled"`,

  `"exception - deadline-exceeded"`)

---

## Example: Reconstruct a Full Task Row

Using the tasks file:

```javascript

const d = data;

const i = 42;

// Reconstruct absolute scheduled time

let scheduledMs = d.tasks.scheduled[0];

for (let j = 1; j <= i; j++) {

  scheduledMs += d.tasks.scheduled[j];

const startedMs = d.tasks.started[i] !== null

  ? scheduledMs + d.tasks.started[i] : null;

const resolvedMs = scheduledMs + d.tasks.resolved[i];

const taskId = d.tasks.taskIds[i];

const label = d.tables.labels[d.tasks.labelIds[i]];

const resolution = d.tables.resolutions[d.tasks.resolutionIds[i]];

const priority = d.tables.priorities[d.tasks.priorityIds[i]];

const taskQueue = d.tables.taskQueueIds[d.tasks.taskQueueIdIds[i]];

const cost = d.tasks.runCosts[i];

// Task group and project

const tgIdx = d.tasks.taskGroupIdIds[i];

const taskGroupId = tgIdx !== null ? d.tables.taskGroupIds[tgIdx] : null;

const projectIdx = tgIdx !== null ? d.taskGroupInfo.projectIds[tgIdx] : null;

const project = projectIdx !== null ? d.tables.projects[projectIdx] : null;

// User

const userIdx = d.tasks.userIds[i];

const user = userIdx !== null ? d.tables.users[userIdx] : null;

// Worker (may be null if task never ran)

const wIdx = d.tasks.workerIdIds[i];

const workerId = wIdx !== null ? d.tables.workerIds[wIdx] : null;

const workerGroup = wIdx !== null

  ? d.tables.workerGroups[d.workerInfo.workerGroupIds[wIdx]] : null;

```

---

## Data Characteristics

- **Task count**: ~240K tasks per day

- **Summary file size**: ~5 MB uncompressed per day

- **Tasks file size**: ~20 MB uncompressed per day

- **String tables**: Sorted by frequency (most used entries have lowest indices)

- **History**: 21 days maintained via CI artifact re-upload

- **Source**: STMO query 112377 (BigQuery)

- **Schedule**: Generated daily at 03:00 UTC via CI cron