sync-manager.md - mozsearch

# Sync manager

We've identified the need for a "sync manager" (although are yet to identify a

good name for it!) This manager will be responsible for managing "global sync

state" and coordinating each engine.

At a very high level, the sync manager is responsible for all syncing. So far,

so obvious. However, given our architecture, it's possible to identify a

key architectural split.

* The embedding application will be responsible for most high-level operations.

  For example, the app itself will choose how often regular syncs should

  happen, what environmental concerns apply (eg, should I only sync on WiFi?),

  letting the user choose exactly what to sync, and so forth.

* A lower-level component will be responsible for the direct interaction with

  the engines and with the various servers needed to perform a sync. It will

  also have the ultimate responsibility to not cause harm to the service (for

  example, it will be likely to enforce some kind of rate limiting or ensuring

  that service requests for backoff are enforced)

Because all application-services engines are written in Rust, it's tempting to

suggest that this lower-level component also be written in Rust and everything

"just works", but there are a couple of complications here:

* For iOS, we hope to integrate with older engines which aren't written in

  Rust, even if iOS does move to the new Sync Manager.

* For Desktop, we hope to start by reusing the existing "sync manager"

  implemented by Desktop, and start moving individual engines across.

* There may be some cross-crate issues even for the Rust implemented engines.

  Or more specifically, we'd like to avoid assuming any particular linkage or

  packaging of Rust implemented engines.

Even with these complications, we expect there to be a number of high-level

components, each written in a platform specific language (eg, Kotlin or Swift)

and a single lower-level component to be implemented in Rust and delivered

as part of the application-services library - but that's not a free-pass.

Why "a number of high-level components"? Because that is the thing which

understands the requirements of the embedding application. For example, Android

may end up with a single high-level component in the android-components repo

and shared between all Android components. Alternatively, the Android teams

may decide the sync manager isn't generic enough to share, so each app will

have their own. iOS will probably end up with its own and you could imagine

a future where Desktop does too - but they should all be able to share the

low level component.

# The responsibilities of the Sync Manager.

The primary responsibilities of the "high level" portion of the sync manager are:

* Manage all FxA interaction. The low-level component will have a way to

  communicate auth related problems, but it is the high-level component

  which takes concrete FxA action.

* Expose all UI for the user to choose what to sync and coordinate this with

  the low-level component. Note that because these choices can be made on any

  connected device, these choices must be communicated in both directions.

* Implement timers or any other mechanism to fully implement the "sync

  scheduler", including any policy decisions such as only syncing on WiFi,

  etc.

* Implement a UI so the user can "sync now".

* Collect telemetry from the low-level component, probably augment it, then

  submit it to the telemetry pipeline.

The primary responsibilities of the "low level" portion of the sync manager are:

* Manage the `meta/global`, `crypto/keys` and `info/collections` resources,

  and interact with each engine as necessary based on the content of these

  resources.

* Manage interaction with the token server.

* Enforce constraints necessary to ensure the entire ecosystem is not

  subject to undue load. For example, this component should ignore attempts to

  sync continuously, or to sync when the services have requested backoff.

* Manage the "clients" collection - we probably can't ignore this any longer,

  especially for bookmarks (as desktop will send a wipe command on bookmark

  restore, and things will "be bad" if we don't see that command).

* Define a minimal "service state" so certain things can be coordinated with

  the high-level component. Examples of these states are "everything seems ok",

  "the service requested we backoff for some period", "an authentication error

  occurred", and possibly others.

* Perform, or coordinate, the actual sync of the rust implemented engines -

  from the containing app's POV, there's a single "sync now" entry-point (in

  practice there might be a couple, but conceptually there's a single way to

  sync). Note that as below, how non-rust implemented engines are managed is

  TBD.

* Manage the collection of (but *not* the submission of) telemetry from the

  various engines.

* Expose APIs and otherwise coordinate with the high-level component.

Stuff we aren't quite sure where it fits include:

* Coordination with non-rust implemented engines. These engines are almost

  certainly going to be implemented in the same language as the high-level

  component, which will make integration simpler. However, the low-level

  component will almost certainly need some information about these engines for

  populating info/collections etc. For now, we are punting on this until things

  become a bit clearer.

# Implementation Details.

The above has been carefully written to try and avoid implementation details -

the intent is that it's an overview of the architecture without any specific

implementation decisions.

These next sections start getting specific, so implementation choices need to

be made, and thus will possibly be more contentious.

In other words, get your spray-cans ready because there's a bikeshed being built!

However, let's start small and make some general observations.

## Current implementations and challenges with the Rust components

* Some apps only care about a subset of the engines - lockbox is one such app

  and only cares about a single collection/engine. It might be the case that

  lockbox uses a generic application-services library with many engines

  available, even though it only wants logins. Thus, the embedding application

  is the only thing which knows which engines should be considered to "exist".

  It may be that the app layer passes an engine to the sync manager, or the

  sync manager knows via some magic how to obtain these handles.

* Some apps will use a combination of Rust components and "legacy"

  engines. For example, iOS is moving some of the engines to using Rust

  components, while other engines will be ported after delivery of the

  sync manager, if they are ported at all. We also plan

  to introduce some rust engines into desktop without integrating the

  "sync manager"

* The rust components themselves are designed to be consumed as individual

  components - the "logins" component doesn't know anything about the

  "bookmarks" component.

There are a couple of gotchyas in the current implementations too - there's an

issue when certain engines don't yet appear in meta/global - see bug 1479929

for all the details.

The tl;dr of the above is that each rust component should be capable of

working with different sync managers. That said though, let's not over-engineer

this and pretend we can design a single, canonical thing that will not need

changing as we consider desktop and iOS.

## State, state and more state. And then some state.

There's loads of state here. The app itself has some state. The high-level

Sync Manager component will have state, the low-level component will have state,

and each engine has state. Some of this state will need to be persisted (either

on the device or on the storage servers) and some of this state can be considered

ephemeral and lives only as long as the app.

A key challenge will be defining this state in a coherent way with clear

boundaries between them, in an effort to allow decoupling of the various bits

so Desktop and iOS can fit into this world.

This state management should also provide the correct degree of isolation for

the various components. For example, each engine should only know about state

which directly impacts how it works. For example, the keys used to encrypt

a collection should only be exposed to that specific engine, and there's no

need for one engine to know what info/collections returns for other engines,

nor whether the device is currently connected to WiFi.

A thorn here is for persisted state - it would be ideal if the low-level

component could avoid needing to persist any state, so it can avoid any

kind of storage abstraction. We have a couple of ways of managing this:

* The state which needs to be persisted is quite small, so we could delegate

  state storage to the high-level component in an opaque way, as this

  high-level component almost certainly already has storage requirements, such

  as storing the "choose what to sync" preferences.

* The low-level component could add its own storage abstraction. This would

  isolate the high-level component from this storage requirement, but would

  add complexity to the sync manager - for example, it would need to be passed

  a directory where it should create a file or database.

We'll probably go with the former.

# Implementation plan for the low-level component.

Let's try and move into actionable decisions for the implementation. We expect

the implementation of the low-level component to happen first, followed very

closely by the implementation of the high-level component for Android. So we

focus first on these.

## Clients Engine

The clients engine includes some meta-data about each client. We've decided

we can't replace the clients engine with the FxA device record and we can't

simply drop this engine entirely.

Of particular interest is "commands" - these involve communicating with the

engine regarding commands targeting it, and accepting commands to be send to

other devices. Note that outgoing commands are likely to not originate from a sync,

but instead from other actions, such as "restore bookmarks".

However, because the only current requirement for commands is to wipe the

store, and because you could anticipate "wipe" also being used when remotely

disconnecting a device (eg, when a device is lost or stolen), our lives would

probably be made much simpler by initially supporting only per-engine wipe

commands.

Note that there has been some discussion about not implementing the client

engine and replacing "commands" with some other mechanism. However, we have

decided to not do that because the implementation isn't considered too

difficult, and because desktop will probably require a number of changes to

remove it (eg, "synced tabs" will not work correctly without a client record

with the same guid as the clients engine.)

Note however that unlike desktop, we will use the FxA device ID as the client

ID. Because FxA device IDs are more ephemeral than sync IDs, it will be

necessary for engines using this ID to track the most-recent ID they synced

with so the old record can be deleted when a change is detected.

## Collections vs engines vs stores vs preferences vs Apis

For the purposes of the sync manager, we define:

* An *engine* is the unit exposed to the user - an "engine" can be enabled

  or disabled. There is a single set of canonical "engines" used across the

  entire sync ecosystem - ie, desktop and mobile devices all need to agree

  about what engines exist and what the identifier for an engine is.

* An *Api* is the unit exposed to the application layer for general application

  functionality. Application services has 'places' and 'logins' Apis and is

  the API used by the application to store and fetch items. Each 'Api' may

  have one or more 'stores' (although the application layer will generally not

  interact directly with a store)

* A *store* is the code which actually syncs. This is largely an implementation

  detail. There may be multiple stores per engine (eg, the "history" engine

  may have "history" and "forms" stores) and a single 'Api' may expose multiple

  stores (eg, the "places Api" will expose history and bookmarks stores)

* A *collection* is a unit of storage on a server. It's even more of an

  implementation detail than a store. For example, you might imagine a future

  where the "history" store uses multiple "collections" to help with containers.

In practice, this means that the high-level component should only need to care

about an *engine* (for exposing a choice of what to sync to the user) and an

*api* (for interacting with the data managed by that api). The low-level

component will manage the mapping of engines to stores.

## The declined list

This document isn't going to outline the history of how "declined" is used, nor

talk about how this might change in the future. For the purposes of the sync

manager, we have the following hard requirements:

* The low-level component needs to know what the currently declined set of

  engines is for the purposes of re-populating `meta/global`.

* The low-level component needs to know when the declined set needs to change

  based on user input (ie, when the user opts in to or out of a particular

  engine on this device)

* The high-level component needs to be aware that the set of declined engines

  may change on every sync (ie, when the user opts in to or out of a particular

  engine on another device)

A complication is that due to networks being unreliable, there's an inherent

conflict between "what is the current state?" and "what state changes are

requested?". For example, if the user changes the state of an engine while

there's no network, then exits the app, how do we ensure the user's new state

is updated next time the app starts? What if the user has since made a

different state request on a different device? Is the state as last-known on

this device considered canonical?

To clarify, consider:

* User on this device declines logins. This device now believes logins is

  disabled but history is enabled, but is unable to write this to the server

  due to no network.

* The user declines history on a different device, but doesn't change logins.

  This device does manage to write the new list to the server.

* This device restarts and the network is up. It believes history is enabled

  but logins is not - however, the state on the server is the exact opposite.

How does this device react?

(On the plus side, this is an extreme edge-case which none of our existing

implementations handle "correctly" - which is easy to say, because there's

no real definition for "correctly")

Regardless, the low-level component will not pretend to hide this complexity

(ie, it will ignore it!). The low-level component will allow the app to ask

for state changes as part of a sync, and will return what's on the server at

the end of every sync. The app is then free to build whatever experience

it desires around this.

## Disconnecting from Sync

The low-level component needs to have the ability to disconnect all engines

from Sync. Engines which are declined should also be reset.

Because we will need wipe() functionality to implement the clients engine,

and because Lockbox wants to wipe on disconnect, we will provide disconnect

and wipe functionality.

# Specific deliverables for the low-level component.

Breaking the above down into actionable tasks which can be some somewhat

concurrently, we will deliver:

## The API

A straw-man for the API we will expose to the high-level components. This

probably isn't too big, but we should do this as thoroughly as we can. In

particular, ensure we have covered:

* Declined management - how the app changes the declined list and how it learns

  of changes from other devices.

* How telemetry gets handed from the low-level to the high-level.

* The "state" - in particular, how the high-level component understands the

  auth state is wrong, and whether the service is in a degraded mode (eg,

  server requested backoff)

* How the high-level component might specify "special" syncs, such as "just

  one engine" or "this is a pre-sleep, quick-as-possible sync", etc

There's a straw-man proposal for this at the end of the document.

## A command-line (and possibly Android) utility.

We should build a utility (or 2) which can stand in for the high-level

component, for testing and demonstration purposes.

This is something like places-utils.rs and the little utility Grisha has

been using. This utility should act like a real client (ie, it should have

an FxA device record, etc) and it should use the low-level component in

exactly the same we we expect real products to use it.

Because it is just a consumer of the low-level component, it will force us to

confront some key issues, such as how to get references to engines stored in

multiple crates, how to present a unified "state" for things like auth errors,

etc.

## The "clients" engine

The initial work for the clients engine can probably be done without too

much regard for how things are tied together - for example, much work could

be done without caring how we get a reference to engines across crates.

## State work

Implementing things needed to we can expose the correct state to the high-level

manager for things like auth errors, backoff semantics, etc

## Tie it together and other misc things.

There will be lots of loose ends to clean up - things like telemetry, etc.

# Followup with non-rust engines.

We have identified that iOS will, at least in the short term, want the

sync manager to be implemented in Swift. This will be responsible for

syncing both the Swift and Rust implemented engines.

At some point in the future, Desktop may do the same - we will have both

Rust and JS implemented engines which need to be coordinated. We ignore this

requirement for now.

This approach still has a fairly easy time coordinating with the Rust

implemented engines - the FFI will need to expose the relevant sync

entry-points to be called by Swift, but the Swift code can hard-code the

Rust engines it has and make explicit calls to these entry-points.

This Swift code will need to create the structures identified below, but this

shouldn't be too much of a burden as it already has the information necessary

to do so (ie, it already has info/collections etc)

TODO: dig into the Swift code and make sure this is sane.

# Details

While we use rust struct definitions here, it's important to keep in mind that

as mentioned above, we'll need to support the manager being written in

something other than rust, and to support engines written in other than rust.

The structures below are a straw-man, but hopefully capture all the information

that needs to be passed around.

```rust

// We want to define a list of "engine IDs" - ie, canonical strings which

// refer to what the user perceives as an "enigine" - but as above, these

// *do not* correspond 1:1 with either "stores" or "collections" (eg, "history"

// refers to 2 stores, and in a future world, might involve 3 collections).

enum Engine {

  History, // The "History" and "Forms" stores.

  Bookmarks, // The "Bookmark" store.

  Passwords,

impl Engine {

  fn as_str(&self) -> &'static str {

    match self {

      History => "history",

      // etc

// A struct which reflects engine declined states.

struct EngineState {

  engine: Engine,

  enabled: bool,

// A straw-man for the reasons why a sync is happening.

enum SyncReason {

  Scheduled,

  User,

  PreSleep,

  Startup,

// A straw man for the general status.

enum ServiceStatus {

Ok,

  // Some general network issue.

  NetworkError,

  // Some apparent issue with the servers.

  ServiceError,

  // Some external FxA action needs to be taken.

  AuthenticationError,

  // We declined to do anything for backoff or rate-limiting reasons.

  BackedOff,

  // Something else - you need to check the logs for more details.

  OtherError,

// Info we need from FxA to sync. This is roughly our Sync15StorageClientInit

// structure with the FxA device ID.

struct AccountInfo {

  key_id: String,

  access_token: String,

  tokenserver_url: Url,

  device_id: String,

// Instead of massive param and result lists, we use structures.

// This structure is passed to each and every sync.

struct SyncParams {

  // The engines to Sync. None means "sync all"

  engines: Option<Vec<Engine>>,

  // Why this sync is being done.

  reason: SyncReason,

  // Any state changes which should be done as part of this sync.

  engine_state_changes: Vec<EngineState>,

  // An opaque state "blob". This should be persisted by the app so it is

  // reused next sync.

  persisted_state: Option<String>,

struct SyncResult {

  // The general health.

  service_status: ServiceStatus,

  // The result for each engine.

  engine_results: HashMap<Engine, Result<()>>,

  // The list of declined engines, or None if we failed to get that far.

  declined_engines: Option<Vec<Engine>>,

  // When we are allowed to sync again. If > now() then there's some kind

  // of back-off. Note that it's not strictly necessary for the app to

  // enforce this (ie, it can keep asking us to sync, but we might decline).

  // But we might not too - eg, we might try a user-initiated sync.

  next_sync_allowed_at: Timestamp,

  // New opaque state which should be persisted by the embedding app and supplied

  // the next time Sync is called.

  persisted_state: String,

  // Telemetry. Nailing this down is tbd.

  telemetry: Option<JSONValue>,

struct SyncManager {}

impl SyncManager {

  // Initialize the sync manager with the set of Engines known by this

  // application without regard to the enabled/declined states.

  // XXX - still TBD is how we will pass "stores" around - it may be that

  // this function ends up taking an `impl Store`

  fn init(&self, engines: Vec<&str>) -> Result<()>;

  fn sync(&self, params: SyncParams) -> Result<SyncResult>;

  // Interrupt any current syncs. Note that this can be called from a different

  // thread.

  fn interrupt() -> Result<()>;

  // Disconnect this device from sync. This may "reset" the stores, but will

  // not wipe local data.

  fn disconnect(&self) -> Result<()>;

  // Wipe all local data for all local stores. This can be done after

  // disconnecting.

  // There's no exposed way to wipe the remote store - while it's possible

  // stores will want to do this, there's no need to expose this to the user.

  fn wipe(&self) -> Result<()>;

```

Revision control

Copy as Markdown