0002-database-corruption.md

# Handling Database Corruption

* Status: accepted

* Date: 2021-06-08

## Context and Problem Statement

Some of our users have corrupt SQLite databases and this makes the related

component unusable.  The best way to deal with corrupt databases is to simply

delete the database and start fresh (#2628).  However, we only want to do this

for persistent errors, not transient errors like programming logic errors, disk

full, etc.  This ADR deals with 2 related questions:

  * A) When and how do we identify corrupted databases?

  * B) What do we do when we identify corrupted databases?

## Decision Drivers

* Deleting valid user data should be avoided at almost any cost

* Keeping a corrupted database around is almost as bad.  It currently prevents

  the component from working at all.

* We don't currently have a good way to distinguish between persistent and

  transient errors, but this can be improved by reviewing telemetry and sentry

  data.

## Considered Options

* A) When and how do we identify corrupted databases?

  * 1: Assume all errors when opening a database are from corrupt databases

  * 2: Check for errors when opening a database and compare against known corruption error types

  * 3: Check for errors for all database operations and compare against known corruption error types

* B) What do we do when we identify corrupted databases?

  * 1: Delete the database file and recreate the database

  * 2: Move the database file and recreate the database

  * 3: Have the component fail

## Decision Outcome

* A2: Check for errors when opening a database and compare against known corruption error types

* B1: Delete the database file and recreate the database

Decision B follows from the choice of A.  Since we're being conservative in

identifying errors, we can delete the database file with relative confidence.

"Check for errors for all database operations and compare against known

corruption error types" also seems like a reasonable solution that we may

pursue in the future, but we decided to wait for now.  Checking for errors

during opening time is the simpler solution to implement and should fix the

issue in many cases.  The plan is to implement that first, then monitor

sentry/telemetry to decide what to do next.

# Pros and Cons of the Options

### A1: Assume all errors when opening a database are from corrupt databases

* Good, because the sentry data indicates that many errors happen during opening time

* Good, because migrations are especially likely to trigger corruption errors

* Good, because it's a natural time to delete the database -- the consumer code

  hasn't run any queries yet and doesn't have any open connections.

* Bad, because it will delete valid user data in several situations that are

  relatively common: migration logic errors, OOM errors, Disk full.

### A2: Check for errors when opening a database and compare against known corruption error types (Decided)

* Good, because should eliminate the possibility of deleting valid user data.

* Good, because the sentry data indicates that many errors happen during opening time

* Good, because it's a natural time to delete the database -- the consumer code

  hasn't run any queries yet and doesn't have any open connections.

* Bad, because we don't currently have a good list corruption errors

### A3: Check for errors for all database operations and compare against known corruption error types

* Good, because the sentry data indicates that many errors happen outside of opening time

* Good, because should eliminate the possibility of deleting valid user data.

* Bad, because the consumer code probably doesn't expect the database to be

  deleted and recreated in the middle of a query.  However, this is just an

  extreme case of normal database behavior -- for example any given row can be

  deleted during a sync.

* Bad, because we don't currently have a good list corruption errors

### B1: Delete the database file and recreate the database (Decided)

* Good, because it would allow users with corrupted databases to use the

  affected components again

* Bad, because any misidentification will lead to data loss.

### B2: Move the database file and recreate the database

This option would be similar to 1, but instead of deleting the file we would

move it to a backup location.  When we started up, we could look for backup

files and try to import lost data.

* Good, because if we misidentify corrupt databases, then we have the

  possibility of recovering the data

* Good, because it allows a way for users to delete their data (in theory).

  If the consumer code executed a `wipe()` on the database, we could also

  delete any backup data.

* Bad, because it's very difficult to write a recovery function that merged

  deleted data with any new data.  This function would be fairly hard to test

  and it would be easy to introduce a new logic error.

* Bad, because it adds significant complexity to the database opening code

* Bad, because the user experience would be strange.  A user would open the

  app, discover that their data was gone, then sometime later discover that the

  data is back again.

### B3: Return a failure code

* Good, because this option leaves no chance of user data being deleted

* Good, because it's the simplest to implement

* Bad, because the component will not be usable if the database is corrupt

* Bad, because the user's data is potentially exposed in the corrupted database

  file and we don't provide any way for them to delete it.