Source code
Revision control
Copy as Markdown
Other Tools
//! SeaHash: A blazingly fast, portable hash function with proven statistical guarantees.
//!
//! SeaHash is a hash function with performance better than (around 3-20% improvement) xxHash and
//! MetroHash. Furthermore, SeaHash has mathematically provable statistical guarantees.
//!
//! SeaHash is a portable hash function, meaning that the output is not dependent on the hosting
//! architecture, and makes no assumptions on endianness or the alike. This stable layout allows it
//! to be used for on-disk/permanent storage (e.g. checksums).
//!
//! # Design, advantages, and features
//!
//! - **High quality**: It beats most other general purpose hash functions because it provides full
//! avalanche inbetween state updates.
//! - **Performance**: SeaHash beats every high-quality (grading 10/10 in smhasher) hash function
//! that I know of.
//! - **Provable quality guarantees**: Contrary to most other non-cryptographic hash function,
//! SeaHash can be proved to satisfy the avalanche criterion as well as BIC.
//! - **Parallelizable**: Consists of multiple, independent states to take advantage of ILP and/or
//! software threads.
//! - **Bulk reads**: Reads 8 or 4 bytes a time.
//! - **Stable and portable**: Does not depend on the target architecture, and produces a stable
//! value, which is only changed in major version bumps.
//! - **Keyed**: Designed to not leak the seed/key. Note that it has not gone through
//! cryptoanalysis yet, so the keyed version shouldn't be relied on when security is needed.
//! - **Hardware accelerateable**: SeaHash is designed such that ASICs can implement it with really
//! high performance.
//!
//! # A word of warning!
//!
//! This is **not** a cryptographic function, and it certainly should not be used as one. If you
//! want a good cryptographic hash function, you should use SHA-3 (Keccak) or BLAKE2.
//!
//! It is not secure, nor does it aim to be. It aims to have high quality pseudorandom output and
//! few collisions, as well as being fast.
//!
//! # Benchmark
//!
//! On normal hardware, it is expected to run with a rate around 5.9-6.7 GB/S on a 2.5 GHz CPU.
//! Further improvement can be seen when hashing very big buffers in parallel.
//!
//! | Function | Quality | Cycles per byte (lower is better) | Author
//! |-------------|---------------|-----------------------------------|-------------------
//! | **SeaHash** | **Excellent** | **0.24** | **Ticki**
//! | xxHash | Excellent | 0.31 | Collet
//! | MetroHash | Excellent | 0.35 | Rogers
//! | Murmur | Excellent | 0.64 | Appleby
//! | Rabin | Medium | 1.51 | Rabin
//! | CityHash | Excellent | 1.62 | Pike, Alakuijala
//! | LoseLose | Terrible | 2.01 | Kernighan, Ritchie
//! | FNV | Poor | 3.12 | Fowler, Noll, Vo
//! | SipHash | Pseudorandom | 3.21 | Aumasson, Bernstein
//! | CRC | Good | 3.91 | Peterson
//! | DJB2 | Poor | 4.13 | Bernstein
//!
//! ## Ideal architecture
//!
//! SeaHash is designed and optimized for the most common architecture in use:
//!
//! - Little-endian
//! - 64-bit
//! - 64 or more bytes cache lines
//! - 4 or more instruction pipelines
//! - 4 or more 64-bit registers
//!
//! Anything that does not hold the above requirements will perform worse by up to 30-40%. Note that
//! this means it is still faster than CityHash (~1 GB/S), MurMurHash (~2.6 GB/S), FNV (~0.5 GB/S),
//! etc.
//!
//! # Achieving the performance
//!
//! Like any good general-purpose hash function, SeaHash reads 8 bytes at once effectively reducing
//! the running time by an order of ~5.
//!
//! Secondly, SeaHash achieves the performance by heavily exploiting Instruction-Level Parallelism.
//! In particular, it fetches 4 integers in every round and independently diffuses them. This
//! yields four different states, which are finally combined.
//!
//! # Statistical guarantees
//!
//! SeaHash comes with certain proven guarantees about the statistical properties of the output:
//!
//! 1. Pick some _n_-byte sequence, _s_. The number of _n_-byte sequence colliding with _s_ is
//! independent of the choice of _s_ (all equivalence class have equal size).
//! 2. If you flip any bit in the input, the probability for any bit in the output to be flipped is
//! 0.5.
//! 3. The hash value of a sequence of uniformly distributed bytes is itself uniformly distributed.
//!
//! The first guarantee can be derived through deduction, by proving that the diffusion function is
//! bijective (reverse the XORs and find the congruence inverses to the primes).
//!
//! The second guarantee requires more complex calculations: Construct a matrix of probabilities
//! and set one to certain (1), then apply transformations through the respective operations. The
//! proof is a bit long, but relatively simple.
//!
//! The third guarantee requires proving that the hash value is a tree, such that:
//! - Leafs represents the input values.
//! - Single-child nodes reduce to the diffusion of the child.
//! - Multiple-child nodes reduce to the sum of the children.
//!
//! Then simply show that each of these reductions transform uniformly distributed variables to
//! uniformly distributed variables.
//!
//! # Inner workings
//!
//! In technical terms, SeaHash follows a alternating 4-state length-padded Merkle–Damgård
//! construction with an XOR-diffuse compression function (click to enlarge):
//!
//! []
//!
//! It starts with 4 initial states, then it alternates between them (increment, wrap on 4) and
//! does XOR with the respective block. When a state has been visited the diffusion function (f) is
//! applied. The very last block is padded with zeros.
//!
//! After all the blocks have been gone over, all the states are XOR'd to the number of bytes
//! written. The sum is then passed through the diffusion function, which produces the final hash
//! value.
//!
//! The diffusion function is drawn below.
//!
//! ```notest
//! x ← px
//! x ← x ⊕ ((x ≫ 32) ≫ (x ≫ 60))
//! x ← px
//! ```
//!
//! The advantage of having four completely segregated (note that there is no mix round, so they're
//! entirely independent) states is that fast parallelism is possible. For example, if I were to
//! hash 1 TB, I can spawn up four threads which can run independently without _any_
//! intercommunication or synchronization before the last round.
//!
//! If the diffusion function (f) was cryptographically secure, it would pass cryptoanalysis
//! trivially. This might seem irrelevant, as it clearly isn't cryptographically secure, but it
//! tells us something about the inner semantics. In particular, any diffusion function with
//! sufficient statistical quality will make up a good hash function in this construction.
//!
//! Read [the blog post](http://ticki.github.io/blog/seahash-explained/) for more details.
//!
//! # ASIC version
//!
//! SeaHash is specifically designed such that it can be efficiently implemented in the form of
//! ASIC while only using very few transistors.
//!
//! # Specification
//!
//! See the [`reference`](./reference) module.
//!
//! # Credits
//!
//! Aside for myself (@ticki), there are couple of other people who have helped creating this.
//! Joshua Landau suggested using the [PCG family of diffusions](http://www.pcg-random.org/),
//! created by Melissa E. O'Neill. Sokolov Yura spotted multiple bugs in SeaHash.
#![warn(missing_docs)]
#![cfg_attr(all(not(test), not(feature = "use_std")), no_std)]
#[cfg(all(not(test), not(feature = "use_std")))]
extern crate core as std;
pub use buffer::{hash, hash_seeded, State};
pub use stream::SeaHasher;
mod buffer;
mod helper;
pub mod reference;
mod stream;
#[cfg(feature = "use_std")]
mod impl_std;