A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.
Perceptual hashes must be robust enough to take into account transformations or "attacks" on a given input and yet be flexible enough to distinguish between dissimilar files. Such attacks can include rotation, skew, contrast adjustment and different compression/formats. All of these challenges make perceptual hashing an interesting field of study and at the forefront of computer science research.
pHash is an open source software library released under the GPLv3 license that implements several perceptual hashing algorithms, and provides a C-like API to use those functions in your own programs. pHash itself is written in C++.
01.28.2010 Fixes minor bugs in the MH image hash as well as when compiling on older gcc versions. Download.
01.28.2010 pHash 0.8.1 released. Minor bug fixes for MH image hash and compilation with older gcc releases.
01.25.2010 pHash 0.8.0 released. A new perceptual hash has been added based on the Marr/Mexican hat wavelet, the JNI has been greatly improved, and several bugs have been fixed.
12.23.2009 pHash 0.7.2 released. Fixed a bug when building on systems where mremap is not present.
12.20.2009 pHash 0.7.1 released. Updates to the Java bindings to use new dct video hash, removed need for FFTW, included spec file for creating RPMs and general code clean up.
12.12.2009 pHash 0.7 released. Fixed a bug in the perceptual text hash to make hash truly cyclic (credit to Xiaofan Lin for discovering the bug), now works with latest CImg versions, as well as on Windows and BSD systems.
10.07.2009 pHash 0.6 released. The new release contains a variable length DCT video hash which supercedes the previous video hash.
09.12.2009 The end-stopped wavelets are taking longer than anticipated, so in the meantime we've been devoting more time to improving the video hash to handle longer videos. Look for it in the 0.6 release.
07.22.2009 The next version of pHash is in the works and will include a new image hash based on Gabor and end-stopped wavelets, leading to better feature extraction. We will also be improving the video hash to account for longer videos. Stay tuned!
07.02.2009 pHash 0.5 released.
06.29.2009 Custom index technique added for quick storage, search and retrieval of all hash values within a given distance of a query. This technique uses a specially developed file format for persistent storage and can be used for virtually any size hash and distance metric. Preliminary testing reveals a 300% improvement in search time over a simple linear search. For image or audio hashes, additional storage amounts to less than 0.05% of the space used by the actual files. To be included in the 0.5 release!
06.22.2009 Support for Textual hashing is now in the library. Although support is limited to plain utf-8 textual encoded documents for now, the functions allow for a quick scan of documents to find string matches and their offsets. Expect this in the next release.
06.05.2009 Changed the build system to use the gnu autoconf tools. This should make things easier to build and install the pHash lib and program files.
04.15.2009 Java bindings for all pHash library functions.
02.03.2009 pHash now supports hashing for audio files. Derived from frequency spectrum data along the bark scale, this hash is based on characteristics that tend to be the most prominent for the human auditory system. Furthermore, the number of hashes generated per file vary according to the number of samples in the audio file, so short clips can be matched to longer sound files. Naturally, the longer the clip, the more successful it will be. So far, this has proven to work well with 30 second music clips when altered by either mp3 compression and/or telephone simulated filtering.
11.04.2008 The dct hash method has been adapted to video. This is useful for short video clips only, since the entire video is condensed to a fixed length hash.
10.24.2008 Support for an image hash based on the discrete cosine transform. The DCT is a quick and efficent method to write a hash based on frequency data of the underlying image. While it is generally not sophisticated enough to identify visually similar images in any semantically meaningful way, it is fairly robust against minor distortions of the image, such as blurring, rotation and different compression formats.
Potential applications include copyright protection, similarity search for media files, or even digital forensics. For example, YouTube could maintain a database of hashes that have been submitted by the major movie producers of movies to which they hold the copyright. If a user then uploads the same video to YouTube, the hash will be almost identical, and it can be flagged as a possible copyright violation. The audio hash could be used to automatically tag MP3 files with proper ID3 information, while the text hash could be used for plagiarism detection.
Have another use for pHash? Let us know!