What is a perceptual hash?

A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.

Relevance of Perceptual Hashing

Perceptual hashes must be robust enough to take into account transformations or "attacks" on a given input and yet be flexible enough to distinguish between dissimilar files. Such attacks can include rotation, skew, contrast adjustment and different compression/formats. All of these challenges make perceptual hashing an interesting field of study and at the forefront of computer science research.

What is pHash?

pHash is an open source software library released under the GPLv3 license that implements several perceptual hashing algorithms, and provides a C-like API to use those functions in your own programs. pHash itself is written in C++.

pHash 0.9.6 Released

04.23.2013 pHash 0.9.6 fixes some compilation errors and warnings, as well as updates to the automake files to support building on Gentoo.

News and Updates:

04.23.2013 pHash 0.9.6 released. Fix some compilation errors and warnings, as well as updates to the automake files to support building on Gentoo.

11.23.2012 pHash 0.9.5 released. Fix compilation problem with using deprecated FFmpeg functions.

10.20.2011 Cumulix 1.0 Cumulix is an extremely fast and scalable cloud-based image search and retrieval system based on pHash Pro and Neo4j.

01.31.2011 pHash 0.9.4 released. Added radial image hash to Java bindings, fixed compilation on Mac OS X with complex header type, and the examples linking to pthread.

12.24.2010 MVPTree v1.0 New download available. The MVPTree is a generic distance-based indexing structure to store n-dimensional data points. The distance function is configurable as well as the type of data.

10.29.2010 AudioScout v1.0 audio content indexing software released! A scalable audio content indexing solution for managing a collection of audio files, AudioScout is a set of distributed servers that index audio signals based on low-level features in the signal, not just simply on the filename or even its metadata. This makes it ideal for such uses as duplicate detection or the protection of copyrights. It works on both music and speech content. Read the preliminary paper or contact us for further details.

08.15.2010 pHash 0.9.3 released. Fixed a bug with the auxiliary header file causing mp3 support to break.

08.15.2010 pHash 0.9.2 released. Fixed a bug in the audio perceptual hash when converting from stereo to mono for WAV/Ogg/FLAC audio files.

06.15.2010 pHash 0.9.1 released. Removed dependency on ffmpeg for audio functions (now using libmpg123,libsndfile and libsamplerate libraries), cleaned up java bindings, fixed bug in determining number of cpus on mac os x, fixed bug in multi-threaded image, audio and video functions, preliminary bindings for php and c#.

03.28.2010 pHash 0.9.0 released. Multithreading support added for hash functions, audio hash can now read ogg and flac files and image hash can handle RGBA files. Fixed a heap corruption bug in the mvp storage functions.

01.28.2010 pHash 0.8.1 released. Minor bug fixes for MH image hash and compilation with older gcc releases.

01.25.2010 pHash 0.8.0 released. A new perceptual hash has been added based on the Marr/Mexican hat wavelet, the JNI has been greatly improved, and several bugs have been fixed.

12.23.2009 pHash 0.7.2 released. Fixed a bug when building on systems where mremap is not present.

12.20.2009 pHash 0.7.1 released. Updates to the Java bindings to use new dct video hash, removed need for FFTW, included spec file for creating RPMs and general code clean up.

12.12.2009 pHash 0.7 released. Fixed a bug in the perceptual text hash to make hash truly cyclic (credit to Xiaofan Lin for discovering the bug), now works with latest CImg versions, as well as on Windows and BSD systems.

10.07.2009 pHash 0.6 released. The new release contains a variable length DCT video hash which supercedes the previous video hash.

09.12.2009 The end-stopped wavelets are taking longer than anticipated, so in the meantime we've been devoting more time to improving the video hash to handle longer videos. Look for it in the 0.6 release.

07.22.2009 The next version of pHash is in the works and will include a new image hash based on Gabor and end-stopped wavelets, leading to better feature extraction. We will also be improving the video hash to account for longer videos. Stay tuned!

07.02.2009 pHash 0.5 released.

06.29.2009 Custom index technique added for quick storage, search and retrieval of all hash values within a given distance of a query. This technique uses a specially developed file format for persistent storage and can be used for virtually any size hash and distance metric. Preliminary testing reveals a 300% improvement in search time over a simple linear search. For image or audio hashes, additional storage amounts to less than 0.05% of the space used by the actual files. To be included in the 0.5 release!

06.22.2009 Support for Textual hashing is now in the library. Although support is limited to plain utf-8 textual encoded documents for now, the functions allow for a quick scan of documents to find string matches and their offsets. Expect this in the next release.

06.05.2009 Changed the build system to use the gnu autoconf tools. This should make things easier to build and install the pHash lib and program files.

04.15.2009 Java bindings for all pHash library functions.

02.03.2009 pHash now supports hashing for audio files. Derived from frequency spectrum data along the bark scale, this hash is based on characteristics that tend to be the most prominent for the human auditory system. Furthermore, the number of hashes generated per file vary according to the number of samples in the audio file, so short clips can be matched to longer sound files. Naturally, the longer the clip, the more successful it will be. So far, this has proven to work well with 30 second music clips when altered by either mp3 compression and/or telephone simulated filtering.

11.04.2008-2010 The dct hash method has been adapted to video. This is useful for short video clips only, since the entire video is condensed to a fixed length hash.

10.24.2008-2010 Support for an image hash based on the discrete cosine transform. The DCT is a quick and efficent method to write a hash based on frequency data of the underlying image. While it is generally not sophisticated enough to identify visually similar images in any semantically meaningful way, it is fairly robust against minor distortions of the image, such as blurring, rotation and different compression formats.

That's great but what is it good for?

Potential applications include copyright protection, similarity search for media files, or even digital forensics. For example, YouTube could maintain a database of hashes that have been submitted by the major movie producers of movies to which they hold the copyright. If a user then uploads the same video to YouTube, the hash will be almost identical, and it can be flagged as a possible copyright violation. The audio hash could be used to automatically tag MP3 files with proper ID3 information, while the text hash could be used for plagiarism detection.

Have another use for pHash? Let us know!