# Functionality
The raw file hash is used in the DAM to identify files with identical content (not metadata). The same raw file hash most likely points to identical image files. One differing pixel would lead to a different raw file hash.
When uploading image files into the DAM, file processor rawfilehash
automatically generates a specific hash from the images' content data. The method of hash generation depends on the file format (details).
The raw hash is written into fields rhash
and rhash_sec
. The content in field rhash_sec
describes the "quality" of the hash. Both fields are part of module "Files".
By default, the raw file hash creation is executed after the derivative generation. Since the entire file must be analyzed, this process requires increased computing effort for larger files.
To consider images files that were already uploaded into the DAM, select them and use toolbox action "Refresh previews".
Configuration Options
- An image's raw hash can be filtered for and used to identify duplicates (search configuration guide).
- If required, the file processor for generating the raw file hash can be customized (options).
# Supported File Formats
The following image file formats are currently supported:
file format | mimetype | specification for hash creation |
---|---|---|
TIFF | image/tiff | TYPE_TIFF |
JPEG | image/jpeg | TYPE_JPEG |
PNG | image/png | TYPE_PNG |
Windows Bitmap | image/bmp | TYPE_BMP |
PSD | image/x-photoshop | TYPE_PSD |
AI | application/illustrator | TYPE_AI |
PostScript (EPS) | application/postscript | TYPE_EPS_IMAGE TYPE_EPS_VECTOR |
PDF (limitation) | application/pdf | TYPE_PDF_PHOTOSHOP TYPE_PDF_ARTWORK |
- For raster formats,
rhash_sec
is always set to "true". - For vector formats,
rhash_sec
is only set to "true" if the "strict mode" is active (it is by default) and the file contains an instance UUID. Without an instance UUID, it can be assumed that the representation is identical, but there may still be differences in content.
# PDF Limitation
Please note that a raw file hash can only be created for one-page PDF documents. PDF documents that consist of more than one page will not be considered during generation and will get no raw file hash.
# Parameters Used
The following parameters are available for creating a raw hash.
Please note: The parameters actually used for an image file's raw hash depend on its specific type (list of all types).
General Parameters:
parameter | description |
---|---|
filetype | file type |
width | width in pixels or pt |
height | height in pixels or pt |
resolution | resolution, mostly in dpi |
channel_depth | bit depth per color channel |
im_signature | hash value of a generated raster image of the file |
colorspace | color space |
channelcount | number of color and alpha channels |
orientation | rotation of the display in relation to the basic data |
File Type Specific Parameters:
parameter | description |
---|---|
has_alpha | alpha channel available (true/false) |
colorprofile | color profile |
clippingpaths | names of the clipping paths or work paths |
paths | SVG data of the paths |
layers | level names and their signatures (hash) |
palette and colorType | colour type and optional colour table (for PNG) |
Interlace | interlace mode (for PNG) |
instanceID | UUID of the document instance. This is only used in strict mode (details). It also changes if metadata in the document was changed with an Adobe program. |
# File Type Specific Hash Generation
Depending on an image file's type and specification, the following parameters are used to generate its raw hash:
file type | parameters used |
---|---|
TYPE_TIFF |
|
TYPE_JPEG |
|
TYPE_PNG |
|
TYPE_BMP |
|
TYPE_PSD |
|
TYPE_AI |
|
TYPE_EPS_IMAGE |
|
TYPE_EPS_VECTOR |
|
TYPE_PDF_PHOTOSHOP |
|
TYPE_PDF_ARTWORK |
|
* possibly available, but not supported by the file format
# Distinction from DAM's File Hash
4App DAM comes with standard field file_hash
, a hash to identify identical files. This hash also considers metadata, e.g. IPTC, EXIF or XMP data. For most image files, each saving leads to a new file_hash
, e.g., timestamps are also written to the files. If you want to find files with identical content data (independent of their metadata), you first have to separate content data from metadata.
Raw-File-Hash field rhash
is used to identify DAM files with identical content. For this purpose, the hash is generated from the content data only, not from the metadata.
# Hit Rate and Known Issues
Using the raw file hash to identify image files with identical content can be assumed to have a high hit rate. However, it can still be possible that files with different contents receive the same hash. It is also possible that files with the same content receive a different hash. This is particularly the case with vector-based file formats EPS, AI, PDF.
Since TIFF files can contain thousands of tiff tags whose usage and data types are not clearly defined, problems are also conceivable here.
The PSD file format is a proprietary format of Adobe Systems. Here, the format changes/expands with each version jump of "Adobe Photoshop". If layers or many alpha or spot colors are used in the documents, it should be checked whether seemingly identical files are really the same.
# Strict Mode
The strict mode is a parameter in file processor rawfilehash
and active by default. It causes the instance UUID to be included in the hash for vector formats AI, EPS, and PDF. This means that each time such a document is saved using an Adobe program, a new hash is created. To prevent this, the strict mode can be set to inactive (false), which would potentially increase the hit rate (details).
If the strict mode is inactive, false hits must still be considered: For vector-based formats, the content cannot be fully analyzed, and vector elements that do not change the representation do not result in a different hash.