# Functionality

The raw file hash is used in the DAM to identify files with identical content (not metadata). The same raw file hash most likely points to identical image files. One differing pixel would lead to a different raw file hash.

When uploading image files into the DAM, file processor rawfilehash automatically generates a specific hash from the images' content data. The method of hash generation depends on the file format (details).
The raw hash is written into fields rhash and rhash_sec. The content in field rhash_sec describes the "quality" of the hash. Both fields are part of module "Files".

By default, the raw file hash creation is executed after the derivative generation. Since the entire file must be analyzed, this process requires increased computing effort for larger files.
To consider images files that were already uploaded into the DAM, select them and use toolbox action "Refresh previews".

Configuration Options

  • An image's raw hash can be filtered for and used to identify duplicates (search configuration guide).
  • If required, the file processor for generating the raw file hash can be customized (options).

# Supported File Formats

The following image file formats are currently supported:

file format mimetype specification for hash creation
TIFF image/tiff TYPE_TIFF
JPEG image/jpeg TYPE_JPEG
PNG image/png TYPE_PNG
Windows Bitmap image/bmp TYPE_BMP
PSD image/x-photoshop TYPE_PSD
AI application/illustrator TYPE_AI
PostScript (EPS) application/postscript TYPE_EPS_IMAGE
TYPE_EPS_VECTOR
PDF (limitation) application/pdf TYPE_PDF_PHOTOSHOP
TYPE_PDF_ARTWORK
  • For raster formats, rhash_sec is always set to "true".
  • For vector formats, rhash_sec is only set to "true" if the "strict mode" is active (it is by default) and the file contains an instance UUID. Without an instance UUID, it can be assumed that the representation is identical, but there may still be differences in content.

# PDF Limitation

Please note that a raw file hash can only be created for one-page PDF documents. PDF documents that consist of more than one page will not be considered during generation and will get no raw file hash.

# Parameters Used

The following parameters are available for creating a raw hash.
Please note: The parameters actually used for an image file's raw hash depend on its specific type (list of all types).

General Parameters:

parameter description
filetype file type
width width in pixels or pt
height height in pixels or pt
resolution resolution, mostly in dpi
channel_depth bit depth per color channel
im_signature hash value of a generated raster image of the file
colorspace color space
channelcount number of color and alpha channels
orientation rotation of the display in relation to the basic data

File Type Specific Parameters:

parameter description
has_alpha alpha channel available (true/false)
colorprofile color profile
clippingpaths names of the clipping paths or work paths
paths SVG data of the paths
layers level names and their signatures (hash)
palette and colorType colour type and optional colour table (for PNG)
Interlace interlace mode (for PNG)
instanceID UUID of the document instance. This is only used in strict mode (details). It also changes if metadata in the document was changed with an Adobe program.

# File Type Specific Hash Generation

Depending on an image file's type and specification, the following parameters are used to generate its raw hash:

file type parameters used
TYPE_TIFF
  • all general parameters
  • has_alpha
  • colorprofile
  • clippingpaths
  • paths
  • layers
TYPE_JPEG
  • all general parameters
  • has_alpha
  • colorprofile
  • clippingpaths
  • paths
TYPE_PNG
  • all general parameters
  • has_alpha
  • colorprofile
  • clippingpaths *
  • paths *
  • palette and colorType
  • Interlace
TYPE_BMP
  • all general parameters
  • has_alpha
  • colorprofile
  • clippingpaths *
  • paths *
  • layers
TYPE_PSD
  • all general parameters
  • has_alpha
  • colorprofile
  • clippingpaths
  • paths
  • layers
TYPE_AI
  • filetype
  • width
  • height
  • resolution
  • im_signature
  • instanceID (only used in strict mode)
TYPE_EPS_IMAGE
  • all general parameters
  • has_alpha
  • colorprofile
  • clippingpaths
  • paths
TYPE_EPS_VECTOR
  • filetype
  • width
  • height
  • resolution
  • im_signature
  • instanceID (only used in strict mode)
TYPE_PDF_PHOTOSHOP
  • filetype
  • width
  • height
  • resolution
  • im_signature
  • has_alpha
  • colorprofile
  • clippingpaths
  • paths
  • channelCount
  • instanceID (only used in strict mode)
TYPE_PDF_ARTWORK
  • filetype
  • width
  • height
  • resolution
  • im_signature
  • instanceID (only used in strict mode)

* possibly available, but not supported by the file format

# Distinction from DAM's File Hash

4App DAM comes with standard field file_hash, a hash to identify identical files. This hash also considers metadata, e.g. IPTC, EXIF or XMP data. For most image files, each saving leads to a new file_hash, e.g., timestamps are also written to the files. If you want to find files with identical content data (independent of their metadata), you first have to separate content data from metadata.

Raw-File-Hash field rhash is used to identify DAM files with identical content. For this purpose, the hash is generated from the content data only, not from the metadata.

# Hit Rate and Known Issues

Using the raw file hash to identify image files with identical content can be assumed to have a high hit rate. However, it can still be possible that files with different contents receive the same hash. It is also possible that files with the same content receive a different hash. This is particularly the case with vector-based file formats EPS, AI, PDF.
Since TIFF files can contain thousands of tiff tags whose usage and data types are not clearly defined, problems are also conceivable here.
The PSD file format is a proprietary format of Adobe Systems. Here, the format changes/expands with each version jump of "Adobe Photoshop". If layers or many alpha or spot colors are used in the documents, it should be checked whether seemingly identical files are really the same.

# Strict Mode

The strict mode is a parameter in file processor rawfilehash and active by default. It causes the instance UUID to be included in the hash for vector formats AI, EPS, and PDF. This means that each time such a document is saved using an Adobe program, a new hash is created. To prevent this, the strict mode can be set to inactive (false), which would potentially increase the hit rate (details).

If the strict mode is inactive, false hits must still be considered: For vector-based formats, the content cannot be fully analyzed, and vector elements that do not change the representation do not result in a different hash.

Request missing documentation