Content Hashed Duplicate File Report

Details

Summary

This report utilizes the new added file content hash feature of version 4.0.  The 4.0 scan policy definition gives a new option to Generate file content hashes for All Files or Files uploaded since the last scan.  This option prompts the AgentFS to generate a SHA256 hash of the file content and store the hash in the database where it can be compared against other files that match in content.

Details

Line 10 contains the paths to be reported against.  Modify these paths appropriate to your environment.

The fullpath_hash is a hash of the full path string stored for quick comparison. In SQL you can reference single or multiple paths using the = or IN operand respectively.

The path_hash() function can be used to query inline the fullpath_hash for the desired path(s).

Code
 WITH
    root_path(fullpath, fullpath_hash, ns_left, ns_right, path_type, scan_id) AS (SELECT sd.fullpath,
          sd.fullpath_hash,
          sd.ns_left,
          sd.ns_right,
          sd.path_type,
          sd.scan_id
        FROM srs.scan_data AS sd
          INNER JOIN srs.scans AS s ON s.id = sd.scan_id
        WHERE (sd.fullpath_hash IN (srs.path_hash('\\srs-m1.sp.cctec.org\Shares\Atlanta\Employees'),srs.path_hash('\\srs-m1.sp.cctec.org\Shares2\mp'))) AND
          (sd.path_type = 2) AND
          (s.progress_status = 4) AND
          (s.scan_type = 1)),
    q(fullpath, size, create_time, modify_time, access_time, name, item_count, total_hash_size, content_hash) AS (SELECT sd.fullpath,
          sd.size,
          sd.create_time,
          sd.modify_time,
          sd.access_time,
          sd.name,
          COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count,
          Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size,
          srs.bytes_to_hex_string(sd.content_hash) as content_hash
        FROM srs.scan_data AS sd
          INNER JOIN root_path AS rp ON (rp.scan_id = sd.scan_id) AND (rp.ns_left <= sd.ns_left) AND (rp.ns_right >= sd.ns_right)
        WHERE sd.path_type = 1)
SELECT
	q.fullpath,
	q.item_count,
	srs.byte_string(q.size) AS size_string,
	srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string,
	(q.total_hash_size - q.size) AS wasted_space,
	srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string,
	q.total_hash_size,
	q.size,
	q.content_hash
FROM
    q
WHERE
    (q.item_count >= 2) AND
    (q.size > 1)
ORDER BY q.fullpath
 
Author
rlagger
Last modified
Thursday, January 7, 2021 - 11:49
Properties
Supported Version
Report Category
Includes a Report Layout
Yes
Databases
Rating
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
Downloads
Sample Report
AttachmentSize
duplicate file - hash.pdf291.53 KB
Preview Images