Content Hash Duplicate File Report

Content Hash Duplicate File Report

Details

Summary

This report utilizes the new added file content hash feature of version 4.0.  The 4.0 scan policy definition gives a new option to Generate file content hashes for All Files or Files uploaded since the last scan.  This option prompts the AgentFS to generate a SHA256 hash of the file content and store the hash in the database where it can be compared against other files that match in content.

New with version 4.1 is the ability to manage the report paths outside of the query via the Report Designer. 

Details

The report layout included in the versioned zip files does contain updates appropriate to the version.

Version 4.0

Line 10, contains the paths to be reported against.  Modify these paths appropriate to your environment.

The fullpath_hash is a hash of the full path string stored for quick comparison. In SQL you can reference single or multiple paths using the = or IN operand respectively.

The path_hash() function can be used to query inline the fullpath_hash for the desired path(s).

Line 39, contains the minimum item duplicate count to be included in the report.

Line 40, contians the minimum size (in bytes) file to be included in the report.

Version 4.1

The report paths are managed in the Report Designer not in the query.

Line 33, contains the minimum item duplicate count to be included in the report.

Line 34, contians the minimum size (in bytes) file to be included in the report.

Code
 WITH
    root_path(fullpath, fullpath_hash, ns_left, ns_right, path_type, scan_id) AS (SELECT sd.fullpath,
          sd.fullpath_hash,
          sd.ns_left,
          sd.ns_right,
          sd.path_type,
          sd.scan_id
        FROM srs.scan_data AS sd
          INNER JOIN srs.scans AS s ON s.id = sd.scan_id
        WHERE (sd.fullpath_hash IN (srs.path_hash('\\srs-m1.sp.cctec.org\Shares\Atlanta\Employees'),srs.path_hash('\\srs-m1.sp.cctec.org\Shares2\mp'))) AND
          (sd.path_type = 2) AND
          (s.progress_status = 4) AND
          (s.scan_type = 1)),
    q(fullpath, size, create_time, modify_time, access_time, name, item_count, total_hash_size, content_hash) AS (SELECT sd.fullpath,
          sd.size,
          sd.create_time,
          sd.modify_time,
          sd.access_time,
          sd.name,
          COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count,
          Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size,
          srs.bytes_to_hex_string(sd.content_hash) as content_hash
        FROM srs.scan_data AS sd
          INNER JOIN root_path AS rp ON (rp.scan_id = sd.scan_id) AND (rp.ns_left <= sd.ns_left) AND (rp.ns_right >= sd.ns_right)
        WHERE sd.path_type = 1)
SELECT
	q.fullpath,
	q.item_count,
	srs.byte_string(q.size) AS size_string,
	srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string,
	(q.total_hash_size - q.size) AS wasted_space,
	srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string,
	q.total_hash_size,
	q.size,
	q.content_hash
FROM
    q
WHERE
    (q.item_count >= 2) AND
    (q.size >= 10 * 1024 * 1024)
ORDER BY q.fullpath
 
 WITH
    q AS (SELECT sd.fullpath,
          sd.size,
          sd.create_time,
          sd.modify_time,
          sd.access_time,
          sd.name,
          COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count,
          Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size,
          srs.bytes_to_hex_string(sd.content_hash) as content_hash
        FROM srs.scan_data AS sd
          INNER JOIN #tmp_cq_fs_paths AS cp ON cp.scan_id = sd.scan_id
             AND cp.ns_left <= sd.ns_left
             AND cp.ns_right >= sd.ns_right
             AND cp.is_filesystem_scan = 'true'
             AND cp.is_current = 'true'
        WHERE sd.path_type = 1
        	AND sd.content_hash IS NOT NULL
        )
SELECT
	q.fullpath,
	q.item_count,
	srs.byte_string(q.size) AS size_string,
	srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string,
	(q.total_hash_size - q.size) AS wasted_space,
	srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string,
	q.total_hash_size,
	q.size,
	q.content_hash
FROM
    q
WHERE
    (q.item_count >= 2) AND
    (q.size > 10 * 1024 * 1024)
ORDER BY q.fullpath
 
 WITH
  q AS (SELECT sd.fullpath,
        sd.size,
        sd.create_time,
        sd.modify_time,
        sd.access_time,
        sd.name,
        COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count,
        Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size,
        srs.bytes_to_hex_string(sd.content_hash) as content_hash
      FROM srs.scan_data AS sd
        INNER JOIN tmp_cq_fs_paths AS cp ON (cp.scan_id = sd.scan_id) 
            AND (cp.ns_left <= sd.ns_left) 
            AND (cp.ns_right >= sd.ns_right)
            AND (cp.is_filesystem_scan = 'true')
            AND (cp.is_current = 'true')
      WHERE sd.path_type = 1
        AND sd.content_hash IS NOT NULL
)
SELECT
	q.fullpath,
	q.item_count,
	srs.byte_string(q.size) AS size_string,
	srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string,
	(q.total_hash_size - q.size) AS wasted_space,
	srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string,
	q.total_hash_size,
	q.size,
	q.content_hash
FROM
    q
WHERE
    (q.item_count >= 2) AND
    (q.size > > 10 * 1024 * 1024)
ORDER BY q.fullpath
 
Author
rlagger
Last modified
Friday, April 15, 2022 - 09:03
Properties
Supported Version
Report Category
Includes a Report Layout
Yes
Sample Report
AttachmentSize
duplicate file - hash.pdf291.53 KB
Preview Images