Content Hash Duplicate File Report
Content Hash Duplicate File Report
Summary
This report utilizes the new added file content hash feature of version 4.0. The 4.0 scan policy definition gives a new option to Generate file content hashes for All Files or Files uploaded since the last scan. This option prompts the AgentFS to generate a SHA256 hash of the file content and store the hash in the database where it can be compared against other files that match in content.
New with version 4.1 is the ability to manage the report paths outside of the query via the Report Designer.
Details
The report layout included in the versioned zip files does contain updates appropriate to the version.
Version 4.0
Line 10, contains the paths to be reported against. Modify these paths appropriate to your environment.
The fullpath_hash is a hash of the full path string stored for quick comparison. In SQL you can reference single or multiple paths using the = or IN operand respectively.
The path_hash() function can be used to query inline the fullpath_hash for the desired path(s).
Line 39, contains the minimum item duplicate count to be included in the report.
Line 40, contians the minimum size (in bytes) file to be included in the report.
Version 4.1
The report paths are managed in the Report Designer not in the query.
Line 33, contains the minimum item duplicate count to be included in the report.
Line 34, contians the minimum size (in bytes) file to be included in the report.
WITH root_path(fullpath, fullpath_hash, ns_left, ns_right, path_type, scan_id) AS (SELECT sd.fullpath, sd.fullpath_hash, sd.ns_left, sd.ns_right, sd.path_type, sd.scan_id FROM srs.scan_data AS sd INNER JOIN srs.scans AS s ON s.id = sd.scan_id WHERE (sd.fullpath_hash IN (srs.path_hash('\\srs-m1.sp.cctec.org\Shares\Atlanta\Employees'),srs.path_hash('\\srs-m1.sp.cctec.org\Shares2\mp'))) AND (sd.path_type = 2) AND (s.progress_status = 4) AND (s.scan_type = 1)), q(fullpath, size, create_time, modify_time, access_time, name, item_count, total_hash_size, content_hash) AS (SELECT sd.fullpath, sd.size, sd.create_time, sd.modify_time, sd.access_time, sd.name, COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count, Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size, srs.bytes_to_hex_string(sd.content_hash) as content_hash FROM srs.scan_data AS sd INNER JOIN root_path AS rp ON (rp.scan_id = sd.scan_id) AND (rp.ns_left <= sd.ns_left) AND (rp.ns_right >= sd.ns_right) WHERE sd.path_type = 1) SELECT q.fullpath, q.item_count, srs.byte_string(q.size) AS size_string, srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string, (q.total_hash_size - q.size) AS wasted_space, srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string, q.total_hash_size, q.size, q.content_hash FROM q WHERE (q.item_count >= 2) AND (q.size >= 10 * 1024 * 1024) ORDER BY q.fullpath
WITH q AS (SELECT sd.fullpath, sd.size, sd.create_time, sd.modify_time, sd.access_time, sd.name, COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count, Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size, srs.bytes_to_hex_string(sd.content_hash) as content_hash FROM srs.scan_data AS sd INNER JOIN #tmp_cq_fs_paths AS cp ON cp.scan_id = sd.scan_id AND cp.ns_left <= sd.ns_left AND cp.ns_right >= sd.ns_right AND cp.is_filesystem_scan = 'true' AND cp.is_current = 'true' WHERE sd.path_type = 1 AND sd.content_hash IS NOT NULL ) SELECT q.fullpath, q.item_count, srs.byte_string(q.size) AS size_string, srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string, (q.total_hash_size - q.size) AS wasted_space, srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string, q.total_hash_size, q.size, q.content_hash FROM q WHERE (q.item_count >= 2) AND (q.size > 10 * 1024 * 1024) ORDER BY q.fullpath
WITH q AS (SELECT sd.fullpath, sd.size, sd.create_time, sd.modify_time, sd.access_time, sd.name, COUNT(*) OVER (PARTITION BY sd.content_hash) AS item_count, Sum(sd.size) OVER (PARTITION BY sd.content_hash) AS total_hash_size, srs.bytes_to_hex_string(sd.content_hash) as content_hash FROM srs.scan_data AS sd INNER JOIN tmp_cq_fs_paths AS cp ON (cp.scan_id = sd.scan_id) AND (cp.ns_left <= sd.ns_left) AND (cp.ns_right >= sd.ns_right) AND (cp.is_filesystem_scan = 'true') AND (cp.is_current = 'true') WHERE sd.path_type = 1 AND sd.content_hash IS NOT NULL ) SELECT q.fullpath, q.item_count, srs.byte_string(q.size) AS size_string, srs.byte_string(CAST(q.total_hash_size AS BIGINT)) AS total_size_string, (q.total_hash_size - q.size) AS wasted_space, srs.byte_string(CAST((q.total_hash_size - q.size) AS BIGINT)) AS wasted_space_string, q.total_hash_size, q.size, q.content_hash FROM q WHERE (q.item_count >= 2) AND (q.size > > 10 * 1024 * 1024) ORDER BY q.fullpath
Attachment | Size |
---|---|
4.1 Content Hash Duplicate File Report.zip | 4.67 KB |
4.0 Content Hash Duplicate File Report.zip | 1.78 KB |
Attachment | Size |
---|---|
duplicate file - hash.pdf | 291.53 KB |