Login

Login

Need to register? Lost password?

Options

Advertisements

Advertisements

 

Script to locate dups ... read it Sheqel =p

Minitokyo

Minitokyo » Forum » Minitokyo Fora » Minitokyo  Script to locate dups ... read it Sheqel =p

page 1 of 1

Hi there xD

I did this to locate 100% dups (same file). This will find exact same files on the gallery (there are some ... ok a lot, specially on the scans section ... wallpapers dups people often change to remove credits and this script won't find them)

The script scans all files for same size (level 1 check), if it's the same it will read a 1024B chunk of each and compare (level 2 check), if it's the same, then it will perform a full comparission of the file (level 3 check = match).

It can be quite heavy on the server so I suggest using it only on a maintenance event. Also, it might not detect if there are more than 2 dups (i.e. 3 files the same) since it tests 2 on 2, so there IS a point running it twice after deleting dups.

Script features anti-deadlock control.

Script does nothing, just list dups.

Tested on 200+ files, but I don't know who it behaves on 200000 files ><;

<?

locatedups(""); // type path to locate relative to this script folder as parameter (ex.: locatedups("uploads/images"); )

function locatedups($path) {

$anti_loop_magic_number = 200000; // if there are more files than this number, increase this number =/
$very_similar_threshold = 1024; // bytes needed to be equal to go to full check

$files = listaArquivosPath($path); // gets all files on that path
$files_ex = array();
if ((substr($path,strlen($path)-1,1)) != "/" ) $path .= "/";

$asize = 0;
$first = -1;
// add on files_ex ordered by size using single-chained list
foreach ($files as $trash => $fname) {
$tempfile = array ( "name" => $fname,
"size" => filesize($path.$fname) );
$pointer = $first;
$prev = -1;
$al = 0;
while ($pointer != -1 && $al < $anti_loop_magic_number) {
if ($tempfile["size"] < $files_ex[$pointer]['size']) break;
$prev = $pointer;
$pointer = $files_ex[$pointer]['next'];
$al ++;
}
if ($al == $anti_loop_magic_number) die("Anti-loop activated");

if ($pointer == $first) { // I'm the new first file
$tempfile['next'] = $first; // Next after me is the actual first
$first = $asize; // I'm the new first
} else { // I'm in the middle/ending of the list
$tempfile['next'] = $files_ex[$prev]['next']; // My next is the next of the one before me
$files_ex[$prev]['next'] = $asize; // The next from the one before the one I go is me
}
$files_ex[$asize] = $tempfile;
$asize++;
}

$pointer = $first;
$al = 0;
$ssf = 0; $vsf = 0; $matches = array();
while ($pointer != -1 && $al < $anti_loop_magic_number) {
$next = $files_ex[$pointer]['next'];
if ($next != -1) {
if ($files_ex[$pointer]['size'] == $files_ex[$next]['size']) {
$ssf++;
// same filesizes ... there is a chance it's the same file! compare them!
$pointer_fc = fileread($path.$files_ex[$pointer]['name'],true,$very_similar_threshold);
$next_fc = fileread($path.$files_ex[$next]['name'],true,$very_similar_threshold);
if ($pointer_fc == $next_fc) { // first chuck of files are equal! Amazinly high change of been the same file, performe full comparision
$vsf++;
$pointer_fc = fileread($path.$files_ex[$pointer]['name']);
$next_fc = fileread($path.$files_ex[$next]['name']);
if ($pointer_fc == $next_fc) { //
// MATCH!
array_push($matches,$path.$files_ex[$pointer]['name']." = ".$path.$files_ex[$next]['name']);
}
}
}
}
$pointer = $next;
$al ++;
}
if ($al == $anti_loop_magic_number) die("Anti-loop activated (l2)");
echo "File comparission complete:<BR><BR>";
echo "Files detected: <B>$asize</B><BR>";
echo "Files with same size: <B>$ssf</B><BR>";
echo ".. among them, total very similar: <B>$vsf</B><BR>";
echo ".... and total detected to be the same: <B>".count($matches)."</B><BR><BR>";
echo "Listing matches:<BR><BR>";
foreach($matches as $trash => $text)
echo $text."<BR>";

}

function fileread($ofile,$firstchuck=false,$chucksize=1024) {
if (is_file($ofile)) {
$fd = fopen ($ofile, "r");
$saida = "";
while ($line=fgets($fd,$chucksize)) {
$saida.=$line;
if ($firstchuck) break; // I want only the first chuck
}
fclose($fd);
return $saida;
} else
return "File not found: ".$ofile;
}

function listaArquivosPath($path,$eregfilter='^(.*)$') {

$array = array();
$cont = 0;
if (!is_dir($path))
return $array;
if ($handle = opendir($path))
while (false !== ($file = readdir($handle)))
if ($file != "." and $file != ".." and eregi($eregfilter,$file)) {
$array[$cont]= $file;
$cont++;
}
closedir($handle);

$total = count($array);
$temp = "";
return $array;
}


?>

---- edit since was closed ---

Jinzhou still is friendly and compreensive with people who want to help, nothing changes ^^

  • Jinzhou
  • Retired Moderator
  • 3y 9wk ago

Please put this in a technical inquiry. Thank you.

page 1 of 1

Only members can post replies, please register.