Is there a hard drive file organizer that will ...

M

MEB

"98 Guy" <98@Guy.com> wrote in message news:49047427.19372C62@Guy.com...
| thanatoid wrote:
|
| > > It would be easy enough to look for such utilities on
| > > Google
| >
| > I thought I'd spare 98 Guy that statement -)
|
| What you get are tons and tons of programs that will "catalog" your
| files (especially multi-media files - music, movies, etc).
|
| > > you could have done the same with Windows===> find
| > >
| > > then sorting and deleting as needed
|
| Given perhaps several hundred thousand files on any given drive,
| multiply that by 1 to 2 dozen drives, and you're going to spend hours
| rounding up, sorting, comparing, and aggregating all the files you want
| from all of them.
|
| Perhaps you still don't understand what I'm trying to do.

I think I get what you're thinking about, but when you bring in corporate
structure and potential issues {as you did previously} then that should be
handled by the server/network setup, group policies, synchronization
aspects, and other network possibilities..
Your question IS viable for a user who has failed to apply sensible usage
or even a small network without a centralized server or more, but fails to
address and understand how large networks ensure these things do NOT occur.
If your company or another has these types of difficulties then you/they
need to re-think your/their networking setup PARTICULARLY when using
Microsoft servers and OSs.
ANY user who fails to emplace some form of directory/file type/specific
activity/temp folders-master files/synchronization policies/etc. system will
ALWAYS end of with tons of JUNK. So if you actually think about it, what
you're asking is for ANOTHER third party program to use to correct your own
failure to properly setup your own usage.
IF you're referring to yourself and with issues with dozens of drives, then
I would question why you haven't been applying your own method of removing
old files and multiple duplicates on a regular basis. Even 98 had
synchronization abilities.... moreover, I would question why you don't use
CDROMS and/or DVDs and a multiple burner drives rather than HDs, seems like
a tremendous waste of money and a poorly thought out usage of those
resources.
If you're swapping Hard Drives, then why haven't you labeled them for
SPECIFIC usage.

--
MEB
http://peoplescounsel.org
a Peoples' counsel
_ _
~~
 
T

thanatoid

98 Guy <98@Guy.com> wrote in news:4904757E.5C19DCA6@Guy.com:

> thanatoid wrote:
>
>> > but for a SOHO situation where you have perhaps 5 to 10
>> > years worth of computer use by an office with 2, 5, 10
>> > or 20 people, you tend to build up a collection of hard
>> > drives that one day you want to organize and retrieve
>> > the contents of to make them available to others, and
>> > to wipe the original drives before donating or
>> > discarding.

>>
>> Ever heard of a network?

>
> What's that got to do with the paragraph above?
>
> If I have 10 copies of the same .xls or .doc file spread
> across 10 hard drives, putting all those drives on a
> network isin't going to change the fact that there are 10
> copies of the same file accessible to everyone on the
> network, instead of just one copy.


Your understanding of networks in not nearly as impeccable as
your logic of finding duplicates on one drive on which I
commented in my previous reply.

In any case, discussing the problem endlessly is not going to
make it go away, so either get to work, or quit your job. You
have been given all the info short of the phone number of
someone who will do it for you for free.

It's a nasty job, I admit. I suggest better planning next time
(it may not have been YOU that set things up this way, but it
appears to be in your lap now) and there is obviously little
control over what individuals do in that place. There are
various ways of dealing with such problems, and a network server
(which, BTW would contain ONE copy of each file which people can
access and work on as they need to) is one of starting points.
Careful supervision of what people actually DO on their
individual workstations and their qualifications is the second.

If you want this to actually be a success you may have to tell
everybody to go on vacation for a while so they don't instantly
mess up every little bit you've managed to do.

Reminds me of when I worked for a lunatic who wouldn't ever use
a pen, he ONLY used pencils - for when he made his endless
mistakes/corrections.


--
Those who cast the votes decide nothing. Those who count the
votes decide everything.
- Josef Stalin

NB: Not only is my KF over 4 KB and growing, I am also filtering
everything from discussions.microsoft and google groups, so no
offense if you don't get a reply/comment unless I see you quoted
in another post.
 
F

Franc Zabkar

On Sun, 26 Oct 2008 13:56:34 +0000, "J. P. Gilliver (John)"
<G6JPG@soft255.demon.co.uk> put finger to keyboard and composed:

>In message <sa78g4hh81gdhr2k51vndb81dm429id32l@4ax.com>, Franc Zabkar
><fzabkar@iinternode.on.net> writes
>[]
>>>http://www.david-taylor.pwp.blueyonder.co.uk/software/disk.html#FindDuplicates

>[]
>>That looks like a nice program but I've been running it all day and
>>it's still only a fraction of the way through the comparisons. However

>
>That is the problem I've found with it it also slows down the PC a bit
>when it's been running a while. The solution is just to hit the stop
>button it checks files in descending order of size, so by the time it
>has slowed to a crawl, it will have compared the large files. (I've seen
>it spend ages comparing a 44 byte file!) When you hit the stop button,
>you _don't_ lose what it has found so far. Once you've dealt with those,
>you can set it going again, and (assuming you've not _left_ big
>duplicates in place), it will start with the remaining duplicates, back
>at its higher starting speed.
>
>EasyCleaner, from http://personal.inet.fi/business/toniarts/ecleane.htm
>(which is a free set of utilities I think anyone should have anyway)
>includes a duplicate finder which I think uses the same engine as
>FindDup, but starts with the littlest files. (I have a feeling it might
>not have the slowdown, either.)


AFAICS, a fundamental flaw in duplicate finder software is that it
relies on direct binary comparisons. With programs like FindDup, if we
have 3 files of equal size, then we would need to compare file1 with
file2, file1 with file3, and file2 with file3. This requires 6 reads.
For n equally sized files, the number of reads is n(n-1).

Alternatively, if we relied on MD5 checksums, then each file would
only need to be read once.

>>that is probably a reflection on my poor housekeeping. In any case it
>>seems to me that the author would benefit greatly by using 98Guy's
>>approach, ie calculating and comparing MD5 checksums. IIRC, FastSum
>>took less than 30 minutes on my 450MHz box.

>[]
>For finding "what's eating my disc", I haven't come across anything to
>beat Steffen Gerlach's Scanner, from
>http://www.steffengerlach.de/freeware/
> this is what I can only describe as a hierarchical piecharter, and you
>should try it. Of course, it must be rubbish, as it's only a 164K
>download ... There's also a piecharter in David Taylor's area (same page
>as FindDup IIRR), and as part of EasyCleaner (again, I think uses David
>Taylor's code), and you can go up and down the levels in those, but I'm
>not aware of anything that has a hierarchical display like Scanner.


I *love* small utility software. At the moment I'm playing with
Windows CE in a small GPS device. It reminds me what can be done with
a small amount of resources, eg a 16KB calculator, 23KB task manager,
6.5KB screen capture utility.

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
 
F

FromTheRafters

> AFAICS, a fundamental flaw in duplicate finder software is that it
> relies on direct binary comparisons. With programs like FindDup, if we
> have 3 files of equal size, then we would need to compare file1 with
> file2, file1 with file3, and file2 with file3. This requires 6 reads.
> For n equally sized files, the number of reads is n(n-1).
>
> Alternatively, if we relied on MD5 checksums, then each file would
> only need to be read once.


So...once it is found to be the same checksum, what should the
program do next? How important are these files? A fundamental
flaw would be to trust MD5 checksums as an indication that the
files are indeed duplicates. You can mostly trust MD5 checksums
to indicate two files are different, but the other way around?
 
B

Bill in Co.

FromTheRafters wrote:
>> AFAICS, a fundamental flaw in duplicate finder software is that it
>> relies on direct binary comparisons. With programs like FindDup, if we
>> have 3 files of equal size, then we would need to compare file1 with
>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>> For n equally sized files, the number of reads is n(n-1).
>>
>> Alternatively, if we relied on MD5 checksums, then each file would
>> only need to be read once.

>
> So...once it is found to be the same checksum, what should the
> program do next? How important are these files?


> A fundamental
> flaw would be to trust MD5 checksums as an indication that the
> files are indeed duplicates.


Since when? What is the statistical likelyhood of that being true?

> You can mostly trust MD5 checksums
> to indicate two files are different, but the other way around?
 
F

Franc Zabkar

On Sun, 26 Oct 2008 16:23:16 -0400, "FromTheRafters"
<erratic@nomail.afraid.org> put finger to keyboard and composed:

>> AFAICS, a fundamental flaw in duplicate finder software is that it
>> relies on direct binary comparisons. With programs like FindDup, if we
>> have 3 files of equal size, then we would need to compare file1 with
>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>> For n equally sized files, the number of reads is n(n-1).
>>
>> Alternatively, if we relied on MD5 checksums, then each file would
>> only need to be read once.

>
>So...once it is found to be the same checksum, what should the
>program do next? How important are these files? A fundamental
>flaw would be to trust MD5 checksums as an indication that the
>files are indeed duplicates. You can mostly trust MD5 checksums
>to indicate two files are different, but the other way around?


OK, I retract my ill-informed comment, but it still seems to me that
the benefits far outweigh the risks. FindDup has been running for the
past 18 hours or so as I write this, so I'm happy to accept a 30
minute alternative. In any case, all programs appear to require that
the user decides whether or not a file can be safely deleted. To this
end the programmer could allow for a binary comparision in those cases
where there is any doubt.

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
 
9

98 Guy

"Bill in Co." wrote:

> > A fundamental flaw would be to trust MD5 checksums as an
> > indication that the files are indeed duplicates.

>
> Since when? What is the statistical likelyhood of that being
> true?


If there was no malicious intent or source involved, I'd say the odds
are pretty low. But even if you had 2 identical hashs, it's simple
enough to just see if the files are the same length, and if they were,
then you do a byte-by-byte comparison.
 
J

J. P. Gilliver (John)

In message <6og9g4ppkkap70p8ljdadfh8v8g6q6isc3@4ax.com>, Franc Zabkar
<fzabkar@iinternode.on.net> writes
[]
>AFAICS, a fundamental flaw in duplicate finder software is that it
>relies on direct binary comparisons. With programs like FindDup, if we
>have 3 files of equal size, then we would need to compare file1 with
>file2, file1 with file3, and file2 with file3. This requires 6 reads.
>For n equally sized files, the number of reads is n(n-1).


Yes, if none of the comparisons match if file 1 is found to be the same
as file 2, then there's no need to compare file 3 to both of them, only
one.
>
>Alternatively, if we relied on MD5 checksums, then each file would
>only need to be read once.


It's a while since I played with FindDup - but I think using checksums
of some sort is one of its configuration options.
[]
>>http://www.steffengerlach.de/freeware/
>> this is what I can only describe as a hierarchical piecharter, and you
>>should try it. Of course, it must be rubbish, as it's only a 164K

[]
>I *love* small utility software. At the moment I'm playing with
>Windows CE in a small GPS device. It reminds me what can be done with
>a small amount of resources, eg a 16KB calculator, 23KB task manager,
>6.5KB screen capture utility.

[]
Of course, I was being sarcastic - I like small util.s too not just for
the intrinsic appeal, but because they tend to run more quickly and with
fewer problems.

I still haven't found anything to beat flamer.com - OK, it is only a
fire simulator, but how it manages to do it in 437 bytes (4xx, anyway) I
still don't know. (Works under everything I've tried up to XP.)
--
J. P. Gilliver. UMRA: 1960/<1985 MB++G.5AL(+++)IS-P--Ch+(p)Ar+T[?]H+Sh0!:`)DNAf
Lada for sale - see www.autotrader.co.uk

This trip should be called "Driving Miss Crazy" - Emma Wilson, on crossing the
southern United States with her mother, Ann Robinson, 2003 or 2004
 
F

FromTheRafters

"Franc Zabkar" <fzabkar@iinternode.on.net> wrote in message
news:rpn9g4h6d10d3kv20ud3j02e2phuq66ucg@4ax.com...
> On Sun, 26 Oct 2008 16:23:16 -0400, "FromTheRafters"
> <erratic@nomail.afraid.org> put finger to keyboard and composed:
>
>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>> relies on direct binary comparisons. With programs like FindDup, if we
>>> have 3 files of equal size, then we would need to compare file1 with
>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>> For n equally sized files, the number of reads is n(n-1).
>>>
>>> Alternatively, if we relied on MD5 checksums, then each file would
>>> only need to be read once.

>>
>>So...once it is found to be the same checksum, what should the
>>program do next? How important are these files? A fundamental
>>flaw would be to trust MD5 checksums as an indication that the
>>files are indeed duplicates. You can mostly trust MD5 checksums
>>to indicate two files are different, but the other way around?

>
> OK, I retract my ill-informed comment, but it still seems to me that
> the benefits far outweigh the risks. FindDup has been running for the
> past 18 hours or so as I write this, so I'm happy to accept a 30
> minute alternative. In any case, all programs appear to require that
> the user decides whether or not a file can be safely deleted. To this
> end the programmer could allow for a binary comparision in those cases
> where there is any doubt.


It all depends on the risk you are willing to assume. It would be nice
to have a hybrid case where you could switch between the MD5
mode and the byte by byte mode depending on such factors as type
or location of files etc.
 
F

FromTheRafters

"Bill in Co." <not_really_here@earthlink.net> wrote in message
news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
> FromTheRafters wrote:
>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>> relies on direct binary comparisons. With programs like FindDup, if we
>>> have 3 files of equal size, then we would need to compare file1 with
>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>> For n equally sized files, the number of reads is n(n-1).
>>>
>>> Alternatively, if we relied on MD5 checksums, then each file would
>>> only need to be read once.

>>
>> So...once it is found to be the same checksum, what should the
>> program do next? How important are these files?

>
>> A fundamental
>> flaw would be to trust MD5 checksums as an indication that the
>> files are indeed duplicates.

>
> Since when?


Forever.

Checksums are often smaller than the file they are derived from
(thats kinda the point, eh?).

> What is the statistical likelyhood of that being true?


Greater than zero.
 
F

FromTheRafters

> Your understanding of networks in not nearly as impeccable as
> your logic of finding duplicates on one drive on which I
> commented in my previous reply.


Okay, so as this thread reaches its EOL, it may interest
someone that all might not be as it seems.

I'm not sure about modern disk operating systems, but
some older ones would not actually make a copy when
asked to do so. Rather, they would make another full
path to the same data on disk (why waste space with
redundant data). Copying to another disk, or partition
on the same disk, would actually necessitate a copy
and would take longer as a result. When access was
made to the file, and it was modified, then the path used
to access that file would point to a newly created file
while the *original* would still be accessed from the
other paths.

So, deleting duplicate files on a single drive in this case
would only clean up the file system without freeing up
any harddrive space.
 
B

Bill in Co.

FromTheRafters wrote:
> "Bill in Co." <not_really_here@earthlink.net> wrote in message
> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>> FromTheRafters wrote:
>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>> relies on direct binary comparisons. With programs like FindDup, if we
>>>> have 3 files of equal size, then we would need to compare file1 with
>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>> For n equally sized files, the number of reads is n(n-1).
>>>>
>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>> only need to be read once.
>>>
>>> So...once it is found to be the same checksum, what should the
>>> program do next? How important are these files?

>>
>>> A fundamental
>>> flaw would be to trust MD5 checksums as an indication that the
>>> files are indeed duplicates.

>>
>> Since when?

>
> Forever.
>
> Checksums are often smaller than the file they are derived from
> (thats kinda the point, eh?).


No, that's not the point. Your statement was that the checksums did not
assure the integrity of the file, whatsoever - i.e., that two files could
have the same hash valus and yet be different, which I still say is *highly*
unlikely. A statistically insignificant probability, so that using hash
values is often prudent and is much more expedient, of course.
 
F

FromTheRafters

"Bill in Co." <not_really_here@earthlink.net> wrote in message
news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
> FromTheRafters wrote:
>> "Bill in Co." <not_really_here@earthlink.net> wrote in message
>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>> FromTheRafters wrote:
>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>> relies on direct binary comparisons. With programs like FindDup, if we
>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>
>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>> only need to be read once.
>>>>
>>>> So...once it is found to be the same checksum, what should the
>>>> program do next? How important are these files?
>>>
>>>> A fundamental
>>>> flaw would be to trust MD5 checksums as an indication that the
>>>> files are indeed duplicates.
>>>
>>> Since when?

>>
>> Forever.
>>
>> Checksums are often smaller than the file they are derived from
>> (thats kinda the point, eh?).

>
> No, that's not the point. Your statement was that the checksums did not
> assure the integrity of the file, whatsoever


I didn't say anything about the integrity of a file, and I also didn't
say 'whatsoever'. You can still read what I said above.

If you want to ensure they are duplicates - compare the files exactly.
If you only need to be reasonably sure they are duplicates, checksums
are adequate.

> - i.e., that two files could have the same hash valus and yet be
> different, which I still say is *highly* unlikely.


Highly unlikely -yes. But files can be highly valuable too. Just
how fast does such a program need to be? How much speed
is worth how much accuracy?

> A statistically insignificant probability, so that using hash values is
> often prudent and is much more expedient, of course.


True, but to aim toward accuracy instead of speed is not a flaw.
 
T

thanatoid

"FromTheRafters" <erratic@nomail.afraid.org> wrote in
news:#mS$aA8NJHA.3876@TK2MSFTNGP04.phx.gbl:

>> Your understanding of networks in not nearly as impeccable as
>> your logic of finding duplicates on one drive on which I
>> commented in my previous reply.

>
> Okay, so as this thread reaches its EOL, it may interest
> someone that all might not be as it seems.
>
> I'm not sure about modern disk operating systems, but
> some older ones would not actually make a copy when
> asked to do so. Rather, they would make another full
> path to the same data on disk (why waste space with
> redundant data). Copying to another disk, or partition
> on the same disk, would actually necessitate a copy
> and would take longer as a result. When access was
> made to the file, and it was modified, then the path used
> to access that file would point to a newly created file
> while the *original* would still be accessed from the
> other paths.
>
> So, deleting duplicate files on a single drive in this case
> would only clean up the file system without freeing up
> any harddrive space.


OR deleting duplicates, it would seem (don't want to read it
again, see below).

Thanks for the headache. What a nightmare.


--
Those who cast the votes decide nothing. Those who count the
votes decide everything.
- Josef Stalin

NB: Not only is my KF over 4 KB and growing, I am also filtering
everything from discussions.microsoft and google groups, so no
offense if you don't get a reply/comment unless I see you quoted
in another post.
 
B

Bill in Co.

FromTheRafters wrote:
> "Bill in Co." <not_really_here@earthlink.net> wrote in message
> news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
>> FromTheRafters wrote:
>>> "Bill in Co." <not_really_here@earthlink.net> wrote in message
>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>>> FromTheRafters wrote:
>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>>> relies on direct binary comparisons. With programs like FindDup, if
>>>>>> we
>>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>>
>>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>>> only need to be read once.
>>>>>
>>>>> So...once it is found to be the same checksum, what should the
>>>>> program do next? How important are these files?
>>>>
>>>>> A fundamental
>>>>> flaw would be to trust MD5 checksums as an indication that the
>>>>> files are indeed duplicates.
>>>>
>>>> Since when?
>>>
>>> Forever.
>>>
>>> Checksums are often smaller than the file they are derived from
>>> (thats kinda the point, eh?).

>>
>> No, that's not the point. Your statement was that the checksums did not
>> assure the integrity of the file, whatsoever

>
> I didn't say anything about the integrity of a file, and I also didn't
> say 'whatsoever'. You can still read what I said above.
>
> If you want to ensure they are duplicates - compare the files exactly.
> If you only need to be reasonably sure they are duplicates, checksums
> are adequate.


"Very reasonably sure" is correct.

>> - i.e., that two files could have the same hash valus and yet be
>> different, which I still say is *highly* unlikely.

>
> Highly unlikely -yes. But files can be highly valuable too. Just
> how fast does such a program need to be? How much speed
> is worth how much accuracy?


That's the question, isn't it. Considering the difference in speed, and
for most of our applications, I'd say the hash checksum approach does just
fine. :)

>> A statistically insignificant probability, so that using hash values is
>> often prudent and is much more expedient, of course.

>
> True, but to aim toward accuracy instead of speed is not a flaw.


And there is a point of diminishing returns. Prudence comes in here i.e.,
using the appropriate technique for the case at hand.
 
F

Franc Zabkar

On Sun, 26 Oct 2008 21:46:24 -0400, "FromTheRafters"
<erratic@nomail.afraid.org> put finger to keyboard and composed:

>
>"Bill in Co." <not_really_here@earthlink.net> wrote in message
>news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
>> FromTheRafters wrote:
>>> "Bill in Co." <not_really_here@earthlink.net> wrote in message
>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>>> FromTheRafters wrote:
>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>>> relies on direct binary comparisons. With programs like FindDup, if we
>>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>>
>>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>>> only need to be read once.
>>>>>
>>>>> So...once it is found to be the same checksum, what should the
>>>>> program do next? How important are these files?
>>>>
>>>>> A fundamental
>>>>> flaw would be to trust MD5 checksums as an indication that the
>>>>> files are indeed duplicates.
>>>>
>>>> Since when?
>>>
>>> Forever.
>>>
>>> Checksums are often smaller than the file they are derived from
>>> (thats kinda the point, eh?).

>>
>> No, that's not the point. Your statement was that the checksums did not
>> assure the integrity of the file, whatsoever

>
>I didn't say anything about the integrity of a file, and I also didn't
>say 'whatsoever'. You can still read what I said above.
>
>If you want to ensure they are duplicates - compare the files exactly.
>If you only need to be reasonably sure they are duplicates, checksums
>are adequate.
>
>> - i.e., that two files could have the same hash valus and yet be
>> different, which I still say is *highly* unlikely.

>
>Highly unlikely -yes. But files can be highly valuable too. Just
>how fast does such a program need to be? How much speed
>is worth how much accuracy?
>
>> A statistically insignificant probability, so that using hash values is
>> often prudent and is much more expedient, of course.

>
>True, but to aim toward accuracy instead of speed is not a flaw.


Sorry, bad choice of word on my part. However, speed and accuracy, or
speed and safety, are legitimate compromises that we make on a daily
basis. For example, our residential speed limit has been reduced from
60kph to 50kph in the interests of public safety, but we could easily
have a zero road toll if we reduced the limit all the way to 1kph.
Similarly, I could have left FindDup running for several more hours (I
killed it after about 24), but the inconvenience finally got to me.
I'd rather go for speed with something like FastSum, and safeguard
against unlikely losses with a total backup. In fact, I wonder why it
is that no antivirus product seems to be able to reliably detect *all
known* viruses. Is this an intentional compromise of speed versus
security? For example, I used to download Trend Micro's pattern file
updates manually for some time, and noticed that the ZIP files grew to
as much a 23MB until about a year (?) ago when they suddenly shrank to
only 15MB. Have Trend Micro decided to exclude extinct or rare viruses
from their database, or have they really found a more efficient way to
do things?

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
 
F

FromTheRafters

"Franc Zabkar" <fzabkar@iinternode.on.net> wrote in message
news:upjag4hvtmrrrnldacv22l39huud20jt74@4ax.com...
[snip]

> Sorry, bad choice of word on my part. However, speed and accuracy, or
> speed and safety, are legitimate compromises that we make on a daily
> basis. For example, our residential speed limit has been reduced from
> 60kph to 50kph in the interests of public safety, but we could easily
> have a zero road toll if we reduced the limit all the way to 1kph.
> Similarly, I could have left FindDup running for several more hours (I
> killed it after about 24), but the inconvenience finally got to me.
> I'd rather go for speed with something like FastSum, and safeguard
> against unlikely losses with a total backup.


Having read some of your excellent posts, I was sure you would
know where I was coming from with those comments.

> In fact, I wonder why it
> is that no antivirus product seems to be able to reliably detect *all
> known* viruses.


Virus detection is reducible to "The Halting Problem".
http://claymania.com/halting-problem.html

Add to that the many methods applied by viruses to make the task
more difficult for the detector.

Heuristics is a less accurate but faster approach, and reminds me of the
current topic (only more markedly). It seems a shame to equate a near
100% accurate byte by byte method as equivalent to a MD5 hash when
in virus detection world 100% is a pipe dream and heuristics must be
dampened to avoid false positives getting out of hand.

Many of the better AV programs use a mixture of methods including
but not limited to the above methods.

> Is this an intentional compromise of speed versus
> security? For example, I used to download Trend Micro's pattern file
> updates manually for some time, and noticed that the ZIP files grew to
> as much a 23MB until about a year (?) ago when they suddenly shrank to
> only 15MB. Have Trend Micro decided to exclude extinct or rare viruses
> from their database, or have they really found a more efficient way to
> do things?


You might find this interesting:

http://us.trendmicro.com/imperia/md/content/us/pdf/threats/securitylibrary/perry-vb2008.pdf

I highly suspect that old (extinct?) viruses will still be detected.
 
F

FromTheRafters

"Bill in Co." <not_really_here@earthlink.net> wrote in message
news:%23c5PZw%23NJHA.1144@TK2MSFTNGP05.phx.gbl...
> FromTheRafters wrote:
>> "Bill in Co." <not_really_here@earthlink.net> wrote in message
>> news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
>>> FromTheRafters wrote:
>>>> "Bill in Co." <not_really_here@earthlink.net> wrote in message
>>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>>>> FromTheRafters wrote:
>>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>>>> relies on direct binary comparisons. With programs like FindDup, if
>>>>>>> we
>>>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>>>> file2, file1 with file3, and file2 with file3. This requires 6
>>>>>>> reads.
>>>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>>>
>>>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>>>> only need to be read once.
>>>>>>
>>>>>> So...once it is found to be the same checksum, what should the
>>>>>> program do next? How important are these files?
>>>>>
>>>>>> A fundamental
>>>>>> flaw would be to trust MD5 checksums as an indication that the
>>>>>> files are indeed duplicates.
>>>>>
>>>>> Since when?
>>>>
>>>> Forever.
>>>>
>>>> Checksums are often smaller than the file they are derived from
>>>> (thats kinda the point, eh?).
>>>
>>> No, that's not the point. Your statement was that the checksums did
>>> not
>>> assure the integrity of the file, whatsoever

>>
>> I didn't say anything about the integrity of a file, and I also didn't
>> say 'whatsoever'. You can still read what I said above.
>>
>> If you want to ensure they are duplicates - compare the files exactly.
>> If you only need to be reasonably sure they are duplicates, checksums
>> are adequate.

>
> "Very reasonably sure" is correct.
>
>>> - i.e., that two files could have the same hash valus and yet be
>>> different, which I still say is *highly* unlikely.

>>
>> Highly unlikely -yes. But files can be highly valuable too. Just
>> how fast does such a program need to be? How much speed
>> is worth how much accuracy?

>
> That's the question, isn't it. Considering the difference in speed, and
> for most of our applications, I'd say the hash checksum approach does just
> fine. :)
>
>>> A statistically insignificant probability, so that using hash values is
>>> often prudent and is much more expedient, of course.

>>
>> True, but to aim toward accuracy instead of speed is not a flaw.

>
> And there is a point of diminishing returns. Prudence comes in here
> i.e., using the appropriate technique for the case at hand.


We agree then! :eek:)

I think a hybrid approach would be best. For instance filetypes like JPEG
are rather large and I value them much lower than I do PDF, DOC, and
even some JPEG depending on their location. Plus, that puts the
responsibilty
on the user who made the informed decision to use a pretty nearly flawless
approach instead of a most nearly flawless approach in the event of a
disaster.

Writing such a program selling it as just as good but faster than byte by
byte
comparisons could leave one open to a lawsuit.
 
Back
Top Bottom