All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs-image hash collision option, super slow
@ 2017-11-12  0:18 Chris Murphy
  2017-11-12  0:42 ` Hugo Mills
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Murphy @ 2017-11-12  0:18 UTC (permalink / raw)
  To: Btrfs BTRFS

OK this might be in the stupid questions category, but I'm not
understanding the purpose of computing hash collisions with -ss. Or
more correctly, why it's taking so much longer than -s.

It seems like what we'd want is every filename to have the same hash,
but for the file to go through a PBKDF so the hashes we get aren't
(easily) brute forced. So I totally understand that -ss should take
much longer than -s, but this is at least two orders magnitude longer
(so far). That's why I'm confused.

-s option on this file system took 5 minutes, start to finish.
-ss option is at 8 hours and counting.

The other part I'm not groking is that some filenames fail with:

WARNING: cannot find a hash collision for 'Tool', generating garbage,
it won't match indexes

So? That seems like an undesirable outcome. And if it were just being
pushed through a PBKDF function, it wouldn't fail. Every
file/directory "Tool" would get the same hash on *this* run of
btrs-image. If I run it again, or someone else runs it, they'd get
some other hash (same hashes for each instance of "Tool" on their
filesystem).



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs-image hash collision option, super slow
  2017-11-12  0:18 btrfs-image hash collision option, super slow Chris Murphy
@ 2017-11-12  0:42 ` Hugo Mills
  2017-11-12  0:54   ` Chris Murphy
  0 siblings, 1 reply; 8+ messages in thread
From: Hugo Mills @ 2017-11-12  0:42 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4509 bytes --]

On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote:
> OK this might be in the stupid questions category, but I'm not
> understanding the purpose of computing hash collisions with -ss. Or
> more correctly, why it's taking so much longer than -s.
> 
> It seems like what we'd want is every filename to have the same hash,
> but for the file to go through a PBKDF so the hashes we get aren't
> (easily) brute forced. So I totally understand that -ss should take
> much longer than -s, but this is at least two orders magnitude longer
> (so far). That's why I'm confused.
> 
> -s option on this file system took 5 minutes, start to finish.
> -ss option is at 8 hours and counting.
> 
> The other part I'm not groking is that some filenames fail with:
> 
> WARNING: cannot find a hash collision for 'Tool', generating garbage,
> it won't match indexes
> 
> So? That seems like an undesirable outcome. And if it were just being
> pushed through a PBKDF function, it wouldn't fail. Every
> file/directory "Tool" would get the same hash on *this* run of
> btrs-image. If I run it again, or someone else runs it, they'd get
> some other hash (same hashes for each instance of "Tool" on their
> filesystem).

   In the FS tree, you can go from the inode of the file to its name
(where the inode is in the index, and the name is stored in the
corresponding data item). Alternatively, you can go from the filename
to the inode. In the latter case, since the keys are a structured 17
byte object, you obviously can't fit the whole filename into the key,
so the filename is hashed (using, IIRC, CRC32), and it's the hash that
appears in the key of the index.

   When an image is made without the -s options, the whole metadata is
stored, including all the filenames in the data items. For some
people, that's a security risk, and they don't want their filenames
leaking out, so -s exists to put junk in the filename records.
However, it doesn't change the hashes in the index to correspond with
the modified filenames, because that would at minimum require the
whole tree to be rebuilt (because all the items would have different
hashes, and hence different ordering in the index). This is a bad
thing for debugging, because you're not getting the details of the
tree as it was in the broken filesystem. So, in this case, the image
is actually broken, because the filenames don't match the hashes.

   Most of the time, that's absolutely fine, because the thing being
debugged is somewhere else, and it doesn't matter that "ls" on the
restored FS won't work right.

   However, in some (possibly hypothetical) cases, it _does_ matter,
and you do need the hashes to match the filenames. This is where -ss
comes in. We can't generate random filenames and then take the hashes
of those, because of the undesirability of rewriting the whole FS tree
to reindex it with the changed hashes. So, what -ss tries to do is
stick with the original hashes and find arbitrary filenames which
match them. It's (I think) CRC32, so it shouldn't be too hard, but
it's still non-trivial amounts of work to reverse engineer a
human-readable ASCII filename which hashes to a given value.
Particularly if, as was the case when Josef wrote it, a simple
brute-force algorithm was used.

   It could definitely be improved -- I believe there are some good
(but non-trivial) algorithms for finding preimages for CRC32 checksums
out there. It's just that btrfs-image doesn't use them. However, it's
not an option that's needed very often, so it's probably not worth
putting in the effort to fix it up. (I definitely remember Josef
commenting on IRC when he wrote -s and -ss that it could almost
certainly be done more efficiently, but he had bigger fish to fry at
the time, like fixing the broken FS he was working on)

   As to the thing where it's not finding a pre-image at all -- I'm
guessing here, but it's possible that this is a case where two of the
orginal filenames hashed to the same value. If that happens, one of
the hashes is incremented by a small integer in a predictable way
before storage. So it may be that the resulting value isn't mappable
to an ASCII pre-image, or that the search just gives up before finding
one.

   Hugo.

-- 
Hugo Mills             | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4          |                                            Derek Lowe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs-image hash collision option, super slow
  2017-11-12  0:42 ` Hugo Mills
@ 2017-11-12  0:54   ` Chris Murphy
  2017-11-12 14:05     ` Piotr Pawłow
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Murphy @ 2017-11-12  0:54 UTC (permalink / raw)
  To: Hugo Mills, Chris Murphy, Btrfs BTRFS

On Sat, Nov 11, 2017 at 5:42 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote:
>> OK this might be in the stupid questions category, but I'm not
>> understanding the purpose of computing hash collisions with -ss. Or
>> more correctly, why it's taking so much longer than -s.
>>
>> It seems like what we'd want is every filename to have the same hash,
>> but for the file to go through a PBKDF so the hashes we get aren't
>> (easily) brute forced. So I totally understand that -ss should take
>> much longer than -s, but this is at least two orders magnitude longer
>> (so far). That's why I'm confused.
>>
>> -s option on this file system took 5 minutes, start to finish.
>> -ss option is at 8 hours and counting.
>>
>> The other part I'm not groking is that some filenames fail with:
>>
>> WARNING: cannot find a hash collision for 'Tool', generating garbage,
>> it won't match indexes
>>
>> So? That seems like an undesirable outcome. And if it were just being
>> pushed through a PBKDF function, it wouldn't fail. Every
>> file/directory "Tool" would get the same hash on *this* run of
>> btrs-image. If I run it again, or someone else runs it, they'd get
>> some other hash (same hashes for each instance of "Tool" on their
>> filesystem).
>
>    In the FS tree, you can go from the inode of the file to its name
> (where the inode is in the index, and the name is stored in the
> corresponding data item). Alternatively, you can go from the filename
> to the inode. In the latter case, since the keys are a structured 17
> byte object, you obviously can't fit the whole filename into the key,
> so the filename is hashed (using, IIRC, CRC32), and it's the hash that
> appears in the key of the index.
>
>    When an image is made without the -s options, the whole metadata is
> stored, including all the filenames in the data items. For some
> people, that's a security risk, and they don't want their filenames
> leaking out, so -s exists to put junk in the filename records.
> However, it doesn't change the hashes in the index to correspond with
> the modified filenames, because that would at minimum require the
> whole tree to be rebuilt (because all the items would have different
> hashes, and hence different ordering in the index). This is a bad
> thing for debugging, because you're not getting the details of the
> tree as it was in the broken filesystem. So, in this case, the image
> is actually broken, because the filenames don't match the hashes.
>
>    Most of the time, that's absolutely fine, because the thing being
> debugged is somewhere else, and it doesn't matter that "ls" on the
> restored FS won't work right.
>
>    However, in some (possibly hypothetical) cases, it _does_ matter,
> and you do need the hashes to match the filenames. This is where -ss
> comes in. We can't generate random filenames and then take the hashes
> of those, because of the undesirability of rewriting the whole FS tree
> to reindex it with the changed hashes. So, what -ss tries to do is
> stick with the original hashes and find arbitrary filenames which
> match them. It's (I think) CRC32, so it shouldn't be too hard, but
> it's still non-trivial amounts of work to reverse engineer a
> human-readable ASCII filename which hashes to a given value.
> Particularly if, as was the case when Josef wrote it, a simple
> brute-force algorithm was used.
>
>    It could definitely be improved -- I believe there are some good
> (but non-trivial) algorithms for finding preimages for CRC32 checksums
> out there. It's just that btrfs-image doesn't use them. However, it's
> not an option that's needed very often, so it's probably not worth
> putting in the effort to fix it up. (I definitely remember Josef
> commenting on IRC when he wrote -s and -ss that it could almost
> certainly be done more efficiently, but he had bigger fish to fry at
> the time, like fixing the broken FS he was working on)
>
>    As to the thing where it's not finding a pre-image at all -- I'm
> guessing here, but it's possible that this is a case where two of the
> orginal filenames hashed to the same value. If that happens, one of
> the hashes is incremented by a small integer in a predictable way
> before storage. So it may be that the resulting value isn't mappable
> to an ASCII pre-image, or that the search just gives up before finding
> one.
>

Super explanation, thanks Hugo. Maybe it's worth an additional note in
the man page, just how expensive it is. I'm definitely regretting not
starting this imaging in a tmux session.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs-image hash collision option, super slow
  2017-11-12  0:54   ` Chris Murphy
@ 2017-11-12 14:05     ` Piotr Pawłow
  2017-11-13  3:28       ` Duncan
  2017-11-13  3:42       ` Chris Murphy
  0 siblings, 2 replies; 8+ messages in thread
From: Piotr Pawłow @ 2017-11-12 14:05 UTC (permalink / raw)
  To: Chris Murphy, Hugo Mills, Btrfs BTRFS


>>    It could definitely be improved -- I believe there are some good
>> (but non-trivial) algorithms for finding preimages for CRC32 checksums
>> out there. It's just that btrfs-image doesn't use them.

I implemented a faster method using a reverse CRC32 table, which is in btrfs-progs since release 4.13.1.

On my 20GB root fs SSD:

# echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -s /dev/mapper/pp-root /dev/null >/dev/null 2>&1

real    0m22,023s
user    0m14,812s
sys    0m2,508s
# echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -ss /dev/mapper/pp-root /dev/null >/dev/null 2>&1

real    0m31,977s
user    0m17,984s
sys    0m2,632s

Slower, but not 2 orders of magnitude slower :)

> The other part I'm not groking is that some filenames fail with:
>
> WARNING: cannot find a hash collision for 'Tool', generating garbage,
> it won't match indexes

Unfortunately there are no CRC32 collisions for any file name having 4 or less characters when you have to keep the same file name length, and there may be no collisions for longer file names when you limit the result to ASCII only.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs-image hash collision option, super slow
  2017-11-12 14:05     ` Piotr Pawłow
@ 2017-11-13  3:28       ` Duncan
  2017-11-13  3:42       ` Chris Murphy
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2017-11-13  3:28 UTC (permalink / raw)
  To: linux-btrfs

Piotr Pawłow posted on Sun, 12 Nov 2017 15:05:44 +0100 as excerpted:

>> The other part I'm not groking is that some filenames fail with:
>>
>> WARNING: cannot find a hash collision for 'Tool', generating garbage,
>> it won't match indexes
> 
> Unfortunately there are no CRC32 collisions for any file name having 4
> or less characters when you have to keep the same file name length, and
> there may be no collisions for longer file names when you limit the
> result to ASCII only.

Hmm...  Sounds reasonable, tho I hadn't thought/read of it before.  But 
now that I am...

What's the statistical collision spread as the number of characters 
increases?

More worrying than not having /any/ collisions might be having only one, 
or for that matter, some known small number less than say 1024, and 
certainly less than 64 or so, since for the only one collision 
possibility, once you know it wasn't the one you have, you know which one 
it actually was, and chances are, with a bit of information about the 
target in question, such as a suspected crime or the department at a 
competing company the filesystem was from, 64 or even something like 1024 
possibilities are trivial to look up in some rainbow table somewhere and 
see which possibilities look interesting.   Get a few of those and pretty 
soon you can start putting a story together.

So it might make sense to (at least have an option to, maybe -sss) force 
garbage for anything under say 6 chars or whatever, depending on how fast 
the number of collisions rise, to avoid the information leak due to too 
few collisions issue, particularly given that once we assume people 
consider it worth going to the trouble in the first place, they likely 
/are/ going to be concerned about such low number of collisions 
information leakage.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs-image hash collision option, super slow
  2017-11-12 14:05     ` Piotr Pawłow
  2017-11-13  3:28       ` Duncan
@ 2017-11-13  3:42       ` Chris Murphy
  2017-11-13  8:02         ` Piotr Pawłow
  1 sibling, 1 reply; 8+ messages in thread
From: Chris Murphy @ 2017-11-13  3:42 UTC (permalink / raw)
  To: Piotr Pawłow; +Cc: Chris Murphy, Hugo Mills, Btrfs BTRFS

On Sun, Nov 12, 2017 at 7:05 AM, Piotr Pawłow <pp@siedziba.pl> wrote:
>
>>>    It could definitely be improved -- I believe there are some good
>>> (but non-trivial) algorithms for finding preimages for CRC32 checksums
>>> out there. It's just that btrfs-image doesn't use them.
>
> I implemented a faster method using a reverse CRC32 table, which is in btrfs-progs since release 4.13.1.
>
> On my 20GB root fs SSD:
>
> # echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -s /dev/mapper/pp-root /dev/null >/dev/null 2>&1
>
> real    0m22,023s
> user    0m14,812s
> sys    0m2,508s
> # echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -ss /dev/mapper/pp-root /dev/null >/dev/null 2>&1
>
> real    0m31,977s
> user    0m17,984s
> sys    0m2,632s
>
> Slower, but not 2 orders of magnitude slower :)


Strange. I was using 4.3.3 and it had been running for over 9 hours at
the time I finally cancelled it.


>
>> The other part I'm not groking is that some filenames fail with:
>>
>> WARNING: cannot find a hash collision for 'Tool', generating garbage,
>> it won't match indexes
>
> Unfortunately there are no CRC32 collisions for any file name having 4 or less characters when you have to keep the same file name length, and there may be no collisions for longer file names when you limit the result to ASCII only.

Gotcha.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs-image hash collision option, super slow
  2017-11-13  3:42       ` Chris Murphy
@ 2017-11-13  8:02         ` Piotr Pawłow
  2017-11-13 18:18           ` Chris Murphy
  0 siblings, 1 reply; 8+ messages in thread
From: Piotr Pawłow @ 2017-11-13  8:02 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hugo Mills, Btrfs BTRFS

W dniu 13.11.2017 o 04:42, Chris Murphy pisze:
> Strange. I was using 4.3.3 and it had been running for over 9 hours at
> the time I finally cancelled it.

If you're compiling from source, the usual advice would be to "make clean" and make sure you're using the correct executable.

If your fs is very large then caching may be a problem. With the brute force method it was a good idea to cache the results. With the "reverse crc" method caching is not very useful. It's only marginally faster on my root fs, and on larger filesystems searching the cache will be slower than finding collisions. You can remove the code and see if it makes any difference:

diff --git a/image/main.c b/image/main.c
index 4cffbdba..f77b1504 100644
--- a/image/main.c
+++ b/image/main.c
@@ -500,19 +500,9 @@ static char *find_collision(struct metadump_struct *md, char *name,
                            u32 name_len)
 {
        struct name *val;
-       struct rb_node *entry;
-       struct name tmp;
        int found;
        int i;
 
-       tmp.val = name;
-       tmp.len = name_len;
-       entry = tree_search(&md->name_tree, &tmp.n, name_cmp, 0);
-       if (entry) {
-               val = rb_entry(entry, struct name, n);
-               free(name);
-               return val->sub;
-       }
 
        val = malloc(sizeof(struct name));
        if (!val) {
@@ -548,7 +538,6 @@ static char *find_collision(struct metadump_struct *md, char *name,
                }
        }
 
-       tree_insert(&md->name_tree, &val->n, name_cmp);
        return val->sub;
 }

>
>> Unfortunately there are no CRC32 collisions for any file name having 4 or less characters when you have to keep the same file name length, and there may be no collisions for longer file names when you limit the result to ASCII only.
> Gotcha.

Yeah, it also means for short file names an attacker can easily find the real name by finding all collisions and filtering out obviously nonsensical names, so it's more of an obfuscation than sanitization :/


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: btrfs-image hash collision option, super slow
  2017-11-13  8:02         ` Piotr Pawłow
@ 2017-11-13 18:18           ` Chris Murphy
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Murphy @ 2017-11-13 18:18 UTC (permalink / raw)
  To: Piotr Pawłow; +Cc: Chris Murphy, Hugo Mills, Btrfs BTRFS

On Mon, Nov 13, 2017 at 1:02 AM, Piotr Pawłow <pp@siedziba.pl> wrote:
> W dniu 13.11.2017 o 04:42, Chris Murphy pisze:
>> Strange. I was using 4.3.3 and it had been running for over 9 hours at
>> the time I finally cancelled it.
>
> If you're compiling from source, the usual advice would be to "make clean" and make sure you're using the correct executable.

Interesting, using btrfs-progs-4.13.3-1.fc28.x86_64, it goes super
fast with -ss (minutes). But when using 4.3.3 built myself (nothing
special, autogen, configure, make) it's super slow, hours.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-11-13 18:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-12  0:18 btrfs-image hash collision option, super slow Chris Murphy
2017-11-12  0:42 ` Hugo Mills
2017-11-12  0:54   ` Chris Murphy
2017-11-12 14:05     ` Piotr Pawłow
2017-11-13  3:28       ` Duncan
2017-11-13  3:42       ` Chris Murphy
2017-11-13  8:02         ` Piotr Pawłow
2017-11-13 18:18           ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.