* algorithm for half-md4 used in htree directories @ 2021-10-01 11:49 Avi Deitcher 2021-10-03 12:47 ` Avi Deitcher 0 siblings, 1 reply; 15+ messages in thread From: Avi Deitcher @ 2021-10-01 11:49 UTC (permalink / raw) To: linux-ext4 Hi, I have been trying to understand the algorithm used for the "half-md4" in htree-structured directories. Going through the code (and trying not to get into reverse engineering), it looks like it is part of md4 but not entirely? Yet any subset I take doesn't quite line up with what I see in an actual sample. What is the algorithm it is using to turn an entry of, e.g., "file125" into the appropriate hash. I did run a live sample, and try to get some form of correlation between the actual md4 hash (16 bytes) of the above to the actual entry (4 bytes) shown by debugfs, without much luck. What basic thing am I missing? Separately, how does the seed play into it? Thanks Avi ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-01 11:49 algorithm for half-md4 used in htree directories Avi Deitcher @ 2021-10-03 12:47 ` Avi Deitcher 2021-10-03 16:43 ` Andreas Dilger 0 siblings, 1 reply; 15+ messages in thread From: Avi Deitcher @ 2021-10-03 12:47 UTC (permalink / raw) To: linux-ext4 I can narrow down the question further. In my live sample, one of the entries in the tree is for a directory named "dir155". If I run "dx_hash dir155", I get: # debugfs -R "dx_hash dir155" /var/lib/file.img debugfs 1.46.2 (28-Feb-2021) Hash of dir155 is 0x16279534 (minor 0x0) If I look in the tree with "htree_dump", I get: # debugfs -R "htree_dump /testdir" /var/lib/file.img debugfs 1.46.2 (28-Feb-2021) .... Entry #0: Hash 0x00000000, block 1 Reading directory block 1, phys 6459 168 0x00d11d98-b9b6b16b (16) dir155 332 0x009edafe-77de7d72 (16) dir319 That hash for dir155 does not match what dx_hash gave. If I try to take the code from fs/ext4/hash.c and build a small program to calculate the hash, I get: $ ./md4 dir155 MD4: d90278a1 25182ac7 a02e56be c3f30f04 hash: 0x25182ac6 minor: 0xa02e56be Clearly that isn't what is in the tree. What basic am I missing? On Fri, Oct 1, 2021 at 2:49 PM Avi Deitcher <avi@deitcher.net> wrote: > > Hi, > > I have been trying to understand the algorithm used for the "half-md4" > in htree-structured directories. Going through the code (and trying > not to get into reverse engineering), it looks like it is part of md4 > but not entirely? Yet any subset I take doesn't quite line up with > what I see in an actual sample. > > What is the algorithm it is using to turn an entry of, e.g., "file125" > into the appropriate hash. I did run a live sample, and try to get > some form of correlation between the actual md4 hash (16 bytes) of the > above to the actual entry (4 bytes) shown by debugfs, without much > luck. > > What basic thing am I missing? > > Separately, how does the seed play into it? > > Thanks > Avi -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-03 12:47 ` Avi Deitcher @ 2021-10-03 16:43 ` Andreas Dilger 2021-10-04 7:57 ` Avi Deitcher 0 siblings, 1 reply; 15+ messages in thread From: Andreas Dilger @ 2021-10-03 16:43 UTC (permalink / raw) To: Avi Deitcher; +Cc: linux-ext4 On Oct 3, 2021, at 06:47, Avi Deitcher <avi@deitcher.net> wrote: > > I can narrow down the question further. In my live sample, one of the > entries in the tree is for a directory named "dir155". > > If I run "dx_hash dir155", I get: > > # debugfs -R "dx_hash dir155" /var/lib/file.img > debugfs 1.46.2 (28-Feb-2021) > Hash of dir155 is 0x16279534 (minor 0x0) > > If I look in the tree with "htree_dump", I get: > > # debugfs -R "htree_dump /testdir" /var/lib/file.img > debugfs 1.46.2 (28-Feb-2021) > .... > Entry #0: Hash 0x00000000, block 1 > Reading directory block 1, phys 6459 > 168 0x00d11d98-b9b6b16b (16) dir155 332 0x009edafe-77de7d72 (16) dir319 > > That hash for dir155 does not match what dx_hash gave. If I try to > take the code from fs/ext4/hash.c and build a small program to > calculate the hash, I get: > > $ ./md4 dir155 > MD4: d90278a1 25182ac7 a02e56be c3f30f04 > hash: 0x25182ac6 > minor: 0xa02e56be > > Clearly that isn't what is in the tree. What basic am I missing? One important factor is that the directory hash has an initial seed to prevent pathological cases where the user can construct thousands of directory entries that have a hash collision. Looking at the code explains this in the comment for __ext4fs_dirhash(). The seed itself comes from sbi->s_hash_seed and is stored in the per-directory hinfo.seed to be used when counting the filename hash. In theory there could be a per-directory hash, but it appears to be a constant for the whole filesystem. Cheers, Andreas > >> On Fri, Oct 1, 2021 at 2:49 PM Avi Deitcher <avi@deitcher.net> wrote: >> >> Hi, >> >> I have been trying to understand the algorithm used for the "half-md4" >> in htree-structured directories. Going through the code (and trying >> not to get into reverse engineering), it looks like it is part of md4 >> but not entirely? Yet any subset I take doesn't quite line up with >> what I see in an actual sample. >> >> What is the algorithm it is using to turn an entry of, e.g., "file125" >> into the appropriate hash. I did run a live sample, and try to get >> some form of correlation between the actual md4 hash (16 bytes) of the >> above to the actual entry (4 bytes) shown by debugfs, without much >> luck. >> >> What basic thing am I missing? >> >> Separately, how does the seed play into it? >> >> Thanks >> Avi > > > > -- > Avi Deitcher > avi@deitcher.net > Follow me http://twitter.com/avideitcher > Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-03 16:43 ` Andreas Dilger @ 2021-10-04 7:57 ` Avi Deitcher 2021-10-11 15:30 ` Avi Deitcher 0 siblings, 1 reply; 15+ messages in thread From: Avi Deitcher @ 2021-10-04 7:57 UTC (permalink / raw) To: Andreas Dilger; +Cc: linux-ext4 Hi Andreas, I had looked in __ext4fs_dirhash(). Yes, it does reference the seed - and create a default if none is there at the filesystem level - but it doesn't appear to use it, in that function. hinfo is populated in the function - hash, minor-hash, seed - but it never uses the seed to manipulate the hash. Are you saying that it is at a higher level? i.e. __ext4fs_dirhash() is the *first* step, and there is further processing to get the actual hash? I did walk up the stack, but couldn't figure out. Thanks for stepping in Avi On Sun, Oct 3, 2021 at 7:43 PM Andreas Dilger <adilger@dilger.ca> wrote: > > On Oct 3, 2021, at 06:47, Avi Deitcher <avi@deitcher.net> wrote: > > > > I can narrow down the question further. In my live sample, one of the > > entries in the tree is for a directory named "dir155". > > > > If I run "dx_hash dir155", I get: > > > > # debugfs -R "dx_hash dir155" /var/lib/file.img > > debugfs 1.46.2 (28-Feb-2021) > > Hash of dir155 is 0x16279534 (minor 0x0) > > > > If I look in the tree with "htree_dump", I get: > > > > # debugfs -R "htree_dump /testdir" /var/lib/file.img > > debugfs 1.46.2 (28-Feb-2021) > > .... > > Entry #0: Hash 0x00000000, block 1 > > Reading directory block 1, phys 6459 > > 168 0x00d11d98-b9b6b16b (16) dir155 332 0x009edafe-77de7d72 (16) dir319 > > > > That hash for dir155 does not match what dx_hash gave. If I try to > > take the code from fs/ext4/hash.c and build a small program to > > calculate the hash, I get: > > > > $ ./md4 dir155 > > MD4: d90278a1 25182ac7 a02e56be c3f30f04 > > hash: 0x25182ac6 > > minor: 0xa02e56be > > > > Clearly that isn't what is in the tree. What basic am I missing? > > One important factor is that the directory hash has an initial seed > to prevent pathological cases where the user can construct thousands > of directory entries that have a hash collision. > > Looking at the code explains this in the comment for __ext4fs_dirhash(). > The seed itself comes from sbi->s_hash_seed and is stored in the > per-directory hinfo.seed to be used when counting the filename hash. > In theory there could be a per-directory hash, but it appears to be a > constant for the whole filesystem. > > Cheers, Andreas > > > > >> On Fri, Oct 1, 2021 at 2:49 PM Avi Deitcher <avi@deitcher.net> wrote: > >> > >> Hi, > >> > >> I have been trying to understand the algorithm used for the "half-md4" > >> in htree-structured directories. Going through the code (and trying > >> not to get into reverse engineering), it looks like it is part of md4 > >> but not entirely? Yet any subset I take doesn't quite line up with > >> what I see in an actual sample. > >> > >> What is the algorithm it is using to turn an entry of, e.g., "file125" > >> into the appropriate hash. I did run a live sample, and try to get > >> some form of correlation between the actual md4 hash (16 bytes) of the > >> above to the actual entry (4 bytes) shown by debugfs, without much > >> luck. > >> > >> What basic thing am I missing? > >> > >> Separately, how does the seed play into it? > >> > >> Thanks > >> Avi > > > > > > > > -- > > Avi Deitcher > > avi@deitcher.net > > Follow me http://twitter.com/avideitcher > > Read me http://blog.atomicinc.com -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-04 7:57 ` Avi Deitcher @ 2021-10-11 15:30 ` Avi Deitcher 2021-10-11 20:20 ` Theodore Ts'o 0 siblings, 1 reply; 15+ messages in thread From: Avi Deitcher @ 2021-10-11 15:30 UTC (permalink / raw) To: linux-ext4 Does someone know how this is constructed and used? On Mon, Oct 4, 2021 at 12:57 AM Avi Deitcher <avi@deitcher.net> wrote: > > Hi Andreas, > > I had looked in __ext4fs_dirhash(). Yes, it does reference the seed - > and create a default if none is there at the filesystem level - but it > doesn't appear to use it, in that function. hinfo is populated in the > function - hash, minor-hash, seed - but it never uses the seed to > manipulate the hash. > > Are you saying that it is at a higher level? i.e. __ext4fs_dirhash() > is the *first* step, and there is further processing to get the actual > hash? I did walk up the stack, but couldn't figure out. > > Thanks for stepping in > Avi > > On Sun, Oct 3, 2021 at 7:43 PM Andreas Dilger <adilger@dilger.ca> wrote: > > > > On Oct 3, 2021, at 06:47, Avi Deitcher <avi@deitcher.net> wrote: > > > > > > I can narrow down the question further. In my live sample, one of the > > > entries in the tree is for a directory named "dir155". > > > > > > If I run "dx_hash dir155", I get: > > > > > > # debugfs -R "dx_hash dir155" /var/lib/file.img > > > debugfs 1.46.2 (28-Feb-2021) > > > Hash of dir155 is 0x16279534 (minor 0x0) > > > > > > If I look in the tree with "htree_dump", I get: > > > > > > # debugfs -R "htree_dump /testdir" /var/lib/file.img > > > debugfs 1.46.2 (28-Feb-2021) > > > .... > > > Entry #0: Hash 0x00000000, block 1 > > > Reading directory block 1, phys 6459 > > > 168 0x00d11d98-b9b6b16b (16) dir155 332 0x009edafe-77de7d72 (16) dir319 > > > > > > That hash for dir155 does not match what dx_hash gave. If I try to > > > take the code from fs/ext4/hash.c and build a small program to > > > calculate the hash, I get: > > > > > > $ ./md4 dir155 > > > MD4: d90278a1 25182ac7 a02e56be c3f30f04 > > > hash: 0x25182ac6 > > > minor: 0xa02e56be > > > > > > Clearly that isn't what is in the tree. What basic am I missing? > > > > One important factor is that the directory hash has an initial seed > > to prevent pathological cases where the user can construct thousands > > of directory entries that have a hash collision. > > > > Looking at the code explains this in the comment for __ext4fs_dirhash(). > > The seed itself comes from sbi->s_hash_seed and is stored in the > > per-directory hinfo.seed to be used when counting the filename hash. > > In theory there could be a per-directory hash, but it appears to be a > > constant for the whole filesystem. > > > > Cheers, Andreas > > > > > > > >> On Fri, Oct 1, 2021 at 2:49 PM Avi Deitcher <avi@deitcher.net> wrote: > > >> > > >> Hi, > > >> > > >> I have been trying to understand the algorithm used for the "half-md4" > > >> in htree-structured directories. Going through the code (and trying > > >> not to get into reverse engineering), it looks like it is part of md4 > > >> but not entirely? Yet any subset I take doesn't quite line up with > > >> what I see in an actual sample. > > >> > > >> What is the algorithm it is using to turn an entry of, e.g., "file125" > > >> into the appropriate hash. I did run a live sample, and try to get > > >> some form of correlation between the actual md4 hash (16 bytes) of the > > >> above to the actual entry (4 bytes) shown by debugfs, without much > > >> luck. > > >> > > >> What basic thing am I missing? > > >> > > >> Separately, how does the seed play into it? > > >> > > >> Thanks > > >> Avi > > > > > > > > > > > > -- > > > Avi Deitcher > > > avi@deitcher.net > > > Follow me http://twitter.com/avideitcher > > > Read me http://blog.atomicinc.com > > > > -- > Avi Deitcher > avi@deitcher.net > Follow me http://twitter.com/avideitcher > Read me http://blog.atomicinc.com -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-11 15:30 ` Avi Deitcher @ 2021-10-11 20:20 ` Theodore Ts'o 2021-10-12 2:58 ` Avi Deitcher 0 siblings, 1 reply; 15+ messages in thread From: Theodore Ts'o @ 2021-10-11 20:20 UTC (permalink / raw) To: Avi Deitcher; +Cc: linux-ext4 On Mon, Oct 11, 2021 at 08:30:36AM -0700, Avi Deitcher wrote: > Does someone know how this is constructed and used? > > On Mon, Oct 4, 2021 at 12:57 AM Avi Deitcher <avi@deitcher.net> wrote: > > > > Hi Andreas, > > > > I had looked in __ext4fs_dirhash(). Yes, it does reference the seed - > > and create a default if none is there at the filesystem level - but it > > doesn't appear to use it, in that function. hinfo is populated in the > > function - hash, minor-hash, seed - but it never uses the seed to > > manipulate the hash. The seed is used to initialize the buf array, so long as the seed is not all zero's. If it is all zeros, then the default seed is used instead (right above this bit of code: if (hinfo->seed) { for (i = 0; i < 4; i++) { if (hinfo->seed[i]) { memcpy(buf, hinfo->seed, sizeof(buf)); break; } } } The legacy hash doesn't use the seed, yes. But for the other hash types (hash_version), they mix the filename (in different ways depending on the hash type. For example, for half md4: case DX_HASH_HALF_MD4: p = name; while (len > 0) { (*str2hashbuf)(p, len, in, 8); half_md4_transform(buf, in); ^^^ len -= 32; p += 32; } minor_hash = buf[2]; hash = buf[1]; break; When the hash seed is different, that means the initial state of the buf array will different, and this influences the resulting hash. Cheers, - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-11 20:20 ` Theodore Ts'o @ 2021-10-12 2:58 ` Avi Deitcher 2021-10-12 17:30 ` Theodore Ts'o 0 siblings, 1 reply; 15+ messages in thread From: Avi Deitcher @ 2021-10-12 2:58 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Aha. I missed that the seed is injected into buf before passing it into half_md4_transform. I was looking at it as just the empty buffer before the first iteration of the loop (or, in my case, since I was testing with a 6 char filename, the only iteration). I will repeat my experiment with that and see if I can tease it out. Thanks, Ted! On Mon, Oct 11, 2021 at 1:20 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Oct 11, 2021 at 08:30:36AM -0700, Avi Deitcher wrote: > > Does someone know how this is constructed and used? > > > > On Mon, Oct 4, 2021 at 12:57 AM Avi Deitcher <avi@deitcher.net> wrote: > > > > > > Hi Andreas, > > > > > > I had looked in __ext4fs_dirhash(). Yes, it does reference the seed - > > > and create a default if none is there at the filesystem level - but it > > > doesn't appear to use it, in that function. hinfo is populated in the > > > function - hash, minor-hash, seed - but it never uses the seed to > > > manipulate the hash. > > The seed is used to initialize the buf array, so long as the seed is > not all zero's. If it is all zeros, then the default seed is used > instead (right above this bit of code: > > if (hinfo->seed) { > for (i = 0; i < 4; i++) { > if (hinfo->seed[i]) { > memcpy(buf, hinfo->seed, sizeof(buf)); > break; > } > } > } > > The legacy hash doesn't use the seed, yes. But for the other hash > types (hash_version), they mix the filename (in different ways > depending on the hash type. For example, for half md4: > > case DX_HASH_HALF_MD4: > p = name; > while (len > 0) { > (*str2hashbuf)(p, len, in, 8); > half_md4_transform(buf, in); > ^^^ > len -= 32; > p += 32; > } > minor_hash = buf[2]; > hash = buf[1]; > break; > > When the hash seed is different, that means the initial state of the > buf array will different, and this influences the resulting hash. > > Cheers, > > - Ted -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-12 2:58 ` Avi Deitcher @ 2021-10-12 17:30 ` Theodore Ts'o 2021-10-13 2:20 ` Avi Deitcher 2021-10-15 18:43 ` Avi Deitcher 0 siblings, 2 replies; 15+ messages in thread From: Theodore Ts'o @ 2021-10-12 17:30 UTC (permalink / raw) To: Avi Deitcher; +Cc: linux-ext4 On Mon, Oct 11, 2021 at 07:58:00PM -0700, Avi Deitcher wrote: > Aha. I missed that the seed is injected into buf before passing it > into half_md4_transform. I was looking at it as just the empty buffer > before the first iteration of the loop (or, in my case, since I was > testing with a 6 char filename, the only iteration). > > I will repeat my experiment with that and see if I can tease it out. BTW, if you are looking for a userspace implementation of the hash, it's available in libext2fs in e2fsprogs. Cheers, - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-12 17:30 ` Theodore Ts'o @ 2021-10-13 2:20 ` Avi Deitcher 2021-10-15 18:43 ` Avi Deitcher 1 sibling, 0 replies; 15+ messages in thread From: Avi Deitcher @ 2021-10-13 2:20 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Yep, right here https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/lib/ext2fs/dirhash.c Hey, that's your name on it, Ted! I am close, must be messing it up somehow. I literally copied the majority of that (actually, some slight variant, but basically the same) into a standalone main.c, removed any library dependencies by copying those in so I can get a standalone, and I still am not quite getting it. Maybe I am messing up the seed? dump2fs of the superblock shows: Default directory hash: half_md4 Directory Hash Seed: d64563bc-ea93-4aaf-a943-4657711ed153 and debugfs of the hash tree shows: Root node dump: Reserved zero: 0 Hash Version: 1 Info length: 8 Indirect levels: 1 Flags: 0 Number of entries (count): 2 Number of entries (limit): 123 Checksum: 0x49f0afdc Entry #0: Hash 0x00000000, block 124 Entry #1: Hash 0x78b6e3b8, block 128 Entry #0: Hash 0x00000000, block 124 Number of entries (count): 113 Number of entries (limit): 126 Checksum: 0x78407270 Entry #0: Hash 0x00000000, block 1 Entry #1: Hash 0x00f48688, block 193 ... So it has the hash version correct (I also gdb-ed through my little program. Maybe I am getting the u32 order wrong in the seed? Or the endianness? I should just create a gist with this, shouldn't I? On Tue, Oct 12, 2021 at 10:30 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Oct 11, 2021 at 07:58:00PM -0700, Avi Deitcher wrote: > > Aha. I missed that the seed is injected into buf before passing it > > into half_md4_transform. I was looking at it as just the empty buffer > > before the first iteration of the loop (or, in my case, since I was > > testing with a 6 char filename, the only iteration). > > > > I will repeat my experiment with that and see if I can tease it out. > > BTW, if you are looking for a userspace implementation of the hash, > it's available in libext2fs in e2fsprogs. > > Cheers, > > - Ted -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-12 17:30 ` Theodore Ts'o 2021-10-13 2:20 ` Avi Deitcher @ 2021-10-15 18:43 ` Avi Deitcher 2021-10-15 19:10 ` Theodore Ts'o 1 sibling, 1 reply; 15+ messages in thread From: Avi Deitcher @ 2021-10-15 18:43 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 I am absolutely stumped. I tried the seed as four u32 as is on disk (i.e. big-endian); four u32 little-endian; one long little-endian array of bytes (I have no idea why that would make sense, but worth trying); zeroed out so it gets the default. No one gives a consistent solution. As far as I can tell: hash tells you which intermediate block to look in, minor hash tells you which leaf block to look in, and then you scan. So it is pretty easy to see in what range the minor and major hash should be, but no luck. I put up a gist with debugfs and source and output. https://gist.github.com/deitch/53b01a90635449e7674babfe7e7dd002 Anyone who feels like a look-see, I would much appreciate it (and if they figure it out, owe a beer if ever in the same city). On Tue, Oct 12, 2021 at 10:30 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Oct 11, 2021 at 07:58:00PM -0700, Avi Deitcher wrote: > > Aha. I missed that the seed is injected into buf before passing it > > into half_md4_transform. I was looking at it as just the empty buffer > > before the first iteration of the loop (or, in my case, since I was > > testing with a 6 char filename, the only iteration). > > > > I will repeat my experiment with that and see if I can tease it out. > > BTW, if you are looking for a userspace implementation of the hash, > it's available in libext2fs in e2fsprogs. > > Cheers, > > - Ted -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-15 18:43 ` Avi Deitcher @ 2021-10-15 19:10 ` Theodore Ts'o 2021-10-15 19:43 ` Avi Deitcher 2021-10-15 19:50 ` Theodore Ts'o 0 siblings, 2 replies; 15+ messages in thread From: Theodore Ts'o @ 2021-10-15 19:10 UTC (permalink / raw) To: Avi Deitcher; +Cc: linux-ext4 On Fri, Oct 15, 2021 at 11:43:07AM -0700, Avi Deitcher wrote: > I am absolutely stumped. I tried the seed as four u32 as is on disk > (i.e. big-endian); four u32 little-endian; one long little-endian > array of bytes (I have no idea why that would make sense, but worth > trying); zeroed out so it gets the default. No one gives a consistent > solution. > > As far as I can tell: hash tells you which intermediate block to look > in, minor hash tells you which leaf block to look in, and then you > scan. So it is pretty easy to see in what range the minor and major > hash should be, but no luck. > > I put up a gist with debugfs and source and output. > https://gist.github.com/deitch/53b01a90635449e7674babfe7e7dd002 > > Anyone who feels like a look-see, I would much appreciate it (and if > they figure it out, owe a beer if ever in the same city). I'm really curious *why* you are trying to reverse engineer the implementation. What are you trying to do? In any case, you're mostly right about what hash and minor_hash are for. The 32-bit hash value is only thing that we use in the hashed B+ tree which is used for hash directories. The 32-bit minor hash is used to form a 64-bit number that gets used when we need to support things like NFSv3 directory cursors, and POSIX telldir/seekdir (which is a massive headache for modern file system, since it assumes that a 64-bit "offset" is all you get to reliably provide the POSIX telldir/seekdir/readdir guarantees even when the directory is getting large number of directory entries added and deleted without limit between the telldir and the seekdir call). As far as what you are doing wrong, I'll point out that *this* program (attached below) provides the correct result. Running this through a debugger and comparing it with your implrementation is left as an exercise for the reader --- but why do you want to try to make your own implementation, when you could just pull down e2fsprogs, compile it, and then *use* the provided ext2_fs library for whatever the heck you are trying to do? - Ted #include <stdio.h> #include <stdlib.h> #include <et/com_err.h> #include <uuid/uuid.h> #include <ext2fs/ext2_fs.h> #include <ext2fs/ext2fs.h> int main(int argc, char **argv) { uuid_t buf; unsigned int *p; int i; ext2_dirhash_t hash, minor_hash; errcode_t retval; uuid_parse("d64563bc-ea93-4aaf-a943-4657711ed153", buf); p = (unsigned int *) buf; for (i=0; i < 4; i++) { printf("buf[%d] = 0x%08x\n", i, p[i]); } retval = ext2fs_dirhash(1, "dir478", strlen("dir478"), p, &hash, &minor_hash); printf("dirhash results: retval=%u, hash=0x%08x, minor_hash=0x%08x\n", i, hash, minor_hash); exit(0); } % gcc -g -o /tmp/foo /tmp/foo.c -luuid -lext2fs -lcom_err % /tmp/foo buf[0] = 0xbc6345d6 buf[1] = 0xaf4a93ea buf[2] = 0x574643a9 buf[3] = 0x53d11e71 dirhash results: retval=4, hash=0x012225e2, minor_hash=0x3f08755d ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-15 19:10 ` Theodore Ts'o @ 2021-10-15 19:43 ` Avi Deitcher 2021-10-15 20:30 ` Theodore Ts'o 2021-10-15 19:50 ` Theodore Ts'o 1 sibling, 1 reply; 15+ messages in thread From: Avi Deitcher @ 2021-10-15 19:43 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Thanks, Ted, I will try yours and step through it to figure out what is off. You ask a fair question: other than madness, why would someone want to recreate the exact algorithm? I have had a number of cases where I have needed to manipulate disks, filesystems, partition tables, etc. from within a non-C-source program. The options have been: fork/exec to some external program; run a VM where I can mount the fs and manipulate content as needed. Those both work, but have had issues in various environments. I made the mistake of saying, "well, all of this is just manipulating bytes on a disk image or block device; how hard could it be?" :-) So understanding the algorithm actually becomes important. I probably could link the library in if I am working on languages that support it (go, rust), but not all do, and there are reasons that is more difficult for the target use case. I was long hoping to finish with go and move onto Rust by now, then possibly some others, but this is not what I get paid for, so catch as catch can. Does that answer? Feel free to say, "madness" too; I won't disagree. Avi On Fri, Oct 15, 2021 at 12:10 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Fri, Oct 15, 2021 at 11:43:07AM -0700, Avi Deitcher wrote: > > I am absolutely stumped. I tried the seed as four u32 as is on disk > > (i.e. big-endian); four u32 little-endian; one long little-endian > > array of bytes (I have no idea why that would make sense, but worth > > trying); zeroed out so it gets the default. No one gives a consistent > > solution. > > > > As far as I can tell: hash tells you which intermediate block to look > > in, minor hash tells you which leaf block to look in, and then you > > scan. So it is pretty easy to see in what range the minor and major > > hash should be, but no luck. > > > > I put up a gist with debugfs and source and output. > > https://gist.github.com/deitch/53b01a90635449e7674babfe7e7dd002 > > > > Anyone who feels like a look-see, I would much appreciate it (and if > > they figure it out, owe a beer if ever in the same city). > > I'm really curious *why* you are trying to reverse engineer the > implementation. What are you trying to do? > > In any case, you're mostly right about what hash and minor_hash are > for. The 32-bit hash value is only thing that we use in the hashed B+ > tree which is used for hash directories. The 32-bit minor hash is > used to form a 64-bit number that gets used when we need to support > things like NFSv3 directory cursors, and POSIX telldir/seekdir (which > is a massive headache for modern file system, since it assumes that a > 64-bit "offset" is all you get to reliably provide the POSIX > telldir/seekdir/readdir guarantees even when the directory is getting > large number of directory entries added and deleted without limit > between the telldir and the seekdir call). > > As far as what you are doing wrong, I'll point out that *this* program > (attached below) provides the correct result. Running this through a > debugger and comparing it with your implrementation is left as an > exercise for the reader --- but why do you want to try to make your > own implementation, when you could just pull down e2fsprogs, compile > it, and then *use* the provided ext2_fs library for whatever the heck > you are trying to do? > > - Ted > > #include <stdio.h> > #include <stdlib.h> > > #include <et/com_err.h> > #include <uuid/uuid.h> > #include <ext2fs/ext2_fs.h> > #include <ext2fs/ext2fs.h> > > int main(int argc, char **argv) > { > uuid_t buf; > unsigned int *p; > int i; > ext2_dirhash_t hash, minor_hash; > errcode_t retval; > > uuid_parse("d64563bc-ea93-4aaf-a943-4657711ed153", buf); > p = (unsigned int *) buf; > for (i=0; i < 4; i++) { > printf("buf[%d] = 0x%08x\n", i, p[i]); > } > > retval = ext2fs_dirhash(1, "dir478", strlen("dir478"), p, > &hash, &minor_hash); > printf("dirhash results: retval=%u, hash=0x%08x, minor_hash=0x%08x\n", > i, hash, minor_hash); > > exit(0); > } > > % gcc -g -o /tmp/foo /tmp/foo.c -luuid -lext2fs -lcom_err > % /tmp/foo > buf[0] = 0xbc6345d6 > buf[1] = 0xaf4a93ea > buf[2] = 0x574643a9 > buf[3] = 0x53d11e71 > dirhash results: retval=4, hash=0x012225e2, minor_hash=0x3f08755d -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-15 19:43 ` Avi Deitcher @ 2021-10-15 20:30 ` Theodore Ts'o 0 siblings, 0 replies; 15+ messages in thread From: Theodore Ts'o @ 2021-10-15 20:30 UTC (permalink / raw) To: Avi Deitcher; +Cc: linux-ext4 On Fri, Oct 15, 2021 at 12:43:33PM -0700, Avi Deitcher wrote: > Thanks, Ted, I will try yours and step through it to figure out what is off. > > You ask a fair question: other than madness, why would someone want to > recreate the exact algorithm? > > I have had a number of cases where I have needed to manipulate disks, > filesystems, partition tables, etc. from within a non-C-source > program. The options have been: fork/exec to some external program; > run a VM where I can mount the fs and manipulate content as needed. > Those both work, but have had issues in various environments. > > I made the mistake of saying, "well, all of this is just manipulating > bytes on a disk image or block device; how hard could it be?" :-) > So understanding the algorithm actually becomes important. I think once you take a look at all of the "byte manipulation" that is needed for any kind of non-trivial file system operation, you're probably better off trying to figure out how to link the library in. > I probably could link the library in if I am working on languages that > support it (go, rust), but not all do, and there are reasons that is > more difficult for the target use case. Have you looked at SWIG? SWIG suppotrs a *lot* of lanaguages, including Tcl, Python, Perl, Guile, Java, Ruby, Mzscheme, PHP, OCaml, C#, Lua, R, Octave, Go, D, Javascript, Scilab, etc. If you end up writing the equivalent of libext2fs for one language, it's really not going to help you all that much for another language. I also note you've not really been specific about "the target use case". Is that something you'd be willing to say more about? In any case, if you're interested in implementing SWIG bindings for libext2fs, that is certainly something we could discuss integrating into e2fsprogs, so that other people could also benefit from that work. Let me know if you're interested! - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-15 19:10 ` Theodore Ts'o 2021-10-15 19:43 ` Avi Deitcher @ 2021-10-15 19:50 ` Theodore Ts'o 2021-10-18 16:56 ` Avi Deitcher 1 sibling, 1 reply; 15+ messages in thread From: Theodore Ts'o @ 2021-10-15 19:50 UTC (permalink / raw) To: Avi Deitcher; +Cc: linux-ext4 Oh, and taking a quick look at your program, here's at least one of the bugs: static void calculate(char *name) { ^^^^^^^^^^ ... __ext4fs_dirhash(name, sizeof(name), &hinfo); ^^^^^^^^^^^^ With apologies to the movie "The Princess Bride"[1]: You fell victim to one of the classic blunders! The most famous is to never get involved in a land war in Asia, but only slightly less well-known is this: 'taking the size of a C pointer is generally not what you had wanted to do'! :-) [1] https://www.youtube.com/watch?v=R7TFPQqglb4 - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: algorithm for half-md4 used in htree directories 2021-10-15 19:50 ` Theodore Ts'o @ 2021-10-18 16:56 ` Avi Deitcher 0 siblings, 0 replies; 15+ messages in thread From: Avi Deitcher @ 2021-10-18 16:56 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Yes, it definitely was my silly use of sizeof() instead of strlen(). Switch to strlen(), and my test program's output (using little-endian for each u32) gives the exact same output. Looks like I owe you that beer (and happy to share it with you) next time we are in the same place! On Fri, Oct 15, 2021 at 10:50 PM Theodore Ts'o <tytso@mit.edu> wrote: > > Oh, and taking a quick look at your program, here's at least one of > the bugs: > > static void calculate(char *name) { > ^^^^^^^^^^ > ... > __ext4fs_dirhash(name, sizeof(name), &hinfo); > ^^^^^^^^^^^^ > > With apologies to the movie "The Princess Bride"[1]: > > You fell victim to one of the classic blunders! The most famous > is to never get involved in a land war in Asia, but only slightly > less well-known is this: 'taking the size of a C pointer is > generally not what you had wanted to do'! :-) > > [1] https://www.youtube.com/watch?v=R7TFPQqglb4 > > - Ted -- Avi Deitcher avi@deitcher.net Follow me http://twitter.com/avideitcher Read me http://blog.atomicinc.com ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-10-18 16:56 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-10-01 11:49 algorithm for half-md4 used in htree directories Avi Deitcher 2021-10-03 12:47 ` Avi Deitcher 2021-10-03 16:43 ` Andreas Dilger 2021-10-04 7:57 ` Avi Deitcher 2021-10-11 15:30 ` Avi Deitcher 2021-10-11 20:20 ` Theodore Ts'o 2021-10-12 2:58 ` Avi Deitcher 2021-10-12 17:30 ` Theodore Ts'o 2021-10-13 2:20 ` Avi Deitcher 2021-10-15 18:43 ` Avi Deitcher 2021-10-15 19:10 ` Theodore Ts'o 2021-10-15 19:43 ` Avi Deitcher 2021-10-15 20:30 ` Theodore Ts'o 2021-10-15 19:50 ` Theodore Ts'o 2021-10-18 16:56 ` Avi Deitcher
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.