On Dec 27, 2018, at 10:41 AM, Peter Maydell wrote: > > On Thu, 27 Dec 2018 at 17:19, Florian Weimer wrote: >> We have a bit of an interesting problem with respect to the d_off >> field in struct dirent. >> >> When running a 64-bit kernel on certain file systems, notably ext4, >> this field uses the full 63 bits even for small directories (strace -v >> output, wrapped here for readability): >> >> getdents(3, [ >> {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG}, >> {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR}, >> {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR} >> ], 32768) = 88 >> >> When running in 32-bit compat mode, this value is somehow truncated to >> 31 bits, for both the getdents and the getdents64 (!) system call (at >> least on i386). > > Yes -- look for hash2pos() and friends in fs/ext4/dir.c. > The ext4 code in the kernel uses a 32 bit hash if (a) the kernel > is 32 bit (b) this is a compat syscall (b) some other bit of > the kernel asked it to via the FMODE_32BITHASH flag (currently only > NFS does that I think). > > As you note, this causes breakage for userspace programs which > need to implement an API/ABI with 32-bit offset but which only > have access to the kernel's 64-bit offset API/ABI. This is (IMHO) a bit of an oxymoron, isn't it? Applications using the 64-bit API, but storing the value in a 32-bit field? The same problem would exist for filesystems with 64-bit inodes or 64-bit file offsets trying to store these values in 32-bit variables. It might work most of the time, but it can also break randomly. > I think the best fix for this would be for the kernel to either > (a) consistently use a 32-bit hash or (b) to provide an API > so that userspace can use the FMODE_32BITHASH flag the way > that kernel-internal users already can. It would be relatively straight forward to add a "32bitapi" mount option to return a 32-bit directory hash to userspace for operations on that mountpoint (ext4 doesn't have 64-bit inode numbers yet). However, I can't think of an easy way to do this on a per-process basis without just having it call the 32-bit API directly. > I couldn't think of or find any existing way for userspace > to get the right results here, which is why > 32-bit-guest-on-64-bit-host QEMU doesn't work on these filesystems > (depending on what exactly the guest's libc etc do). > >> the 32-bit getdents system call emulation in a 64-bit qemu-user >> process would just silently truncate the d_off field as part of >> the translation, not reporting an error. >> [...] >> This truncation has always been a bug; it breaks telldir/seekdir >> at least in some cases. > > Yes; you can't fit a quart into a pint pot, so if the guest > only handles 32-bit offsets then truncation is about all we > can do. This works fine if offsets are offsets, assuming the > directory isn't so enormous it would have broken the guest > anyway. I'm not aware of any issues with this other than the > oddball ext4 offsets-are-hashes situation -- could you expand > on the telldir/seekdir issue? (I suppose we should probably > make QEMU's syscall emulation layer return "no more entries" > rather than entries with truncated hashes.) For ext4 at least, you could just shift the high 32-bit part of the 64-bit hash down into a 32-bit value in telldir(), and shift it back up when seekdir() is called. Cheers, Andreas