linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@dilger.ca>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: Florian Weimer <fw@deneb.enyo.de>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	lucho@ionkov.net, libc-alpha@sourceware.org,
	Arnd Bergmann <arnd@arndb.de>,
	ericvh@gmail.com, hpa@zytor.com,
	lkml - Kernel Mailing List <linux-kernel@vger.kernel.org>,
	QEMU Developers <qemu-devel@nongnu.org>,
	rminnich@sandia.gov, v9fs-developer@lists.sourceforge.net
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
Date: Thu, 27 Dec 2018 17:23:28 -0700	[thread overview]
Message-ID: <9C6A7D45-CF53-4C61-B5DD-12CA0D419972@dilger.ca> (raw)
In-Reply-To: <CAFEAcA92m4vhzjJ+B=mP_o6Wfhx1XSKo3uWxah3osh=u5UXFuw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3621 bytes --]

On Dec 27, 2018, at 10:41 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Thu, 27 Dec 2018 at 17:19, Florian Weimer <fw@deneb.enyo.de> wrote:
>> We have a bit of an interesting problem with respect to the d_off
>> field in struct dirent.
>> 
>> When running a 64-bit kernel on certain file systems, notably ext4,
>> this field uses the full 63 bits even for small directories (strace -v
>> output, wrapped here for readability):
>> 
>> getdents(3, [
>>  {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, d_name="authorized_keys", d_type=DT_REG},
>>  {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", d_type=DT_DIR},
>>  {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", d_type=DT_DIR}
>> ], 32768) = 88
>> 
>> When running in 32-bit compat mode, this value is somehow truncated to
>> 31 bits, for both the getdents and the getdents64 (!) system call (at
>> least on i386).
> 
> Yes -- look for hash2pos() and friends in fs/ext4/dir.c.
> The ext4 code in the kernel uses a 32 bit hash if (a) the kernel
> is 32 bit (b) this is a compat syscall (b) some other bit of
> the kernel asked it to via the FMODE_32BITHASH flag (currently only
> NFS does that I think).
> 
> As you note, this causes breakage for userspace programs which
> need to implement an API/ABI with 32-bit offset but which only
> have access to the kernel's 64-bit offset API/ABI.

This is (IMHO) a bit of an oxymoron, isn't it?  Applications using
the 64-bit API, but storing the value in a 32-bit field?  The same
problem would exist for filesystems with 64-bit inodes or 64-bit
file offsets trying to store these values in 32-bit variables.
It might work most of the time, but it can also break randomly.

> I think the best fix for this would be for the kernel to either
> (a) consistently use a 32-bit hash or (b) to provide an API
> so that userspace can use the FMODE_32BITHASH flag the way
> that kernel-internal users already can.

It would be relatively straight forward to add a "32bitapi" mount
option to return a 32-bit directory hash to userspace for operations
on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
However, I can't think of an easy way to do this on a per-process
basis without just having it call the 32-bit API directly.

> I couldn't think of or find any existing way for userspace
> to get the right results here, which is why
> 32-bit-guest-on-64-bit-host QEMU doesn't work on these filesystems
> (depending on what exactly the guest's libc etc do).
> 
>> the 32-bit getdents system call emulation in a 64-bit qemu-user
>> process would just silently truncate the d_off field as part of
>> the translation, not reporting an error.
>> [...]
>> This truncation has always been a bug; it breaks telldir/seekdir
>> at least in some cases.
> 
> Yes; you can't fit a quart into a pint pot, so if the guest
> only handles 32-bit offsets then truncation is about all we
> can do. This works fine if offsets are offsets, assuming the
> directory isn't so enormous it would have broken the guest
> anyway. I'm not aware of any issues with this other than the
> oddball ext4 offsets-are-hashes situation -- could you expand
> on the telldir/seekdir issue? (I suppose we should probably
> make QEMU's syscall emulation layer return "no more entries"
> rather than entries with truncated hashes.)

For ext4 at least, you could just shift the high 32-bit part of
the 64-bit hash down into a 32-bit value in telldir(), and
shift it back up when seekdir() is called.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

  reply	other threads:[~2018-12-28  0:23 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-27 17:18 d_off field in struct dirent and 32-on-64 emulation Florian Weimer
2018-12-27 17:41 ` [Qemu-devel] " Peter Maydell
2018-12-28  0:23   ` Andreas Dilger [this message]
2018-12-28 11:18     ` Peter Maydell
2018-12-28 23:16       ` Andreas Dilger
2018-12-29  0:12         ` Peter Maydell
2018-12-29  1:54           ` Matthew Wilcox
2018-12-29 16:49             ` Andy Lutomirski
2018-12-30 13:59               ` Peter Maydell
2018-12-29  2:11       ` Theodore Y. Ts'o
2018-12-29  2:37         ` Dominique Martinet
2018-12-29  3:14           ` Theodore Y. Ts'o
2018-12-29  4:04             ` [V9fs-developer] " Dominique Martinet
     [not found] ` <C65D3222-723F-4C0B-AF02-38488C302E84@amacapital.net>
2018-12-27 17:56   ` Florian Weimer
2018-12-27 17:58 ` Adhemerval Zanella
2018-12-27 18:09   ` Florian Weimer
2018-12-28 11:53     ` Adhemerval Zanella
2018-12-28 11:56       ` Florian Weimer
2018-12-28 12:01         ` Florian Weimer
2018-12-28 12:21           ` Adhemerval Zanella
2018-12-31 17:03       ` Joseph Myers
2019-01-02 13:16         ` Adhemerval Zanella
2018-12-28  2:23 ` Dmitry V. Levin
2018-12-28  7:38   ` Florian Weimer
2018-12-28 15:26 ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9C6A7D45-CF53-4C61-B5DD-12CA0D419972@dilger.ca \
    --to=adilger@dilger.ca \
    --cc=arnd@arndb.de \
    --cc=ericvh@gmail.com \
    --cc=fw@deneb.enyo.de \
    --cc=hpa@zytor.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lucho@ionkov.net \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rminnich@sandia.gov \
    --cc=v9fs-developer@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).