LKML Archive on
 help / color / Atom feed
From: Peter Maydell <>
To: Andreas Dilger <>
Cc: Florian Weimer <>,
	linux-fsdevel <>,
	Linux API <>,
	Ext4 Developers List <>,,,
	Arnd Bergmann <>,,,
	lkml - Kernel Mailing List <>,
	QEMU Developers <>,,
Subject: Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation
Date: Fri, 28 Dec 2018 11:18:18 +0000
Message-ID: <> (raw)
In-Reply-To: <>

On Fri, 28 Dec 2018 at 00:23, Andreas Dilger <> wrote:
> On Dec 27, 2018, at 10:41 AM, Peter Maydell <> wrote:
> > As you note, this causes breakage for userspace programs which
> > need to implement an API/ABI with 32-bit offset but which only
> > have access to the kernel's 64-bit offset API/ABI.
> This is (IMHO) a bit of an oxymoron, isn't it?  Applications using
> the 64-bit API, but storing the value in a 32-bit field?

I didn't say "which choose to store the value in a 32-bit field",
I said "which have to implement an API/ABI which has 32-bit fields".
In QEMU's case, we use the host kernel's ABI, which has 64-bit
offset fields. We implement a syscall ABI for the guest binary
we are running under emulation, which may have 32-bit offset fields
(for instance if we are running a 32-bit Arm binary.) Both of
these ABIs are fixed -- QEMU doesn't have a choice here, it
just has to make the best effort it can with what the host kernel
provides it, to provide the semantics the guest binary needs.
My suggestion in this thread is that the host kernel provides
a wider range of facilities so that QEMU can do the job it's
trying to do.

>  The same
> problem would exist for filesystems with 64-bit inodes or 64-bit
> file offsets trying to store these values in 32-bit variables.
> It might work most of the time, but it can also break randomly.

In general inodes and offsets start from 0 and work up --
so almost all of the time they don't actually overflow.
The problem with ext4 directory hash "offsets" is that they
overflow all the time and immediately, so instead of "works
unless you have a weird edge case" like all the other filesystems,
it's "never works".

> > I think the best fix for this would be for the kernel to either
> > (a) consistently use a 32-bit hash or (b) to provide an API
> > so that userspace can use the FMODE_32BITHASH flag the way
> > that kernel-internal users already can.
> It would be relatively straight forward to add a "32bitapi" mount
> option to return a 32-bit directory hash to userspace for operations
> on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
> However, I can't think of an easy way to do this on a per-process
> basis without just having it call the 32-bit API directly.

The problem is that there is no 32-bit API in some cases
(unless I have misunderstood the kernel code) -- not all
host architectures implement compat syscalls or allow them
to be called from 64-bit processes or implement all the older
syscall variants that had smaller offets. If there was a guaranteed
"this syscall always exists and always gives me 32-bit offsets"
we could use it.

> For ext4 at least, you could just shift the high 32-bit part of
> the 64-bit hash down into a 32-bit value in telldir(), and
> shift it back up when seekdir() is called.

Yes, that has been suggested, but it seemed a bit dubious
to bake in knowledge of ext4's internal implementation details.
Can we rely on this as an ABI promise that will always work
for all versions of all file systems going forwards?

-- PMM

  reply index

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-27 17:18 Florian Weimer
2018-12-27 17:41 ` [Qemu-devel] " Peter Maydell
2018-12-28  0:23   ` Andreas Dilger
2018-12-28 11:18     ` Peter Maydell [this message]
2018-12-28 23:16       ` Andreas Dilger
2018-12-29  0:12         ` Peter Maydell
2018-12-29  1:54           ` Matthew Wilcox
2018-12-29 16:49             ` Andy Lutomirski
2018-12-30 13:59               ` Peter Maydell
2018-12-29  2:11       ` Theodore Y. Ts'o
2018-12-29  2:37         ` Dominique Martinet
2018-12-29  3:14           ` Theodore Y. Ts'o
2018-12-29  4:04             ` [V9fs-developer] " Dominique Martinet
     [not found] ` <>
2018-12-27 17:56   ` Florian Weimer
2018-12-27 17:58 ` Adhemerval Zanella
2018-12-27 18:09   ` Florian Weimer
2018-12-28 11:53     ` Adhemerval Zanella
2018-12-28 11:56       ` Florian Weimer
2018-12-28 12:01         ` Florian Weimer
2018-12-28 12:21           ` Adhemerval Zanella
2018-12-31 17:03       ` Joseph Myers
2019-01-02 13:16         ` Adhemerval Zanella
2018-12-28  2:23 ` Dmitry V. Levin
2018-12-28  7:38   ` Florian Weimer
2018-12-28 15:26 ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on

Archives are clonable:
	git clone --mirror lkml/git/0.git
	git clone --mirror lkml/git/1.git
	git clone --mirror lkml/git/2.git
	git clone --mirror lkml/git/3.git
	git clone --mirror lkml/git/4.git
	git clone --mirror lkml/git/5.git
	git clone --mirror lkml/git/6.git
	git clone --mirror lkml/git/7.git
	git clone --mirror lkml/git/8.git
	git clone --mirror lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ \
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone