All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@infradead.org>,
	Ilya Dryomov <idryomov@gmail.com>, Jens Axboe <axboe@kernel.dk>,
	Jeff Layton <jlayton@kernel.org>, <linux-xfs@vger.kernel.org>,
	<linux-fsdevel@vger.kernel.org>, <linux-block@vger.kernel.org>,
	<ceph-devel@vger.kernel.org>, <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	John Hubbard <jhubbard@nvidia.com>
Subject: [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast()
Date: Fri, 21 Aug 2020 21:20:54 -0700	[thread overview]
Message-ID: <20200822042059.1805541-1-jhubbard@nvidia.com> (raw)

Hi,

This converts the Direct IO block/bio layer over to use FOLL_PIN pages
(those acquired via pin_user_pages*()). This effectively converts
several file systems (ext4, for example) that use the common Direct IO
routines. See "Remaining work", below for a bit more detail there.

Quite a few approaches have been considered over the years. This one is
inspired by Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. After working through how bio submission and completion works, I
became convinced that this is the simplest and cleanest approach to
conversion.

Not content to let well enough alone, I then continued on to the
unthinkable: adding a new flag to struct bio, whose "short int" flags
field was full, thuse triggering an expansion of the field from 16, to
32 bits. This allows for a nice assertion in bio_release_pages(), that
the bio page release mechanism matches the page acquisition mechanism.
This is especially welcome for a change that affects a lot of callers
and could really make a mess if there is a bug somewhere.

I'm unable to spot any performance implications, either theoretically or
via (rather light) performance testing, from enlarging bio.bi_flags, but
I suspect that there are maybe still valid reasons for having such a
tiny bio.bi_flags field. I just have no idea what they are. (Hardware
that knows the size of a bio? No, because there would be obvious
build-time assertions, and comments about such a constraint.) Anyway, I
can drop that patch if it seems like too much cost for too little
benefit.

And finally, as long as we're all staring at the iter_iov code, I'm
including a nice easy ceph patch, that removes one more caller of
iter_iov_get_pages().

Design notes ============

This whole approach depends on certain concepts:

1) Each struct bio instance must not mix different types of pages:
FOLL_PIN and non-FOLL_PIN pages. (By FOLL_PIN I'm referring to pages
that were acquired and pinned via pin_user_page*() routines.)
Fortunately, this is already an enforced constraint for bio's, as
evidenced by the existence and use of BIO_NO_PAGE_REF.

2) Christoph Hellwig's July, 2019 observation that there are
only 5 ITER_ types, and we can simplify handling of them for Direct IO
[1]. Accordingly, this series implements the following pseudocode:

Direct IO behavior:

    ITER_IOVEC:
        pin_user_pages_fast();
        break;

    ITER_KVEC:    // already elevated page refcount, leave alone
    ITER_BVEC:    // already elevated page refcount, leave alone
    ITER_PIPE:    // just, no :)
    ITER_DISCARD: // discard
        return -EFAULT or -ENVALID;

...which works for callers that already have sorted out which case they
are in. Such as, Direct IO in the block/bio layers.

Now, this does leave ITER_KVEC and ITER_BVEC unconverted, but on the
other hand, it's not clear that these are actually affected in the real
world, by the get_user_pages()+filesystem interaction problems of [2].
If it turns out to matter, then those can be handled too, but it's just
more refactoring and surgery to do so.

Testing
=======

Performance: no obvious regressions from running fio (direct=1: Direct
IO) on both SSD and NVMe drives.

Functionality: selected non-destructive bare metal xfstests on xfs,
ext4, btrfs, orangefs filesystems, plus LTP tests.

Note that I have only a single x86 64-bit test machine, though.

Remaining work
==============

Non-converted call sites for iter_iov_get_pages*() at the
moment include: net, crypto, cifs, ceph, vhost, fuse, nfs/direct,
vhost/scsi.

About-to-be-converted sites (in a subsequent patch) are: Direct IO for
filesystems that use the generic read/write functions.

[1] https://lore.kernel.org/kvm/20190724061750.GA19397@infradead.org/

[2] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/


John Hubbard (5):
  iov_iter: introduce iov_iter_pin_user_pages*() routines
  mm/gup: introduce pin_user_page()
  bio: convert get_user_pages_fast() --> pin_user_pages_fast()
  bio: introduce BIO_FOLL_PIN flag
  fs/ceph: use pipe_get_pages_alloc() for pipe

 block/bio.c               | 29 +++++++------
 block/blk-map.c           |  7 +--
 fs/ceph/file.c            |  3 +-
 fs/direct-io.c            | 30 ++++++-------
 fs/iomap/direct-io.c      |  2 +-
 include/linux/blk_types.h |  5 ++-
 include/linux/mm.h        |  2 +
 include/linux/uio.h       |  9 +++-
 lib/iov_iter.c            | 91 +++++++++++++++++++++++++++++++++++++--
 mm/gup.c                  | 30 +++++++++++++
 10 files changed, 169 insertions(+), 39 deletions(-)

-- 
2.28.0


             reply	other threads:[~2020-08-22  4:22 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-22  4:20 John Hubbard [this message]
2020-08-22  4:20 ` [PATCH 1/5] iov_iter: introduce iov_iter_pin_user_pages*() routines John Hubbard
2020-08-22  4:20 ` [PATCH 2/5] mm/gup: introduce pin_user_page() John Hubbard
2020-08-22  4:20 ` [PATCH 3/5] bio: convert get_user_pages_fast() --> pin_user_pages_fast() John Hubbard
2020-08-22  4:20 ` [PATCH 4/5] bio: introduce BIO_FOLL_PIN flag John Hubbard
2020-08-23  6:25   ` Christoph Hellwig
2020-08-23  6:57     ` John Hubbard
2020-08-24  7:36       ` John Hubbard
2020-08-24  9:20     ` Jens Axboe
2020-08-24 14:42       ` Christoph Hellwig
2020-08-27  3:25   ` [bio] 37abbdc72e: WARNING:at_block/bio.c:#bio_release_pages kernel test robot
2020-08-27  3:25     ` kernel test robot
2020-08-27  3:25     ` [LTP] " kernel test robot
2020-08-27  3:59     ` John Hubbard
2020-08-27  3:59       ` John Hubbard
2020-08-27  3:59       ` [LTP] " John Hubbard
2020-08-22  4:20 ` [PATCH 5/5] fs/ceph: use pipe_get_pages_alloc() for pipe John Hubbard
2020-08-24 10:53   ` Jeff Layton
2020-08-24 10:53     ` Jeff Layton
2020-08-24 17:54     ` John Hubbard
2020-08-24 18:47       ` Jeff Layton
2020-08-24 18:47         ` Jeff Layton
2020-08-24 18:54         ` Matthew Wilcox
2020-08-24 19:02           ` John Hubbard
2020-08-25  1:20           ` [PATCH v2] " John Hubbard
2020-08-25 16:22             ` Jeff Layton
2020-08-25 16:22               ` Jeff Layton
2020-08-25  1:54 ` [PATCH 0/5] bio: Direct IO: convert to pin_user_pages_fast() Al Viro
2020-08-25  2:07   ` Al Viro
2020-08-25  2:13     ` John Hubbard
2020-08-25  2:07   ` John Hubbard
2020-08-25  2:22     ` Al Viro
2020-08-25  2:27       ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200822042059.1805541-1-jhubbard@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=ceph-devel@vger.kernel.org \
    --cc=hch@infradead.org \
    --cc=idryomov@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.