Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	linux-block@vger.kernel.org, linux-arch@vger.kernel.org,
	hch@lst.de, jmoyer@redhat.com, avi@scylladb.com
Subject: Re: [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers
Date: Thu, 17 Jan 2019 09:09:38 +1100
Message-ID: <20190116220938.GR4205@dastard> (raw)
In-Reply-To: <9db63405-6797-9305-3ce1-fdc11edbf49c@kernel.dk>

On Wed, Jan 16, 2019 at 02:20:53PM -0700, Jens Axboe wrote:
> On 1/16/19 1:53 PM, Dave Chinner wrote:
> > On Wed, Jan 16, 2019 at 10:50:00AM -0700, Jens Axboe wrote:
> >> If we have fixed user buffers, we can map them into the kernel when we
> >> setup the io_context. That avoids the need to do get_user_pages() for
> >> each and every IO.
> > .....
> >> +			return -ENOMEM;
> >> +	} while (atomic_long_cmpxchg(&ctx->user->locked_vm, cur_pages,
> >> +					new_pages) != cur_pages);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
> >> +{
> >> +	int i, j;
> >> +
> >> +	if (!ctx->user_bufs)
> >> +		return -EINVAL;
> >> +
> >> +	for (i = 0; i < ctx->sq_entries; i++) {
> >> +		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> >> +
> >> +		for (j = 0; j < imu->nr_bvecs; j++) {
> >> +			set_page_dirty_lock(imu->bvec[j].bv_page);
> >> +			put_page(imu->bvec[j].bv_page);
> >> +		}
> > 
> > Hmmm, so we call set_page_dirty() when the gup reference is dropped...
> > 
> > .....
> > 
> >> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> >> +				  unsigned nr_args)
> >> +{
> > 
> > .....
> > 
> >> +		down_write(&current->mm->mmap_sem);
> >> +		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
> >> +						pages, NULL);
> >> +		up_write(&current->mm->mmap_sem);
> > 
> > Thought so. This has the same problem as RDMA w.r.t. using
> > file-backed mappings for the user buffer.  It is not synchronised
> > against truncate, hole punches, async page writeback cleaning the
> > page, etc, and so can lead to data corruption and/or kernel panics.
> > 
> > It also can't be used with DAX because the above problems are
> > actually a user-after-free of storage space, not just a dangling
> > page reference that can be cleaned up after the gup pin is dropped.
> > 
> > Perhaps, at least until we solve the GUP problems w.r.t. file backed
> > pages and/or add and require file layout leases for these reference,
> > we should error out if the  user buffer pages are file-backed
> > mappings?
> 
> Thanks for taking a look at this.
> 
> I'd be fine with that restriction, especially since it can get relaxed
> down the line. Do we have an appropriate API for this?  And why isn't
> get_user_pages_longterm() that exact API already?

get_user_pages_longterm() is the right thing to use to ensure DAX
doesn't trip over this - it's effectively just get_user_pages()
with a "if (vma_is_fsdax(vma))" check in it to abort and return
-EOPNOTSUPP. IOWs, this is safe on DAX but it's not safe on anything
else. :/

Unfortunately, disallowing userspace GUP pins on non-DAX file backed
pages will break existing "mostly just work" userspace apps all over
the place. And so right now there are discussions ongoing about how
to map gup references avoid the writeback races and be able to be
seen/tracked by other kernel infrastructure (see the long, long
thread "[PATCH 0/2] put_user_page*(): start converting the call
sites" on -fsdevel). Progress is slow, but I think we're starting to
close on a workable solution.

FWIW, this doesn't solve the "long term user pin will block
filesystem operations until unpin" problem, that's what moving to
using revocable file layout leases is intended to solve. There have
been patches posted some time ago to add this user API for this, but
we've got to solve the other problems first....

> Would seem that most
> (all?) callers of this API is currently broken then.

Yup, there's a long, long history of machines using userspace RDMA
panicing because filesystems have detected or tripped over invalid
page cache state during writeback attempts. This is not a new
problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply index

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-16 17:49 [PATCHSET v5] io_uring IO interface Jens Axboe
2019-01-16 17:49 ` [PATCH 01/15] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-16 17:49 ` [PATCH 02/15] block: wire up block device iopoll method Jens Axboe
2019-01-16 17:49 ` [PATCH 03/15] block: add bio_set_polled() helper Jens Axboe
2019-01-16 17:49 ` [PATCH 04/15] iomap: wire up the iopoll method Jens Axboe
2019-01-16 17:49 ` [PATCH 05/15] Add io_uring IO interface Jens Axboe
2019-01-17 12:02   ` Roman Penyaev
2019-01-17 13:54     ` Jens Axboe
2019-01-17 14:34       ` Roman Penyaev
2019-01-17 14:54         ` Jens Axboe
2019-01-17 15:19           ` Roman Penyaev
2019-01-17 12:48   ` Roman Penyaev
2019-01-17 14:01     ` Jens Axboe
2019-01-17 20:03       ` Jeff Moyer
2019-01-17 20:09         ` Jens Axboe
2019-01-17 20:14           ` Jens Axboe
2019-01-17 20:50             ` Jeff Moyer
2019-01-17 20:53               ` Jens Axboe
2019-01-17 21:02                 ` Jeff Moyer
2019-01-17 21:17                   ` Jens Axboe
2019-01-17 21:21                     ` Jeff Moyer
2019-01-17 21:27                       ` Jens Axboe
2019-01-18  8:23               ` Roman Penyaev
2019-01-16 17:49 ` [PATCH 06/15] io_uring: add fsync support Jens Axboe
2019-01-16 17:49 ` [PATCH 07/15] io_uring: support for IO polling Jens Axboe
2019-01-16 17:49 ` [PATCH 08/15] fs: add fget_many() and fput_many() Jens Axboe
2019-01-16 17:49 ` [PATCH 09/15] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-16 17:49 ` [PATCH 10/15] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-16 17:49 ` [PATCH 11/15] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-16 17:50 ` [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-16 20:53   ` Dave Chinner
2019-01-16 21:20     ` Jens Axboe
2019-01-16 22:09       ` Dave Chinner [this message]
2019-01-16 22:21         ` Jens Axboe
2019-01-16 23:09           ` Dave Chinner
2019-01-16 23:17             ` Jens Axboe
2019-01-16 22:13       ` Jens Axboe
2019-01-16 17:50 ` [PATCH 13/15] io_uring: add submission polling Jens Axboe
2019-01-16 17:50 ` [PATCH 14/15] io_uring: add file registration Jens Axboe
2019-01-16 17:50 ` [PATCH 15/15] io_uring: add io_uring_event cache hit information Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-01-10  2:43 [PATCHSET v2] io_uring IO interface Jens Axboe
2019-01-10  2:44 ` [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers Jens Axboe

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190116220938.GR4205@dastard \
    --to=david@fromorbit.com \
    --cc=avi@scylladb.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox