Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines

From: Al Viro <viro@zeniv.linux.org.uk>
To: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jens Axboe <axboe@kernel.dk>, Miklos Szeredi <miklos@szeredi.hu>,
	"Darrick J . Wong" <djwong@kernel.org>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	Logan Gunthorpe <logang@deltatee.com>,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
Date: Thu, 22 Sep 2022 03:22:48 +0100	[thread overview]
Message-ID: <YyvG+Oih2A37Grcf@ZenIV> (raw)
In-Reply-To: <20220915081625.6a72nza6yq4l5etp@quack3>

On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:

> > How would that work?  What protects the area where you want to avoid running
> > into pinned pages from previously acceptable page getting pinned?  If "they
> > must have been successfully unmapped" is a part of what you are planning, we
> > really do have a problem...
> 
> But this is a very good question. So far the idea was that we lock the
> page, unmap (or writeprotect) the page, and then check pincount == 0 and
> that is a reliable method for making sure page data is stable (until we
> unlock the page & release other locks blocking page faults and writes). But
> once suddently ordinary page references can be used to create pins this
> does not work anymore. Hrm.
> 
> Just brainstorming ideas now: So we'd either need to obtain the pins early
> when we still have the virtual address (but I guess that is often not
> practical but should work e.g. for normal direct IO path) or we need some
> way to "simulate" the page fault when pinning the page, just don't map it
> into page tables in the end. This simulated page fault could be perhaps
> avoided if rmap walk shows that the page is already mapped somewhere with
> suitable permissions.

OK.  As far as I can see, the rules are along the lines of
	* creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
	  That includes
		* page known to be locked by caller
		* page being privately allocated and not visible to anyone else
		* iterator being data source
		* page coming from pin_user_pages(), possibly as the result of
		  iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
	* ITER_PIPE pages are always safe
	* pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
	  had been created with such.
My preference would be to have iov_iter_get_pages() and friends pin if and
only if we have data-destination iov_iter that is user-backed.  For
data-source user-backed we only need FOLL_GET, and for all other flavours
(ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
at all.

What I'd like to have is the understanding of the places where we drop
the references acquired by iov_iter_get_pages().  How do we decide
whether to unpin?  E.g. pipe_buffer carries a reference to page and no
way to tell whether it's a pinned one; results of iov_iter_get_pages()
on ITER_IOVEC *can* end up there, but thankfully only from data-source
(== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
Then there's nfs_request; AFAICS, we do need to pin the references in
those if they are coming from nfs_direct_read_schedule_iovec(), but
not if they come from readpage_async_filler().  How do we deal with
coalescence, etc.?  It's been a long time since I really looked at
that code...  Christoph, could you give any comments on that one?

Note, BTW, that nfs_request coming from readpage_async_filler() have
pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
do not, and that's where we want them pinned.  Resulting page references
end up (after quite a trip through data structures) stuffed into struct
rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
they get picked by xs_read_bvec() and fed to iov_iter_bvec().  In one
case it's safe since the pages are locked; in another - since they would
come from pin_user_pages().  The call chain at the time they are used
has nothing to do with the originator - sunrpc is looking at the arrived
response to READ that matches an rpc_rqst that had been created by sender
of that request and safety is the sender's responsibility.