All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>, David Howells <dhowells@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <axboe@kernel.dk>, Jeff Layton <jlayton@kernel.org>,
	Logan Gunthorpe <logang@deltatee.com>,
	linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
Date: Tue, 24 Jan 2023 11:29:31 +0100	[thread overview]
Message-ID: <20230124102931.g7e33syuhfo7s36h@quack3> (raw)
In-Reply-To: <Y87E5HAo7ZoHyrbE@casper.infradead.org>

On Mon 23-01-23 17:33:24, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote:
> > On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> > > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > > >  (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > > >      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > > >      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > > >      added.
> > > 
> > > I think there's a simpler solution than all of this.
> > > 
> > > As I understand the fundamental problem here, the question is
> > > when to copy a page on fork.  We have the optimisation of COW, but
> > > O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
> > > to the fork code "You can't do COW to this page".
> > > 
> > > Why do we want to track that information on a per-page basis?  Wouldn't it
> > > be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> > > time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> > > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > > And the whole "did we pin or get this page" problem goes away.  Along
> > > with folio->pincount.
> > 
> > Well, but anon COW code is not the only (planned) consumer of the pincount.
> > Filesystems also need to know whether a (shared pagecache) page is pinned
> > and can thus be modified behind their backs. And for that VMA tracking
> > isn't really an option.
> 
> Bleh, I'd forgotten about that problem.  We really do need to keep
> track of which pages are under I/O for this case, because we need to
> tell the filesystem that they are now available for writeback.
> 
> That said, I don't know that we need to keep track of it in the
> pages themselves.  Can't we have something similar to rmap which
> keeps track of a range of pinned pages, and have it taken care of
> at a higher level (ie unpin the pages in the dio_iodone_t rather
> than in the BIO completion handler)?

We could but bear in mind that there are more places than just (the two)
direct IO paths. There are many GUP users that can be operating on mmapped
pagecache and that can modify page data. Also note that e.g. direct IO is
rather performance sensitive so any CPU overhead that is added to that path
gets noticed. That was actually the reason why we keep pinned pages on the
LRU - John tried to remove them while pinning but the overhead was not
acceptable.

Finally, since we need to handle pinning for anon pages as well, just
(ab)using the page refcount looks like the least painful solution.

> I'm not even sure why pinned pagecache pages remain on the LRU.
> They should probably go onto the unevictable list with the mlocked
> pages, then on unpin get marked dirty and placed on the active list.
> There's no point in writing back a pinned page since it can be
> written to at any instant without any part of the kernel knowing.

True but as John said sometimes we need to writeout even pinned page - e.g.
on fsync(2). For some RDMA users which keep pages pinned for days or
months, this is actually crutial...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  parent reply	other threads:[~2023-01-24 10:29 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
2023-01-20 17:55 ` [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction David Howells
2023-01-21 13:01   ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
2023-01-21 13:01   ` Christoph Hellwig
2023-01-21 13:10   ` Christoph Hellwig
2023-01-21 13:30   ` David Howells
2023-01-21 13:33     ` Christoph Hellwig
2023-01-23 11:28   ` David Hildenbrand
2023-01-23 11:51   ` David Howells
2023-01-23 13:11     ` David Hildenbrand
2023-01-23 13:19     ` David Howells
2023-01-23 13:24       ` David Hildenbrand
2023-01-23 19:56         ` John Hubbard
2023-01-26 22:15         ` Al Viro
2023-01-26 23:41           ` David Hildenbrand
2023-01-27  0:05           ` David Howells
2023-01-27  0:20             ` David Hildenbrand
2023-01-23 13:38       ` David Howells
2023-01-23 14:20         ` David Hildenbrand
2023-01-23 14:48           ` Christoph Hellwig
2023-01-23 16:11         ` Jan Kara
2023-01-23 16:17           ` Christoph Hellwig
2023-01-23 23:07           ` John Hubbard
2023-01-24  5:57             ` Christoph Hellwig
2023-01-24  6:55               ` John Hubbard
2023-01-23 12:00   ` David Howells
2023-01-23 12:00     ` David Howells
2023-01-20 17:55 ` [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page David Howells
2023-01-20 17:55 ` [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
2023-01-21 13:04   ` Christoph Hellwig
2023-01-23  9:38   ` David Howells
2023-01-23  9:56     ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 5/8] block: Add BIO_PAGE_PINNED David Howells
2023-01-21 13:05   ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
2023-01-21 13:07   ` Christoph Hellwig
2023-01-23 11:28   ` David Howells
2023-01-23 14:49     ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 7/8] block: Fix bio_flagged() so that gcc can better optimise it David Howells
2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
2023-01-20 18:59   ` Matthew Wilcox
2023-01-20 19:18   ` David Howells
2023-01-23 16:31 ` [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) Matthew Wilcox
2023-01-23 16:42   ` Jan Kara
2023-01-23 17:33     ` Matthew Wilcox
2023-01-23 22:53       ` John Hubbard
2023-01-24 10:29       ` Jan Kara [this message]
2023-01-24 13:21         ` Christoph Hellwig
2023-01-23 16:38 ` David Howells
2023-01-23 16:42   ` Matthew Wilcox
2023-01-23 17:25     ` Jan Kara
2023-01-24 10:24       ` David Hildenbrand
2023-01-23 17:19   ` David Howells
2023-01-23 18:04     ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230124102931.g7e33syuhfo7s36h@quack3 \
    --to=jack@suse.cz \
    --cc=axboe@kernel.dk \
    --cc=dhowells@redhat.com \
    --cc=hch@infradead.org \
    --cc=jhubbard@nvidia.com \
    --cc=jlayton@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.