LKML Archive on lore.kernel.org
 help / Atom feed
From: Jan Kara <jack@suse.cz>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jan Kara <jack@suse.cz>, Matthew Wilcox <willy@infradead.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	john.hubbard@gmail.com, Michal Hocko <mhocko@kernel.org>,
	Christopher Lameter <cl@linux.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org, John Hubbard <jhubbard@nvidia.com>
Subject: Re: [PATCH 0/2] mm/fs: put_user_page() proposal
Date: Tue, 10 Jul 2018 09:51:54 +0200
Message-ID: <20180710075154.lhaxaf6q35fnvdiq@quack2.suse.cz> (raw)
In-Reply-To: <20180709195657.GA29026@ziepe.ca>

On Mon 09-07-18 13:56:57, Jason Gunthorpe wrote:
> On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote:
> > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote:
> > > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote:
> > > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote:
> > > > > The problem with blocking in clear_page_dirty_for_io is that the fs is
> > > > > holding the page lock (or locks) and possibly others too. If you
> > > > > expect to have a bunch of long term references hanging around on the
> > > > > page, then there will be hangs and deadlocks everywhere. And if you do
> > > > > not have such log term references, then page lock (or some similar lock
> > > > > bit) for the duration of the DMA should be about enough?
> > > > 
> > > > There are two separate questions:
> > > > 
> > > > 1) How to identify pages pinned for DMA? We have no bit in struct page to
> > > > use and we cannot reuse page lock as that immediately creates lock
> > > > inversions e.g. in direct IO code (which could be fixed but then good luck
> > > > with auditing all the other GUP users). Matthew had an idea and John
> > > > implemented it based on removing page from LRU and using that space in
> > > > struct page. So we at least have a way to identify pages that are pinned
> > > > and can track their pin count.
> > > > 
> > > > 2) What to do when some page is pinned but we need to do e.g.
> > > > clear_page_dirty_for_io(). After some more thinking I agree with you that
> > > > just blocking waiting for page to unpin will create deadlocks like:
> > > 
> > > Why are we trying to writeback a page that is pinned?  It's presumed to
> > > be continuously redirtied by its pinner.  We can't evict it.
> > 
> > So what should be a result of fsync(file), where some 'file' pages are
> > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to
> > userspace that data was committed while it was not (and it's not only about
> > data that has landed in those pages via DMA, you can have first 1k of a page
> > modified by normal IO in parallel to DMA modifying second 1k chunk). If
> > fsync(2) returns error, it would be really unexpected by userspace and most
> > apps will just not handle that correctly. So what else can you do than
> > block?
> 
> I think as a userspace I would expect the 'current content' to be
> flushed without waiting..

Yes but the problem is we cannot generally write out a page whose contents
is possibly changing (e.g. RAID5 checksums would then be wrong). But maybe
using bounce pages (and keeping original page still dirty) in such case would
be worth it - originally I thought using bounce pages would not bring us
much but now seeing problems with blocking in more detail maybe they are
worth the trouble after all...

> If you block fsync() then anyone using a RDMA MR with it will just
> dead lock. What happens if two processes open the same file and
> one makes a MR and the other calls fsync()? Sounds bad.

Yes, that's one of the reasons why we were discussing revoke mechanisms for
long term pins. But with bounce pages we could possibly avoid that (except
for cases like DAX + truncate where it's really unavoidable but there it's
a new functionality so mandating revoke and returning error otherwise is
fine I guess).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply index

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-09  8:05 john.hubbard
2018-07-09  8:05 ` [PATCH 1/2] mm: introduce put_user_page(), placeholder version john.hubbard
2018-07-09 10:08   ` kbuild test robot
2018-07-09 18:48     ` John Hubbard
2018-07-09 15:53   ` Jason Gunthorpe
2018-07-09 16:11     ` Jan Kara
2018-07-09  8:05 ` [PATCH 2/2] goldfish_pipe/mm: convert to the new put_user_page() call john.hubbard
2018-07-09  8:49 ` [PATCH 0/2] mm/fs: put_user_page() proposal Nicholas Piggin
2018-07-09 16:08   ` Jan Kara
2018-07-09 17:16     ` Matthew Wilcox
2018-07-09 19:47       ` Jan Kara
2018-07-09 19:56         ` Jason Gunthorpe
2018-07-10  7:51           ` Jan Kara [this message]
2018-07-09 20:00         ` Matthew Wilcox
2018-07-10  8:21           ` Jan Kara
2018-07-09 16:27 ` Jan Kara

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180710075154.lhaxaf6q35fnvdiq@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=cl@linux.com \
    --cc=dan.j.williams@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=john.hubbard@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=npiggin@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox