Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Jan Kara <jack@suse.cz>
To: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>,
	john.hubbard@gmail.com, Matthew Wilcox <willy@infradead.org>,
	Michal Hocko <mhocko@kernel.org>,
	Christopher Lameter <cl@linux.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	Christian Benvenuti <benve@cisco.com>,
	Dennis Dalessandro <dennis.dalessandro@intel.com>,
	Doug Ledford <dledford@redhat.com>,
	Mike Marciniszyn <mike.marciniszyn@intel.com>
Subject: Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
Date: Wed, 3 Oct 2018 18:08:36 +0200
Message-ID: <20181003160836.GF24030@quack2.suse.cz> (raw)
In-Reply-To: <20180929084608.GA3188@redhat.com>

On Sat 29-09-18 04:46:09, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote:
> > Actually, the latest direction on that discussion was toward periodically
> > writing back, even while under RDMA, via bounce buffers:
> > 
> >   https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
> > 
> > I still think that's viable. Of course, there are other things besides 
> > writeback (see below) that might also lead to waiting.
> 
> Write back under bounce buffer is fine, when looking back at links you
> provided the solution that was discuss was blocking in page_mkclean()
> which is horrible in my point of view.

Yeah, after looking into it for some time, we figured that waiting for page
pins in page_mkclean() isn't really going to fly due to deadlocks. So we
came up with the bounce buffers idea which should solve that nicely.

> > > With the solution put forward here you can potentialy wait _forever_ for
> > > the driver that holds a pin to drop it. This was the point i was trying to
> > > get accross during LSF/MM. 
> > 
> > I agree that just blocking indefinitely is generally unacceptable for kernel
> > code, but we can probably avoid it for many cases (bounce buffers), and
> > if we think it is really appropriate (file system unmounting, maybe?) then
> > maybe tolerate it in some rare cases.  
> > 
> > >You can not fix broken hardware that decided to
> > > use GUP to do a feature they can't reliably do because their hardware is
> > > not capable to behave.
> > > 
> > > Because code is easier here is what i was meaning:
> > > 
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=a5dbc0fe7e71d347067579f13579df372ec48389
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=01677bc039c791a16d5f82b3ef84917d62fac826
> > > 
> > 
> > While that may work sometimes, I don't think it is reliable enough to trust for
> > identifying pages that have been gup-pinned. There's just too much overloading of
> > other mechanisms going on there, and if we pile on top with this constraint of "if you
> > have +3 refcounts, and this particular combination of page counts and mapcounts, then
> > you're definitely a long-term pinned page", I think users will find a lot of corner
> > cases for us that break that assumption. 
> 
> So the mapcount == refcount (modulo extra reference for mapping and
> private) should holds, here are the case when it does not:
>     - page being migrated
>     - page being isolated from LRU
>     - mempolicy changes against the page
>     - page cache lookup
>     - some file system activities
>     - i likely miss couples here i am doing that from memory
> 
> What matter is that all of the above are transitory, the extra reference
> only last for as long as it takes for the action to finish (migration,
> mempolicy change, ...).
> 
> So skipping those false positive page while reclaiming likely make sense,
> the blocking free buffer maybe not.

Well, as John wrote, these page refcount are fragile (and actually
filesystem dependent as some filesystems hold page reference from their
page->private data and some don't). So I think we really need a new
reliable mechanism for tracking page references from GUP. And John works
towards that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

      parent reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-28  5:39 john.hubbard
2018-09-28  5:39 ` [PATCH 1/4] mm: get_user_pages: consolidate error handling john.hubbard
2018-09-28  5:39 ` [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call john.hubbard
2018-09-28 15:39   ` Jason Gunthorpe
2018-09-29  3:12     ` John Hubbard
2018-09-29 16:21       ` Matthew Wilcox
2018-09-29 19:19         ` Jason Gunthorpe
2018-10-01 12:50         ` Christoph Hellwig
2018-10-01 15:29           ` Matthew Wilcox
2018-10-01 15:51             ` Christoph Hellwig
2018-10-01 14:35       ` Dennis Dalessandro
2018-10-03  5:40         ` John Hubbard
2018-10-03 16:27       ` Jan Kara
2018-10-03 23:19         ` John Hubbard
2018-09-28  5:39 ` [PATCH 2/4] mm: introduce put_user_page(), placeholder version john.hubbard
2018-10-03 16:22   ` Jan Kara
2018-10-03 23:23     ` John Hubbard
2018-09-28  5:39 ` [PATCH 4/4] goldfish_pipe/mm: convert to the new release_user_pages() call john.hubbard
2018-09-28 15:29 ` [PATCH 0/4] get_user_pages*() and RDMA: first steps Jerome Glisse
2018-09-28 19:06   ` John Hubbard
2018-09-28 21:49     ` Jerome Glisse
2018-09-29  2:28       ` John Hubbard
2018-09-29  8:46         ` Jerome Glisse
2018-10-01  6:11           ` Dave Chinner
2018-10-01 12:47             ` Christoph Hellwig
2018-10-02  1:14               ` Dave Chinner
2018-10-03 16:21                 ` Jan Kara
2018-10-01 15:31             ` Jason Gunthorpe
2018-10-03 16:08           ` Jan Kara [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181003160836.GF24030@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=benve@cisco.com \
    --cc=cl@linux.com \
    --cc=dan.j.williams@intel.com \
    --cc=dennis.dalessandro@intel.com \
    --cc=dledford@redhat.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=john.hubbard@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mike.marciniszyn@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git