linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Christopher Lameter <cl@linux.com>, <john.hubbard@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christian Benvenuti <benve@cisco.com>,
	Christoph Hellwig <hch@infradead.org>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Chinner <david@fromorbit.com>,
	Dennis Dalessandro <dennis.dalessandro@intel.com>,
	Doug Ledford <dledford@redhat.com>,
	Ira Weiny <ira.weiny@intel.com>, Jan Kara <jack@suse.cz>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Jerome Glisse <jglisse@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Michal Hocko <mhocko@kernel.org>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Mike Marciniszyn <mike.marciniszyn@intel.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Tom Talpey <tom@talpey.com>, LKML <linux-kernel@vger.kernel.org>,
	<linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
Date: Thu, 7 Mar 2019 19:15:24 -0800	[thread overview]
Message-ID: <3cc3c382-2505-3b6c-ec58-1f14ebcb77e8@nvidia.com> (raw)
In-Reply-To: <010001695b3d2701-3215b423-7367-44d6-98bc-64fc2f84264a-000000@email.amazonses.com>

On 3/7/19 6:58 PM, Christopher Lameter wrote:
> On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote:
> 
>> Dave Chinner's description of this is very clear:
>>
>>     "The fundamental issue is that ->page_mkwrite must be called on every
>>     write access to a clean file backed page, not just the first one.
>>     How long the GUP reference lasts is irrelevant, if the page is clean
>>     and you need to dirty it, you must call ->page_mkwrite before it is
>>     marked writeable and dirtied. Every. Time."
>>
>> This is just one symptom of the larger design problem: filesystems do not
>> actually support get_user_pages() being called on their pages, and letting
>> hardware write directly to those pages--even though that patter has been
>> going on since about 2005 or so.
> 
> Can we distinguish between real filesystems that actually write to a
> backing device and the special filesystems (like hugetlbfs, shm and
> friends) that are like anonymous memory and do not require
> ->page_mkwrite() in the same way as regular filesystems?

Yes. I'll change the wording in the commit message to say "real filesystems
that actually write to a backing device", instead of "filesystems". That
does help, thanks.

> 
> The use that I have seen in my section of the world has been restricted to
> RDMA and get_user_pages being limited to anonymous memory and those
> special filesystems. And if the RDMA memory is of such type then the use
> in the past and present is safe.

Agreed.

> 
> So a logical other approach would be to simply not allow the use of
> long term get_user_page() on real filesystem pages. I hope this patch
> supports that?

This patch neither prevents nor provides that. What this patch does is
provide a prerequisite to clear identification of pages that have had
get_user_pages() called on them.


> 
> It is customary after all that a file read or write operation involve one
> single file(!) and that what is written either comes from or goes to
> memory (anonymous or special memory filesystem).
> 
> If you have an mmapped memory segment with a regular device backed file
> then you already have one file associated with a memory segment and a
> filesystem that does take care of synchronizing the contents of the memory
> segment to a backing device.
> 
> If you now perform RDMA or device I/O on such a memory segment then you
> will have *two* different devices interacting with that memory segment. I
> think that ought not to happen and not be supported out of the box. It
> will be difficult to handle and the semantics will be hard for users to
> understand.
> 
> What could happen is that the filesystem could agree on request to allow
> third party I/O to go to such a memory segment. But that needs to be well
> defined and clearly and explicitly handled by some mechanism in user space
> that has well defined semantics for data integrity for the filesystem as
> well as the RDMA or device I/O.
> 

Those discussions are underway. Dave Chinner and others have been talking
about filesystem leases, for example. The key point here is that we'll still
need, in any of these approaches, to be able to identify the gup-pinned
pages. And there are lots (100+) of call sites to change. So I figure we'd
better get that started.

thanks,
-- 
John Hubbard
NVIDIA

  reply	other threads:[~2019-03-08  3:15 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-06 23:54 [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions john.hubbard
2019-03-06 23:54 ` [PATCH v3 1/1] " john.hubbard
2019-03-08  2:58   ` Christopher Lameter
2019-03-08  3:15     ` John Hubbard [this message]
2019-03-08 17:43       ` Weiny, Ira
2019-03-08 17:57   ` Jerome Glisse
2019-03-08 21:27     ` John Hubbard
2019-03-12 15:30   ` Ira Weiny
2019-03-13  0:38     ` John Hubbard
2019-03-13 14:49       ` Ira Weiny
2019-03-14  3:19         ` John Hubbard
2019-03-07  8:37 ` [PATCH v3 0/1] " Ira Weiny
2019-03-08  3:08 ` Christopher Lameter
2019-03-08 19:07   ` Jerome Glisse
2019-03-12  4:52     ` Christopher Lameter
2019-03-12 15:35       ` Jerome Glisse
2019-03-12 15:53         ` Jason Gunthorpe
2019-03-13 19:16         ` Christopher Lameter
2019-03-13 19:33           ` Jerome Glisse
2019-03-14  9:03           ` Jan Kara
2019-03-14 12:57             ` Jason Gunthorpe
2019-03-14 13:30               ` Jan Kara
2019-03-14 20:25                 ` William Kucharski
2019-03-14 20:37                   ` John Hubbard
2019-03-10 22:47   ` Dave Chinner
2019-03-12  5:23     ` Christopher Lameter
2019-03-12 10:39       ` Ira Weiny
2019-03-12 22:11         ` Dave Chinner
2019-03-12 15:23           ` Ira Weiny
2019-03-13 16:03           ` Christoph Hellwig
2019-03-13 19:21             ` Christopher Lameter
2019-03-14  9:06               ` Jan Kara
2019-03-18 20:12                 ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3cc3c382-2505-3b6c-ec58-1f14ebcb77e8@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=benve@cisco.com \
    --cc=cl@linux.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=dennis.dalessandro@intel.com \
    --cc=dledford@redhat.com \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=john.hubbard@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.marciniszyn@intel.com \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@linux.ibm.com \
    --cc=tom@talpey.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).