Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>, Matthew Wilcox <willy@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Dan Williams <dan.j.williams@intel.com>,
	John Hubbard <john.hubbard@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>,
	tom@talpey.com, Al Viro <viro@zeniv.linux.org.uk>,
	benve@cisco.com, Christoph Hellwig <hch@infradead.org>,
	Christopher Lameter <cl@linux.com>,
	"Dalessandro, Dennis" <dennis.dalessandro@intel.com>,
	Doug Ledford <dledford@redhat.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Michal Hocko <mhocko@kernel.org>,
	mike.marciniszyn@intel.com, rcampbell@nvidia.com,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions
Date: Tue, 18 Dec 2018 21:07:24 -0500
Message-ID: <20181219020723.GD4347@redhat.com> (raw)
In-Reply-To: <9f43d124-2386-7bfd-d90b-4d0417f51ccd@nvidia.com>

On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote:
> On 12/18/18 1:30 AM, Jan Kara wrote:
> > On Mon 17-12-18 10:34:43, Matthew Wilcox wrote:
> >> On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote:
> >>> On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote:
> >>>> Sure, that's a possibility, but that doesn't close off any race
> >>>> conditions because there can be DMA into the page in progress while
> >>>> the page is being bounced, right? AFAICT this ext3+DIF/DIX case is
> >>>> different in that there is no 3rd-party access to the page while it
> >>>> is under IO (ext3 arbitrates all access to it's metadata), and so
> >>>> nothing can actually race for modification of the page between
> >>>> submission and bouncing at the block layer.
> >>>>
> >>>> In this case, the moment the page is unlocked, anyone else can map
> >>>> it and start (R)DMA on it, and that can happen before the bio is
> >>>> bounced by the block layer. So AFAICT, block layer bouncing doesn't
> >>>> solve the problem of racing writeback and DMA direct to the page we
> >>>> are doing IO on. Yes, it reduces the race window substantially, but
> >>>> it doesn't get rid of it.
> >>>
> >>> So the event flow is:
> >>>     - userspace create object that match a range of virtual address
> >>>       against a given kernel sub-system (let's say infiniband) and
> >>>       let's assume that the range is an mmap() of a regular file
> >>>     - device driver do GUP on the range (let's assume it is a write
> >>>       GUP) so if the page is not already map with write permission
> >>>       in the page table than a page fault is trigger and page_mkwrite
> >>>       happens
> >>>     - Once GUP return the page to the device driver and once the
> >>>       device driver as updated the hardware states to allow access
> >>>       to this page then from that point on hardware can write to the
> >>>       page at _any_ time, it is fully disconnected from any fs event
> >>>       like write back, it fully ignore things like page_mkclean
> >>>
> >>> This is how it is to day, we allowed people to push upstream such
> >>> users of GUP. This is a fact we have to live with, we can not stop
> >>> hardware access to the page, we can not force the hardware to follow
> >>> page_mkclean and force a page_mkwrite once write back ends. This is
> >>> the situation we are inheriting (and i am personnaly not happy with
> >>> that).
> >>>
> >>> >From my point of view we are left with 2 choices:
> >>>     [C1] break all drivers that do not abide by the page_mkclean and
> >>>          page_mkwrite
> >>>     [C2] mitigate as much as possible the issue
> >>>
> >>> For [C2] the idea is to keep track of GUP per page so we know if we
> >>> can expect the page to be written to at any time. Here is the event
> >>> flow:
> >>>     - driver GUP the page and program the hardware, page is mark as
> >>>       GUPed
> >>>     ...
> >>>     - write back kicks in on the dirty page, lock the page and every
> >>>       thing as usual , sees it is GUPed and inform the block layer to
> >>>       use a bounce page
> >>
> >> No.  The solution John, Dan & I have been looking at is to take the
> >> dirty page off the LRU while it is pinned by GUP.  It will never be
> >> found for writeback.
> >>
> >> That's not the end of the story though.  Other parts of the kernel (eg
> >> msync) also need to be taught to stay away from pages which are pinned
> >> by GUP.  But the idea is that no page gets written back to storage while
> >> it's pinned by GUP.  Only when the last GUP ends is the page returned
> >> to the list of dirty pages.
> > 
> > We've been through this in:
> > 
> > https://lore.kernel.org/lkml/20180709194740.rymbt2fzohbdmpye@quack2.suse.cz/
> > 
> > back in July. You cannot just skip pages for fsync(2). So as I wrote above -
> > memory cleaning writeback can skip pinned pages. Data integrity writeback
> > must be able to write pinned pages. And bouncing is one reasonable way how
> > to do that.
> > 
> > This writeback decision is pretty much independent from the mechanism by
> > which we are going to identify pinned pages. Whether that's going to be
> > separate counter in struct page, using page->_mapcount, or separately
> > allocated data structure as you know promote.
> > 
> > I currently like the most the _mapcount suggestion from Jerome but I'm not
> > really attached to any solution as long as it performs reasonably and
> > someone can make it working :) as I don't have time to implement it at
> > least till January.
> > 
> 
> OK, so let's take another look at Jerome's _mapcount idea all by itself (using
> *only* the tracking pinned pages aspect), given that it is the lightest weight
> solution for that.  
> 
> So as I understand it, this would use page->_mapcount to store both the real
> mapcount, and the dma pinned count (simply added together), but only do so for
> file-backed (non-anonymous) pages:
> 
> 
> __get_user_pages()
> {
> 	...
> 	get_page(page);
> 
> 	if (!PageAnon)
> 		atomic_inc(page->_mapcount);
> 	...
> }
> 
> put_user_page(struct page *page)
> {
> 	...
> 	if (!PageAnon)
> 		atomic_dec(&page->_mapcount);
> 
> 	put_page(page);
> 	...
> }
> 
> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page)
> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you 
> had in mind?

Mostly, with the extra two observations:
    [1] We only need to know the pin count when a write back kicks in
    [2] We need to protect GUP code with wait_for_write_back() in case
        GUP is racing with a write back that might not the see the
        elevated mapcount in time.

So for [2]

__get_user_pages()
{
    get_page(page);

    if (!PageAnon) {
        atomic_inc(page->_mapcount);
+       if (PageWriteback(page)) {
+           // Assume we are racing and curent write back will not see
+           // the elevated mapcount so wait for current write back and
+           // force page fault
+           wait_on_page_writeback(page);
+           // force slow path that will fault again
+       }
    }
}

For [1] only needing pin count during write back turns page_mkclean into
the perfect spot to check for that so:

int page_mkclean(struct page *page)
{
    int cleaned = 0;
+   int real_mapcount = 0;
    struct address_space *mapping;
    struct rmap_walk_control rwc = {
        .arg = (void *)&cleaned,
        .rmap_one = page_mkclean_one,
        .invalid_vma = invalid_mkclean_vma,
+       .mapcount = &real_mapcount,
    };

    BUG_ON(!PageLocked(page));

    if (!page_mapped(page))
        return 0;

    mapping = page_mapping(page);
    if (!mapping)
        return 0;

    // rmap_walk need to change to count mapping and return value
    // in .mapcount easy one
    rmap_walk(page, &rwc);

    // Big fat comment to explain what is going on
+   if ((page_mapcount(page) - real_mapcount) > 0) {
+       SetPageDMAPined(page);
+   } else {
+       ClearPageDMAPined(page);
+   }

    // Maybe we want to leverage the int nature of return value so that
    // we can express more than cleaned/truncated and express cleaned/
    // truncated/pinned for benefit of caller and that way we do not
    // even need one bit as page flags above.

    return cleaned;
}

You do not want to change page_mapped() i do not see a need for that.

Then the whole discussion between Jan and Dave seems to indicate that
the bounce mechanism will need to be in the fs layer and that we can
not reuse the bio bounce mechanism. This means that more work is needed
at the fs level for that (so that fs do not freak on bounce page).

Note that they are few gotcha where we need to preserve the pin count
ie mostly in truncate code path that can remove page from page cache
and overwrite the mapcount in the process, this would need to be fixed
to not overwrite mapcount so that put_user_page does not set the map
count to an invalid value turning the page into a bad state that will
at one point trigger kernel BUG_ON();

I am not saying block truncate, i am saying make sure it does not
erase pin count and keep truncating happily. The how to handle truncate
is a per existing GUP user discussion to see what they want to do for
that.

Obviously a bit deeper analysis of all spot that use mapcount is needed
to check that we are not breaking anything but from the top of my head
i can not think of anything bad (migrate will abort and other things will
assume the page is mapped even it is only in hardware page table, ...).

Cheers,
J�r�me

  reply index

Thread overview: 213+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-04  0:17 [PATCH 0/2] put_user_page*(): start converting the call sites john.hubbard
2018-12-04  0:17 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
2018-12-04  7:53   ` Mike Rapoport
2018-12-05  1:40     ` John Hubbard
2018-12-04 20:28   ` Dan Williams
2018-12-04 21:56     ` John Hubbard
2018-12-04 23:03       ` Dan Williams
2018-12-05  0:36         ` Jerome Glisse
2018-12-05  0:40           ` Dan Williams
2018-12-05  0:59             ` John Hubbard
2018-12-05  0:58         ` John Hubbard
2018-12-05  1:00           ` Dan Williams
2018-12-05  1:15           ` Matthew Wilcox
2018-12-05  1:44             ` Jerome Glisse
2018-12-05  1:57               ` John Hubbard
2018-12-07  2:45                 ` John Hubbard
2018-12-07 19:16                   ` Jerome Glisse
2018-12-07 19:26                     ` Dan Williams
2018-12-07 19:40                       ` Jerome Glisse
2018-12-08  0:52                     ` John Hubbard
2018-12-08  2:24                       ` Jerome Glisse
2018-12-10 10:28                         ` Jan Kara
2018-12-12 15:03                           ` Jerome Glisse
2018-12-12 16:27                             ` Dan Williams
2018-12-12 17:02                               ` Jerome Glisse
2018-12-12 17:49                                 ` Dan Williams
2018-12-12 19:07                                   ` John Hubbard
2018-12-12 21:30                               ` Jerome Glisse
2018-12-12 21:40                                 ` Dan Williams
2018-12-12 21:53                                   ` Jerome Glisse
2018-12-12 22:11                                     ` Matthew Wilcox
2018-12-12 22:16                                       ` Jerome Glisse
2018-12-12 23:37                                     ` Jason Gunthorpe
2018-12-12 23:46                                       ` John Hubbard
2018-12-12 23:54                                       ` Dan Williams
2018-12-13  0:01                                       ` Jerome Glisse
2018-12-13  0:18                                         ` Dan Williams
2018-12-13  0:44                                           ` Jerome Glisse
2018-12-13  3:26                                             ` Jason Gunthorpe
2018-12-13  3:20                                         ` Jason Gunthorpe
2018-12-13 12:43                                           ` Jerome Glisse
2018-12-13 13:40                                             ` Tom Talpey
2018-12-13 14:18                                               ` Jerome Glisse
2018-12-13 14:51                                                 ` Tom Talpey
2018-12-13 15:18                                                   ` Jerome Glisse
2018-12-13 18:12                                                     ` Tom Talpey
2018-12-13 19:18                                                       ` Jerome Glisse
2018-12-14 10:41                                             ` Jan Kara
2018-12-14 15:25                                               ` Jerome Glisse
2018-12-12 21:56                                 ` John Hubbard
2018-12-12 22:04                                   ` Jerome Glisse
2018-12-12 22:11                                     ` John Hubbard
2018-12-12 22:14                                       ` Jerome Glisse
2018-12-12 22:17                                         ` John Hubbard
2018-12-12 21:46                             ` Dave Chinner
2018-12-12 21:59                               ` Jerome Glisse
2018-12-13  0:51                                 ` Dave Chinner
2018-12-13  2:02                                   ` Jerome Glisse
2018-12-13 15:56                                     ` Christopher Lameter
2018-12-13 16:02                                       ` Jerome Glisse
2018-12-14  6:00                                     ` Dave Chinner
2018-12-14 15:13                                       ` Jerome Glisse
2018-12-14  3:52                                   ` John Hubbard
2018-12-14  5:21                                     ` Dan Williams
2018-12-14  6:11                                       ` John Hubbard
2018-12-14 15:20                                         ` Jerome Glisse
2018-12-14 19:38                                         ` Dan Williams
2018-12-14 19:48                                           ` Matthew Wilcox
2018-12-14 19:53                                             ` Dave Hansen
2018-12-14 20:03                                               ` Matthew Wilcox
2018-12-14 20:17                                                 ` Dan Williams
2018-12-14 20:29                                                   ` Matthew Wilcox
2018-12-15  0:41                                                 ` John Hubbard
2018-12-17  8:56                                           ` Jan Kara
2018-12-17 18:28                                             ` Dan Williams
2018-12-14 15:43                               ` Jan Kara
2018-12-16 21:58                                 ` Dave Chinner
2018-12-17 18:11                                   ` Jerome Glisse
2018-12-17 18:34                                     ` Matthew Wilcox
2018-12-17 19:48                                       ` Jerome Glisse
2018-12-17 19:51                                         ` Matthew Wilcox
2018-12-17 19:54                                           ` Jerome Glisse
2018-12-17 19:59                                             ` Matthew Wilcox
2018-12-17 20:55                                               ` Jerome Glisse
2018-12-17 21:03                                                 ` Matthew Wilcox
2018-12-17 21:15                                                   ` Jerome Glisse
2018-12-18  1:09                                       ` Dave Chinner
2018-12-18  6:12                                       ` Darrick J. Wong
2018-12-18  9:30                                       ` Jan Kara
2018-12-18 23:29                                         ` John Hubbard
2018-12-19  2:07                                           ` Jerome Glisse [this message]
2018-12-19 11:08                                             ` Jan Kara
2018-12-20 10:54                                               ` John Hubbard
2018-12-20 16:50                                                 ` Jerome Glisse
2018-12-20 16:57                                                   ` Dan Williams
2018-12-20 16:49                                               ` Jerome Glisse
2019-01-03  1:55                                               ` Jerome Glisse
2019-01-03  3:27                                                 ` John Hubbard
2019-01-03 14:57                                                   ` Jerome Glisse
2019-01-03  9:26                                                 ` Jan Kara
2019-01-03 14:44                                                   ` Jerome Glisse
2019-01-11  2:59                                                     ` John Hubbard
2019-01-11  2:59                                                       ` John Hubbard
2019-01-11 16:51                                                       ` Jerome Glisse
2019-01-11 16:51                                                         ` Jerome Glisse
2019-01-12  1:04                                                         ` John Hubbard
2019-01-12  1:04                                                           ` John Hubbard
2019-01-12  2:02                                                           ` Jerome Glisse
2019-01-12  2:02                                                             ` Jerome Glisse
2019-01-12  2:38                                                             ` John Hubbard
2019-01-12  2:38                                                               ` John Hubbard
2019-01-12  2:46                                                               ` Jerome Glisse
2019-01-12  2:46                                                                 ` Jerome Glisse
2019-01-12  3:06                                                                 ` John Hubbard
2019-01-12  3:06                                                                   ` John Hubbard
2019-01-12  3:25                                                                   ` Jerome Glisse
2019-01-12  3:25                                                                     ` Jerome Glisse
2019-01-12 20:46                                                                     ` John Hubbard
2019-01-12 20:46                                                                       ` John Hubbard
2019-01-14 14:54                                                                   ` Jan Kara
2019-01-14 14:54                                                                     ` Jan Kara
2019-01-14 17:21                                                                     ` Jerome Glisse
2019-01-14 17:21                                                                       ` Jerome Glisse
2019-01-14 19:09                                                                       ` John Hubbard
2019-01-14 19:09                                                                         ` John Hubbard
2019-01-15  8:34                                                                         ` Jan Kara
2019-01-15  8:34                                                                           ` Jan Kara
2019-01-15 21:39                                                                           ` John Hubbard
2019-01-15 21:39                                                                             ` John Hubbard
2019-01-15  8:07                                                                       ` Jan Kara
2019-01-15  8:07                                                                         ` Jan Kara
2019-01-15 17:15                                                                         ` Jerome Glisse
2019-01-15 17:15                                                                           ` Jerome Glisse
2019-01-15 21:56                                                                           ` John Hubbard
2019-01-15 21:56                                                                             ` John Hubbard
2019-01-15 22:12                                                                             ` Jerome Glisse
2019-01-15 22:12                                                                               ` Jerome Glisse
2019-01-16  0:44                                                                               ` John Hubbard
2019-01-16  0:44                                                                                 ` John Hubbard
2019-01-16  1:56                                                                                 ` Jerome Glisse
2019-01-16  1:56                                                                                   ` Jerome Glisse
2019-01-16  2:01                                                                                   ` Dan Williams
2019-01-16  2:01                                                                                     ` Dan Williams
2019-01-16  2:23                                                                                     ` Jerome Glisse
2019-01-16  2:23                                                                                       ` Jerome Glisse
2019-01-16  4:34                                                                                       ` Dave Chinner
2019-01-16  4:34                                                                                         ` Dave Chinner
2019-01-16 14:50                                                                                         ` Jerome Glisse
2019-01-16 14:50                                                                                           ` Jerome Glisse
2019-01-16 22:51                                                                                           ` Dave Chinner
2019-01-16 22:51                                                                                             ` Dave Chinner
2019-01-16 11:38                                                                         ` Jan Kara
2019-01-16 11:38                                                                           ` Jan Kara
2019-01-16 13:08                                                                           ` Jerome Glisse
2019-01-16 13:08                                                                             ` Jerome Glisse
2019-01-17  5:42                                                                             ` John Hubbard
2019-01-17  5:42                                                                               ` John Hubbard
2019-01-17 15:21                                                                               ` Jerome Glisse
2019-01-17 15:21                                                                                 ` Jerome Glisse
2019-01-18  0:16                                                                                 ` Dave Chinner
2019-01-18  1:59                                                                                   ` Jerome Glisse
2019-01-17  9:30                                                                             ` Jan Kara
2019-01-17  9:30                                                                               ` Jan Kara
2019-01-17 15:17                                                                               ` Jerome Glisse
2019-01-17 15:17                                                                                 ` Jerome Glisse
2019-01-22 15:24                                                                                 ` Jan Kara
2019-01-22 16:46                                                                                   ` Jerome Glisse
2019-01-23 18:02                                                                                     ` Jan Kara
2019-01-23 19:04                                                                                       ` Jerome Glisse
2019-01-29  0:22                                                                                         ` John Hubbard
2019-01-29  1:23                                                                                           ` Jerome Glisse
2019-01-29  6:41                                                                                             ` John Hubbard
2019-01-29 10:12                                                                                               ` Jan Kara
2019-01-30  2:21                                                                                                 ` John Hubbard
2019-01-17  5:25                                                                         ` John Hubbard
2019-01-17  5:25                                                                           ` John Hubbard
2019-01-17  9:04                                                                           ` Jan Kara
2019-01-17  9:04                                                                             ` Jan Kara
2019-01-12  3:14                                                               ` Jerome Glisse
2019-01-12  3:14                                                                 ` Jerome Glisse
2018-12-18 10:33                                   ` Jan Kara
2018-12-18 23:42                                     ` Dave Chinner
2018-12-19  3:03                                       ` Jason Gunthorpe
2018-12-19  5:26                                         ` Dan Williams
2018-12-19 11:19                                           ` Jan Kara
2018-12-19 10:28                                         ` Dave Chinner
2018-12-19 11:35                                           ` Jan Kara
2018-12-19 16:56                                             ` Jason Gunthorpe
2018-12-19 22:33                                             ` Dave Chinner
2018-12-20  9:07                                               ` Jan Kara
2018-12-20 16:54                                               ` Jerome Glisse
2018-12-19 13:24                                       ` Jan Kara
2018-12-08  5:18                       ` Matthew Wilcox
2018-12-12 19:13                         ` John Hubbard
2018-12-08  7:16                       ` Dan Williams
2018-12-08 16:33                         ` Jerome Glisse
2018-12-08 16:48                           ` Christoph Hellwig
2018-12-08 17:47                             ` Jerome Glisse
2018-12-08 18:26                               ` Christoph Hellwig
2018-12-08 18:45                                 ` Jerome Glisse
2018-12-08 18:09                             ` Dan Williams
2018-12-08 18:12                               ` Christoph Hellwig
2018-12-11  6:18                               ` Dave Chinner
2018-12-05  5:52             ` Dan Williams
2018-12-05 11:16       ` Jan Kara
2018-12-04  0:17 ` [PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*() john.hubbard
2018-12-04 17:10 ` [PATCH 0/2] put_user_page*(): start converting the call sites David Laight
2018-12-05  1:05   ` John Hubbard
2018-12-05 14:08     ` David Laight
2018-12-28  8:37       ` Pavel Machek
2019-02-08  7:56 [PATCH 0/2] mm: put_user_page() call site conversion first john.hubbard
2019-02-08  7:56 ` [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions john.hubbard
2019-02-08 10:32   ` Mike Rapoport
2019-02-08 20:44     ` John Hubbard

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181219020723.GD4347@redhat.com \
    --to=jglisse@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=benve@cisco.com \
    --cc=cl@linux.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=dennis.dalessandro@intel.com \
    --cc=dledford@redhat.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=john.hubbard@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.marciniszyn@intel.com \
    --cc=rcampbell@nvidia.com \
    --cc=tom@talpey.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox