Linux-kselftest Archive on lore.kernel.org
 help / color / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Al Viro" <viro@zeniv.linux.org.uk>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Dave Chinner" <david@fromorbit.com>,
	"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	"Michal Hocko" <mhocko@suse.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	"John Hubbard" <jhubbard@nvidia.com>
Subject: [PATCH v2 0/8] mm/gup: track FOLL_PIN pages (follow on from v12)
Date: Tue, 28 Jan 2020 19:24:09 -0800
Message-ID: <20200129032417.3085670-1-jhubbard@nvidia.com> (raw)

OK, as requested, I've split the tracking patch into 6 smaller patches,
and it should be *much* easier to understand and review now.

============================================================
Changes since v1:

* Split the tracking patch into 6 smaller patches

* Rebased onto today's linux-next/akpm (there weren't any conflicts).

* Fixed an "unsigned int" vs. "int" problem in gup_benchmark, reported
  by Nathan Chancellor. (I don't see it in my local builds, probably
  because they use gcc, but an LLVM test found the mismatch.)

* Fixed a huge page pincount problem (add/subtract vs.
  increment/decrement), spotted by Jan Kara.
============================================================

There is a reasonable case to be made for merging two of the patches
(patches 4 and 5), given that patch 4 provides tracking that has upper
limits on the number of pins that can be done with huge pages. Let me
know if anyone wants those merged, but unless there is some weird chance
of someone grabbing patch 4 and not patch 5, I don't really see the
need. Meanwhile, it's easier to review in this form.

Also, patch 3 has been revived. Earlier reviewers asked for it to be
merged into the tracking patch (one cannot please everyone, heh), but
now it's back out on it's own.

This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].

It is based on today's (Jan 28) linux-next (branch: akpm),
commit 280e9cb00b41 ("drivers/media/platform/sti/delta/delta-ipc.c: fix
read buffer overflow")

There is a git repo and branch, for convenience in reviewing:

    git@github.com:johnhubbard/linux.git
            track_user_pages_v2_linux-next_akpm_28Jan2020

FOLL_PIN support is (so far) in mmotm and linux-next. However, the
patch to use FOLL_PIN to track pages was *not* submitted, because Leon
saw an RDMA test suite failure that involved (I think) page refcount
overflows when huge pages were used.

This patch definitively solves that kind of overflow problem, by adding
an exact pincount, for compound pages (of order > 1), in the 3rd struct
page of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.

Here's the last reviewed version of the tracking patch (v11):

  https://lore.kernel.org/r/20191216222537.491123-1-jhubbard@nvidia.com

Jan Kara had provided a reviewed-by tag for that, but I've had to remove
it (again) here, due to having changed the patch "a little bit", in
order to add the feature described above.

Other interesting changes:

* dump_page(): added one, or two new things to report for compound
  pages: head refcount (for all compound pages), and map_pincount (for
  compound pages of order > 1).

* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
  huge page refcount upper limit problems, and added notes about how it
  works now. Also added a note about the dump_page() enhancements.

* Added some comments in gup.c and mm.h, to explain that there are two
  ways to count pinned pages: exact (for compound pages of order > 1)
  and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).

============================================================
General notes about the tracking patch:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()

============================================================
Next steps:

* Convert more subsystems from get_user_pages() to pin_user_pages().
* Work with Ira and others to connect this all up with file system
  leases.

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/

[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/

[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/

[4] LWN kernel index: get_user_pages()
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages


John Hubbard (8):
  mm: dump_page: print head page's refcount, for compound pages
  mm/gup: split get_user_pages_remote() into two routines
  mm/gup: pass a flags arg to __gup_device_* functions
  mm/gup: track FOLL_PIN pages
  mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
  mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
    coverage

 Documentation/core-api/pin_user_pages.rst  |  47 +--
 include/linux/mm.h                         | 109 ++++-
 include/linux/mm_types.h                   |   7 +-
 include/linux/mmzone.h                     |   2 +
 include/linux/page_ref.h                   |  10 +
 mm/debug.c                                 |  22 +-
 mm/gup.c                                   | 460 ++++++++++++++++-----
 mm/gup_benchmark.c                         |  71 +++-
 mm/huge_memory.c                           |  29 +-
 mm/hugetlb.c                               |  44 +-
 mm/page_alloc.c                            |   2 +
 mm/rmap.c                                  |   6 +
 mm/vmstat.c                                |   2 +
 tools/testing/selftests/vm/gup_benchmark.c |  15 +-
 tools/testing/selftests/vm/run_vmtests     |  22 +
 15 files changed, 681 insertions(+), 167 deletions(-)

-- 
2.25.0


             reply index

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-29  3:24 John Hubbard [this message]
2020-01-29  3:24 ` [PATCH v2 1/8] mm: dump_page: print head page's refcount, for compound pages John Hubbard
2020-01-29 11:25   ` Kirill A. Shutemov
2020-01-29 22:26     ` John Hubbard
2020-01-29 22:59       ` Matthew Wilcox
2020-01-30  6:23         ` John Hubbard
2020-01-30  6:30           ` John Hubbard
2020-01-29  3:24 ` [PATCH v2 2/8] mm/gup: split get_user_pages_remote() into two routines John Hubbard
2020-01-29  3:24 ` [PATCH v2 3/8] mm/gup: pass a flags arg to __gup_device_* functions John Hubbard
2020-01-29  3:24 ` [PATCH v2 4/8] mm/gup: track FOLL_PIN pages John Hubbard
2020-01-29 13:51   ` Kirill A. Shutemov
2020-01-30  6:44     ` John Hubbard
2020-01-30 11:31       ` Kirill A. Shutemov
2020-01-31  3:19         ` John Hubbard
2020-01-29  3:24 ` [PATCH v2 5/8] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages John Hubbard
2020-01-29  3:24 ` [PATCH v2 6/8] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting John Hubbard
2020-01-29  3:24 ` [PATCH v2 7/8] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2020-01-29  3:24 ` [PATCH v2 8/8] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200129032417.3085670-1-jhubbard@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=shuah@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-kselftest Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-kselftest/0 linux-kselftest/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-kselftest linux-kselftest/ https://lore.kernel.org/linux-kselftest \
		linux-kselftest@vger.kernel.org
	public-inbox-index linux-kselftest

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kselftest


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git