All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE
@ 2021-01-22  3:37 Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 01/14] mm/gup: don't pin migrated cma pages in movable zone Pavel Tatashin
                   ` (13 more replies)
  0 siblings, 14 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

Changelog
---------
v7
- Added reviewed-by's
- Fixed a compile bug on non-mmu builds reported by robot

v6
  Small update, but I wanted to send it out quicker, as it removes a
  controversial patch and replaces it with something sane.
- Removed forcing FOLL_WRITE for longterm gup, instead added a patch to
  skip zero pages during migration.
- Added reviewed-by's and minor log changes.

v5
- Added the following patches to the beginning of series, which are fixes
   to the other existing problems with CMA migration code:
	mm/gup: check every subpage of a compound page during isolation
	mm/gup: return an error on migration failure
	mm/gup: check for isolation errors also at the beginning of series
	mm/gup: do not allow zero page for pinned pages
- remove .gfp_mask/.reclaim_idx changes from mm/vmscan.c
- update movable zone header comment in patch 8 instead of patch 3, fix
  the comment
- Added acked, sign-offs
- Updated commit logs based on feedback
- Addressed issues reported by Michal and Jason.
- Remove:
	#define PINNABLE_MIGRATE_MAX	10
	#define PINNABLE_ISOLATE_MAX	100
   Instead: fail on the first migration failure, and retry isolation
   forever as their failures are transient.

- In self-set addressed some of the comments from John Hubbard, updated
  commit logs, and added comments. Renamed gup->flags with gup->test_flags.

v4
- Address page migration comments. New patch:
  mm/gup: limit number of gup migration failures, honor failures
  Implements the limiting number of retries for migration failures, and
  also check for isolation failures.
  Added a test case into gup_test to verify that pages never long-term
  pinned in a movable zone, and also added tests to fault both in kernel
  and in userland.
v3
- Merged with linux-next, which contains clean-up patch from Jason,
  therefore this series is reduced by two patches which did the same
  thing.
v2
- Addressed all review comments
- Added Reviewed-by's.
- Renamed PF_MEMALLOC_NOMOVABLE to PF_MEMALLOC_PIN
- Added is_pinnable_page() to check if page can be longterm pinned
- Fixed gup fast path by checking is_in_pinnable_zone()
- rename cma_page_list to movable_page_list
- add a admin-guide note about handling pinned pages in ZONE_MOVABLE,
  updated caveat about pinned pages from linux/mmzone.h
- Move current_gfp_context() to fast-path

---------
When page is pinned it cannot be moved and its physical address stays
the same until pages is unpinned.

This is useful functionality to allows userland to implementation DMA
access. For example, it is used by vfio in vfio_pin_pages().

However, this functionality breaks memory hotplug/hotremove assumptions
that pages in ZONE_MOVABLE can always be migrated.

This patch series fixes this issue by forcing new allocations during
page pinning to omit ZONE_MOVABLE, and also to migrate any existing
pages from ZONE_MOVABLE during pinning.

It uses the same scheme logic that is currently used by CMA, and extends
the functionality for all allocations.

For more information read the discussion [1] about this problem.
[1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com

Previous versions:
v1
https://lore.kernel.org/lkml/20201202052330.474592-1-pasha.tatashin@soleen.com
v2
https://lore.kernel.org/lkml/20201210004335.64634-1-pasha.tatashin@soleen.com
v3
https://lore.kernel.org/lkml/20201211202140.396852-1-pasha.tatashin@soleen.com
v4
https://lore.kernel.org/lkml/20201217185243.3288048-1-pasha.tatashin@soleen.com
v5
https://lore.kernel.org/lkml/20210119043920.155044-1-pasha.tatashin@soleen.com
v6
https://lore.kernel.org/lkml/20210120014333.222547-1-pasha.tatashin@soleen.com


Pavel Tatashin (14):
  mm/gup: don't pin migrated cma pages in movable zone
  mm/gup: check every subpage of a compound page during isolation
  mm/gup: return an error on migration failure
  mm/gup: check for isolation errors
  mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
  mm: apply per-task gfp constraints in fast path
  mm: honor PF_MEMALLOC_PIN for all movable pages
  mm/gup: do not migrate zero page
  mm/gup: migrate pinned pages out of movable zone
  memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning
  mm/gup: change index type to long as it counts pages
  mm/gup: longterm pin migration cleanup
  selftests/vm: test flag is broken
  selftests/vm: test faulting in kernel, and verify pinnable pages

 .../admin-guide/mm/memory-hotplug.rst         |   9 +
 include/linux/migrate.h                       |   1 +
 include/linux/mm.h                            |  11 ++
 include/linux/mmzone.h                        |  13 +-
 include/linux/pgtable.h                       |   3 +-
 include/linux/sched.h                         |   2 +-
 include/linux/sched/mm.h                      |  27 +--
 include/trace/events/migrate.h                |   3 +-
 mm/gup.c                                      | 176 ++++++++----------
 mm/gup_test.c                                 |  29 +--
 mm/gup_test.h                                 |   3 +-
 mm/hugetlb.c                                  |   4 +-
 mm/page_alloc.c                               |  33 ++--
 tools/testing/selftests/vm/gup_test.c         |  36 +++-
 14 files changed, 191 insertions(+), 159 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v7 01/14] mm/gup: don't pin migrated cma pages in movable zone
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 02/14] mm/gup: check every subpage of a compound page during isolation Pavel Tatashin
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

In order not to fragment CMA the pinned pages are migrated. However,
they are migrated to ZONE_MOVABLE, which also should not have pinned pages.

Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning
is allowed.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/gup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3e086b073624..24f25b1e9103 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1563,7 +1563,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 	long ret = nr_pages;
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
-		.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN,
+		.gfp_mask = GFP_USER | __GFP_NOWARN,
 	};
 
 check_again:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 02/14] mm/gup: check every subpage of a compound page during isolation
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 01/14] mm/gup: don't pin migrated cma pages in movable zone Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 03/14] mm/gup: return an error on migration failure Pavel Tatashin
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

When pages are isolated in check_and_migrate_movable_pages() we skip
compound number of pages at a time. However, as Jason noted, it is
not necessary correct that pages[i] corresponds to the pages that
we skipped. This is because it is possible that the addresses in
this range had split_huge_pmd()/split_huge_pud(), and these functions
do not update the compound page metadata.

The problem can be reproduced if something like this occurs:

1. User faulted huge pages.
2. split_huge_pmd() was called for some reason
3. User has unmapped some sub-pages in the range
4. User tries to longterm pin the addresses.

The resulting pages[i] might end-up having pages which are not compound
size page aligned.

Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 mm/gup.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 24f25b1e9103..16f10d5a9eb6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1556,26 +1556,23 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 					unsigned int gup_flags)
 {
 	unsigned long i;
-	unsigned long step;
 	bool drain_allow = true;
 	bool migrate_allow = true;
 	LIST_HEAD(cma_page_list);
 	long ret = nr_pages;
+	struct page *prev_head, *head;
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
 		.gfp_mask = GFP_USER | __GFP_NOWARN,
 	};
 
 check_again:
-	for (i = 0; i < nr_pages;) {
-
-		struct page *head = compound_head(pages[i]);
-
-		/*
-		 * gup may start from a tail page. Advance step by the left
-		 * part.
-		 */
-		step = compound_nr(head) - (pages[i] - head);
+	prev_head = NULL;
+	for (i = 0; i < nr_pages; i++) {
+		head = compound_head(pages[i]);
+		if (head == prev_head)
+			continue;
+		prev_head = head;
 		/*
 		 * If we get a page from the CMA zone, since we are going to
 		 * be pinning these entries, we might as well move them out
@@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 				}
 			}
 		}
-
-		i += step;
 	}
 
 	if (!list_empty(&cma_page_list)) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 03/14] mm/gup: return an error on migration failure
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 01/14] mm/gup: don't pin migrated cma pages in movable zone Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 02/14] mm/gup: check every subpage of a compound page during isolation Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 04/14] mm/gup: check for isolation errors Pavel Tatashin
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

When migration failure occurs, we still pin pages, which means
that we may pin CMA movable pages which should never be the case.

Instead return an error without pinning pages when migration failure
happens.

No need to retry migrating, because migrate_pages() already retries
10 times.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 mm/gup.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 16f10d5a9eb6..88ce41f41543 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1557,7 +1557,6 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 {
 	unsigned long i;
 	bool drain_allow = true;
-	bool migrate_allow = true;
 	LIST_HEAD(cma_page_list);
 	long ret = nr_pages;
 	struct page *prev_head, *head;
@@ -1608,17 +1607,15 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 			for (i = 0; i < nr_pages; i++)
 				put_page(pages[i]);
 
-		if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
-			/*
-			 * some of the pages failed migration. Do get_user_pages
-			 * without migration.
-			 */
-			migrate_allow = false;
-
+		ret = migrate_pages(&cma_page_list, alloc_migration_target,
+				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
+				    MR_CONTIG_RANGE);
+		if (ret) {
 			if (!list_empty(&cma_page_list))
 				putback_movable_pages(&cma_page_list);
+			return ret > 0 ? -ENOMEM : ret;
 		}
+
 		/*
 		 * We did migrate all the pages, Try to get the page references
 		 * again migrating any new CMA pages which we failed to isolate
@@ -1628,7 +1625,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 						   pages, vmas, NULL,
 						   gup_flags);
 
-		if ((ret > 0) && migrate_allow) {
+		if (ret > 0) {
 			nr_pages = ret;
 			drain_allow = true;
 			goto check_again;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 04/14] mm/gup: check for isolation errors
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (2 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 03/14] mm/gup: return an error on migration failure Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 05/14] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN Pavel Tatashin
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

It is still possible that we pin movable CMA pages if there are isolation
errors and cma_page_list stays empty when we check again.

Check for isolation errors, and return success only when there are no
isolation errors, and cma_page_list is empty after checking.

Because isolation errors are transient, we retry indefinitely.

Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 mm/gup.c | 60 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 88ce41f41543..7ecca2d66dff 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1555,8 +1555,8 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 					struct vm_area_struct **vmas,
 					unsigned int gup_flags)
 {
-	unsigned long i;
-	bool drain_allow = true;
+	unsigned long i, isolation_error_count;
+	bool drain_allow;
 	LIST_HEAD(cma_page_list);
 	long ret = nr_pages;
 	struct page *prev_head, *head;
@@ -1567,6 +1567,8 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 
 check_again:
 	prev_head = NULL;
+	isolation_error_count = 0;
+	drain_allow = true;
 	for (i = 0; i < nr_pages; i++) {
 		head = compound_head(pages[i]);
 		if (head == prev_head)
@@ -1578,25 +1580,35 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 		 * of the CMA zone if possible.
 		 */
 		if (is_migrate_cma_page(head)) {
-			if (PageHuge(head))
-				isolate_huge_page(head, &cma_page_list);
-			else {
+			if (PageHuge(head)) {
+				if (!isolate_huge_page(head, &cma_page_list))
+					isolation_error_count++;
+			} else {
 				if (!PageLRU(head) && drain_allow) {
 					lru_add_drain_all();
 					drain_allow = false;
 				}
 
-				if (!isolate_lru_page(head)) {
-					list_add_tail(&head->lru, &cma_page_list);
-					mod_node_page_state(page_pgdat(head),
-							    NR_ISOLATED_ANON +
-							    page_is_file_lru(head),
-							    thp_nr_pages(head));
+				if (isolate_lru_page(head)) {
+					isolation_error_count++;
+					continue;
 				}
+				list_add_tail(&head->lru, &cma_page_list);
+				mod_node_page_state(page_pgdat(head),
+						    NR_ISOLATED_ANON +
+						    page_is_file_lru(head),
+						    thp_nr_pages(head));
 			}
 		}
 	}
 
+	/*
+	 * If list is empty, and no isolation errors, means that all pages are
+	 * in the correct zone.
+	 */
+	if (list_empty(&cma_page_list) && !isolation_error_count)
+		return ret;
+
 	if (!list_empty(&cma_page_list)) {
 		/*
 		 * drop the above get_user_pages reference.
@@ -1616,23 +1628,19 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 			return ret > 0 ? -ENOMEM : ret;
 		}
 
-		/*
-		 * We did migrate all the pages, Try to get the page references
-		 * again migrating any new CMA pages which we failed to isolate
-		 * earlier.
-		 */
-		ret = __get_user_pages_locked(mm, start, nr_pages,
-						   pages, vmas, NULL,
-						   gup_flags);
-
-		if (ret > 0) {
-			nr_pages = ret;
-			drain_allow = true;
-			goto check_again;
-		}
+		/* We unpinned pages before migration, pin them again */
+		ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+					      NULL, gup_flags);
+		if (ret <= 0)
+			return ret;
+		nr_pages = ret;
 	}
 
-	return ret;
+	/*
+	 * check again because pages were unpinned, and we also might have
+	 * had isolation errors and need more pages to migrate.
+	 */
+	goto check_again;
 }
 #else
 static long check_and_migrate_cma_pages(struct mm_struct *mm,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 05/14] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (3 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 04/14] mm/gup: check for isolation errors Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 06/14] mm: apply per-task gfp constraints in fast path Pavel Tatashin
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not return
pages that might belong to CMA region. This is currently used for long
term gup to make sure that such pins are not going to be done on any CMA
pages.

When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it is
focusing on CMA pages too much and that there is larger class of pages that
need the same treatment. MOVABLE zone cannot contain any long term pins as
well so it makes sense to reuse and redefine this flag for that usecase as
well. Rename the flag to PF_MEMALLOC_PIN which defines an allocation
context which can only get pages suitable for long-term pins.

Also re-name:
memalloc_nocma_save()/memalloc_nocma_restore
to
memalloc_pin_save()/memalloc_pin_restore()
and make the new functions common.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h    |  2 +-
 include/linux/sched/mm.h | 21 +++++----------------
 mm/gup.c                 |  4 ++--
 mm/hugetlb.c             |  4 ++--
 mm/page_alloc.c          |  4 ++--
 5 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e088c1bf282..43c4efa4f575 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1576,7 +1576,7 @@ extern struct pid *cad_pid;
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
-#define PF_MEMALLOC_NOCMA	0x10000000	/* All allocation request will have _GFP_MOVABLE cleared */
+#define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
 #define PF_FREEZER_SKIP		0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1ae08b8462a4..5f4dd3274734 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -270,29 +270,18 @@ static inline void memalloc_noreclaim_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC) | flags;
 }
 
-#ifdef CONFIG_CMA
-static inline unsigned int memalloc_nocma_save(void)
+static inline unsigned int memalloc_pin_save(void)
 {
-	unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+	unsigned int flags = current->flags & PF_MEMALLOC_PIN;
 
-	current->flags |= PF_MEMALLOC_NOCMA;
+	current->flags |= PF_MEMALLOC_PIN;
 	return flags;
 }
 
-static inline void memalloc_nocma_restore(unsigned int flags)
+static inline void memalloc_pin_restore(unsigned int flags)
 {
-	current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+	current->flags = (current->flags & ~PF_MEMALLOC_PIN) | flags;
 }
-#else
-static inline unsigned int memalloc_nocma_save(void)
-{
-	return 0;
-}
-
-static inline void memalloc_nocma_restore(unsigned int flags)
-{
-}
-#endif
 
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
diff --git a/mm/gup.c b/mm/gup.c
index 7ecca2d66dff..857b273e32ac 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1669,7 +1669,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
 	long rc;
 
 	if (gup_flags & FOLL_LONGTERM)
-		flags = memalloc_nocma_save();
+		flags = memalloc_pin_save();
 
 	rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
 				     gup_flags);
@@ -1678,7 +1678,7 @@ static long __gup_longterm_locked(struct mm_struct *mm,
 		if (rc > 0)
 			rc = check_and_migrate_cma_pages(mm, start, rc, pages,
 							 vmas, gup_flags);
-		memalloc_nocma_restore(flags);
+		memalloc_pin_restore(flags);
 	}
 	return rc;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a6bad1f686c5..2d79e515a7a3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1049,10 +1049,10 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 {
 	struct page *page;
-	bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA);
+	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
 	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
-		if (nocma && is_migrate_cma_page(page))
+		if (pin && is_migrate_cma_page(page))
 			continue;
 
 		if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b031a5ae0bd5..f92d7c810953 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3813,8 +3813,8 @@ static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
 #ifdef CONFIG_CMA
 	unsigned int pflags = current->flags;
 
-	if (!(pflags & PF_MEMALLOC_NOCMA) &&
-			gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (!(pflags & PF_MEMALLOC_PIN) &&
+	    gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 06/14] mm: apply per-task gfp constraints in fast path
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (4 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 05/14] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 07/14] mm: honor PF_MEMALLOC_PIN for all movable pages Pavel Tatashin
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

Function current_gfp_context() is called after fast path. However, soon we
will add more constraints which will also limit zones based on context.
Move this call into fast path, and apply the correct constraints for all
allocations.

Also update .reclaim_idx based on value returned by current_gfp_context()
because it soon will modify the allowed zones.

Note:
With this patch we will do one extra current->flags load during fast path,
but we already load current->flags in fast-path:

__alloc_pages_nodemask()
 prepare_alloc_pages()
  current_alloc_flags(gfp_mask, *alloc_flags);

Later, when we add the zone constrain logic to current_gfp_context() we
will be able to remove current->flags load from current_alloc_flags, and
therefore return fast-path to the current performance level.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f92d7c810953..c93e801a45e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4981,6 +4981,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	}
 
 	gfp_mask &= gfp_allowed_mask;
+	/*
+	 * Apply scoped allocation constraints. This is mainly about GFP_NOFS
+	 * resp. GFP_NOIO which has to be inherited for all allocation requests
+	 * from a particular context which has been marked by
+	 * memalloc_no{fs,io}_{save,restore}.
+	 */
+	gfp_mask = current_gfp_context(gfp_mask);
 	alloc_mask = gfp_mask;
 	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
@@ -4996,13 +5003,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	if (likely(page))
 		goto out;
 
-	/*
-	 * Apply scoped allocation constraints. This is mainly about GFP_NOFS
-	 * resp. GFP_NOIO which has to be inherited for all allocation requests
-	 * from a particular context which has been marked by
-	 * memalloc_no{fs,io}_{save,restore}.
-	 */
-	alloc_mask = current_gfp_context(gfp_mask);
+	alloc_mask = gfp_mask;
 	ac.spread_dirty_pages = false;
 
 	/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 07/14] mm: honor PF_MEMALLOC_PIN for all movable pages
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (5 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 06/14] mm: apply per-task gfp constraints in fast path Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 08/14] mm/gup: do not migrate zero page Pavel Tatashin
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

PF_MEMALLOC_PIN is only honored for CMA pages, extend
this flag to work for any allocations from ZONE_MOVABLE by removing
__GFP_MOVABLE from gfp_mask when this flag is passed in the current
context.

Add is_pinnable_page() to return true if page is in a pinnable page.
A pinnable page is not in ZONE_MOVABLE and not of MIGRATE_CMA type.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h       | 11 +++++++++++
 include/linux/sched/mm.h |  6 +++++-
 mm/hugetlb.c             |  2 +-
 mm/page_alloc.c          | 20 +++++++++-----------
 4 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5d618d08506..0990a76d5e6f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,6 +1117,17 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
+static inline bool is_zone_movable_page(const struct page *page)
+{
+	return page_zonenum(page) == ZONE_MOVABLE;
+}
+
+/* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
+static inline bool is_pinnable_page(struct page *page)
+{
+	return !is_zone_movable_page(page) && !is_migrate_cma_page(page);
+}
+
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 5f4dd3274734..a55277b0d475 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -150,12 +150,13 @@ static inline bool in_vfork(struct task_struct *tsk)
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_PIN  implies !GFP_MOVABLE
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
 	unsigned int pflags = READ_ONCE(current->flags);
 
-	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) {
+	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_PIN))) {
 		/*
 		 * NOIO implies both NOIO and NOFS and it is a weaker context
 		 * so always make sure it makes precedence
@@ -164,6 +165,9 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 			flags &= ~(__GFP_IO | __GFP_FS);
 		else if (pflags & PF_MEMALLOC_NOFS)
 			flags &= ~__GFP_FS;
+
+		if (pflags & PF_MEMALLOC_PIN)
+			flags &= ~__GFP_MOVABLE;
 	}
 	return flags;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2d79e515a7a3..77098492a2fd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1052,7 +1052,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
 	bool pin = !!(current->flags & PF_MEMALLOC_PIN);
 
 	list_for_each_entry(page, &h->hugepage_freelists[nid], lru) {
-		if (pin && is_migrate_cma_page(page))
+		if (pin && !is_pinnable_page(page))
 			continue;
 
 		if (PageHWPoison(page))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c93e801a45e9..3f17c73ad582 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3807,16 +3807,13 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 	return alloc_flags;
 }
 
-static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
-					unsigned int alloc_flags)
+/* Must be called after current_gfp_context() which can change gfp_mask */
+static inline unsigned int gpf_to_alloc_flags(gfp_t gfp_mask,
+					      unsigned int alloc_flags)
 {
 #ifdef CONFIG_CMA
-	unsigned int pflags = current->flags;
-
-	if (!(pflags & PF_MEMALLOC_PIN) &&
-	    gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
-
 #endif
 	return alloc_flags;
 }
@@ -4472,7 +4469,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
+	alloc_flags = gpf_to_alloc_flags(gfp_mask, alloc_flags);
 
 	return alloc_flags;
 }
@@ -4774,7 +4771,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
 	if (reserve_flags)
-		alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
+		alloc_flags = gpf_to_alloc_flags(gfp_mask, reserve_flags);
 
 	/*
 	 * Reset the nodemask and zonelist iterators if memory policies can be
@@ -4943,7 +4940,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return false;
 
-	*alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
+	*alloc_flags = gpf_to_alloc_flags(gfp_mask, *alloc_flags);
 
 	/* Dirty zone balancing only done in the fast path */
 	ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
@@ -4985,7 +4982,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	 * Apply scoped allocation constraints. This is mainly about GFP_NOFS
 	 * resp. GFP_NOIO which has to be inherited for all allocation requests
 	 * from a particular context which has been marked by
-	 * memalloc_no{fs,io}_{save,restore}.
+	 * memalloc_no{fs,io}_{save,restore}. And PF_MEMALLOC_PIN which ensures
+	 * movable zones are not used during allocation.
 	 */
 	gfp_mask = current_gfp_context(gfp_mask);
 	alloc_mask = gfp_mask;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 08/14] mm/gup: do not migrate zero page
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (6 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 07/14] mm: honor PF_MEMALLOC_PIN for all movable pages Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 09/14] mm/gup: migrate pinned pages out of movable zone Pavel Tatashin
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

On some platforms ZERO_PAGE(0) might end-up in a movable zone. Do not
migrate zero page in gup during longterm pinning as migration of zero page
is not allowed.

For example, in x86 QEMU with 16G of memory and kernelcore=5G parameter, I
see the following:

Boot#1: zero_pfn  0x48a8d zero_pfn zone: ZONE_DMA32
Boot#2: zero_pfn 0x20168d zero_pfn zone: ZONE_MOVABLE

On x86, empty_zero_page is declared in .bss and depending on the loader
may end up in different physical locations during boots.

Also, move is_zero_pfn() my_zero_pfn() functions under CONFIG_MMU, because
zero_pfn that they are using is declared in memory.c which is compiled
with CONFIG_MMU.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/mmzone.h  | 4 ++++
 include/linux/pgtable.h | 3 +--
 mm/gup.c                | 2 ++
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ae588b2f87ef..72b0b6eba854 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -427,6 +427,10 @@ enum zone_type {
 	 *    techniques might use alloc_contig_range() to hide previously
 	 *    exposed pages from the buddy again (e.g., to implement some sort
 	 *    of memory unplug in virtio-mem).
+	 * 6. ZERO_PAGE(0), kernelcore/movablecore setups might create
+	 *    situations where ZERO_PAGE(0) which is allocated differently
+	 *    on different platforms may end up in a movable zone. ZERO_PAGE(0)
+	 *    cannot be migrated.
 	 *
 	 * In general, no unmovable allocations that degrade memory offlining
 	 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index ea5c4102c23e..54a740602618 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1118,6 +1118,7 @@ extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 extern void untrack_pfn_moved(struct vm_area_struct *vma);
 #endif
 
+#ifdef CONFIG_MMU
 #ifdef __HAVE_COLOR_ZERO_PAGE
 static inline int is_zero_pfn(unsigned long pfn)
 {
@@ -1142,8 +1143,6 @@ static inline unsigned long my_zero_pfn(unsigned long addr)
 }
 #endif
 
-#ifdef CONFIG_MMU
-
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int pmd_trans_huge(pmd_t pmd)
 {
diff --git a/mm/gup.c b/mm/gup.c
index 857b273e32ac..fdd5cda30a07 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1580,6 +1580,8 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 		 * of the CMA zone if possible.
 		 */
 		if (is_migrate_cma_page(head)) {
+			if (is_zero_pfn(page_to_pfn(head)))
+				continue;
 			if (PageHuge(head)) {
 				if (!isolate_huge_page(head, &cma_page_list))
 					isolation_error_count++;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 09/14] mm/gup: migrate pinned pages out of movable zone
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (7 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 08/14] mm/gup: do not migrate zero page Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 10/14] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning Pavel Tatashin
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

We should not pin pages in ZONE_MOVABLE. Currently, we do not pin only
movable CMA pages. Generalize the function that migrates CMA pages to
migrate all movable pages. Use is_pinnable_page() to check which
pages need to be migrated

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/migrate.h        |  1 +
 include/linux/mmzone.h         |  9 ++++-
 include/trace/events/migrate.h |  3 +-
 mm/gup.c                       | 67 +++++++++++++++++-----------------
 4 files changed, 44 insertions(+), 36 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..fdf65f23acec 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -27,6 +27,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_LONGTERM_PIN,
 	MR_TYPES
 };
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 72b0b6eba854..300eb31439cb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -407,8 +407,13 @@ enum zone_type {
 	 * to increase the number of THP/huge pages. Notable special cases are:
 	 *
 	 * 1. Pinned pages: (long-term) pinning of movable pages might
-	 *    essentially turn such pages unmovable. Memory offlining might
-	 *    retry a long time.
+	 *    essentially turn such pages unmovable. Therefore, we do not allow
+	 *    pinning long-term pages in ZONE_MOVABLE. When pages are pinned and
+	 *    faulted, they come from the right zone right away. However, it is
+	 *    still possible that address space already has pages in
+	 *    ZONE_MOVABLE at the time when pages are pinned (i.e. user has
+	 *    touches that memory before pinning). In such case we migrate them
+	 *    to a different zone. When migration fails - pinning fails.
 	 * 2. memblock allocations: kernelcore/movablecore setups might create
 	 *    situations where ZONE_MOVABLE contains unmovable allocations
 	 *    after boot. Memory offlining and allocations fail early.
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..363b54ce104c 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_LONGTERM_PIN,	"longterm_pin")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff --git a/mm/gup.c b/mm/gup.c
index fdd5cda30a07..0422dd150e80 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -89,11 +89,12 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 		int orig_refs = refs;
 
 		/*
-		 * Can't do FOLL_LONGTERM + FOLL_PIN with CMA in the gup fast
-		 * path, so fail and let the caller fall back to the slow path.
+		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
+		 * right zone, so fail and let the caller fall back to the slow
+		 * path.
 		 */
-		if (unlikely(flags & FOLL_LONGTERM) &&
-				is_migrate_cma_page(page))
+		if (unlikely((flags & FOLL_LONGTERM) &&
+			     !is_pinnable_page(page)))
 			return NULL;
 
 		/*
@@ -1547,17 +1548,17 @@ struct page *get_dump_page(unsigned long addr)
 }
 #endif /* CONFIG_ELF_CORE */
 
-#ifdef CONFIG_CMA
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
-					unsigned long start,
-					unsigned long nr_pages,
-					struct page **pages,
-					struct vm_area_struct **vmas,
-					unsigned int gup_flags)
+#ifdef CONFIG_MIGRATION
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+					    unsigned long start,
+					    unsigned long nr_pages,
+					    struct page **pages,
+					    struct vm_area_struct **vmas,
+					    unsigned int gup_flags)
 {
 	unsigned long i, isolation_error_count;
 	bool drain_allow;
-	LIST_HEAD(cma_page_list);
+	LIST_HEAD(movable_page_list);
 	long ret = nr_pages;
 	struct page *prev_head, *head;
 	struct migration_target_control mtc = {
@@ -1575,15 +1576,14 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 			continue;
 		prev_head = head;
 		/*
-		 * If we get a page from the CMA zone, since we are going to
-		 * be pinning these entries, we might as well move them out
-		 * of the CMA zone if possible.
+		 * If we get a movable page, since we are going to be pinning
+		 * these entries, try to move them out if possible.
 		 */
-		if (is_migrate_cma_page(head)) {
+		if (!is_pinnable_page(head)) {
 			if (is_zero_pfn(page_to_pfn(head)))
 				continue;
 			if (PageHuge(head)) {
-				if (!isolate_huge_page(head, &cma_page_list))
+				if (!isolate_huge_page(head, &movable_page_list))
 					isolation_error_count++;
 			} else {
 				if (!PageLRU(head) && drain_allow) {
@@ -1595,7 +1595,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 					isolation_error_count++;
 					continue;
 				}
-				list_add_tail(&head->lru, &cma_page_list);
+				list_add_tail(&head->lru, &movable_page_list);
 				mod_node_page_state(page_pgdat(head),
 						    NR_ISOLATED_ANON +
 						    page_is_file_lru(head),
@@ -1608,10 +1608,10 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 	 * If list is empty, and no isolation errors, means that all pages are
 	 * in the correct zone.
 	 */
-	if (list_empty(&cma_page_list) && !isolation_error_count)
+	if (list_empty(&movable_page_list) && !isolation_error_count)
 		return ret;
 
-	if (!list_empty(&cma_page_list)) {
+	if (!list_empty(&movable_page_list)) {
 		/*
 		 * drop the above get_user_pages reference.
 		 */
@@ -1621,12 +1621,12 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 			for (i = 0; i < nr_pages; i++)
 				put_page(pages[i]);
 
-		ret = migrate_pages(&cma_page_list, alloc_migration_target,
+		ret = migrate_pages(&movable_page_list, alloc_migration_target,
 				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
-				    MR_CONTIG_RANGE);
+				    MR_LONGTERM_PIN);
 		if (ret) {
-			if (!list_empty(&cma_page_list))
-				putback_movable_pages(&cma_page_list);
+			if (!list_empty(&movable_page_list))
+				putback_movable_pages(&movable_page_list);
 			return ret > 0 ? -ENOMEM : ret;
 		}
 
@@ -1645,16 +1645,16 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 	goto check_again;
 }
 #else
-static long check_and_migrate_cma_pages(struct mm_struct *mm,
-					unsigned long start,
-					unsigned long nr_pages,
-					struct page **pages,
-					struct vm_area_struct **vmas,
-					unsigned int gup_flags)
+static long check_and_migrate_movable_pages(struct mm_struct *mm,
+					    unsigned long start,
+					    unsigned long nr_pages,
+					    struct page **pages,
+					    struct vm_area_struct **vmas,
+					    unsigned int gup_flags)
 {
 	return nr_pages;
 }
-#endif /* CONFIG_CMA */
+#endif /* CONFIG_MIGRATION */
 
 /*
  * __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
@@ -1678,8 +1678,9 @@ static long __gup_longterm_locked(struct mm_struct *mm,
 
 	if (gup_flags & FOLL_LONGTERM) {
 		if (rc > 0)
-			rc = check_and_migrate_cma_pages(mm, start, rc, pages,
-							 vmas, gup_flags);
+			rc = check_and_migrate_movable_pages(mm, start, rc,
+							     pages, vmas,
+							     gup_flags);
 		memalloc_pin_restore(flags);
 	}
 	return rc;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 10/14] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (8 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 09/14] mm/gup: migrate pinned pages out of movable zone Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 11/14] mm/gup: change index type to long as it counts pages Pavel Tatashin
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

Document the special handling of page pinning when ZONE_MOVABLE present.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/admin-guide/mm/memory-hotplug.rst | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 5c4432c96c4b..c6618f99f765 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -357,6 +357,15 @@ creates ZONE_MOVABLE as following.
    Unfortunately, there is no information to show which memory block belongs
    to ZONE_MOVABLE. This is TBD.
 
+.. note::
+   Techniques that rely on long-term pinnings of memory (especially, RDMA and
+   vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
+   hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
+   memory can still get hot removed - be aware that pinning can fail even if
+   there is plenty of free memory in ZONE_MOVABLE. In addition, using
+   ZONE_MOVABLE might make page pinning more expensive, because pages have to be
+   migrated off that zone first.
+
 .. _memory_hotplug_how_to_offline_memory:
 
 How to offline memory
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 11/14] mm/gup: change index type to long as it counts pages
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (9 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 10/14] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 12/14] mm/gup: longterm pin migration cleanup Pavel Tatashin
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

In __get_user_pages_locked() i counts number of pages which should be
long, as long is used in all other places to contain number of pages, and
32-bit becomes increasingly small for handling page count proportional
values.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/gup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0422dd150e80..ddbbee741d5d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1479,7 +1479,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 {
 	struct vm_area_struct *vma;
 	unsigned long vm_flags;
-	int i;
+	long i;
 
 	/* calculate required read or write permissions.
 	 * If FOLL_FORCE is set, we only require the "MAY" flags.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 12/14] mm/gup: longterm pin migration cleanup
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (10 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 11/14] mm/gup: change index type to long as it counts pages Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 13/14] selftests/vm: test flag is broken Pavel Tatashin
  2021-01-22  3:37 ` [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages Pavel Tatashin
  13 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

When pages are longterm pinned, we must migrated them out of movable zone.
The function that migrates them has a hidden loop with goto. The loop is
to retry on isolation failures, and after successful migration.

Make this code better by moving this loop to the caller.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 mm/gup.c | 93 ++++++++++++++++++++++----------------------------------
 1 file changed, 37 insertions(+), 56 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ddbbee741d5d..4a5eed17f451 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1549,27 +1549,28 @@ struct page *get_dump_page(unsigned long addr)
 #endif /* CONFIG_ELF_CORE */
 
 #ifdef CONFIG_MIGRATION
-static long check_and_migrate_movable_pages(struct mm_struct *mm,
-					    unsigned long start,
-					    unsigned long nr_pages,
+/*
+ * Check whether all pages are pinnable, if so return number of pages.  If some
+ * pages are not pinnable, migrate them, and unpin all pages. Return zero if
+ * pages were migrated, or if some pages were not successfully isolated.
+ * Return negative error if migration fails.
+ */
+static long check_and_migrate_movable_pages(unsigned long nr_pages,
 					    struct page **pages,
-					    struct vm_area_struct **vmas,
 					    unsigned int gup_flags)
 {
-	unsigned long i, isolation_error_count;
-	bool drain_allow;
+	unsigned long i;
+	unsigned long isolation_error_count = 0;
+	bool drain_allow = true;
 	LIST_HEAD(movable_page_list);
-	long ret = nr_pages;
-	struct page *prev_head, *head;
+	long ret = 0;
+	struct page *prev_head = NULL;
+	struct page *head;
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
 		.gfp_mask = GFP_USER | __GFP_NOWARN,
 	};
 
-check_again:
-	prev_head = NULL;
-	isolation_error_count = 0;
-	drain_allow = true;
 	for (i = 0; i < nr_pages; i++) {
 		head = compound_head(pages[i]);
 		if (head == prev_head)
@@ -1609,47 +1610,27 @@ static long check_and_migrate_movable_pages(struct mm_struct *mm,
 	 * in the correct zone.
 	 */
 	if (list_empty(&movable_page_list) && !isolation_error_count)
-		return ret;
+		return nr_pages;
 
+	if (gup_flags & FOLL_PIN) {
+		unpin_user_pages(pages, nr_pages);
+	} else {
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+	}
 	if (!list_empty(&movable_page_list)) {
-		/*
-		 * drop the above get_user_pages reference.
-		 */
-		if (gup_flags & FOLL_PIN)
-			unpin_user_pages(pages, nr_pages);
-		else
-			for (i = 0; i < nr_pages; i++)
-				put_page(pages[i]);
-
 		ret = migrate_pages(&movable_page_list, alloc_migration_target,
 				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
 				    MR_LONGTERM_PIN);
-		if (ret) {
-			if (!list_empty(&movable_page_list))
-				putback_movable_pages(&movable_page_list);
-			return ret > 0 ? -ENOMEM : ret;
-		}
-
-		/* We unpinned pages before migration, pin them again */
-		ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
-					      NULL, gup_flags);
-		if (ret <= 0)
-			return ret;
-		nr_pages = ret;
+		if (ret && !list_empty(&movable_page_list))
+			putback_movable_pages(&movable_page_list);
 	}
 
-	/*
-	 * check again because pages were unpinned, and we also might have
-	 * had isolation errors and need more pages to migrate.
-	 */
-	goto check_again;
+	return ret > 0 ? -ENOMEM : ret;
 }
 #else
-static long check_and_migrate_movable_pages(struct mm_struct *mm,
-					    unsigned long start,
-					    unsigned long nr_pages,
+static long check_and_migrate_movable_pages(unsigned long nr_pages,
 					    struct page **pages,
-					    struct vm_area_struct **vmas,
 					    unsigned int gup_flags)
 {
 	return nr_pages;
@@ -1667,22 +1648,22 @@ static long __gup_longterm_locked(struct mm_struct *mm,
 				  struct vm_area_struct **vmas,
 				  unsigned int gup_flags)
 {
-	unsigned long flags = 0;
+	unsigned int flags;
 	long rc;
 
-	if (gup_flags & FOLL_LONGTERM)
-		flags = memalloc_pin_save();
-
-	rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
-				     gup_flags);
+	if (!(gup_flags & FOLL_LONGTERM))
+		return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+					       NULL, gup_flags);
+	flags = memalloc_pin_save();
+	do {
+		rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+					     NULL, gup_flags);
+		if (rc <= 0)
+			break;
+		rc = check_and_migrate_movable_pages(rc, pages, gup_flags);
+	} while (!rc);
+	memalloc_pin_restore(flags);
 
-	if (gup_flags & FOLL_LONGTERM) {
-		if (rc > 0)
-			rc = check_and_migrate_movable_pages(mm, start, rc,
-							     pages, vmas,
-							     gup_flags);
-		memalloc_pin_restore(flags);
-	}
 	return rc;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 13/14] selftests/vm: test flag is broken
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (11 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 12/14] mm/gup: longterm pin migration cleanup Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-24 23:03   ` John Hubbard
  2021-01-22  3:37 ` [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages Pavel Tatashin
  13 siblings, 1 reply; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

In gup_test both gup_flags and test_flags use the same flags field.
This is broken.

Farther, in the actual gup_test.c all the passed gup_flags are erased and
unconditionally replaced with FOLL_WRITE.

Which means that test_flags are ignored, and code like this always
performs pin dump test:

155  			if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
156  				nr = pin_user_pages(addr, nr, gup->flags,
157  						    pages + i, NULL);
158  			else
159  				nr = get_user_pages(addr, nr, gup->flags,
160  						    pages + i, NULL);
161  			break;

Add a new test_flags field, to allow raw gup_flags to work.
Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
should be performed.
Remove  unconditional overwriting of gup_flags via FOLL_WRITE. But,
preserve the previous behaviour where FOLL_WRITE was the default flag,
and add a new option "-W" to unset FOLL_WRITE.

Rename flags with gup_flags.

With the fix, dump works like this:

root@virtme:/# gup_test  -c
---- page #0, starting from user virt addr: 0x7f8acb9e4000
page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
index:0x0 pfn:0x100bcf
anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

root@virtme:/# gup_test  -c -p
---- page #0, starting from user virt addr: 0x7fd19701b000
page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
index:0x0 pfn:0x108008
anon flags: 0x300000000080014(uptodate|lru|swapbacked)
raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
page dumped because: gup_test: dump_pages() test
DUMP_USER_PAGES_TEST: done

Refcount shows the difference between pin vs no-pin case.
Also change type of nr from int to long, as it counts number of pages.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 mm/gup_test.c                         | 23 ++++++++++-------------
 mm/gup_test.h                         |  3 ++-
 tools/testing/selftests/vm/gup_test.c | 15 +++++++++++----
 3 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index e3cf78e5873e..a6ed1c877679 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
 {
 	ktime_t start_time, end_time;
 	unsigned long i, nr_pages, addr, next;
-	int nr;
+	long nr;
 	struct page **pages;
 	int ret = 0;
 	bool needs_mmap_lock =
@@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd,
 			nr = (next - addr) / PAGE_SIZE;
 		}
 
-		/* Filter out most gup flags: only allow a tiny subset here: */
-		gup->flags &= FOLL_WRITE;
-
 		switch (cmd) {
 		case GUP_FAST_BENCHMARK:
-			nr = get_user_pages_fast(addr, nr, gup->flags,
+			nr = get_user_pages_fast(addr, nr, gup->gup_flags,
 						 pages + i);
 			break;
 		case GUP_BASIC_TEST:
-			nr = get_user_pages(addr, nr, gup->flags, pages + i,
+			nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
 					    NULL);
 			break;
 		case PIN_FAST_BENCHMARK:
-			nr = pin_user_pages_fast(addr, nr, gup->flags,
+			nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
 						 pages + i);
 			break;
 		case PIN_BASIC_TEST:
-			nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+			nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
 					    NULL);
 			break;
 		case PIN_LONGTERM_BENCHMARK:
 			nr = pin_user_pages(addr, nr,
-					    gup->flags | FOLL_LONGTERM,
+					    gup->gup_flags | FOLL_LONGTERM,
 					    pages + i, NULL);
 			break;
 		case DUMP_USER_PAGES_TEST:
-			if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
-				nr = pin_user_pages(addr, nr, gup->flags,
+			if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
+				nr = pin_user_pages(addr, nr, gup->gup_flags,
 						    pages + i, NULL);
 			else
-				nr = get_user_pages(addr, nr, gup->flags,
+				nr = get_user_pages(addr, nr, gup->gup_flags,
 						    pages + i, NULL);
 			break;
 		default:
@@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd,
 
 	start_time = ktime_get();
 
-	put_back_pages(cmd, pages, nr_pages, gup->flags);
+	put_back_pages(cmd, pages, nr_pages, gup->test_flags);
 
 	end_time = ktime_get();
 	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
diff --git a/mm/gup_test.h b/mm/gup_test.h
index 90a6713d50eb..887ac1d5f5bc 100644
--- a/mm/gup_test.h
+++ b/mm/gup_test.h
@@ -21,7 +21,8 @@ struct gup_test {
 	__u64 addr;
 	__u64 size;
 	__u32 nr_pages_per_call;
-	__u32 flags;
+	__u32 gup_flags;
+	__u32 test_flags;
 	/*
 	 * Each non-zero entry is the number of the page (1-based: first page is
 	 * page 1, so that zero entries mean "do nothing") from the .addr base.
diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
index 6c6336dd3b7f..943cc2608dc2 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -37,13 +37,13 @@ int main(int argc, char **argv)
 {
 	struct gup_test gup = { 0 };
 	unsigned long size = 128 * MB;
-	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
+	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
 	unsigned long cmd = GUP_FAST_BENCHMARK;
 	int flags = MAP_PRIVATE;
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
 		switch (opt) {
 		case 'a':
 			cmd = PIN_FAST_BENCHMARK;
@@ -65,9 +65,13 @@ int main(int argc, char **argv)
 			 */
 			gup.which_pages[0] = 1;
 			break;
+		case 'p':
+			/* works only with DUMP_USER_PAGES_TEST */
+			gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN;
+			break;
 		case 'F':
 			/* strtol, so you can pass flags in hex form */
-			gup.flags = strtol(optarg, 0, 0);
+			gup.gup_flags = strtol(optarg, 0, 0);
 			break;
 		case 'm':
 			size = atoi(optarg) * MB;
@@ -93,6 +97,9 @@ int main(int argc, char **argv)
 		case 'w':
 			write = 1;
 			break;
+		case 'W':
+			write = 0;
+			break;
 		case 'f':
 			file = optarg;
 			break;
@@ -140,7 +147,7 @@ int main(int argc, char **argv)
 
 	gup.nr_pages_per_call = nr_pages;
 	if (write)
-		gup.flags |= FOLL_WRITE;
+		gup.gup_flags |= FOLL_WRITE;
 
 	fd = open("/sys/kernel/debug/gup_test", O_RDWR);
 	if (fd == -1) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages
  2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
                   ` (12 preceding siblings ...)
  2021-01-22  3:37 ` [PATCH v7 13/14] selftests/vm: test flag is broken Pavel Tatashin
@ 2021-01-22  3:37 ` Pavel Tatashin
  2021-01-24 23:18   ` John Hubbard
  13 siblings, 1 reply; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-22  3:37 UTC (permalink / raw)
  To: pasha.tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, jhubbard, linux-doc, ira.weiny,
	linux-kselftest

When pages are pinned they can be faulted in userland and migrated, and
they can be faulted right in kernel without migration.

In either case, the pinned pages must end-up being pinnable (not movable).

Add a new test to gup_test, to help verify that the gup/pup
(get_user_pages() / pin_user_pages()) behavior with respect to pinnable
and movable pages is reasonable and correct. Specifically, provide a
way to:

1) Verify that only "pinnable" pages are pinned. This is checked
automatically for you.

2) Verify that gup/pup performance is reasonable. This requires
comparing benchmarks between doing gup/pup on pages that have been
pre-faulted in from user space, vs. doing gup/pup on pages that are not
faulted in until gup/pup time (via FOLL_TOUCH). This decision is
controlled with the new -z command line option.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 mm/gup_test.c                         |  6 ++++++
 tools/testing/selftests/vm/gup_test.c | 23 +++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/gup_test.c b/mm/gup_test.c
index a6ed1c877679..d974dec19e1c 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
 
 				dump_page(page, "gup_test failure");
 				break;
+			} else if (cmd == PIN_LONGTERM_BENCHMARK &&
+				WARN(!is_pinnable_page(page),
+				     "pages[%lu] is NOT pinnable but pinned\n",
+				     i)) {
+				dump_page(page, "gup_test failure");
+				break;
 			}
 		}
 		break;
diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
index 943cc2608dc2..1e662d59c502 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -13,6 +13,7 @@
 
 /* Just the flags we need, copied from mm.h: */
 #define FOLL_WRITE	0x01	/* check pte is writable */
+#define FOLL_TOUCH	0x02	/* mark page accessed */
 
 static char *cmd_to_str(unsigned long cmd)
 {
@@ -39,11 +40,11 @@ int main(int argc, char **argv)
 	unsigned long size = 128 * MB;
 	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
 	unsigned long cmd = GUP_FAST_BENCHMARK;
-	int flags = MAP_PRIVATE;
+	int flags = MAP_PRIVATE, touch = 0;
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
 		switch (opt) {
 		case 'a':
 			cmd = PIN_FAST_BENCHMARK;
@@ -110,6 +111,10 @@ int main(int argc, char **argv)
 		case 'H':
 			flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
 			break;
+		case 'z':
+			/* fault pages in gup, do not fault in userland */
+			touch = 1;
+			break;
 		default:
 			return -1;
 		}
@@ -167,8 +172,18 @@ int main(int argc, char **argv)
 	else if (thp == 0)
 		madvise(p, size, MADV_NOHUGEPAGE);
 
-	for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
-		p[0] = 0;
+	/*
+	 * FOLL_TOUCH, in gup_test, is used as an either/or case: either
+	 * fault pages in from the kernel via FOLL_TOUCH, or fault them
+	 * in here, from user space. This allows comparison of performance
+	 * between those two cases.
+	 */
+	if (touch) {
+		gup.gup_flags |= FOLL_TOUCH;
+	} else {
+		for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
+			p[0] = 0;
+	}
 
 	/* Only report timing information on the *_BENCHMARK commands: */
 	if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 13/14] selftests/vm: test flag is broken
  2021-01-22  3:37 ` [PATCH v7 13/14] selftests/vm: test flag is broken Pavel Tatashin
@ 2021-01-24 23:03   ` John Hubbard
  2021-01-25 14:16       ` Pavel Tatashin
  0 siblings, 1 reply; 22+ messages in thread
From: John Hubbard @ 2021-01-24 23:03 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, linux-doc, ira.weiny, linux-kselftest

On 1/21/21 7:37 PM, Pavel Tatashin wrote:
> In gup_test both gup_flags and test_flags use the same flags field.
> This is broken.
> 
> Farther, in the actual gup_test.c all the passed gup_flags are erased and
> unconditionally replaced with FOLL_WRITE.
> 
> Which means that test_flags are ignored, and code like this always
> performs pin dump test:
> 
> 155  			if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> 156  				nr = pin_user_pages(addr, nr, gup->flags,
> 157  						    pages + i, NULL);
> 158  			else
> 159  				nr = get_user_pages(addr, nr, gup->flags,
> 160  						    pages + i, NULL);
> 161  			break;
> 
> Add a new test_flags field, to allow raw gup_flags to work.
> Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
> should be performed.
> Remove  unconditional overwriting of gup_flags via FOLL_WRITE. But,
> preserve the previous behaviour where FOLL_WRITE was the default flag,
> and add a new option "-W" to unset FOLL_WRITE.
> 
> Rename flags with gup_flags.

Hi Pavel,

Thanks again for fixing this up! Looks good, with a tiny point about the
subject line:

1) To follow convention, the subject line should say what you're doing,
not what the previous condition was. Also, there are several tests in that
directory, so we should say which one. So more like this:

"selftests/vm: gup_test: fix test flag"

That is just a minor documentation point, so either way, feel free to add:


Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> 
> With the fix, dump works like this:
> 
> root@virtme:/# gup_test  -c
> ---- page #0, starting from user virt addr: 0x7f8acb9e4000
> page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
> index:0x0 pfn:0x100bcf
> anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
> raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
> raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
> page dumped because: gup_test: dump_pages() test
> DUMP_USER_PAGES_TEST: done
> 
> root@virtme:/# gup_test  -c -p
> ---- page #0, starting from user virt addr: 0x7fd19701b000
> page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
> index:0x0 pfn:0x108008
> anon flags: 0x300000000080014(uptodate|lru|swapbacked)
> raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
> raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
> page dumped because: gup_test: dump_pages() test
> DUMP_USER_PAGES_TEST: done
> 
> Refcount shows the difference between pin vs no-pin case.
> Also change type of nr from int to long, as it counts number of pages.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> ---
>   mm/gup_test.c                         | 23 ++++++++++-------------
>   mm/gup_test.h                         |  3 ++-
>   tools/testing/selftests/vm/gup_test.c | 15 +++++++++++----
>   3 files changed, 23 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/gup_test.c b/mm/gup_test.c
> index e3cf78e5873e..a6ed1c877679 100644
> --- a/mm/gup_test.c
> +++ b/mm/gup_test.c
> @@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
>   {
>   	ktime_t start_time, end_time;
>   	unsigned long i, nr_pages, addr, next;
> -	int nr;
> +	long nr;
>   	struct page **pages;
>   	int ret = 0;
>   	bool needs_mmap_lock =
> @@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd,
>   			nr = (next - addr) / PAGE_SIZE;
>   		}
>   
> -		/* Filter out most gup flags: only allow a tiny subset here: */
> -		gup->flags &= FOLL_WRITE;
> -
>   		switch (cmd) {
>   		case GUP_FAST_BENCHMARK:
> -			nr = get_user_pages_fast(addr, nr, gup->flags,
> +			nr = get_user_pages_fast(addr, nr, gup->gup_flags,
>   						 pages + i);
>   			break;
>   		case GUP_BASIC_TEST:
> -			nr = get_user_pages(addr, nr, gup->flags, pages + i,
> +			nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
>   					    NULL);
>   			break;
>   		case PIN_FAST_BENCHMARK:
> -			nr = pin_user_pages_fast(addr, nr, gup->flags,
> +			nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
>   						 pages + i);
>   			break;
>   		case PIN_BASIC_TEST:
> -			nr = pin_user_pages(addr, nr, gup->flags, pages + i,
> +			nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
>   					    NULL);
>   			break;
>   		case PIN_LONGTERM_BENCHMARK:
>   			nr = pin_user_pages(addr, nr,
> -					    gup->flags | FOLL_LONGTERM,
> +					    gup->gup_flags | FOLL_LONGTERM,
>   					    pages + i, NULL);
>   			break;
>   		case DUMP_USER_PAGES_TEST:
> -			if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> -				nr = pin_user_pages(addr, nr, gup->flags,
> +			if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> +				nr = pin_user_pages(addr, nr, gup->gup_flags,
>   						    pages + i, NULL);
>   			else
> -				nr = get_user_pages(addr, nr, gup->flags,
> +				nr = get_user_pages(addr, nr, gup->gup_flags,
>   						    pages + i, NULL);
>   			break;
>   		default:
> @@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd,
>   
>   	start_time = ktime_get();
>   
> -	put_back_pages(cmd, pages, nr_pages, gup->flags);
> +	put_back_pages(cmd, pages, nr_pages, gup->test_flags);
>   
>   	end_time = ktime_get();
>   	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> diff --git a/mm/gup_test.h b/mm/gup_test.h
> index 90a6713d50eb..887ac1d5f5bc 100644
> --- a/mm/gup_test.h
> +++ b/mm/gup_test.h
> @@ -21,7 +21,8 @@ struct gup_test {
>   	__u64 addr;
>   	__u64 size;
>   	__u32 nr_pages_per_call;
> -	__u32 flags;
> +	__u32 gup_flags;
> +	__u32 test_flags;
>   	/*
>   	 * Each non-zero entry is the number of the page (1-based: first page is
>   	 * page 1, so that zero entries mean "do nothing") from the .addr base.
> diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> index 6c6336dd3b7f..943cc2608dc2 100644
> --- a/tools/testing/selftests/vm/gup_test.c
> +++ b/tools/testing/selftests/vm/gup_test.c
> @@ -37,13 +37,13 @@ int main(int argc, char **argv)
>   {
>   	struct gup_test gup = { 0 };
>   	unsigned long size = 128 * MB;
> -	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
> +	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
>   	unsigned long cmd = GUP_FAST_BENCHMARK;
>   	int flags = MAP_PRIVATE;
>   	char *file = "/dev/zero";
>   	char *p;
>   
> -	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) {
> +	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
>   		switch (opt) {
>   		case 'a':
>   			cmd = PIN_FAST_BENCHMARK;
> @@ -65,9 +65,13 @@ int main(int argc, char **argv)
>   			 */
>   			gup.which_pages[0] = 1;
>   			break;
> +		case 'p':
> +			/* works only with DUMP_USER_PAGES_TEST */
> +			gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN;
> +			break;
>   		case 'F':
>   			/* strtol, so you can pass flags in hex form */
> -			gup.flags = strtol(optarg, 0, 0);
> +			gup.gup_flags = strtol(optarg, 0, 0);
>   			break;
>   		case 'm':
>   			size = atoi(optarg) * MB;
> @@ -93,6 +97,9 @@ int main(int argc, char **argv)
>   		case 'w':
>   			write = 1;
>   			break;
> +		case 'W':
> +			write = 0;
> +			break;
>   		case 'f':
>   			file = optarg;
>   			break;
> @@ -140,7 +147,7 @@ int main(int argc, char **argv)
>   
>   	gup.nr_pages_per_call = nr_pages;
>   	if (write)
> -		gup.flags |= FOLL_WRITE;
> +		gup.gup_flags |= FOLL_WRITE;
>   
>   	fd = open("/sys/kernel/debug/gup_test", O_RDWR);
>   	if (fd == -1) {
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages
  2021-01-22  3:37 ` [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages Pavel Tatashin
@ 2021-01-24 23:18   ` John Hubbard
  2021-01-24 23:40     ` John Hubbard
  2021-01-25 14:17       ` Pavel Tatashin
  0 siblings, 2 replies; 22+ messages in thread
From: John Hubbard @ 2021-01-24 23:18 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, linux-doc, ira.weiny, linux-kselftest

On 1/21/21 7:37 PM, Pavel Tatashin wrote:
> When pages are pinned they can be faulted in userland and migrated, and
> they can be faulted right in kernel without migration.
> 
> In either case, the pinned pages must end-up being pinnable (not movable).
> 
> Add a new test to gup_test, to help verify that the gup/pup
> (get_user_pages() / pin_user_pages()) behavior with respect to pinnable
> and movable pages is reasonable and correct. Specifically, provide a
> way to:
> 
> 1) Verify that only "pinnable" pages are pinned. This is checked
> automatically for you.
> 
> 2) Verify that gup/pup performance is reasonable. This requires
> comparing benchmarks between doing gup/pup on pages that have been
> pre-faulted in from user space, vs. doing gup/pup on pages that are not
> faulted in until gup/pup time (via FOLL_TOUCH). This decision is
> controlled with the new -z command line option.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> ---
>   mm/gup_test.c                         |  6 ++++++
>   tools/testing/selftests/vm/gup_test.c | 23 +++++++++++++++++++----
>   2 files changed, 25 insertions(+), 4 deletions(-)
> 

This also looks good. I do see the WARN_ON_ONCE firing in
internal_get_user_pages_fast(), when running with *only* the new -z
option.

I'll poke around the rest of the patchset and see if that is expected
and normal, but either way the test code itself looks correct and seems
to be passing my set of "run a bunch of different gup_test options" here,
so feel free to add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/mm/gup_test.c b/mm/gup_test.c
> index a6ed1c877679..d974dec19e1c 100644
> --- a/mm/gup_test.c
> +++ b/mm/gup_test.c
> @@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
>   
>   				dump_page(page, "gup_test failure");
>   				break;
> +			} else if (cmd == PIN_LONGTERM_BENCHMARK &&
> +				WARN(!is_pinnable_page(page),
> +				     "pages[%lu] is NOT pinnable but pinned\n",
> +				     i)) {
> +				dump_page(page, "gup_test failure");
> +				break;
>   			}
>   		}
>   		break;
> diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> index 943cc2608dc2..1e662d59c502 100644
> --- a/tools/testing/selftests/vm/gup_test.c
> +++ b/tools/testing/selftests/vm/gup_test.c
> @@ -13,6 +13,7 @@
>   
>   /* Just the flags we need, copied from mm.h: */
>   #define FOLL_WRITE	0x01	/* check pte is writable */
> +#define FOLL_TOUCH	0x02	/* mark page accessed */
>   
>   static char *cmd_to_str(unsigned long cmd)
>   {
> @@ -39,11 +40,11 @@ int main(int argc, char **argv)
>   	unsigned long size = 128 * MB;
>   	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
>   	unsigned long cmd = GUP_FAST_BENCHMARK;
> -	int flags = MAP_PRIVATE;
> +	int flags = MAP_PRIVATE, touch = 0;
>   	char *file = "/dev/zero";
>   	char *p;
>   
> -	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
> +	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
>   		switch (opt) {
>   		case 'a':
>   			cmd = PIN_FAST_BENCHMARK;
> @@ -110,6 +111,10 @@ int main(int argc, char **argv)
>   		case 'H':
>   			flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
>   			break;
> +		case 'z':
> +			/* fault pages in gup, do not fault in userland */
> +			touch = 1;
> +			break;
>   		default:
>   			return -1;
>   		}
> @@ -167,8 +172,18 @@ int main(int argc, char **argv)
>   	else if (thp == 0)
>   		madvise(p, size, MADV_NOHUGEPAGE);
>   
> -	for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> -		p[0] = 0;
> +	/*
> +	 * FOLL_TOUCH, in gup_test, is used as an either/or case: either
> +	 * fault pages in from the kernel via FOLL_TOUCH, or fault them
> +	 * in here, from user space. This allows comparison of performance
> +	 * between those two cases.
> +	 */
> +	if (touch) {
> +		gup.gup_flags |= FOLL_TOUCH;
> +	} else {
> +		for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> +			p[0] = 0;
> +	}
>   
>   	/* Only report timing information on the *_BENCHMARK commands: */
>   	if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages
  2021-01-24 23:18   ` John Hubbard
@ 2021-01-24 23:40     ` John Hubbard
  2021-01-25 14:17       ` Pavel Tatashin
  1 sibling, 0 replies; 22+ messages in thread
From: John Hubbard @ 2021-01-24 23:40 UTC (permalink / raw)
  To: Pavel Tatashin, linux-kernel, linux-mm, akpm, vbabka, mhocko,
	david, osalvador, dan.j.williams, sashal, tyhicks,
	iamjoonsoo.kim, mike.kravetz, rostedt, mingo, jgg, peterz,
	mgorman, willy, rientjes, linux-doc, ira.weiny, linux-kselftest

On 1/24/21 3:18 PM, John Hubbard wrote:
> On 1/21/21 7:37 PM, Pavel Tatashin wrote:
>> When pages are pinned they can be faulted in userland and migrated, and
>> they can be faulted right in kernel without migration.
>>
>> In either case, the pinned pages must end-up being pinnable (not movable).
>>
>> Add a new test to gup_test, to help verify that the gup/pup
>> (get_user_pages() / pin_user_pages()) behavior with respect to pinnable
>> and movable pages is reasonable and correct. Specifically, provide a
>> way to:
>>
>> 1) Verify that only "pinnable" pages are pinned. This is checked
>> automatically for you.
>>
>> 2) Verify that gup/pup performance is reasonable. This requires
>> comparing benchmarks between doing gup/pup on pages that have been
>> pre-faulted in from user space, vs. doing gup/pup on pages that are not
>> faulted in until gup/pup time (via FOLL_TOUCH). This decision is
>> controlled with the new -z command line option.
>>
>> Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
>> ---
>>   mm/gup_test.c                         |  6 ++++++
>>   tools/testing/selftests/vm/gup_test.c | 23 +++++++++++++++++++----
>>   2 files changed, 25 insertions(+), 4 deletions(-)
>>
> 
> This also looks good. I do see the WARN_ON_ONCE firing in
> internal_get_user_pages_fast(), when running with *only* the new -z
> option.
> 
> I'll poke around the rest of the patchset and see if that is expected
> and normal, but either way the test code itself looks correct and seems

The warning that is firing in internal_get_user_pages_fast() is:

	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
				       FOLL_FAST_ONLY)))
		return -EINVAL;

...OK, so this is because "./gup_test -z" invokes get_user_pages_fast(),
which so far does not allow passing in FOLL_TOUCH. Probably because there
is nothing "fast" about touching and possibly faulting in pages. :)

So, again, the test code still looks correct, even though it's possible
to pass in options that run into things that are rejected by gup.c



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 13/14] selftests/vm: test flag is broken
  2021-01-24 23:03   ` John Hubbard
@ 2021-01-25 14:16       ` Pavel Tatashin
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-25 14:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: LKML, linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko,
	David Hildenbrand, Oscar Salvador, Dan Williams, Sasha Levin,
	Tyler Hicks, Joonsoo Kim, Mike Kravetz, Steven Rostedt,
	Ingo Molnar, Jason Gunthorpe, Peter Zijlstra, Mel Gorman,
	Matthew Wilcox, David Rientjes, Linux Doc Mailing List,
	Ira Weiny, linux-kselftest

On Sun, Jan 24, 2021 at 6:03 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 1/21/21 7:37 PM, Pavel Tatashin wrote:
> > In gup_test both gup_flags and test_flags use the same flags field.
> > This is broken.
> >
> > Farther, in the actual gup_test.c all the passed gup_flags are erased and
> > unconditionally replaced with FOLL_WRITE.
> >
> > Which means that test_flags are ignored, and code like this always
> > performs pin dump test:
> >
> > 155                   if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> > 156                           nr = pin_user_pages(addr, nr, gup->flags,
> > 157                                               pages + i, NULL);
> > 158                   else
> > 159                           nr = get_user_pages(addr, nr, gup->flags,
> > 160                                               pages + i, NULL);
> > 161                   break;
> >
> > Add a new test_flags field, to allow raw gup_flags to work.
> > Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
> > should be performed.
> > Remove  unconditional overwriting of gup_flags via FOLL_WRITE. But,
> > preserve the previous behaviour where FOLL_WRITE was the default flag,
> > and add a new option "-W" to unset FOLL_WRITE.
> >
> > Rename flags with gup_flags.
>
> Hi Pavel,
>
> Thanks again for fixing this up! Looks good, with a tiny point about the
> subject line:
>
> 1) To follow convention, the subject line should say what you're doing,
> not what the previous condition was. Also, there are several tests in that
> directory, so we should say which one. So more like this:
>
> "selftests/vm: gup_test: fix test flag"
>
> That is just a minor documentation point, so either way, feel free to add:
>
>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thank you for your review , I will change the subject line in the next update.

Pasha

>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> >
> > With the fix, dump works like this:
> >
> > root@virtme:/# gup_test  -c
> > ---- page #0, starting from user virt addr: 0x7f8acb9e4000
> > page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
> > index:0x0 pfn:0x100bcf
> > anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
> > raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
> > raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
> > page dumped because: gup_test: dump_pages() test
> > DUMP_USER_PAGES_TEST: done
> >
> > root@virtme:/# gup_test  -c -p
> > ---- page #0, starting from user virt addr: 0x7fd19701b000
> > page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
> > index:0x0 pfn:0x108008
> > anon flags: 0x300000000080014(uptodate|lru|swapbacked)
> > raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
> > raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
> > page dumped because: gup_test: dump_pages() test
> > DUMP_USER_PAGES_TEST: done
> >
> > Refcount shows the difference between pin vs no-pin case.
> > Also change type of nr from int to long, as it counts number of pages.
> >
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > ---
> >   mm/gup_test.c                         | 23 ++++++++++-------------
> >   mm/gup_test.h                         |  3 ++-
> >   tools/testing/selftests/vm/gup_test.c | 15 +++++++++++----
> >   3 files changed, 23 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/gup_test.c b/mm/gup_test.c
> > index e3cf78e5873e..a6ed1c877679 100644
> > --- a/mm/gup_test.c
> > +++ b/mm/gup_test.c
> > @@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
> >   {
> >       ktime_t start_time, end_time;
> >       unsigned long i, nr_pages, addr, next;
> > -     int nr;
> > +     long nr;
> >       struct page **pages;
> >       int ret = 0;
> >       bool needs_mmap_lock =
> > @@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd,
> >                       nr = (next - addr) / PAGE_SIZE;
> >               }
> >
> > -             /* Filter out most gup flags: only allow a tiny subset here: */
> > -             gup->flags &= FOLL_WRITE;
> > -
> >               switch (cmd) {
> >               case GUP_FAST_BENCHMARK:
> > -                     nr = get_user_pages_fast(addr, nr, gup->flags,
> > +                     nr = get_user_pages_fast(addr, nr, gup->gup_flags,
> >                                                pages + i);
> >                       break;
> >               case GUP_BASIC_TEST:
> > -                     nr = get_user_pages(addr, nr, gup->flags, pages + i,
> > +                     nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
> >                                           NULL);
> >                       break;
> >               case PIN_FAST_BENCHMARK:
> > -                     nr = pin_user_pages_fast(addr, nr, gup->flags,
> > +                     nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
> >                                                pages + i);
> >                       break;
> >               case PIN_BASIC_TEST:
> > -                     nr = pin_user_pages(addr, nr, gup->flags, pages + i,
> > +                     nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
> >                                           NULL);
> >                       break;
> >               case PIN_LONGTERM_BENCHMARK:
> >                       nr = pin_user_pages(addr, nr,
> > -                                         gup->flags | FOLL_LONGTERM,
> > +                                         gup->gup_flags | FOLL_LONGTERM,
> >                                           pages + i, NULL);
> >                       break;
> >               case DUMP_USER_PAGES_TEST:
> > -                     if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> > -                             nr = pin_user_pages(addr, nr, gup->flags,
> > +                     if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> > +                             nr = pin_user_pages(addr, nr, gup->gup_flags,
> >                                                   pages + i, NULL);
> >                       else
> > -                             nr = get_user_pages(addr, nr, gup->flags,
> > +                             nr = get_user_pages(addr, nr, gup->gup_flags,
> >                                                   pages + i, NULL);
> >                       break;
> >               default:
> > @@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd,
> >
> >       start_time = ktime_get();
> >
> > -     put_back_pages(cmd, pages, nr_pages, gup->flags);
> > +     put_back_pages(cmd, pages, nr_pages, gup->test_flags);
> >
> >       end_time = ktime_get();
> >       gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> > diff --git a/mm/gup_test.h b/mm/gup_test.h
> > index 90a6713d50eb..887ac1d5f5bc 100644
> > --- a/mm/gup_test.h
> > +++ b/mm/gup_test.h
> > @@ -21,7 +21,8 @@ struct gup_test {
> >       __u64 addr;
> >       __u64 size;
> >       __u32 nr_pages_per_call;
> > -     __u32 flags;
> > +     __u32 gup_flags;
> > +     __u32 test_flags;
> >       /*
> >        * Each non-zero entry is the number of the page (1-based: first page is
> >        * page 1, so that zero entries mean "do nothing") from the .addr base.
> > diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> > index 6c6336dd3b7f..943cc2608dc2 100644
> > --- a/tools/testing/selftests/vm/gup_test.c
> > +++ b/tools/testing/selftests/vm/gup_test.c
> > @@ -37,13 +37,13 @@ int main(int argc, char **argv)
> >   {
> >       struct gup_test gup = { 0 };
> >       unsigned long size = 128 * MB;
> > -     int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
> > +     int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
> >       unsigned long cmd = GUP_FAST_BENCHMARK;
> >       int flags = MAP_PRIVATE;
> >       char *file = "/dev/zero";
> >       char *p;
> >
> > -     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) {
> > +     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
> >               switch (opt) {
> >               case 'a':
> >                       cmd = PIN_FAST_BENCHMARK;
> > @@ -65,9 +65,13 @@ int main(int argc, char **argv)
> >                        */
> >                       gup.which_pages[0] = 1;
> >                       break;
> > +             case 'p':
> > +                     /* works only with DUMP_USER_PAGES_TEST */
> > +                     gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN;
> > +                     break;
> >               case 'F':
> >                       /* strtol, so you can pass flags in hex form */
> > -                     gup.flags = strtol(optarg, 0, 0);
> > +                     gup.gup_flags = strtol(optarg, 0, 0);
> >                       break;
> >               case 'm':
> >                       size = atoi(optarg) * MB;
> > @@ -93,6 +97,9 @@ int main(int argc, char **argv)
> >               case 'w':
> >                       write = 1;
> >                       break;
> > +             case 'W':
> > +                     write = 0;
> > +                     break;
> >               case 'f':
> >                       file = optarg;
> >                       break;
> > @@ -140,7 +147,7 @@ int main(int argc, char **argv)
> >
> >       gup.nr_pages_per_call = nr_pages;
> >       if (write)
> > -             gup.flags |= FOLL_WRITE;
> > +             gup.gup_flags |= FOLL_WRITE;
> >
> >       fd = open("/sys/kernel/debug/gup_test", O_RDWR);
> >       if (fd == -1) {
> >
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 13/14] selftests/vm: test flag is broken
@ 2021-01-25 14:16       ` Pavel Tatashin
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-25 14:16 UTC (permalink / raw)
  To: John Hubbard
  Cc: LKML, linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko,
	David Hildenbrand, Oscar Salvador, Dan Williams, Sasha Levin,
	Tyler Hicks, Joonsoo Kim, Mike Kravetz, Steven Rostedt,
	Ingo Molnar, Jason Gunthorpe, Peter Zijlstra, Mel Gorman,
	Matthew Wilcox, David Rientjes, Linux Doc Mailing List,
	Ira Weiny, linux-kselftest

On Sun, Jan 24, 2021 at 6:03 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 1/21/21 7:37 PM, Pavel Tatashin wrote:
> > In gup_test both gup_flags and test_flags use the same flags field.
> > This is broken.
> >
> > Farther, in the actual gup_test.c all the passed gup_flags are erased and
> > unconditionally replaced with FOLL_WRITE.
> >
> > Which means that test_flags are ignored, and code like this always
> > performs pin dump test:
> >
> > 155                   if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> > 156                           nr = pin_user_pages(addr, nr, gup->flags,
> > 157                                               pages + i, NULL);
> > 158                   else
> > 159                           nr = get_user_pages(addr, nr, gup->flags,
> > 160                                               pages + i, NULL);
> > 161                   break;
> >
> > Add a new test_flags field, to allow raw gup_flags to work.
> > Add a new subcommand for DUMP_USER_PAGES_TEST to specify that pin test
> > should be performed.
> > Remove  unconditional overwriting of gup_flags via FOLL_WRITE. But,
> > preserve the previous behaviour where FOLL_WRITE was the default flag,
> > and add a new option "-W" to unset FOLL_WRITE.
> >
> > Rename flags with gup_flags.
>
> Hi Pavel,
>
> Thanks again for fixing this up! Looks good, with a tiny point about the
> subject line:
>
> 1) To follow convention, the subject line should say what you're doing,
> not what the previous condition was. Also, there are several tests in that
> directory, so we should say which one. So more like this:
>
> "selftests/vm: gup_test: fix test flag"
>
> That is just a minor documentation point, so either way, feel free to add:
>
>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thank you for your review , I will change the subject line in the next update.

Pasha

>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> >
> > With the fix, dump works like this:
> >
> > root@virtme:/# gup_test  -c
> > ---- page #0, starting from user virt addr: 0x7f8acb9e4000
> > page:00000000d3d2ee27 refcount:2 mapcount:1 mapping:0000000000000000
> > index:0x0 pfn:0x100bcf
> > anon flags: 0x300000000080016(referenced|uptodate|lru|swapbacked)
> > raw: 0300000000080016 ffffd0e204021608 ffffd0e208df2e88 ffff8ea04243ec61
> > raw: 0000000000000000 0000000000000000 0000000200000000 0000000000000000
> > page dumped because: gup_test: dump_pages() test
> > DUMP_USER_PAGES_TEST: done
> >
> > root@virtme:/# gup_test  -c -p
> > ---- page #0, starting from user virt addr: 0x7fd19701b000
> > page:00000000baed3c7d refcount:1025 mapcount:1 mapping:0000000000000000
> > index:0x0 pfn:0x108008
> > anon flags: 0x300000000080014(uptodate|lru|swapbacked)
> > raw: 0300000000080014 ffffd0e204200188 ffffd0e205e09088 ffff8ea04243ee71
> > raw: 0000000000000000 0000000000000000 0000040100000000 0000000000000000
> > page dumped because: gup_test: dump_pages() test
> > DUMP_USER_PAGES_TEST: done
> >
> > Refcount shows the difference between pin vs no-pin case.
> > Also change type of nr from int to long, as it counts number of pages.
> >
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > ---
> >   mm/gup_test.c                         | 23 ++++++++++-------------
> >   mm/gup_test.h                         |  3 ++-
> >   tools/testing/selftests/vm/gup_test.c | 15 +++++++++++----
> >   3 files changed, 23 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/gup_test.c b/mm/gup_test.c
> > index e3cf78e5873e..a6ed1c877679 100644
> > --- a/mm/gup_test.c
> > +++ b/mm/gup_test.c
> > @@ -94,7 +94,7 @@ static int __gup_test_ioctl(unsigned int cmd,
> >   {
> >       ktime_t start_time, end_time;
> >       unsigned long i, nr_pages, addr, next;
> > -     int nr;
> > +     long nr;
> >       struct page **pages;
> >       int ret = 0;
> >       bool needs_mmap_lock =
> > @@ -126,37 +126,34 @@ static int __gup_test_ioctl(unsigned int cmd,
> >                       nr = (next - addr) / PAGE_SIZE;
> >               }
> >
> > -             /* Filter out most gup flags: only allow a tiny subset here: */
> > -             gup->flags &= FOLL_WRITE;
> > -
> >               switch (cmd) {
> >               case GUP_FAST_BENCHMARK:
> > -                     nr = get_user_pages_fast(addr, nr, gup->flags,
> > +                     nr = get_user_pages_fast(addr, nr, gup->gup_flags,
> >                                                pages + i);
> >                       break;
> >               case GUP_BASIC_TEST:
> > -                     nr = get_user_pages(addr, nr, gup->flags, pages + i,
> > +                     nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
> >                                           NULL);
> >                       break;
> >               case PIN_FAST_BENCHMARK:
> > -                     nr = pin_user_pages_fast(addr, nr, gup->flags,
> > +                     nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
> >                                                pages + i);
> >                       break;
> >               case PIN_BASIC_TEST:
> > -                     nr = pin_user_pages(addr, nr, gup->flags, pages + i,
> > +                     nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
> >                                           NULL);
> >                       break;
> >               case PIN_LONGTERM_BENCHMARK:
> >                       nr = pin_user_pages(addr, nr,
> > -                                         gup->flags | FOLL_LONGTERM,
> > +                                         gup->gup_flags | FOLL_LONGTERM,
> >                                           pages + i, NULL);
> >                       break;
> >               case DUMP_USER_PAGES_TEST:
> > -                     if (gup->flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> > -                             nr = pin_user_pages(addr, nr, gup->flags,
> > +                     if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
> > +                             nr = pin_user_pages(addr, nr, gup->gup_flags,
> >                                                   pages + i, NULL);
> >                       else
> > -                             nr = get_user_pages(addr, nr, gup->flags,
> > +                             nr = get_user_pages(addr, nr, gup->gup_flags,
> >                                                   pages + i, NULL);
> >                       break;
> >               default:
> > @@ -187,7 +184,7 @@ static int __gup_test_ioctl(unsigned int cmd,
> >
> >       start_time = ktime_get();
> >
> > -     put_back_pages(cmd, pages, nr_pages, gup->flags);
> > +     put_back_pages(cmd, pages, nr_pages, gup->test_flags);
> >
> >       end_time = ktime_get();
> >       gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> > diff --git a/mm/gup_test.h b/mm/gup_test.h
> > index 90a6713d50eb..887ac1d5f5bc 100644
> > --- a/mm/gup_test.h
> > +++ b/mm/gup_test.h
> > @@ -21,7 +21,8 @@ struct gup_test {
> >       __u64 addr;
> >       __u64 size;
> >       __u32 nr_pages_per_call;
> > -     __u32 flags;
> > +     __u32 gup_flags;
> > +     __u32 test_flags;
> >       /*
> >        * Each non-zero entry is the number of the page (1-based: first page is
> >        * page 1, so that zero entries mean "do nothing") from the .addr base.
> > diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> > index 6c6336dd3b7f..943cc2608dc2 100644
> > --- a/tools/testing/selftests/vm/gup_test.c
> > +++ b/tools/testing/selftests/vm/gup_test.c
> > @@ -37,13 +37,13 @@ int main(int argc, char **argv)
> >   {
> >       struct gup_test gup = { 0 };
> >       unsigned long size = 128 * MB;
> > -     int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 0;
> > +     int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
> >       unsigned long cmd = GUP_FAST_BENCHMARK;
> >       int flags = MAP_PRIVATE;
> >       char *file = "/dev/zero";
> >       char *p;
> >
> > -     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwSH")) != -1) {
> > +     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
> >               switch (opt) {
> >               case 'a':
> >                       cmd = PIN_FAST_BENCHMARK;
> > @@ -65,9 +65,13 @@ int main(int argc, char **argv)
> >                        */
> >                       gup.which_pages[0] = 1;
> >                       break;
> > +             case 'p':
> > +                     /* works only with DUMP_USER_PAGES_TEST */
> > +                     gup.test_flags |= GUP_TEST_FLAG_DUMP_PAGES_USE_PIN;
> > +                     break;
> >               case 'F':
> >                       /* strtol, so you can pass flags in hex form */
> > -                     gup.flags = strtol(optarg, 0, 0);
> > +                     gup.gup_flags = strtol(optarg, 0, 0);
> >                       break;
> >               case 'm':
> >                       size = atoi(optarg) * MB;
> > @@ -93,6 +97,9 @@ int main(int argc, char **argv)
> >               case 'w':
> >                       write = 1;
> >                       break;
> > +             case 'W':
> > +                     write = 0;
> > +                     break;
> >               case 'f':
> >                       file = optarg;
> >                       break;
> > @@ -140,7 +147,7 @@ int main(int argc, char **argv)
> >
> >       gup.nr_pages_per_call = nr_pages;
> >       if (write)
> > -             gup.flags |= FOLL_WRITE;
> > +             gup.gup_flags |= FOLL_WRITE;
> >
> >       fd = open("/sys/kernel/debug/gup_test", O_RDWR);
> >       if (fd == -1) {
> >
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages
  2021-01-24 23:18   ` John Hubbard
@ 2021-01-25 14:17       ` Pavel Tatashin
  2021-01-25 14:17       ` Pavel Tatashin
  1 sibling, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-25 14:17 UTC (permalink / raw)
  To: John Hubbard
  Cc: LKML, linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko,
	David Hildenbrand, Oscar Salvador, Dan Williams, Sasha Levin,
	Tyler Hicks, Joonsoo Kim, Mike Kravetz, Steven Rostedt,
	Ingo Molnar, Jason Gunthorpe, Peter Zijlstra, Mel Gorman,
	Matthew Wilcox, David Rientjes, Linux Doc Mailing List,
	Ira Weiny, linux-kselftest

On Sun, Jan 24, 2021 at 6:18 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 1/21/21 7:37 PM, Pavel Tatashin wrote:
> > When pages are pinned they can be faulted in userland and migrated, and
> > they can be faulted right in kernel without migration.
> >
> > In either case, the pinned pages must end-up being pinnable (not movable).
> >
> > Add a new test to gup_test, to help verify that the gup/pup
> > (get_user_pages() / pin_user_pages()) behavior with respect to pinnable
> > and movable pages is reasonable and correct. Specifically, provide a
> > way to:
> >
> > 1) Verify that only "pinnable" pages are pinned. This is checked
> > automatically for you.
> >
> > 2) Verify that gup/pup performance is reasonable. This requires
> > comparing benchmarks between doing gup/pup on pages that have been
> > pre-faulted in from user space, vs. doing gup/pup on pages that are not
> > faulted in until gup/pup time (via FOLL_TOUCH). This decision is
> > controlled with the new -z command line option.
> >
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > ---
> >   mm/gup_test.c                         |  6 ++++++
> >   tools/testing/selftests/vm/gup_test.c | 23 +++++++++++++++++++----
> >   2 files changed, 25 insertions(+), 4 deletions(-)
> >
>
> This also looks good. I do see the WARN_ON_ONCE firing in
> internal_get_user_pages_fast(), when running with *only* the new -z
> option.
>
> I'll poke around the rest of the patchset and see if that is expected
> and normal, but either way the test code itself looks correct and seems
> to be passing my set of "run a bunch of different gup_test options" here,
> so feel free to add:
>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thank you!

Pasha

>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > diff --git a/mm/gup_test.c b/mm/gup_test.c
> > index a6ed1c877679..d974dec19e1c 100644
> > --- a/mm/gup_test.c
> > +++ b/mm/gup_test.c
> > @@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
> >
> >                               dump_page(page, "gup_test failure");
> >                               break;
> > +                     } else if (cmd == PIN_LONGTERM_BENCHMARK &&
> > +                             WARN(!is_pinnable_page(page),
> > +                                  "pages[%lu] is NOT pinnable but pinned\n",
> > +                                  i)) {
> > +                             dump_page(page, "gup_test failure");
> > +                             break;
> >                       }
> >               }
> >               break;
> > diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> > index 943cc2608dc2..1e662d59c502 100644
> > --- a/tools/testing/selftests/vm/gup_test.c
> > +++ b/tools/testing/selftests/vm/gup_test.c
> > @@ -13,6 +13,7 @@
> >
> >   /* Just the flags we need, copied from mm.h: */
> >   #define FOLL_WRITE  0x01    /* check pte is writable */
> > +#define FOLL_TOUCH   0x02    /* mark page accessed */
> >
> >   static char *cmd_to_str(unsigned long cmd)
> >   {
> > @@ -39,11 +40,11 @@ int main(int argc, char **argv)
> >       unsigned long size = 128 * MB;
> >       int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
> >       unsigned long cmd = GUP_FAST_BENCHMARK;
> > -     int flags = MAP_PRIVATE;
> > +     int flags = MAP_PRIVATE, touch = 0;
> >       char *file = "/dev/zero";
> >       char *p;
> >
> > -     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
> > +     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
> >               switch (opt) {
> >               case 'a':
> >                       cmd = PIN_FAST_BENCHMARK;
> > @@ -110,6 +111,10 @@ int main(int argc, char **argv)
> >               case 'H':
> >                       flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
> >                       break;
> > +             case 'z':
> > +                     /* fault pages in gup, do not fault in userland */
> > +                     touch = 1;
> > +                     break;
> >               default:
> >                       return -1;
> >               }
> > @@ -167,8 +172,18 @@ int main(int argc, char **argv)
> >       else if (thp == 0)
> >               madvise(p, size, MADV_NOHUGEPAGE);
> >
> > -     for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> > -             p[0] = 0;
> > +     /*
> > +      * FOLL_TOUCH, in gup_test, is used as an either/or case: either
> > +      * fault pages in from the kernel via FOLL_TOUCH, or fault them
> > +      * in here, from user space. This allows comparison of performance
> > +      * between those two cases.
> > +      */
> > +     if (touch) {
> > +             gup.gup_flags |= FOLL_TOUCH;
> > +     } else {
> > +             for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> > +                     p[0] = 0;
> > +     }
> >
> >       /* Only report timing information on the *_BENCHMARK commands: */
> >       if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
> >
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages
@ 2021-01-25 14:17       ` Pavel Tatashin
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Tatashin @ 2021-01-25 14:17 UTC (permalink / raw)
  To: John Hubbard
  Cc: LKML, linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko,
	David Hildenbrand, Oscar Salvador, Dan Williams, Sasha Levin,
	Tyler Hicks, Joonsoo Kim, Mike Kravetz, Steven Rostedt,
	Ingo Molnar, Jason Gunthorpe, Peter Zijlstra, Mel Gorman,
	Matthew Wilcox, David Rientjes, Linux Doc Mailing List,
	Ira Weiny, linux-kselftest

On Sun, Jan 24, 2021 at 6:18 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 1/21/21 7:37 PM, Pavel Tatashin wrote:
> > When pages are pinned they can be faulted in userland and migrated, and
> > they can be faulted right in kernel without migration.
> >
> > In either case, the pinned pages must end-up being pinnable (not movable).
> >
> > Add a new test to gup_test, to help verify that the gup/pup
> > (get_user_pages() / pin_user_pages()) behavior with respect to pinnable
> > and movable pages is reasonable and correct. Specifically, provide a
> > way to:
> >
> > 1) Verify that only "pinnable" pages are pinned. This is checked
> > automatically for you.
> >
> > 2) Verify that gup/pup performance is reasonable. This requires
> > comparing benchmarks between doing gup/pup on pages that have been
> > pre-faulted in from user space, vs. doing gup/pup on pages that are not
> > faulted in until gup/pup time (via FOLL_TOUCH). This decision is
> > controlled with the new -z command line option.
> >
> > Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
> > ---
> >   mm/gup_test.c                         |  6 ++++++
> >   tools/testing/selftests/vm/gup_test.c | 23 +++++++++++++++++++----
> >   2 files changed, 25 insertions(+), 4 deletions(-)
> >
>
> This also looks good. I do see the WARN_ON_ONCE firing in
> internal_get_user_pages_fast(), when running with *only* the new -z
> option.
>
> I'll poke around the rest of the patchset and see if that is expected
> and normal, but either way the test code itself looks correct and seems
> to be passing my set of "run a bunch of different gup_test options" here,
> so feel free to add:
>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thank you!

Pasha

>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > diff --git a/mm/gup_test.c b/mm/gup_test.c
> > index a6ed1c877679..d974dec19e1c 100644
> > --- a/mm/gup_test.c
> > +++ b/mm/gup_test.c
> > @@ -52,6 +52,12 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages,
> >
> >                               dump_page(page, "gup_test failure");
> >                               break;
> > +                     } else if (cmd == PIN_LONGTERM_BENCHMARK &&
> > +                             WARN(!is_pinnable_page(page),
> > +                                  "pages[%lu] is NOT pinnable but pinned\n",
> > +                                  i)) {
> > +                             dump_page(page, "gup_test failure");
> > +                             break;
> >                       }
> >               }
> >               break;
> > diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> > index 943cc2608dc2..1e662d59c502 100644
> > --- a/tools/testing/selftests/vm/gup_test.c
> > +++ b/tools/testing/selftests/vm/gup_test.c
> > @@ -13,6 +13,7 @@
> >
> >   /* Just the flags we need, copied from mm.h: */
> >   #define FOLL_WRITE  0x01    /* check pte is writable */
> > +#define FOLL_TOUCH   0x02    /* mark page accessed */
> >
> >   static char *cmd_to_str(unsigned long cmd)
> >   {
> > @@ -39,11 +40,11 @@ int main(int argc, char **argv)
> >       unsigned long size = 128 * MB;
> >       int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
> >       unsigned long cmd = GUP_FAST_BENCHMARK;
> > -     int flags = MAP_PRIVATE;
> > +     int flags = MAP_PRIVATE, touch = 0;
> >       char *file = "/dev/zero";
> >       char *p;
> >
> > -     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHp")) != -1) {
> > +     while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
> >               switch (opt) {
> >               case 'a':
> >                       cmd = PIN_FAST_BENCHMARK;
> > @@ -110,6 +111,10 @@ int main(int argc, char **argv)
> >               case 'H':
> >                       flags |= (MAP_HUGETLB | MAP_ANONYMOUS);
> >                       break;
> > +             case 'z':
> > +                     /* fault pages in gup, do not fault in userland */
> > +                     touch = 1;
> > +                     break;
> >               default:
> >                       return -1;
> >               }
> > @@ -167,8 +172,18 @@ int main(int argc, char **argv)
> >       else if (thp == 0)
> >               madvise(p, size, MADV_NOHUGEPAGE);
> >
> > -     for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> > -             p[0] = 0;
> > +     /*
> > +      * FOLL_TOUCH, in gup_test, is used as an either/or case: either
> > +      * fault pages in from the kernel via FOLL_TOUCH, or fault them
> > +      * in here, from user space. This allows comparison of performance
> > +      * between those two cases.
> > +      */
> > +     if (touch) {
> > +             gup.gup_flags |= FOLL_TOUCH;
> > +     } else {
> > +             for (; (unsigned long)p < gup.addr + size; p += PAGE_SIZE)
> > +                     p[0] = 0;
> > +     }
> >
> >       /* Only report timing information on the *_BENCHMARK commands: */
> >       if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
> >
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-01-26  6:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-22  3:37 [PATCH v7 00/14] prohibit pinning pages in ZONE_MOVABLE Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 01/14] mm/gup: don't pin migrated cma pages in movable zone Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 02/14] mm/gup: check every subpage of a compound page during isolation Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 03/14] mm/gup: return an error on migration failure Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 04/14] mm/gup: check for isolation errors Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 05/14] mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 06/14] mm: apply per-task gfp constraints in fast path Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 07/14] mm: honor PF_MEMALLOC_PIN for all movable pages Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 08/14] mm/gup: do not migrate zero page Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 09/14] mm/gup: migrate pinned pages out of movable zone Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 10/14] memory-hotplug.rst: add a note about ZONE_MOVABLE and page pinning Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 11/14] mm/gup: change index type to long as it counts pages Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 12/14] mm/gup: longterm pin migration cleanup Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 13/14] selftests/vm: test flag is broken Pavel Tatashin
2021-01-24 23:03   ` John Hubbard
2021-01-25 14:16     ` Pavel Tatashin
2021-01-25 14:16       ` Pavel Tatashin
2021-01-22  3:37 ` [PATCH v7 14/14] selftests/vm: test faulting in kernel, and verify pinnable pages Pavel Tatashin
2021-01-24 23:18   ` John Hubbard
2021-01-24 23:40     ` John Hubbard
2021-01-25 14:17     ` Pavel Tatashin
2021-01-25 14:17       ` Pavel Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.