All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] mm/gup: Fix pin page write cache bouncing on has_pinned
@ 2021-05-06 23:25 Peter Xu
  2021-05-06 23:25 ` [PATCH 1/3] mm/gup_benchmark: Support threading Peter Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Peter Xu @ 2021-05-06 23:25 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, peterx, John Hubbard, Jan Kara, Kirill Shutemov,
	Jason Gunthorpe, Andrew Morton, Kirill Tkhai, Michal Hocko,
	Oleg Nesterov, Jann Horn, Linus Torvalds, Matthew Wilcox,
	Andrea Arcangeli

This series contains 3 patches, the 1st one enables threading for gup_benchma=
rk
in the kselftest.  The latter two patches are collected from Andrea's local
branch which can fix write cache bouncing issue with pinning fast-gup.

To be explicit on the latter two patches:

  - the 2nd patch fixes the perf degrade when introducing has_pinned, then

  - the last patch tries to remove the has_pinned with a bit in mm->flags

For patch 3: originally I think we had a plan to reuse has_pinned into a
counter very soon, however that's not happening at least until today, so maybe
it proves that we can remove it until we really want such a counter for
whatever reason.  As the commit message stated, it saves 4 bytes for each mm
without observable regressions.

Regarding testing: we can reference to the commit message of patch 2 for some
detailed testing with will-is-scale.  Meanwhile I did patch 1 just because th=
en
we can even easily verify the patchset using the existing kselftest facilities
or even regress test it in the future with the repo if we want.

Below numbers are extra verification tests that I did besides commit message =
of
patch 2 using the new gup_benchmark and 256 cpus.  Below test is done on 40
cpus host with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, and I can get simil=
ar
result (of course the write cache bouncing get severe with even more cores).

After patch 1 applied (only test patch, so using old kernel):

  $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
  PIN_FAST_BENCHMARK: Time: get:459632 put:5990 us
  PIN_FAST_BENCHMARK: Time: get:461967 put:5840 us
  PIN_FAST_BENCHMARK: Time: get:464521 put:6140 us
  PIN_FAST_BENCHMARK: Time: get:465176 put:7100 us
  PIN_FAST_BENCHMARK: Time: get:465960 put:6733 us
  PIN_FAST_BENCHMARK: Time: get:465324 put:6781 us
  PIN_FAST_BENCHMARK: Time: get:466018 put:7130 us
  PIN_FAST_BENCHMARK: Time: get:466362 put:7118 us
  PIN_FAST_BENCHMARK: Time: get:465118 put:6975 us
  PIN_FAST_BENCHMARK: Time: get:466422 put:6602 us
  PIN_FAST_BENCHMARK: Time: get:465791 put:6818 us
  PIN_FAST_BENCHMARK: Time: get:467091 put:6298 us
  PIN_FAST_BENCHMARK: Time: get:467694 put:5432 us
  PIN_FAST_BENCHMARK: Time: get:469575 put:5581 us
  PIN_FAST_BENCHMARK: Time: get:468124 put:6055 us
  PIN_FAST_BENCHMARK: Time: get:468877 put:6720 us
  PIN_FAST_BENCHMARK: Time: get:467212 put:4961 us
  PIN_FAST_BENCHMARK: Time: get:467834 put:6697 us
  PIN_FAST_BENCHMARK: Time: get:470778 put:6398 us
  PIN_FAST_BENCHMARK: Time: get:469788 put:6310 us
  PIN_FAST_BENCHMARK: Time: get:488277 put:7113 us
  PIN_FAST_BENCHMARK: Time: get:486613 put:7085 us
  PIN_FAST_BENCHMARK: Time: get:486940 put:7202 us
  PIN_FAST_BENCHMARK: Time: get:488728 put:7101 us
  PIN_FAST_BENCHMARK: Time: get:487570 put:7327 us
  PIN_FAST_BENCHMARK: Time: get:489260 put:7027 us
  PIN_FAST_BENCHMARK: Time: get:488846 put:6866 us
  PIN_FAST_BENCHMARK: Time: get:488521 put:6745 us
  PIN_FAST_BENCHMARK: Time: get:489950 put:6459 us
  PIN_FAST_BENCHMARK: Time: get:489777 put:6617 us
  PIN_FAST_BENCHMARK: Time: get:488224 put:6591 us
  PIN_FAST_BENCHMARK: Time: get:488644 put:6477 us
  PIN_FAST_BENCHMARK: Time: get:488754 put:6711 us
  PIN_FAST_BENCHMARK: Time: get:488875 put:6743 us
  PIN_FAST_BENCHMARK: Time: get:489290 put:6657 us
  PIN_FAST_BENCHMARK: Time: get:490264 put:6684 us
  PIN_FAST_BENCHMARK: Time: get:489631 put:6737 us
  PIN_FAST_BENCHMARK: Time: get:488434 put:6655 us
  PIN_FAST_BENCHMARK: Time: get:492213 put:6297 us
  PIN_FAST_BENCHMARK: Time: get:491124 put:6173 us

After the whole series applied (new fixed kernel):

  $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
  PIN_FAST_BENCHMARK: Time: get:82038 put:7041 us
  PIN_FAST_BENCHMARK: Time: get:82144 put:6817 us
  PIN_FAST_BENCHMARK: Time: get:83417 put:6674 us
  PIN_FAST_BENCHMARK: Time: get:82540 put:6594 us
  PIN_FAST_BENCHMARK: Time: get:83214 put:6681 us
  PIN_FAST_BENCHMARK: Time: get:83444 put:6889 us
  PIN_FAST_BENCHMARK: Time: get:83194 put:7499 us
  PIN_FAST_BENCHMARK: Time: get:84876 put:7369 us
  PIN_FAST_BENCHMARK: Time: get:86092 put:10289 us
  PIN_FAST_BENCHMARK: Time: get:86153 put:10415 us
  PIN_FAST_BENCHMARK: Time: get:85026 put:7751 us
  PIN_FAST_BENCHMARK: Time: get:85458 put:7944 us
  PIN_FAST_BENCHMARK: Time: get:85735 put:8154 us
  PIN_FAST_BENCHMARK: Time: get:85851 put:8299 us
  PIN_FAST_BENCHMARK: Time: get:86323 put:9617 us
  PIN_FAST_BENCHMARK: Time: get:86288 put:10496 us
  PIN_FAST_BENCHMARK: Time: get:87697 put:9346 us
  PIN_FAST_BENCHMARK: Time: get:87980 put:8382 us
  PIN_FAST_BENCHMARK: Time: get:88719 put:8400 us
  PIN_FAST_BENCHMARK: Time: get:87616 put:8588 us
  PIN_FAST_BENCHMARK: Time: get:86730 put:9563 us
  PIN_FAST_BENCHMARK: Time: get:88167 put:8673 us
  PIN_FAST_BENCHMARK: Time: get:86844 put:9777 us
  PIN_FAST_BENCHMARK: Time: get:88068 put:11774 us
  PIN_FAST_BENCHMARK: Time: get:86170 put:15676 us
  PIN_FAST_BENCHMARK: Time: get:87967 put:12827 us
  PIN_FAST_BENCHMARK: Time: get:95773 put:7652 us
  PIN_FAST_BENCHMARK: Time: get:87734 put:13650 us
  PIN_FAST_BENCHMARK: Time: get:89833 put:14237 us
  PIN_FAST_BENCHMARK: Time: get:96186 put:8029 us
  PIN_FAST_BENCHMARK: Time: get:95532 put:8886 us
  PIN_FAST_BENCHMARK: Time: get:95351 put:5826 us
  PIN_FAST_BENCHMARK: Time: get:96401 put:8407 us
  PIN_FAST_BENCHMARK: Time: get:96473 put:8287 us
  PIN_FAST_BENCHMARK: Time: get:97177 put:8430 us
  PIN_FAST_BENCHMARK: Time: get:98120 put:5263 us
  PIN_FAST_BENCHMARK: Time: get:96271 put:7757 us
  PIN_FAST_BENCHMARK: Time: get:99628 put:10467 us
  PIN_FAST_BENCHMARK: Time: get:99344 put:10045 us
  PIN_FAST_BENCHMARK: Time: get:94212 put:15485 us

Summary:

  Old kernel: 477729.97 (+-3.79%)
  New kernel:  89144.65 (+-11.76%)

I'm not sure whether I should add Fixes for patch 2.  If to add it'll be:

Fixes: 008cfe4418b3d ("mm: Introduce mm_struct.has_pinned")

Then cc stable for 5.9+.  However I'll skip adding it if no one asks, as this
is a perf fix, and frequent+concurrent pinning should not really happen that =
much.

Please review, thanks.

Andrea Arcangeli (2):
  mm: gup: allow FOLL_PIN to scale in SMP
  mm: gup: pack has_pinned in MMF_HAS_PINNED

Peter Xu (1):
  mm/gup_benchmark: Support threading

 fs/proc/task_mmu.c                    |  2 +-
 include/linux/mm.h                    |  2 +-
 include/linux/mm_types.h              | 10 ---
 include/linux/sched/coredump.h        |  1 +
 kernel/fork.c                         |  1 -
 mm/gup.c                              |  9 +--
 tools/testing/selftests/vm/gup_test.c | 94 ++++++++++++++++++---------
 7 files changed, 71 insertions(+), 48 deletions(-)

--=20
2.31.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] mm/gup_benchmark: Support threading
  2021-05-06 23:25 [PATCH 0/3] mm/gup: Fix pin page write cache bouncing on has_pinned Peter Xu
@ 2021-05-06 23:25 ` Peter Xu
  2021-05-07  4:37   ` John Hubbard
  2021-05-06 23:25 ` [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP Peter Xu
  2021-05-06 23:25 ` [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED Peter Xu
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-05-06 23:25 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, peterx, John Hubbard, Jan Kara, Kirill Shutemov,
	Jason Gunthorpe, Andrew Morton, Kirill Tkhai, Michal Hocko,
	Oleg Nesterov, Jann Horn, Linus Torvalds, Matthew Wilcox,
	Andrea Arcangeli

Add a new parameter "-j N" to support concurrent gup test.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/gup_test.c | 94 ++++++++++++++++++---------
 1 file changed, 63 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
index 1e662d59c502a..3f0fcc697c3fc 100644
--- a/tools/testing/selftests/vm/gup_test.c
+++ b/tools/testing/selftests/vm/gup_test.c
@@ -6,6 +6,7 @@
 #include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/types.h>
+#include <pthread.h>
 #include "../../../../mm/gup_test.h"
 
 #define MB (1UL << 20)
@@ -15,6 +16,12 @@
 #define FOLL_WRITE	0x01	/* check pte is writable */
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 
+static unsigned long cmd = GUP_FAST_BENCHMARK;
+static int gup_fd, repeats = 1;
+static unsigned long size = 128 * MB;
+/* Serialize prints */
+static pthread_mutex_t print_mutex = PTHREAD_MUTEX_INITIALIZER;
+
 static char *cmd_to_str(unsigned long cmd)
 {
 	switch (cmd) {
@@ -34,17 +41,55 @@ static char *cmd_to_str(unsigned long cmd)
 	return "Unknown command";
 }
 
+void *gup_thread(void *data)
+{
+	struct gup_test gup = *(struct gup_test *)data;
+	int i;
+
+	/* Only report timing information on the *_BENCHMARK commands: */
+	if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
+	     (cmd == PIN_LONGTERM_BENCHMARK)) {
+		for (i = 0; i < repeats; i++) {
+			gup.size = size;
+			if (ioctl(gup_fd, cmd, &gup))
+				perror("ioctl"), exit(1);
+
+			pthread_mutex_lock(&print_mutex);
+			printf("%s: Time: get:%lld put:%lld us",
+			       cmd_to_str(cmd), gup.get_delta_usec,
+			       gup.put_delta_usec);
+			if (gup.size != size)
+				printf(", truncated (size: %lld)", gup.size);
+			printf("\n");
+			pthread_mutex_unlock(&print_mutex);
+		}
+	} else {
+		gup.size = size;
+		if (ioctl(gup_fd, cmd, &gup)) {
+			perror("ioctl");
+			exit(1);
+		}
+
+		pthread_mutex_lock(&print_mutex);
+		printf("%s: done\n", cmd_to_str(cmd));
+		if (gup.size != size)
+			printf("Truncated (size: %lld)\n", gup.size);
+		pthread_mutex_unlock(&print_mutex);
+	}
+
+	return NULL;
+}
+
 int main(int argc, char **argv)
 {
 	struct gup_test gup = { 0 };
-	unsigned long size = 128 * MB;
-	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
-	unsigned long cmd = GUP_FAST_BENCHMARK;
+	int filed, i, opt, nr_pages = 1, thp = -1, write = 1, threads = 1;
 	int flags = MAP_PRIVATE, touch = 0;
 	char *file = "/dev/zero";
+	pthread_t *tid;
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:F:f:abcj:tTLUuwWSHpz")) != -1) {
 		switch (opt) {
 		case 'a':
 			cmd = PIN_FAST_BENCHMARK;
@@ -74,6 +119,9 @@ int main(int argc, char **argv)
 			/* strtol, so you can pass flags in hex form */
 			gup.gup_flags = strtol(optarg, 0, 0);
 			break;
+		case 'j':
+			threads = atoi(optarg);
+			break;
 		case 'm':
 			size = atoi(optarg) * MB;
 			break;
@@ -154,8 +202,8 @@ int main(int argc, char **argv)
 	if (write)
 		gup.gup_flags |= FOLL_WRITE;
 
-	fd = open("/sys/kernel/debug/gup_test", O_RDWR);
-	if (fd == -1) {
+	gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+	if (gup_fd == -1) {
 		perror("open");
 		exit(1);
 	}
@@ -185,32 +233,16 @@ int main(int argc, char **argv)
 			p[0] = 0;
 	}
 
-	/* Only report timing information on the *_BENCHMARK commands: */
-	if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
-	     (cmd == PIN_LONGTERM_BENCHMARK)) {
-		for (i = 0; i < repeats; i++) {
-			gup.size = size;
-			if (ioctl(fd, cmd, &gup))
-				perror("ioctl"), exit(1);
-
-			printf("%s: Time: get:%lld put:%lld us",
-			       cmd_to_str(cmd), gup.get_delta_usec,
-			       gup.put_delta_usec);
-			if (gup.size != size)
-				printf(", truncated (size: %lld)", gup.size);
-			printf("\n");
-		}
-	} else {
-		gup.size = size;
-		if (ioctl(fd, cmd, &gup)) {
-			perror("ioctl");
-			exit(1);
-		}
-
-		printf("%s: done\n", cmd_to_str(cmd));
-		if (gup.size != size)
-			printf("Truncated (size: %lld)\n", gup.size);
+	tid = malloc(sizeof(pthread_t) * threads);
+	if (!tid) {
+		perror("malloc()");
+		exit(1);
 	}
+	for (i = 0; i < threads; i++)
+		pthread_create(&tid[i], NULL, gup_thread, &gup);
+	for (i = 0; i < threads; i++)
+		pthread_join(tid[i], NULL);
+	free(tid);
 
 	return 0;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP
  2021-05-06 23:25 [PATCH 0/3] mm/gup: Fix pin page write cache bouncing on has_pinned Peter Xu
  2021-05-06 23:25 ` [PATCH 1/3] mm/gup_benchmark: Support threading Peter Xu
@ 2021-05-06 23:25 ` Peter Xu
  2021-05-07  2:34     ` Linus Torvalds
  2021-05-07  6:07   ` John Hubbard
  2021-05-06 23:25 ` [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED Peter Xu
  2 siblings, 2 replies; 15+ messages in thread
From: Peter Xu @ 2021-05-06 23:25 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, peterx, John Hubbard, Jan Kara, Kirill Shutemov,
	Jason Gunthorpe, Andrew Morton, Kirill Tkhai, Michal Hocko,
	Oleg Nesterov, Jann Horn, Linus Torvalds, Matthew Wilcox,
	Andrea Arcangeli

From: Andrea Arcangeli <aarcange@redhat.com>

has_pinned cannot be written by each pin-fast or it won't scale in
SMP. This isn't "false sharing" strictly speaking (it's more like
"true non-sharing"), but it creates the same SMP scalability
bottleneck of "false sharing".

To verify the improvement a new "pin_fast.c" program was added to
the will-it-scale benchmark.

== pin_fast.c - start ==
 /* SPDX-License-Identifier: GPL-2.0-or-later */
 #define _GNU_SOURCE
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <string.h>
 #include <assert.h>
 #include <sys/prctl.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>

 /* exercises pin_user_pages_fast, requires a kernel with CONFIG_GUP_TEST=y */
 char *testcase_description = "pin_user_pages_fast SMP scalability benchmark";

 static int gup_fd;
 #define NR_PAGES 1024
 #define BUFLEN (getpagesize() * NR_PAGES)

 #define GUP_TEST_MAX_PAGES_TO_DUMP		8
 #define PIN_FAST_BENCHMARK	_IOWR('g', 2, struct gup_test)

 struct gup_test {
 	__u64 get_delta_usec;
 	__u64 put_delta_usec;
 	__u64 addr;
 	__u64 size;
 	__u32 nr_pages_per_call;
 	__u32 flags;
 	/*
 	 * Each non-zero entry is the number of the page (1-based: first page is
 	 * page 1, so that zero entries mean "do nothing") from the .addr base.
 	 */
 	__u32 which_pages[GUP_TEST_MAX_PAGES_TO_DUMP];
 };

 void testcase_prepare(unsigned long nr_tasks)
 {
 	gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
 	assert(gup_fd >= 0);
 }

 void testcase(unsigned long long *iterations, unsigned long nr)
 {
 	char *p = aligned_alloc(getpagesize()*512, BUFLEN);
 	assert(p);
 	assert(!madvise(p, BUFLEN, MADV_HUGEPAGE));
 	for (int i = 0; i < NR_PAGES; i++)
 		p[getpagesize()*i] = 0;

 	struct gup_test gup = {
 		.size = BUFLEN,
 		.addr = (unsigned long)p,
 		.nr_pages_per_call = 1,
 	};

 	while (1) {
 		assert(!ioctl(gup_fd, PIN_FAST_BENCHMARK, &gup));

 		(*iterations)++;
 	}

 	free(p);
 }

 void testcase_cleanup(void)
 {
 	assert(!close(gup_fd));
 }
== pin_fast.c - end ==

The pin_fast will-it-scale benchmark was run with 1 thread per-CPU
on this 2 NUMA nodes system:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 0 size: 128792 MB
node 0 free: 126741 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 1 size: 128944 MB
node 1 free: 127330 MB
node distances:
node   0   1
  0:  10  32
  1:  32  10

Before this commit (average 25617 +- 0.16%):

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
256,0,0.00,25641,0.17,0
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
256,0,0.00,25652,0.16,0
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
256,0,0.00,25559,0.16,0

After this commit (average 1194790 +- 0.11%):

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
256,0,0.00,1196513,0.19,0
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
256,0,0.00,1194664,0.19,0
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
256,0,0.00,1193194,0.19,0

This commits increases the SMP scalability of pin_user_pages_fast()
executed by different threads of the same process by more than 4000%.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 63a079e361a3d..8b513e1723b45 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1292,7 +1292,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
 		BUG_ON(*locked != 1);
 	}
 
-	if (flags & FOLL_PIN)
+	if (flags & FOLL_PIN && !atomic_read(&mm->has_pinned))
 		atomic_set(&mm->has_pinned, 1);
 
 	/*
@@ -2617,7 +2617,7 @@ static int internal_get_user_pages_fast(unsigned long start,
 				       FOLL_FAST_ONLY)))
 		return -EINVAL;
 
-	if (gup_flags & FOLL_PIN)
+	if (gup_flags & FOLL_PIN && !atomic_read(&current->mm->has_pinned))
 		atomic_set(&current->mm->has_pinned, 1);
 
 	if (!(gup_flags & FOLL_FAST_ONLY))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED
  2021-05-06 23:25 [PATCH 0/3] mm/gup: Fix pin page write cache bouncing on has_pinned Peter Xu
  2021-05-06 23:25 ` [PATCH 1/3] mm/gup_benchmark: Support threading Peter Xu
  2021-05-06 23:25 ` [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP Peter Xu
@ 2021-05-06 23:25 ` Peter Xu
  2021-05-07  6:42   ` John Hubbard
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-05-06 23:25 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, peterx, John Hubbard, Jan Kara, Kirill Shutemov,
	Jason Gunthorpe, Andrew Morton, Kirill Tkhai, Michal Hocko,
	Oleg Nesterov, Jann Horn, Linus Torvalds, Matthew Wilcox,
	Andrea Arcangeli

From: Andrea Arcangeli <aarcange@redhat.com>

has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.

Any atomic_inc/dec to the mm cacheline shared by all threads in
pin-fast would reintroduce a loss of SMP scalability to pin-fast, so
there's no future potential usefulness to keep an atomic in the mm for
this.

set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than
WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit
(just like atomic_set after this commit) has to be still issued only
once per "mm", so the difference between the two will be lost in the
noise.

will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.

will-it-scale "pin_fast" retains the > 4000% SMP scalability
performance improvement against upstream as expected.

This is a noop as far as overall performance and SMP scalability are
concerned.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
[peterx: Fix build for task_mmu.c]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/proc/task_mmu.c             |  2 +-
 include/linux/mm.h             |  2 +-
 include/linux/mm_types.h       | 10 ----------
 include/linux/sched/coredump.h |  1 +
 kernel/fork.c                  |  1 -
 mm/gup.c                       |  9 +++++----
 6 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4c95cc57a66a8..6144571942db9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1049,7 +1049,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
 		return false;
 	if (!is_cow_mapping(vma->vm_flags))
 		return false;
-	if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
+	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
 		return false;
 	page = vm_normal_page(vma, addr, pte);
 	if (!page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6790ab0cf575..94dc84f6d8658 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1331,7 +1331,7 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
 	if (!is_cow_mapping(vma->vm_flags))
 		return false;
 
-	if (!atomic_read(&vma->vm_mm->has_pinned))
+	if (!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))
 		return false;
 
 	return page_maybe_dma_pinned(page);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a88946..15d79858fadbd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -435,16 +435,6 @@ struct mm_struct {
 		 */
 		atomic_t mm_count;
 
-		/**
-		 * @has_pinned: Whether this mm has pinned any pages.  This can
-		 * be either replaced in the future by @pinned_vm when it
-		 * becomes stable, or grow into a counter on its own. We're
-		 * aggresive on this bit now - even if the pinned pages were
-		 * unpinned later on, we'll still keep this bit set for the
-		 * lifecycle of this mm just for simplicity.
-		 */
-		atomic_t has_pinned;
-
 		/**
 		 * @write_protect_seq: Locked when any thread is write
 		 * protecting pages mapped by this mm to enforce a later COW,
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index dfd82eab29025..bf45badd63e6d 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -73,6 +73,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
 #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
 #define MMF_MULTIPROCESS	27	/* mm is shared between processes */
+#define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
diff --git a/kernel/fork.c b/kernel/fork.c
index 502dc046fbc62..a71e73707ef59 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1026,7 +1026,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
 	mm->locked_vm = 0;
-	atomic_set(&mm->has_pinned, 0);
 	atomic64_set(&mm->pinned_vm, 0);
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
diff --git a/mm/gup.c b/mm/gup.c
index 8b513e1723b45..78416b0909873 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1292,8 +1292,8 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
 		BUG_ON(*locked != 1);
 	}
 
-	if (flags & FOLL_PIN && !atomic_read(&mm->has_pinned))
-		atomic_set(&mm->has_pinned, 1);
+	if (flags & FOLL_PIN && !test_bit(MMF_HAS_PINNED, &mm->flags))
+		set_bit(MMF_HAS_PINNED, &mm->flags);
 
 	/*
 	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
@@ -2617,8 +2617,9 @@ static int internal_get_user_pages_fast(unsigned long start,
 				       FOLL_FAST_ONLY)))
 		return -EINVAL;
 
-	if (gup_flags & FOLL_PIN && !atomic_read(&current->mm->has_pinned))
-		atomic_set(&current->mm->has_pinned, 1);
+	if (gup_flags & FOLL_PIN &&
+	    !test_bit(MMF_HAS_PINNED, &current->mm->flags))
+		set_bit(MMF_HAS_PINNED, &current->mm->flags);
 
 	if (!(gup_flags & FOLL_FAST_ONLY))
 		might_lock_read(&current->mm->mmap_lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP
  2021-05-06 23:25 ` [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP Peter Xu
@ 2021-05-07  2:34     ` Linus Torvalds
  2021-05-07  6:07   ` John Hubbard
  1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2021-05-07  2:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, Linux Kernel Mailing List, Hugh Dickins, John Hubbard,
	Jan Kara, Kirill Shutemov, Jason Gunthorpe, Andrew Morton,
	Kirill Tkhai, Michal Hocko, Oleg Nesterov, Jann Horn,
	Matthew Wilcox, Andrea Arcangeli

On Thu, May 6, 2021 at 4:25 PM Peter Xu <peterx@redhat.com> wrote:
>
> +       if (flags & FOLL_PIN && !atomic_read(&mm->has_pinned))

Please add parentheses to clarify code like this and make the grouping
much more visually obvious.

Yes, yes, '&' has higher precedence than '&&'. This is not about the
compiler, this is about the humans reading it. So please write it as

        if ((flags & FOLL_PIN) && !atomic_read(&mm->has_pinned))

instead (this problem remains - and the fix is the same - in the 3/3 patch).

Otherwise the series looks fine to me (although admittedly I do find
the commit message to be ridiculously verbose for such a trivial patch
- at some point the actual _point_ if it all gets hidden in the long
commit message).

             Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP
@ 2021-05-07  2:34     ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2021-05-07  2:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, Linux Kernel Mailing List, Hugh Dickins, John Hubbard,
	Jan Kara, Kirill Shutemov, Jason Gunthorpe, Andrew Morton,
	Kirill Tkhai, Michal Hocko, Oleg Nesterov, Jann Horn,
	Matthew Wilcox, Andrea Arcangeli

On Thu, May 6, 2021 at 4:25 PM Peter Xu <peterx@redhat.com> wrote:
>
> +       if (flags & FOLL_PIN && !atomic_read(&mm->has_pinned))

Please add parentheses to clarify code like this and make the grouping
much more visually obvious.

Yes, yes, '&' has higher precedence than '&&'. This is not about the
compiler, this is about the humans reading it. So please write it as

        if ((flags & FOLL_PIN) && !atomic_read(&mm->has_pinned))

instead (this problem remains - and the fix is the same - in the 3/3 patch).

Otherwise the series looks fine to me (although admittedly I do find
the commit message to be ridiculously verbose for such a trivial patch
- at some point the actual _point_ if it all gets hidden in the long
commit message).

             Linus


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] mm/gup_benchmark: Support threading
  2021-05-06 23:25 ` [PATCH 1/3] mm/gup_benchmark: Support threading Peter Xu
@ 2021-05-07  4:37   ` John Hubbard
  2021-05-07 14:04     ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: John Hubbard @ 2021-05-07  4:37 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Hugh Dickins, Jan Kara, Kirill Shutemov, Jason Gunthorpe,
	Andrew Morton, Kirill Tkhai, Michal Hocko, Oleg Nesterov,
	Jann Horn, Linus Torvalds, Matthew Wilcox, Andrea Arcangeli

On 5/6/21 4:25 PM, Peter Xu wrote:
> Add a new parameter "-j N" to support concurrent gup test.

I had a long-standing personal TODO item that said, "decide if it's
worth it, to add pthread support to gup_test". So now, at least one
other person has decided that it *is* worth it, and you've also written
the patch. OK, then. Sweet! :)

This looks correct to me. I have a couple of minor comments below, but
either way you can add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   tools/testing/selftests/vm/gup_test.c | 94 ++++++++++++++++++---------
>   1 file changed, 63 insertions(+), 31 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/gup_test.c b/tools/testing/selftests/vm/gup_test.c
> index 1e662d59c502a..3f0fcc697c3fc 100644
> --- a/tools/testing/selftests/vm/gup_test.c
> +++ b/tools/testing/selftests/vm/gup_test.c
> @@ -6,6 +6,7 @@
>   #include <sys/mman.h>
>   #include <sys/stat.h>
>   #include <sys/types.h>
> +#include <pthread.h>
>   #include "../../../../mm/gup_test.h"
>   
>   #define MB (1UL << 20)
> @@ -15,6 +16,12 @@
>   #define FOLL_WRITE	0x01	/* check pte is writable */
>   #define FOLL_TOUCH	0x02	/* mark page accessed */
>   
> +static unsigned long cmd = GUP_FAST_BENCHMARK;
> +static int gup_fd, repeats = 1;
> +static unsigned long size = 128 * MB;
> +/* Serialize prints */
> +static pthread_mutex_t print_mutex = PTHREAD_MUTEX_INITIALIZER;
> +
>   static char *cmd_to_str(unsigned long cmd)
>   {
>   	switch (cmd) {
> @@ -34,17 +41,55 @@ static char *cmd_to_str(unsigned long cmd)
>   	return "Unknown command";
>   }
>   
> +void *gup_thread(void *data)
> +{
> +	struct gup_test gup = *(struct gup_test *)data;
> +	int i;
> +
> +	/* Only report timing information on the *_BENCHMARK commands: */
> +	if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
> +	     (cmd == PIN_LONGTERM_BENCHMARK)) {
> +		for (i = 0; i < repeats; i++) {
> +			gup.size = size;
> +			if (ioctl(gup_fd, cmd, &gup))
> +				perror("ioctl"), exit(1);
> +
> +			pthread_mutex_lock(&print_mutex);
> +			printf("%s: Time: get:%lld put:%lld us",
> +			       cmd_to_str(cmd), gup.get_delta_usec,
> +			       gup.put_delta_usec);
> +			if (gup.size != size)
> +				printf(", truncated (size: %lld)", gup.size);
> +			printf("\n");
> +			pthread_mutex_unlock(&print_mutex);
> +		}
> +	} else {
> +		gup.size = size;
> +		if (ioctl(gup_fd, cmd, &gup)) {
> +			perror("ioctl");
> +			exit(1);
> +		}
> +
> +		pthread_mutex_lock(&print_mutex);
> +		printf("%s: done\n", cmd_to_str(cmd));
> +		if (gup.size != size)
> +			printf("Truncated (size: %lld)\n", gup.size);
> +		pthread_mutex_unlock(&print_mutex);
> +	}
> +
> +	return NULL;
> +}
> +
>   int main(int argc, char **argv)
>   {
>   	struct gup_test gup = { 0 };
> -	unsigned long size = 128 * MB;
> -	int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1;
> -	unsigned long cmd = GUP_FAST_BENCHMARK;
> +	int filed, i, opt, nr_pages = 1, thp = -1, write = 1, threads = 1;


"nthreads" is a little more accurate for this.


>   	int flags = MAP_PRIVATE, touch = 0;
>   	char *file = "/dev/zero";
> +	pthread_t *tid;
>   	char *p;
>   
> -	while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) {
> +	while ((opt = getopt(argc, argv, "m:r:n:F:f:abcj:tTLUuwWSHpz")) != -1) {
>   		switch (opt) {
>   		case 'a':
>   			cmd = PIN_FAST_BENCHMARK;
> @@ -74,6 +119,9 @@ int main(int argc, char **argv)
>   			/* strtol, so you can pass flags in hex form */
>   			gup.gup_flags = strtol(optarg, 0, 0);
>   			break;
> +		case 'j':
> +			threads = atoi(optarg);
> +			break;
>   		case 'm':
>   			size = atoi(optarg) * MB;
>   			break;
> @@ -154,8 +202,8 @@ int main(int argc, char **argv)
>   	if (write)
>   		gup.gup_flags |= FOLL_WRITE;
>   
> -	fd = open("/sys/kernel/debug/gup_test", O_RDWR);
> -	if (fd == -1) {
> +	gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
> +	if (gup_fd == -1) {
>   		perror("open");
>   		exit(1);
>   	}
> @@ -185,32 +233,16 @@ int main(int argc, char **argv)
>   			p[0] = 0;
>   	}
>   
> -	/* Only report timing information on the *_BENCHMARK commands: */
> -	if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) ||
> -	     (cmd == PIN_LONGTERM_BENCHMARK)) {
> -		for (i = 0; i < repeats; i++) {
> -			gup.size = size;
> -			if (ioctl(fd, cmd, &gup))
> -				perror("ioctl"), exit(1);
> -
> -			printf("%s: Time: get:%lld put:%lld us",
> -			       cmd_to_str(cmd), gup.get_delta_usec,
> -			       gup.put_delta_usec);
> -			if (gup.size != size)
> -				printf(", truncated (size: %lld)", gup.size);
> -			printf("\n");
> -		}
> -	} else {
> -		gup.size = size;
> -		if (ioctl(fd, cmd, &gup)) {
> -			perror("ioctl");
> -			exit(1);
> -		}
> -
> -		printf("%s: done\n", cmd_to_str(cmd));
> -		if (gup.size != size)
> -			printf("Truncated (size: %lld)\n", gup.size);
> +	tid = malloc(sizeof(pthread_t) * threads);
> +	if (!tid) {
> +		perror("malloc()");
> +		exit(1);
>   	}
> +	for (i = 0; i < threads; i++)
> +		pthread_create(&tid[i], NULL, gup_thread, &gup);

It might be nice to check the return value of pthread_create(), and
just exit out with an error if it fails. On the other hand, it seems
spectacularly unlikely for it to fail here...on the other-other hand,
I always say that, just before writing code that doesn't check an
error, and the error somehow happens anyway and costs someone some
wasted time.

Your call.

> +	for (i = 0; i < threads; i++)
> +		pthread_join(tid[i], NULL);
> +	free(tid);
>   
>   	return 0;
>   }
> 

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP
  2021-05-06 23:25 ` [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP Peter Xu
  2021-05-07  2:34     ` Linus Torvalds
@ 2021-05-07  6:07   ` John Hubbard
  2021-05-07 14:13     ` Peter Xu
  1 sibling, 1 reply; 15+ messages in thread
From: John Hubbard @ 2021-05-07  6:07 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Hugh Dickins, Jan Kara, Kirill Shutemov, Jason Gunthorpe,
	Andrew Morton, Kirill Tkhai, Michal Hocko, Oleg Nesterov,
	Jann Horn, Linus Torvalds, Matthew Wilcox, Andrea Arcangeli

On 5/6/21 4:25 PM, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> has_pinned cannot be written by each pin-fast or it won't scale in
> SMP. This isn't "false sharing" strictly speaking (it's more like
> "true non-sharing"), but it creates the same SMP scalability
> bottleneck of "false sharing".
> 
> To verify the improvement a new "pin_fast.c" program was added to
> the will-it-scale benchmark.
...
> 
> This commits increases the SMP scalability of pin_user_pages_fast()
> executed by different threads of the same process by more than 4000%.
> 

Remarkable! I mean, yes, everyone knows that atomic writes are
"expensive", but this is a fun, dramatic example of just *how*
expensive they can get, once you start doing contended atomic writes.


Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Other notes, that don't have any effect on the above reviewed-by
tag:

On the commit log, I will add a "+1" to the idea of deleting the
pin_fast.c contents from the commit log, and just providing a URL
instead. No need to put C programs in the commit log, IMHO, especially
when you have them elsewhere anyway.


thanks,
--
John Hubbard
NVIDIA


> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/gup.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 63a079e361a3d..8b513e1723b45 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1292,7 +1292,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
>   		BUG_ON(*locked != 1);
>   	}
>   
> -	if (flags & FOLL_PIN)
> +	if (flags & FOLL_PIN && !atomic_read(&mm->has_pinned))
>   		atomic_set(&mm->has_pinned, 1);
>   
>   	/*
> @@ -2617,7 +2617,7 @@ static int internal_get_user_pages_fast(unsigned long start,
>   				       FOLL_FAST_ONLY)))
>   		return -EINVAL;
>   
> -	if (gup_flags & FOLL_PIN)
> +	if (gup_flags & FOLL_PIN && !atomic_read(&current->mm->has_pinned))
>   		atomic_set(&current->mm->has_pinned, 1);
>   
>   	if (!(gup_flags & FOLL_FAST_ONLY))
> 

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED
  2021-05-06 23:25 ` [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED Peter Xu
@ 2021-05-07  6:42   ` John Hubbard
  2021-05-07  7:39       ` Linus Torvalds
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: John Hubbard @ 2021-05-07  6:42 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Hugh Dickins, Jan Kara, Kirill Shutemov, Jason Gunthorpe,
	Andrew Morton, Kirill Tkhai, Michal Hocko, Oleg Nesterov,
	Jann Horn, Linus Torvalds, Matthew Wilcox, Andrea Arcangeli

On 5/6/21 4:25 PM, Peter Xu wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
> cleanup.
> 
> Any atomic_inc/dec to the mm cacheline shared by all threads in
> pin-fast would reintroduce a loss of SMP scalability to pin-fast, so
> there's no future potential usefulness to keep an atomic in the mm for
> this.
> 
> set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than
> WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit
> (just like atomic_set after this commit) has to be still issued only
> once per "mm", so the difference between the two will be lost in the
> noise.
> 
> will-it-scale "mmap2" shows no change in performance with enterprise
> config as expected.
> 
> will-it-scale "pin_fast" retains the > 4000% SMP scalability
> performance improvement against upstream as expected.
> 
> This is a noop as far as overall performance and SMP scalability are
> concerned.

It's nice that you spelled that out. I don't see any technical problems
with the diffs. There are a couple of tiny suggestions, below.

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> [peterx: Fix build for task_mmu.c]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   fs/proc/task_mmu.c             |  2 +-
>   include/linux/mm.h             |  2 +-
>   include/linux/mm_types.h       | 10 ----------
>   include/linux/sched/coredump.h |  1 +
>   kernel/fork.c                  |  1 -
>   mm/gup.c                       |  9 +++++----
>   6 files changed, 8 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4c95cc57a66a8..6144571942db9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1049,7 +1049,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
>   		return false;
>   	if (!is_cow_mapping(vma->vm_flags))
>   		return false;
> -	if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
> +	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
>   		return false;
>   	page = vm_normal_page(vma, addr, pte);
>   	if (!page)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6790ab0cf575..94dc84f6d8658 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1331,7 +1331,7 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
>   	if (!is_cow_mapping(vma->vm_flags))
>   		return false;
>   
> -	if (!atomic_read(&vma->vm_mm->has_pinned))
> +	if (!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))
>   		return false;
>   
>   	return page_maybe_dma_pinned(page);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a88946..15d79858fadbd 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -435,16 +435,6 @@ struct mm_struct {
>   		 */
>   		atomic_t mm_count;
>   
> -		/**
> -		 * @has_pinned: Whether this mm has pinned any pages.  This can
> -		 * be either replaced in the future by @pinned_vm when it
> -		 * becomes stable, or grow into a counter on its own. We're
> -		 * aggresive on this bit now - even if the pinned pages were
> -		 * unpinned later on, we'll still keep this bit set for the
> -		 * lifecycle of this mm just for simplicity.
> -		 */
> -		atomic_t has_pinned;
> -
>   		/**
>   		 * @write_protect_seq: Locked when any thread is write
>   		 * protecting pages mapped by this mm to enforce a later COW,
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index dfd82eab29025..bf45badd63e6d 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -73,6 +73,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>   #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
>   #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
>   #define MMF_MULTIPROCESS	27	/* mm is shared between processes */
> +#define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */

How about this instead, so that we effectively retain the comment block
that is otherwise being deleted from mm.h:

/*
  * MMF_HAS_PINNED: Whether this mm has pinned any pages.  This can be either
  * replaced in the future by mm.pinned_vm when it becomes stable, or grow into a
  * counter on its own. We're aggresive on this bit for now: even if the pinned
  * pages were unpinned later on, we'll still keep this bit set for the lifecycle
  * of this mm, just for simplicity.
  */
#define MMF_HAS_PINNED		28	/* FOLL_PIN ran. Never cleared. */


>   #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>   
>   #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 502dc046fbc62..a71e73707ef59 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1026,7 +1026,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	mm_pgtables_bytes_init(mm);
>   	mm->map_count = 0;
>   	mm->locked_vm = 0;
> -	atomic_set(&mm->has_pinned, 0);
>   	atomic64_set(&mm->pinned_vm, 0);
>   	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
>   	spin_lock_init(&mm->page_table_lock);
> diff --git a/mm/gup.c b/mm/gup.c
> index 8b513e1723b45..78416b0909873 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1292,8 +1292,8 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
>   		BUG_ON(*locked != 1);
>   	}
>   
> -	if (flags & FOLL_PIN && !atomic_read(&mm->has_pinned))
> -		atomic_set(&mm->has_pinned, 1);
> +	if (flags & FOLL_PIN && !test_bit(MMF_HAS_PINNED, &mm->flags))
> +		set_bit(MMF_HAS_PINNED, &mm->flags);

I expect this suggestion to be controversial, but I'm going to float it
anyway. The above is a little less clear than it used to be, *and* it is
in two places so far, so how about factoring out a tiny subroutine, like this:

diff --git a/mm/gup.c b/mm/gup.c
index 036ab0de9457..2dc001a7c850 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1270,6 +1270,16 @@ int fixup_user_fault(struct mm_struct *mm,
}
EXPORT_SYMBOL_GPL(fixup_user_fault);

+static void set_mm_has_pinned_flag(unsigned long *mm_flags)
+{
+       /*
+        * Avoid setting the bit unless necessary. This matters a lot with
+        * large SMP machines.
+        */
+       if (!test_bit(MMF_HAS_PINNED, mm_flags))
+               set_bit(MMF_HAS_PINNED, mm_flags);
+}
+
/*
* Please note that this function, unlike __get_user_pages will not
* return 0 for nr_pages > 0 without FOLL_NOWAIT
@@ -1292,8 +1302,8 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
BUG_ON(*locked != 1);
}

-       if (flags & FOLL_PIN && !test_bit(MMF_HAS_PINNED, &mm->flags))
-               set_bit(MMF_HAS_PINNED, &mm->flags);
+       if (flags & FOLL_PIN)
+               set_mm_has_pinned_flag(&mm->flags);

...which is now very readable, once again.

>   
>   	/*
>   	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> @@ -2617,8 +2617,9 @@ static int internal_get_user_pages_fast(unsigned long start,
>   				       FOLL_FAST_ONLY)))
>   		return -EINVAL;
>   
> -	if (gup_flags & FOLL_PIN && !atomic_read(&current->mm->has_pinned))
> -		atomic_set(&current->mm->has_pinned, 1);
> +	if (gup_flags & FOLL_PIN &&
> +	    !test_bit(MMF_HAS_PINNED, &current->mm->flags))
> +		set_bit(MMF_HAS_PINNED, &current->mm->flags);
>   
>   	if (!(gup_flags & FOLL_FAST_ONLY))
>   		might_lock_read(&current->mm->mmap_lock);
> 

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED
  2021-05-07  6:42   ` John Hubbard
@ 2021-05-07  7:39       ` Linus Torvalds
  2021-05-07 11:14     ` Matthew Wilcox
  2021-05-07 14:34     ` Peter Xu
  2 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2021-05-07  7:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Peter Xu, Linux-MM, Linux Kernel Mailing List, Hugh Dickins,
	Jan Kara, Kirill Shutemov, Jason Gunthorpe, Andrew Morton,
	Kirill Tkhai, Michal Hocko, Oleg Nesterov, Jann Horn,
	Matthew Wilcox, Andrea Arcangeli

On Thu, May 6, 2021 at 11:43 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> +static void set_mm_has_pinned_flag(unsigned long *mm_flags)
> +{
> +       /*
> +        * Avoid setting the bit unless necessary. This matters a lot with
> +        * large SMP machines.
> +        */
> +       if (!test_bit(MMF_HAS_PINNED, mm_flags))
> +               set_bit(MMF_HAS_PINNED, mm_flags);
> +}

Yes, please do split it up like this.

But please make it explicitly inline, and move the comment to above
the function.

And add the important key part to it: that the bit is never cleared.

That idempotent behavior of the "set_bit()" is what makes it safe to
do this non-atomic test-and-set (yes, the "set_bit()" itself is
atomic, but the sequence above is not).

Side note: we do have a few other places where this kind of thing
happens, so it *might* make sense to even make this a generic pattern
in case somebody can come up with a good descriptive name for that
("set_bit_if_not_set()" sounds descriptive, but the subtle
non-atomicity should probably be part of it).

> +       if (flags & FOLL_PIN)
> +               set_mm_has_pinned_flag(&mm->flags);
>
> ...which is now very readable, once again.

Yes, that does look much better.

Thanks,

              Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED
@ 2021-05-07  7:39       ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2021-05-07  7:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Peter Xu, Linux-MM, Linux Kernel Mailing List, Hugh Dickins,
	Jan Kara, Kirill Shutemov, Jason Gunthorpe, Andrew Morton,
	Kirill Tkhai, Michal Hocko, Oleg Nesterov, Jann Horn,
	Matthew Wilcox, Andrea Arcangeli

On Thu, May 6, 2021 at 11:43 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> +static void set_mm_has_pinned_flag(unsigned long *mm_flags)
> +{
> +       /*
> +        * Avoid setting the bit unless necessary. This matters a lot with
> +        * large SMP machines.
> +        */
> +       if (!test_bit(MMF_HAS_PINNED, mm_flags))
> +               set_bit(MMF_HAS_PINNED, mm_flags);
> +}

Yes, please do split it up like this.

But please make it explicitly inline, and move the comment to above
the function.

And add the important key part to it: that the bit is never cleared.

That idempotent behavior of the "set_bit()" is what makes it safe to
do this non-atomic test-and-set (yes, the "set_bit()" itself is
atomic, but the sequence above is not).

Side note: we do have a few other places where this kind of thing
happens, so it *might* make sense to even make this a generic pattern
in case somebody can come up with a good descriptive name for that
("set_bit_if_not_set()" sounds descriptive, but the subtle
non-atomicity should probably be part of it).

> +       if (flags & FOLL_PIN)
> +               set_mm_has_pinned_flag(&mm->flags);
>
> ...which is now very readable, once again.

Yes, that does look much better.

Thanks,

              Linus


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED
  2021-05-07  6:42   ` John Hubbard
  2021-05-07  7:39       ` Linus Torvalds
@ 2021-05-07 11:14     ` Matthew Wilcox
  2021-05-07 14:34     ` Peter Xu
  2 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2021-05-07 11:14 UTC (permalink / raw)
  To: John Hubbard
  Cc: Peter Xu, linux-mm, linux-kernel, Hugh Dickins, Jan Kara,
	Kirill Shutemov, Jason Gunthorpe, Andrew Morton, Kirill Tkhai,
	Michal Hocko, Oleg Nesterov, Jann Horn, Linus Torvalds,
	Andrea Arcangeli

On Thu, May 06, 2021 at 11:42:59PM -0700, John Hubbard wrote:
> +static void set_mm_has_pinned_flag(unsigned long *mm_flags)

mm_set_has_pinned_flag(struct mm_struct *mm), please.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] mm/gup_benchmark: Support threading
  2021-05-07  4:37   ` John Hubbard
@ 2021-05-07 14:04     ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-05-07 14:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, linux-kernel, Hugh Dickins, Jan Kara, Kirill Shutemov,
	Jason Gunthorpe, Andrew Morton, Kirill Tkhai, Michal Hocko,
	Oleg Nesterov, Jann Horn, Linus Torvalds, Matthew Wilcox,
	Andrea Arcangeli

On Thu, May 06, 2021 at 09:37:37PM -0700, John Hubbard wrote:
> On 5/6/21 4:25 PM, Peter Xu wrote:
> > Add a new parameter "-j N" to support concurrent gup test.
> 
> I had a long-standing personal TODO item that said, "decide if it's
> worth it, to add pthread support to gup_test". So now, at least one
> other person has decided that it *is* worth it, and you've also written
> the patch. OK, then. Sweet! :)
> 
> This looks correct to me. I have a couple of minor comments below, but
> either way you can add:

I'll address them.

> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP
  2021-05-07  6:07   ` John Hubbard
@ 2021-05-07 14:13     ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-05-07 14:13 UTC (permalink / raw)
  To: John Hubbard, Linus Torvalds
  Cc: linux-mm, linux-kernel, Hugh Dickins, Jan Kara, Kirill Shutemov,
	Jason Gunthorpe, Andrew Morton, Kirill Tkhai, Michal Hocko,
	Oleg Nesterov, Jann Horn, Linus Torvalds, Matthew Wilcox,
	Andrea Arcangeli

On Thu, May 06, 2021 at 11:07:32PM -0700, John Hubbard wrote:
> On 5/6/21 4:25 PM, Peter Xu wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > has_pinned cannot be written by each pin-fast or it won't scale in
> > SMP. This isn't "false sharing" strictly speaking (it's more like
> > "true non-sharing"), but it creates the same SMP scalability
> > bottleneck of "false sharing".
> > 
> > To verify the improvement a new "pin_fast.c" program was added to
> > the will-it-scale benchmark.
> ...
> > 
> > This commits increases the SMP scalability of pin_user_pages_fast()
> > executed by different threads of the same process by more than 4000%.
> > 
> 
> Remarkable! I mean, yes, everyone knows that atomic writes are
> "expensive", but this is a fun, dramatic example of just *how*
> expensive they can get, once you start doing contended atomic writes.
> 
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> Other notes, that don't have any effect on the above reviewed-by
> tag:
> 
> On the commit log, I will add a "+1" to the idea of deleting the
> pin_fast.c contents from the commit log, and just providing a URL
> instead. No need to put C programs in the commit log, IMHO, especially
> when you have them elsewhere anyway.

I guess it's put there only because it does not exist elsewhere. :)

I didn't run it, but I think it needs to be put into tests/ of if-it-scale repo
and build, something like that.

https://github.com/antonblanchard/will-it-scale/tree/master/tests

But yeah since we've got the 1st patch which can also reproduce this, I'll
reference that instead in the commit mesasge and I'll be able to shrink it.

I'll also start to use parentheses as Linus suggested.  My thanks to both of
you on the quick comments!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED
  2021-05-07  6:42   ` John Hubbard
  2021-05-07  7:39       ` Linus Torvalds
  2021-05-07 11:14     ` Matthew Wilcox
@ 2021-05-07 14:34     ` Peter Xu
  2 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-05-07 14:34 UTC (permalink / raw)
  To: John Hubbard, Linus Torvalds, Matthew Wilcox
  Cc: linux-mm, linux-kernel, Hugh Dickins, Jan Kara, Kirill Shutemov,
	Jason Gunthorpe, Andrew Morton, Kirill Tkhai, Michal Hocko,
	Oleg Nesterov, Jann Horn, Linus Torvalds, Matthew Wilcox,
	Andrea Arcangeli

On Thu, May 06, 2021 at 11:42:59PM -0700, John Hubbard wrote:
> > +#define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
> 
> How about this instead, so that we effectively retain the comment block
> that is otherwise being deleted from mm.h:
> 
> /*
>  * MMF_HAS_PINNED: Whether this mm has pinned any pages.  This can be either
>  * replaced in the future by mm.pinned_vm when it becomes stable, or grow into a
>  * counter on its own. We're aggresive on this bit for now: even if the pinned
>  * pages were unpinned later on, we'll still keep this bit set for the lifecycle
>  * of this mm, just for simplicity.
>  */
> #define MMF_HAS_PINNED		28	/* FOLL_PIN ran. Never cleared. */

Sure, good to know the comment is still valid!

> > @@ -1292,8 +1292,8 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
> >   		BUG_ON(*locked != 1);
> >   	}
> > -	if (flags & FOLL_PIN && !atomic_read(&mm->has_pinned))
> > -		atomic_set(&mm->has_pinned, 1);
> > +	if (flags & FOLL_PIN && !test_bit(MMF_HAS_PINNED, &mm->flags))
> > +		set_bit(MMF_HAS_PINNED, &mm->flags);
> 
> I expect this suggestion to be controversial, but I'm going to float it
> anyway. The above is a little less clear than it used to be, *and* it is
> in two places so far, so how about factoring out a tiny subroutine, like this:

Definitely less "controversial" than expected, isn't it? ;)

Thanks for the suggestion, it looks much better indeed.  Also I'll rename the
helper to mm_set_has_pinned_flag() as suggested by Matthew.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-05-07 14:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06 23:25 [PATCH 0/3] mm/gup: Fix pin page write cache bouncing on has_pinned Peter Xu
2021-05-06 23:25 ` [PATCH 1/3] mm/gup_benchmark: Support threading Peter Xu
2021-05-07  4:37   ` John Hubbard
2021-05-07 14:04     ` Peter Xu
2021-05-06 23:25 ` [PATCH 2/3] mm: gup: allow FOLL_PIN to scale in SMP Peter Xu
2021-05-07  2:34   ` Linus Torvalds
2021-05-07  2:34     ` Linus Torvalds
2021-05-07  6:07   ` John Hubbard
2021-05-07 14:13     ` Peter Xu
2021-05-06 23:25 ` [PATCH 3/3] mm: gup: pack has_pinned in MMF_HAS_PINNED Peter Xu
2021-05-07  6:42   ` John Hubbard
2021-05-07  7:39     ` Linus Torvalds
2021-05-07  7:39       ` Linus Torvalds
2021-05-07 11:14     ` Matthew Wilcox
2021-05-07 14:34     ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.