All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4][for-next]mm: add a new vector based madvise syscall
@ 2015-12-11  0:03 ` Shaohua Li
  0 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-12-11  0:03 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Kernel-team-b10kYP2dOMg,
	Andrew Morton, Rik van Riel, Mel Gorman, Hugh Dickins,
	Johannes Weiner, Andrea Arcangeli, Andi Kleen, Minchan Kim,
	Arnd Bergmann

In jemalloc, a free(3) doesn't immediately free the memory to OS even
the memory is page aligned/size, and hope the memory can be reused soon.
Later the virtual address becomes fragmented, and more and more free
memory are aggregated. If the free memory size is large, jemalloc uses
madvise(DONT_NEED) to actually free the memory back to OS.

The madvise has significantly overhead paritcularly because of TLB
flush. jemalloc does madvise for several virtual address space ranges
one time. Instead of calling madvise for each of the ranges, we
introduce a new syscall to purge memory for several ranges one time. In
this way, we can merge several TLB flush for the ranges to one big TLB
flush. This also reduce mmap_sem locking and kernel/userspace switching.

I'm running a simple memory allocation benchmark. 32 threads do random
malloc/free/realloc. Corresponding jemalloc patch to utilize this API is
attached.
Without patch:
real    0m18.923s
user    1m11.819s
sys     7m44.626s
each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
is hotest functions. mmap_sem read locking (because of page fault) is
also heavy.

with patch:
real    0m15.026s
user    0m48.548s
sys     6m41.153s
each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
all. mmap_sem read locking (still because of page fault) becomes the
sole hot spot.

Another test malloc a bunch of memory in 48 threads, then all threads
free the memory. I measure the time of the memory free.
Without patch: 34.332s
With patch:    17.429s

MADV_FREE does the same TLB flush as MADV_NEED, this also applies to
MADV_FREE. Other madvise type can have small benefits too, like reduce
syscalls/mmap_sem locking.

V3->V4:
- Support all MADV_* (Andi Kleen)
- Add compat syscall (Arnd Bergmann)

V2->V3:
- Delete iov sort (Andrew Morton)
- Support MADV_FREE

V1->V2:
- make madvisev() acts exactly like madvise(). The only difference is
  madvisev() returns error if a range's start equals to start + size.
  Returning 0 (like madvise) is improper here since there are other ranges.

Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org>
Cc: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
Signed-off-by: Shaohua Li <shli-b10kYP2dOMg@public.gmane.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/compat.h                 |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 kernel/sys_ni.c                        |   2 +
 mm/madvise.c                           | 281 +++++++++++++++++++++++++++------
 7 files changed, 251 insertions(+), 45 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index cb713df..f65e418 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -384,3 +384,4 @@
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
 377	i386	copy_file_range		sys_copy_file_range
+378	i386	madvisev		sys_madvisev			compat_sys_madvisev
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index dc1040a..158eef4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -333,6 +333,7 @@
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
 326	common	copy_file_range		sys_copy_file_range
+327	64	madvisev		sys_madvisev
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -372,3 +373,4 @@
 543	x32	io_setup		compat_sys_io_setup
 544	x32	io_submit		compat_sys_io_submit
 545	x32	execveat		stub_x32_execveat
+546	x32	madvisev		compat_sys_madvisev
diff --git a/include/linux/compat.h b/include/linux/compat.h
index a76c917..a11ddab 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -689,6 +689,9 @@ asmlinkage long compat_sys_sendfile64(int out_fd, int in_fd,
 asmlinkage long compat_sys_sigaltstack(const compat_stack_t __user *uss_ptr,
 				       compat_stack_t __user *uoss_ptr);
 
+asmlinkage long compat_sys_madvisev(const struct compat_iovec __user *uvector,
+		compat_ulong_t nr_segs, compat_int_t behavior);
+
 #ifdef __ARCH_WANT_SYS_SIGPENDING
 asmlinkage long compat_sys_sigpending(compat_old_sigset_t __user *set);
 #endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 185815c..8df6763 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -892,4 +892,7 @@ asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
 
 asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
 
+asmlinkage long sys_madvisev(const struct iovec __user *uvector,
+		unsigned long nr_segs, int behavior);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 2622b33..23cac4e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -717,9 +717,11 @@ __SYSCALL(__NR_membarrier, sys_membarrier)
 __SYSCALL(__NR_mlock2, sys_mlock2)
 #define __NR_copy_file_range 285
 __SYSCALL(__NR_copy_file_range, sys_copy_file_range)
+#define __NR_madvisev 286
+__SC_COMP(__NR_madvisev, sys_madvisev, compat_sys_madvisev)
 
 #undef __NR_syscalls
-#define __NR_syscalls 286
+#define __NR_syscalls 287
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8..8b87f39 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -198,6 +198,8 @@ cond_syscall(sys_munlockall);
 cond_syscall(sys_mlock2);
 cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
+cond_syscall(sys_madvisev);
+cond_syscall(compat_sys_madvisev);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
 cond_syscall(compat_sys_move_pages);
diff --git a/mm/madvise.c b/mm/madvise.c
index f56825b..8d79774 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -21,7 +21,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mmu_notifier.h>
-
+#include <linux/uio.h>
+#ifdef CONFIG_COMPAT
+#include <linux/compat.h>
+#endif
 #include <asm/tlb.h>
 
 /*
@@ -564,7 +567,8 @@ static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
 
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		unsigned long start, unsigned long end, int behavior)
+		unsigned long start, unsigned long end, int behavior,
+		void *data)
 {
 	switch (behavior) {
 	case MADV_REMOVE:
@@ -616,6 +620,62 @@ madvise_behavior_valid(int behavior)
 	}
 }
 
+typedef long (*madvise_iterate_fn)(struct vm_area_struct *vma,
+	struct vm_area_struct **prev, unsigned long start,
+	unsigned long end, int behavior, void *data);
+static int madvise_iterate_vma(unsigned long start, unsigned long end,
+	int *unmapped_error, int behavior, madvise_iterate_fn fn, void *data)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long tmp;
+	int error = 0;
+
+	/*
+	 * If the interval [start,end) covers some unmapped address
+	 * ranges, just ignore them, but return -ENOMEM at the end.
+	 * - different from the way of handling in mlock etc.
+	 */
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Still start < end. */
+		error = -ENOMEM;
+		if (!vma)
+			break;
+
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			*unmapped_error = -ENOMEM;
+			start = vma->vm_start;
+			if (start >= end)
+				break;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = fn(vma, &prev, start, tmp, behavior, data);
+		if (error)
+			break;
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_sem */
+			vma = find_vma(current->mm, start);
+	}
+
+	return error;
+}
+
 /*
  * The madvise(2) system call.
  *
@@ -660,8 +720,7 @@ madvise_behavior_valid(int behavior)
  */
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	unsigned long end, tmp;
-	struct vm_area_struct *vma, *prev;
+	unsigned long end;
 	int unmapped_error = 0;
 	int error = -EINVAL;
 	int write;
@@ -697,56 +756,190 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	else
 		down_read(&current->mm->mmap_sem);
 
-	/*
-	 * If the interval [start,end) covers some unmapped address
-	 * ranges, just ignore them, but return -ENOMEM at the end.
-	 * - different from the way of handling in mlock etc.
-	 */
-	vma = find_vma_prev(current->mm, start, &prev);
-	if (vma && start > vma->vm_start)
-		prev = vma;
-
 	blk_start_plug(&plug);
-	for (;;) {
-		/* Still start < end. */
-		error = -ENOMEM;
-		if (!vma)
-			goto out;
 
-		/* Here start < (end|vma->vm_end). */
-		if (start < vma->vm_start) {
-			unmapped_error = -ENOMEM;
-			start = vma->vm_start;
-			if (start >= end)
-				goto out;
+	error = madvise_iterate_vma(start, end, &unmapped_error,
+			behavior, madvise_vma, NULL);
+	if (error == 0 && unmapped_error != 0)
+		error = unmapped_error;
+
+	blk_finish_plug(&plug);
+	if (write)
+		up_write(&current->mm->mmap_sem);
+	else
+		up_read(&current->mm->mmap_sem);
+
+	return error;
+}
+
+static long
+madvisev_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		unsigned long start, unsigned long end, int behavior,
+		void *data)
+{
+	struct mmu_gather *tlb = data;
+	*prev = vma;
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	switch (behavior) {
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0) {
+			/* MADV_FREE works for only anon vma at the moment */
+			if (!vma_is_anonymous(vma))
+				return -EINVAL;
+			madvise_free_page_range(tlb, vma, start, end);
+			break;
 		}
+		/* passthrough */
+	case MADV_DONTNEED:
+		unmap_vmas(tlb, vma, start, end);
+		break;
+	}
+	return 0;
+}
 
-		/* Here vma->vm_start <= start < (end|vma->vm_end) */
-		tmp = vma->vm_end;
-		if (end < tmp)
-			tmp = end;
+static int do_madvisev(struct iovec *iov, unsigned long nr_segs, int behavior)
+{
+	unsigned long start, end = 0;
+	int unmapped_error = 0;
+	size_t len;
+	struct mmu_gather tlb;
+	int error = 0;
+	int i;
+	int write;
+	struct blk_plug plug;
 
-		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
-		if (error)
-			goto out;
-		start = tmp;
-		if (prev && start < prev->vm_end)
-			start = prev->vm_end;
-		error = unmapped_error;
-		if (start >= end)
-			goto out;
-		if (prev)
-			vma = prev->vm_next;
-		else	/* madvise_remove dropped mmap_sem */
-			vma = find_vma(current->mm, start);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE) {
+		for (i = 0; i < nr_segs; i++) {
+			start = (unsigned long)iov[i].iov_base;
+			len = iov[i].iov_len;
+			error = madvise_hwpoison(behavior, start, start + len);
+			if (error)
+				return error;
+		}
+		return 0;
 	}
-out:
-	blk_finish_plug(&plug);
+#endif
+
+	if (!madvise_behavior_valid(behavior))
+		return -EINVAL;
+
+	for (i = 0; i < nr_segs; i++) {
+		start = (unsigned long)iov[i].iov_base;
+		/* Make sure iovs don't overlap and sorted */
+		if (start & ~PAGE_MASK || start < end)
+			return -EINVAL;
+		len = ((iov[i].iov_len + ~PAGE_MASK) & PAGE_MASK);
+
+		/*
+		 * Check to see whether len was rounded up from small -ve to
+		 * zero
+		 */
+		if (iov[i].iov_len && !len)
+			return -EINVAL;
+
+		end = start + len;
+
+		/*
+		 * end == start returns error (different against madvise).
+		 * return 0 is improper as there are other iovs
+		 */
+		if (end <= start)
+			return -EINVAL;
+
+		iov[i].iov_len = len;
+	}
+
+	write = madvise_need_mmap_write(behavior);
+	if (write)
+		down_write(&current->mm->mmap_sem);
+	else
+		down_read(&current->mm->mmap_sem);
+
+	if (behavior == MADV_DONTNEED || behavior == MADV_FREE) {
+		lru_add_drain();
+		tlb_gather_mmu(&tlb, current->mm,
+			(unsigned long)iov[0].iov_base, end);
+		update_hiwater_rss(current->mm);
+		for (i = 0; i < nr_segs; i++) {
+			start = (unsigned long)iov[i].iov_base;
+			len = iov[i].iov_len;
+
+			error = madvise_iterate_vma(start, start + len,
+				&unmapped_error, behavior, madvisev_vma, &tlb);
+			if (error)
+				break;
+		}
+		tlb_finish_mmu(&tlb, (unsigned long)iov[0].iov_base, end);
+	} else {
+		blk_start_plug(&plug);
+		for (i = 0; i < nr_segs; i++) {
+			start = (unsigned long)iov[i].iov_base;
+			len = iov[i].iov_len;
+
+			error = madvise_iterate_vma(start, start + len,
+				&unmapped_error, behavior, madvise_vma, NULL);
+			if (error)
+				break;
+		}
+		blk_finish_plug(&plug);
+	}
+	if (error == 0 && unmapped_error != 0)
+		error = unmapped_error;
+
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
 		up_read(&current->mm->mmap_sem);
+	return error;
+}
+
+/*
+ * The vector madvise(). Like madvise except running for a vector of virtual
+ * address ranges
+ */
+SYSCALL_DEFINE3(madvisev, const struct iovec __user *, uvector,
+	unsigned long, nr_segs, int, behavior)
+{
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = NULL;
+	int error;
+
+	error = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
+			UIO_FASTIOV, iovstack, &iov);
+	if (error <= 0)
+		return error;
+
+	error = do_madvisev(iov, nr_segs, behavior);
 
+	if (iov != iovstack)
+		kfree(iov);
 	return error;
 }
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE3(madvisev, const struct compat_iovec __user *, uvector,
+	compat_ulong_t, nr_segs, compat_int_t, behavior)
+{
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = NULL;
+	int error;
+
+	error = compat_rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
+			UIO_FASTIOV, iovstack, &iov);
+	if (error <= 0)
+		return error;
+
+	error = do_madvisev(iov, nr_segs, behavior);
+
+	if (iov != iovstack)
+		kfree(iov);
+	return error;
+}
+#endif
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V4][for-next]mm: add a new vector based madvise syscall
@ 2015-12-11  0:03 ` Shaohua Li
  0 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2015-12-11  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Kernel-team, Andrew Morton, Rik van Riel, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Andrea Arcangeli, Andi Kleen,
	Minchan Kim, Arnd Bergmann

In jemalloc, a free(3) doesn't immediately free the memory to OS even
the memory is page aligned/size, and hope the memory can be reused soon.
Later the virtual address becomes fragmented, and more and more free
memory are aggregated. If the free memory size is large, jemalloc uses
madvise(DONT_NEED) to actually free the memory back to OS.

The madvise has significantly overhead paritcularly because of TLB
flush. jemalloc does madvise for several virtual address space ranges
one time. Instead of calling madvise for each of the ranges, we
introduce a new syscall to purge memory for several ranges one time. In
this way, we can merge several TLB flush for the ranges to one big TLB
flush. This also reduce mmap_sem locking and kernel/userspace switching.

I'm running a simple memory allocation benchmark. 32 threads do random
malloc/free/realloc. Corresponding jemalloc patch to utilize this API is
attached.
Without patch:
real    0m18.923s
user    1m11.819s
sys     7m44.626s
each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
is hotest functions. mmap_sem read locking (because of page fault) is
also heavy.

with patch:
real    0m15.026s
user    0m48.548s
sys     6m41.153s
each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
all. mmap_sem read locking (still because of page fault) becomes the
sole hot spot.

Another test malloc a bunch of memory in 48 threads, then all threads
free the memory. I measure the time of the memory free.
Without patch: 34.332s
With patch:    17.429s

MADV_FREE does the same TLB flush as MADV_NEED, this also applies to
MADV_FREE. Other madvise type can have small benefits too, like reduce
syscalls/mmap_sem locking.

V3->V4:
- Support all MADV_* (Andi Kleen)
- Add compat syscall (Arnd Bergmann)

V2->V3:
- Delete iov sort (Andrew Morton)
- Support MADV_FREE

V1->V2:
- make madvisev() acts exactly like madvise(). The only difference is
  madvisev() returns error if a range's start equals to start + size.
  Returning 0 (like madvise) is improper here since there are other ranges.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/compat.h                 |   3 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 kernel/sys_ni.c                        |   2 +
 mm/madvise.c                           | 281 +++++++++++++++++++++++++++------
 7 files changed, 251 insertions(+), 45 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index cb713df..f65e418 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -384,3 +384,4 @@
 375	i386	membarrier		sys_membarrier
 376	i386	mlock2			sys_mlock2
 377	i386	copy_file_range		sys_copy_file_range
+378	i386	madvisev		sys_madvisev			compat_sys_madvisev
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index dc1040a..158eef4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -333,6 +333,7 @@
 324	common	membarrier		sys_membarrier
 325	common	mlock2			sys_mlock2
 326	common	copy_file_range		sys_copy_file_range
+327	64	madvisev		sys_madvisev
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -372,3 +373,4 @@
 543	x32	io_setup		compat_sys_io_setup
 544	x32	io_submit		compat_sys_io_submit
 545	x32	execveat		stub_x32_execveat
+546	x32	madvisev		compat_sys_madvisev
diff --git a/include/linux/compat.h b/include/linux/compat.h
index a76c917..a11ddab 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -689,6 +689,9 @@ asmlinkage long compat_sys_sendfile64(int out_fd, int in_fd,
 asmlinkage long compat_sys_sigaltstack(const compat_stack_t __user *uss_ptr,
 				       compat_stack_t __user *uoss_ptr);
 
+asmlinkage long compat_sys_madvisev(const struct compat_iovec __user *uvector,
+		compat_ulong_t nr_segs, compat_int_t behavior);
+
 #ifdef __ARCH_WANT_SYS_SIGPENDING
 asmlinkage long compat_sys_sigpending(compat_old_sigset_t __user *set);
 #endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 185815c..8df6763 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -892,4 +892,7 @@ asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
 
 asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
 
+asmlinkage long sys_madvisev(const struct iovec __user *uvector,
+		unsigned long nr_segs, int behavior);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 2622b33..23cac4e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -717,9 +717,11 @@ __SYSCALL(__NR_membarrier, sys_membarrier)
 __SYSCALL(__NR_mlock2, sys_mlock2)
 #define __NR_copy_file_range 285
 __SYSCALL(__NR_copy_file_range, sys_copy_file_range)
+#define __NR_madvisev 286
+__SC_COMP(__NR_madvisev, sys_madvisev, compat_sys_madvisev)
 
 #undef __NR_syscalls
-#define __NR_syscalls 286
+#define __NR_syscalls 287
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8..8b87f39 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -198,6 +198,8 @@ cond_syscall(sys_munlockall);
 cond_syscall(sys_mlock2);
 cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
+cond_syscall(sys_madvisev);
+cond_syscall(compat_sys_madvisev);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
 cond_syscall(compat_sys_move_pages);
diff --git a/mm/madvise.c b/mm/madvise.c
index f56825b..8d79774 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -21,7 +21,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/mmu_notifier.h>
-
+#include <linux/uio.h>
+#ifdef CONFIG_COMPAT
+#include <linux/compat.h>
+#endif
 #include <asm/tlb.h>
 
 /*
@@ -564,7 +567,8 @@ static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
 
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		unsigned long start, unsigned long end, int behavior)
+		unsigned long start, unsigned long end, int behavior,
+		void *data)
 {
 	switch (behavior) {
 	case MADV_REMOVE:
@@ -616,6 +620,62 @@ madvise_behavior_valid(int behavior)
 	}
 }
 
+typedef long (*madvise_iterate_fn)(struct vm_area_struct *vma,
+	struct vm_area_struct **prev, unsigned long start,
+	unsigned long end, int behavior, void *data);
+static int madvise_iterate_vma(unsigned long start, unsigned long end,
+	int *unmapped_error, int behavior, madvise_iterate_fn fn, void *data)
+{
+	struct vm_area_struct *vma, *prev;
+	unsigned long tmp;
+	int error = 0;
+
+	/*
+	 * If the interval [start,end) covers some unmapped address
+	 * ranges, just ignore them, but return -ENOMEM at the end.
+	 * - different from the way of handling in mlock etc.
+	 */
+	vma = find_vma_prev(current->mm, start, &prev);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for (;;) {
+		/* Still start < end. */
+		error = -ENOMEM;
+		if (!vma)
+			break;
+
+		/* Here start < (end|vma->vm_end). */
+		if (start < vma->vm_start) {
+			*unmapped_error = -ENOMEM;
+			start = vma->vm_start;
+			if (start >= end)
+				break;
+		}
+
+		/* Here vma->vm_start <= start < (end|vma->vm_end) */
+		tmp = vma->vm_end;
+		if (end < tmp)
+			tmp = end;
+
+		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+		error = fn(vma, &prev, start, tmp, behavior, data);
+		if (error)
+			break;
+		start = tmp;
+		if (prev && start < prev->vm_end)
+			start = prev->vm_end;
+		if (start >= end)
+			break;
+		if (prev)
+			vma = prev->vm_next;
+		else	/* madvise_remove dropped mmap_sem */
+			vma = find_vma(current->mm, start);
+	}
+
+	return error;
+}
+
 /*
  * The madvise(2) system call.
  *
@@ -660,8 +720,7 @@ madvise_behavior_valid(int behavior)
  */
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	unsigned long end, tmp;
-	struct vm_area_struct *vma, *prev;
+	unsigned long end;
 	int unmapped_error = 0;
 	int error = -EINVAL;
 	int write;
@@ -697,56 +756,190 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	else
 		down_read(&current->mm->mmap_sem);
 
-	/*
-	 * If the interval [start,end) covers some unmapped address
-	 * ranges, just ignore them, but return -ENOMEM at the end.
-	 * - different from the way of handling in mlock etc.
-	 */
-	vma = find_vma_prev(current->mm, start, &prev);
-	if (vma && start > vma->vm_start)
-		prev = vma;
-
 	blk_start_plug(&plug);
-	for (;;) {
-		/* Still start < end. */
-		error = -ENOMEM;
-		if (!vma)
-			goto out;
 
-		/* Here start < (end|vma->vm_end). */
-		if (start < vma->vm_start) {
-			unmapped_error = -ENOMEM;
-			start = vma->vm_start;
-			if (start >= end)
-				goto out;
+	error = madvise_iterate_vma(start, end, &unmapped_error,
+			behavior, madvise_vma, NULL);
+	if (error == 0 && unmapped_error != 0)
+		error = unmapped_error;
+
+	blk_finish_plug(&plug);
+	if (write)
+		up_write(&current->mm->mmap_sem);
+	else
+		up_read(&current->mm->mmap_sem);
+
+	return error;
+}
+
+static long
+madvisev_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		unsigned long start, unsigned long end, int behavior,
+		void *data)
+{
+	struct mmu_gather *tlb = data;
+	*prev = vma;
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	switch (behavior) {
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0) {
+			/* MADV_FREE works for only anon vma at the moment */
+			if (!vma_is_anonymous(vma))
+				return -EINVAL;
+			madvise_free_page_range(tlb, vma, start, end);
+			break;
 		}
+		/* passthrough */
+	case MADV_DONTNEED:
+		unmap_vmas(tlb, vma, start, end);
+		break;
+	}
+	return 0;
+}
 
-		/* Here vma->vm_start <= start < (end|vma->vm_end) */
-		tmp = vma->vm_end;
-		if (end < tmp)
-			tmp = end;
+static int do_madvisev(struct iovec *iov, unsigned long nr_segs, int behavior)
+{
+	unsigned long start, end = 0;
+	int unmapped_error = 0;
+	size_t len;
+	struct mmu_gather tlb;
+	int error = 0;
+	int i;
+	int write;
+	struct blk_plug plug;
 
-		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
-		if (error)
-			goto out;
-		start = tmp;
-		if (prev && start < prev->vm_end)
-			start = prev->vm_end;
-		error = unmapped_error;
-		if (start >= end)
-			goto out;
-		if (prev)
-			vma = prev->vm_next;
-		else	/* madvise_remove dropped mmap_sem */
-			vma = find_vma(current->mm, start);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE) {
+		for (i = 0; i < nr_segs; i++) {
+			start = (unsigned long)iov[i].iov_base;
+			len = iov[i].iov_len;
+			error = madvise_hwpoison(behavior, start, start + len);
+			if (error)
+				return error;
+		}
+		return 0;
 	}
-out:
-	blk_finish_plug(&plug);
+#endif
+
+	if (!madvise_behavior_valid(behavior))
+		return -EINVAL;
+
+	for (i = 0; i < nr_segs; i++) {
+		start = (unsigned long)iov[i].iov_base;
+		/* Make sure iovs don't overlap and sorted */
+		if (start & ~PAGE_MASK || start < end)
+			return -EINVAL;
+		len = ((iov[i].iov_len + ~PAGE_MASK) & PAGE_MASK);
+
+		/*
+		 * Check to see whether len was rounded up from small -ve to
+		 * zero
+		 */
+		if (iov[i].iov_len && !len)
+			return -EINVAL;
+
+		end = start + len;
+
+		/*
+		 * end == start returns error (different against madvise).
+		 * return 0 is improper as there are other iovs
+		 */
+		if (end <= start)
+			return -EINVAL;
+
+		iov[i].iov_len = len;
+	}
+
+	write = madvise_need_mmap_write(behavior);
+	if (write)
+		down_write(&current->mm->mmap_sem);
+	else
+		down_read(&current->mm->mmap_sem);
+
+	if (behavior == MADV_DONTNEED || behavior == MADV_FREE) {
+		lru_add_drain();
+		tlb_gather_mmu(&tlb, current->mm,
+			(unsigned long)iov[0].iov_base, end);
+		update_hiwater_rss(current->mm);
+		for (i = 0; i < nr_segs; i++) {
+			start = (unsigned long)iov[i].iov_base;
+			len = iov[i].iov_len;
+
+			error = madvise_iterate_vma(start, start + len,
+				&unmapped_error, behavior, madvisev_vma, &tlb);
+			if (error)
+				break;
+		}
+		tlb_finish_mmu(&tlb, (unsigned long)iov[0].iov_base, end);
+	} else {
+		blk_start_plug(&plug);
+		for (i = 0; i < nr_segs; i++) {
+			start = (unsigned long)iov[i].iov_base;
+			len = iov[i].iov_len;
+
+			error = madvise_iterate_vma(start, start + len,
+				&unmapped_error, behavior, madvise_vma, NULL);
+			if (error)
+				break;
+		}
+		blk_finish_plug(&plug);
+	}
+	if (error == 0 && unmapped_error != 0)
+		error = unmapped_error;
+
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
 		up_read(&current->mm->mmap_sem);
+	return error;
+}
+
+/*
+ * The vector madvise(). Like madvise except running for a vector of virtual
+ * address ranges
+ */
+SYSCALL_DEFINE3(madvisev, const struct iovec __user *, uvector,
+	unsigned long, nr_segs, int, behavior)
+{
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = NULL;
+	int error;
+
+	error = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
+			UIO_FASTIOV, iovstack, &iov);
+	if (error <= 0)
+		return error;
+
+	error = do_madvisev(iov, nr_segs, behavior);
 
+	if (iov != iovstack)
+		kfree(iov);
 	return error;
 }
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE3(madvisev, const struct compat_iovec __user *, uvector,
+	compat_ulong_t, nr_segs, compat_int_t, behavior)
+{
+	struct iovec iovstack[UIO_FASTIOV];
+	struct iovec *iov = NULL;
+	int error;
+
+	error = compat_rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
+			UIO_FASTIOV, iovstack, &iov);
+	if (error <= 0)
+		return error;
+
+	error = do_madvisev(iov, nr_segs, behavior);
+
+	if (iov != iovstack)
+		kfree(iov);
+	return error;
+}
+#endif
-- 
2.4.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
  2015-12-11  0:03 ` Shaohua Li
  (?)
@ 2015-12-11 18:31 ` Andi Kleen
  -1 siblings, 0 replies; 13+ messages in thread
From: Andi Kleen @ 2015-12-11 18:31 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-api, Kernel-team, Andrew Morton, Rik van Riel,
	Mel Gorman, Hugh Dickins, Johannes Weiner, Andrea Arcangeli,
	Andi Kleen, Minchan Kim, Arnd Bergmann

On Thu, Dec 10, 2015 at 04:03:37PM -0800, Shaohua Li wrote:
> In jemalloc, a free(3) doesn't immediately free the memory to OS even
> the memory is page aligned/size, and hope the memory can be reused soon.
> Later the virtual address becomes fragmented, and more and more free
> memory are aggregated. If the free memory size is large, jemalloc uses
> madvise(DONT_NEED) to actually free the memory back to OS.
> 

Looks good to me now.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
  2015-12-11  0:03 ` Shaohua Li
  (?)
  (?)
@ 2016-01-11 19:00 ` Shaohua Li
       [not found]   ` <20160111190021.GA3589410-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  -1 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2016-01-11 19:00 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-api, Kernel-team, Andrew Morton, Rik van Riel, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Andrea Arcangeli, Andi Kleen,
	Minchan Kim, Arnd Bergmann

Hi Andrew,

any chance this can be added to -mm? It can still be applied to latest
-next tree.

Thanks,
Shaohua

On Thu, Dec 10, 2015 at 04:03:37PM -0800, Shaohua Li wrote:
> In jemalloc, a free(3) doesn't immediately free the memory to OS even
> the memory is page aligned/size, and hope the memory can be reused soon.
> Later the virtual address becomes fragmented, and more and more free
> memory are aggregated. If the free memory size is large, jemalloc uses
> madvise(DONT_NEED) to actually free the memory back to OS.
> 
> The madvise has significantly overhead paritcularly because of TLB
> flush. jemalloc does madvise for several virtual address space ranges
> one time. Instead of calling madvise for each of the ranges, we
> introduce a new syscall to purge memory for several ranges one time. In
> this way, we can merge several TLB flush for the ranges to one big TLB
> flush. This also reduce mmap_sem locking and kernel/userspace switching.
> 
> I'm running a simple memory allocation benchmark. 32 threads do random
> malloc/free/realloc. Corresponding jemalloc patch to utilize this API is
> attached.
> Without patch:
> real    0m18.923s
> user    1m11.819s
> sys     7m44.626s
> each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> is hotest functions. mmap_sem read locking (because of page fault) is
> also heavy.
> 
> with patch:
> real    0m15.026s
> user    0m48.548s
> sys     6m41.153s
> each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> all. mmap_sem read locking (still because of page fault) becomes the
> sole hot spot.
> 
> Another test malloc a bunch of memory in 48 threads, then all threads
> free the memory. I measure the time of the memory free.
> Without patch: 34.332s
> With patch:    17.429s
> 
> MADV_FREE does the same TLB flush as MADV_NEED, this also applies to
> MADV_FREE. Other madvise type can have small benefits too, like reduce
> syscalls/mmap_sem locking.
> 
> V3->V4:
> - Support all MADV_* (Andi Kleen)
> - Add compat syscall (Arnd Bergmann)
> 
> V2->V3:
> - Delete iov sort (Andrew Morton)
> - Support MADV_FREE
> 
> V1->V2:
> - make madvisev() acts exactly like madvise(). The only difference is
>   madvisev() returns error if a range's start equals to start + size.
>   Returning 0 (like madvise) is improper here since there are other ranges.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   2 +
>  include/linux/compat.h                 |   3 +
>  include/linux/syscalls.h               |   3 +
>  include/uapi/asm-generic/unistd.h      |   4 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/madvise.c                           | 281 +++++++++++++++++++++++++++------
>  7 files changed, 251 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index cb713df..f65e418 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -384,3 +384,4 @@
>  375	i386	membarrier		sys_membarrier
>  376	i386	mlock2			sys_mlock2
>  377	i386	copy_file_range		sys_copy_file_range
> +378	i386	madvisev		sys_madvisev			compat_sys_madvisev
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index dc1040a..158eef4 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -333,6 +333,7 @@
>  324	common	membarrier		sys_membarrier
>  325	common	mlock2			sys_mlock2
>  326	common	copy_file_range		sys_copy_file_range
> +327	64	madvisev		sys_madvisev
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -372,3 +373,4 @@
>  543	x32	io_setup		compat_sys_io_setup
>  544	x32	io_submit		compat_sys_io_submit
>  545	x32	execveat		stub_x32_execveat
> +546	x32	madvisev		compat_sys_madvisev
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index a76c917..a11ddab 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -689,6 +689,9 @@ asmlinkage long compat_sys_sendfile64(int out_fd, int in_fd,
>  asmlinkage long compat_sys_sigaltstack(const compat_stack_t __user *uss_ptr,
>  				       compat_stack_t __user *uoss_ptr);
>  
> +asmlinkage long compat_sys_madvisev(const struct compat_iovec __user *uvector,
> +		compat_ulong_t nr_segs, compat_int_t behavior);
> +
>  #ifdef __ARCH_WANT_SYS_SIGPENDING
>  asmlinkage long compat_sys_sigpending(compat_old_sigset_t __user *set);
>  #endif
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 185815c..8df6763 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -892,4 +892,7 @@ asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
>  
>  asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
>  
> +asmlinkage long sys_madvisev(const struct iovec __user *uvector,
> +		unsigned long nr_segs, int behavior);
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 2622b33..23cac4e 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -717,9 +717,11 @@ __SYSCALL(__NR_membarrier, sys_membarrier)
>  __SYSCALL(__NR_mlock2, sys_mlock2)
>  #define __NR_copy_file_range 285
>  __SYSCALL(__NR_copy_file_range, sys_copy_file_range)
> +#define __NR_madvisev 286
> +__SC_COMP(__NR_madvisev, sys_madvisev, compat_sys_madvisev)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 286
> +#define __NR_syscalls 287
>  
>  /*
>   * All syscalls below here should go away really,
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 2c5e3a8..8b87f39 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -198,6 +198,8 @@ cond_syscall(sys_munlockall);
>  cond_syscall(sys_mlock2);
>  cond_syscall(sys_mincore);
>  cond_syscall(sys_madvise);
> +cond_syscall(sys_madvisev);
> +cond_syscall(compat_sys_madvisev);
>  cond_syscall(sys_mremap);
>  cond_syscall(sys_remap_file_pages);
>  cond_syscall(compat_sys_move_pages);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index f56825b..8d79774 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -21,7 +21,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/mmu_notifier.h>
> -
> +#include <linux/uio.h>
> +#ifdef CONFIG_COMPAT
> +#include <linux/compat.h>
> +#endif
>  #include <asm/tlb.h>
>  
>  /*
> @@ -564,7 +567,8 @@ static int madvise_hwpoison(int bhv, unsigned long start, unsigned long end)
>  
>  static long
>  madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> -		unsigned long start, unsigned long end, int behavior)
> +		unsigned long start, unsigned long end, int behavior,
> +		void *data)
>  {
>  	switch (behavior) {
>  	case MADV_REMOVE:
> @@ -616,6 +620,62 @@ madvise_behavior_valid(int behavior)
>  	}
>  }
>  
> +typedef long (*madvise_iterate_fn)(struct vm_area_struct *vma,
> +	struct vm_area_struct **prev, unsigned long start,
> +	unsigned long end, int behavior, void *data);
> +static int madvise_iterate_vma(unsigned long start, unsigned long end,
> +	int *unmapped_error, int behavior, madvise_iterate_fn fn, void *data)
> +{
> +	struct vm_area_struct *vma, *prev;
> +	unsigned long tmp;
> +	int error = 0;
> +
> +	/*
> +	 * If the interval [start,end) covers some unmapped address
> +	 * ranges, just ignore them, but return -ENOMEM at the end.
> +	 * - different from the way of handling in mlock etc.
> +	 */
> +	vma = find_vma_prev(current->mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	for (;;) {
> +		/* Still start < end. */
> +		error = -ENOMEM;
> +		if (!vma)
> +			break;
> +
> +		/* Here start < (end|vma->vm_end). */
> +		if (start < vma->vm_start) {
> +			*unmapped_error = -ENOMEM;
> +			start = vma->vm_start;
> +			if (start >= end)
> +				break;
> +		}
> +
> +		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> +		error = fn(vma, &prev, start, tmp, behavior, data);
> +		if (error)
> +			break;
> +		start = tmp;
> +		if (prev && start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			break;
> +		if (prev)
> +			vma = prev->vm_next;
> +		else	/* madvise_remove dropped mmap_sem */
> +			vma = find_vma(current->mm, start);
> +	}
> +
> +	return error;
> +}
> +
>  /*
>   * The madvise(2) system call.
>   *
> @@ -660,8 +720,7 @@ madvise_behavior_valid(int behavior)
>   */
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -	unsigned long end, tmp;
> -	struct vm_area_struct *vma, *prev;
> +	unsigned long end;
>  	int unmapped_error = 0;
>  	int error = -EINVAL;
>  	int write;
> @@ -697,56 +756,190 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  	else
>  		down_read(&current->mm->mmap_sem);
>  
> -	/*
> -	 * If the interval [start,end) covers some unmapped address
> -	 * ranges, just ignore them, but return -ENOMEM at the end.
> -	 * - different from the way of handling in mlock etc.
> -	 */
> -	vma = find_vma_prev(current->mm, start, &prev);
> -	if (vma && start > vma->vm_start)
> -		prev = vma;
> -
>  	blk_start_plug(&plug);
> -	for (;;) {
> -		/* Still start < end. */
> -		error = -ENOMEM;
> -		if (!vma)
> -			goto out;
>  
> -		/* Here start < (end|vma->vm_end). */
> -		if (start < vma->vm_start) {
> -			unmapped_error = -ENOMEM;
> -			start = vma->vm_start;
> -			if (start >= end)
> -				goto out;
> +	error = madvise_iterate_vma(start, end, &unmapped_error,
> +			behavior, madvise_vma, NULL);
> +	if (error == 0 && unmapped_error != 0)
> +		error = unmapped_error;
> +
> +	blk_finish_plug(&plug);
> +	if (write)
> +		up_write(&current->mm->mmap_sem);
> +	else
> +		up_read(&current->mm->mmap_sem);
> +
> +	return error;
> +}
> +
> +static long
> +madvisev_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +		unsigned long start, unsigned long end, int behavior,
> +		void *data)
> +{
> +	struct mmu_gather *tlb = data;
> +	*prev = vma;
> +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> +		return -EINVAL;
> +
> +	switch (behavior) {
> +	case MADV_FREE:
> +		/*
> +		 * XXX: In this implementation, MADV_FREE works like
> +		 * MADV_DONTNEED on swapless system or full swap.
> +		 */
> +		if (get_nr_swap_pages() > 0) {
> +			/* MADV_FREE works for only anon vma at the moment */
> +			if (!vma_is_anonymous(vma))
> +				return -EINVAL;
> +			madvise_free_page_range(tlb, vma, start, end);
> +			break;
>  		}
> +		/* passthrough */
> +	case MADV_DONTNEED:
> +		unmap_vmas(tlb, vma, start, end);
> +		break;
> +	}
> +	return 0;
> +}
>  
> -		/* Here vma->vm_start <= start < (end|vma->vm_end) */
> -		tmp = vma->vm_end;
> -		if (end < tmp)
> -			tmp = end;
> +static int do_madvisev(struct iovec *iov, unsigned long nr_segs, int behavior)
> +{
> +	unsigned long start, end = 0;
> +	int unmapped_error = 0;
> +	size_t len;
> +	struct mmu_gather tlb;
> +	int error = 0;
> +	int i;
> +	int write;
> +	struct blk_plug plug;
>  
> -		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> -		error = madvise_vma(vma, &prev, start, tmp, behavior);
> -		if (error)
> -			goto out;
> -		start = tmp;
> -		if (prev && start < prev->vm_end)
> -			start = prev->vm_end;
> -		error = unmapped_error;
> -		if (start >= end)
> -			goto out;
> -		if (prev)
> -			vma = prev->vm_next;
> -		else	/* madvise_remove dropped mmap_sem */
> -			vma = find_vma(current->mm, start);
> +#ifdef CONFIG_MEMORY_FAILURE
> +	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE) {
> +		for (i = 0; i < nr_segs; i++) {
> +			start = (unsigned long)iov[i].iov_base;
> +			len = iov[i].iov_len;
> +			error = madvise_hwpoison(behavior, start, start + len);
> +			if (error)
> +				return error;
> +		}
> +		return 0;
>  	}
> -out:
> -	blk_finish_plug(&plug);
> +#endif
> +
> +	if (!madvise_behavior_valid(behavior))
> +		return -EINVAL;
> +
> +	for (i = 0; i < nr_segs; i++) {
> +		start = (unsigned long)iov[i].iov_base;
> +		/* Make sure iovs don't overlap and sorted */
> +		if (start & ~PAGE_MASK || start < end)
> +			return -EINVAL;
> +		len = ((iov[i].iov_len + ~PAGE_MASK) & PAGE_MASK);
> +
> +		/*
> +		 * Check to see whether len was rounded up from small -ve to
> +		 * zero
> +		 */
> +		if (iov[i].iov_len && !len)
> +			return -EINVAL;
> +
> +		end = start + len;
> +
> +		/*
> +		 * end == start returns error (different against madvise).
> +		 * return 0 is improper as there are other iovs
> +		 */
> +		if (end <= start)
> +			return -EINVAL;
> +
> +		iov[i].iov_len = len;
> +	}
> +
> +	write = madvise_need_mmap_write(behavior);
> +	if (write)
> +		down_write(&current->mm->mmap_sem);
> +	else
> +		down_read(&current->mm->mmap_sem);
> +
> +	if (behavior == MADV_DONTNEED || behavior == MADV_FREE) {
> +		lru_add_drain();
> +		tlb_gather_mmu(&tlb, current->mm,
> +			(unsigned long)iov[0].iov_base, end);
> +		update_hiwater_rss(current->mm);
> +		for (i = 0; i < nr_segs; i++) {
> +			start = (unsigned long)iov[i].iov_base;
> +			len = iov[i].iov_len;
> +
> +			error = madvise_iterate_vma(start, start + len,
> +				&unmapped_error, behavior, madvisev_vma, &tlb);
> +			if (error)
> +				break;
> +		}
> +		tlb_finish_mmu(&tlb, (unsigned long)iov[0].iov_base, end);
> +	} else {
> +		blk_start_plug(&plug);
> +		for (i = 0; i < nr_segs; i++) {
> +			start = (unsigned long)iov[i].iov_base;
> +			len = iov[i].iov_len;
> +
> +			error = madvise_iterate_vma(start, start + len,
> +				&unmapped_error, behavior, madvise_vma, NULL);
> +			if (error)
> +				break;
> +		}
> +		blk_finish_plug(&plug);
> +	}
> +	if (error == 0 && unmapped_error != 0)
> +		error = unmapped_error;
> +
>  	if (write)
>  		up_write(&current->mm->mmap_sem);
>  	else
>  		up_read(&current->mm->mmap_sem);
> +	return error;
> +}
> +
> +/*
> + * The vector madvise(). Like madvise except running for a vector of virtual
> + * address ranges
> + */
> +SYSCALL_DEFINE3(madvisev, const struct iovec __user *, uvector,
> +	unsigned long, nr_segs, int, behavior)
> +{
> +	struct iovec iovstack[UIO_FASTIOV];
> +	struct iovec *iov = NULL;
> +	int error;
> +
> +	error = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
> +			UIO_FASTIOV, iovstack, &iov);
> +	if (error <= 0)
> +		return error;
> +
> +	error = do_madvisev(iov, nr_segs, behavior);
>  
> +	if (iov != iovstack)
> +		kfree(iov);
>  	return error;
>  }
> +
> +#ifdef CONFIG_COMPAT
> +COMPAT_SYSCALL_DEFINE3(madvisev, const struct compat_iovec __user *, uvector,
> +	compat_ulong_t, nr_segs, compat_int_t, behavior)
> +{
> +	struct iovec iovstack[UIO_FASTIOV];
> +	struct iovec *iov = NULL;
> +	int error;
> +
> +	error = compat_rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
> +			UIO_FASTIOV, iovstack, &iov);
> +	if (error <= 0)
> +		return error;
> +
> +	error = do_madvisev(iov, nr_segs, behavior);
> +
> +	if (iov != iovstack)
> +		kfree(iov);
> +	return error;
> +}
> +#endif
> -- 
> 2.4.6
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
  2016-01-11 19:00 ` Shaohua Li
@ 2016-01-11 19:06       ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2016-01-11 19:06 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Kernel-team-b10kYP2dOMg,
	Rik van Riel, Mel Gorman, Hugh Dickins, Johannes Weiner,
	Andrea Arcangeli, Andi Kleen, Minchan Kim, Arnd Bergmann

On Mon, 11 Jan 2016 11:00:22 -0800 Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:

> any chance this can be added to -mm? It can still be applied to latest
> -next tree.

We have an utterly unbelievable amount of MM stuff queued for 4.4. 
I'll be taking a look at madvisev() after 4.5-rc1.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
@ 2016-01-11 19:06       ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2016-01-11 19:06 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-api, Kernel-team, Rik van Riel, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Andrea Arcangeli, Andi Kleen,
	Minchan Kim, Arnd Bergmann

On Mon, 11 Jan 2016 11:00:22 -0800 Shaohua Li <shli@fb.com> wrote:

> any chance this can be added to -mm? It can still be applied to latest
> -next tree.

We have an utterly unbelievable amount of MM stuff queued for 4.4. 
I'll be taking a look at madvisev() after 4.5-rc1.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
  2015-12-11  0:03 ` Shaohua Li
@ 2016-02-17  0:08     ` Andrew Morton
  -1 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2016-02-17  0:08 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Kernel-team-b10kYP2dOMg,
	Rik van Riel, Mel Gorman, Hugh Dickins, Johannes Weiner,
	Andrea Arcangeli, Andi Kleen, Minchan Kim, Arnd Bergmann

On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:

> In jemalloc, a free(3) doesn't immediately free the memory to OS even
> the memory is page aligned/size, and hope the memory can be reused soon.
> Later the virtual address becomes fragmented, and more and more free
> memory are aggregated. If the free memory size is large, jemalloc uses
> madvise(DONT_NEED) to actually free the memory back to OS.
> 
> The madvise has significantly overhead paritcularly because of TLB
> flush. jemalloc does madvise for several virtual address space ranges
> one time. Instead of calling madvise for each of the ranges, we
> introduce a new syscall to purge memory for several ranges one time. In
> this way, we can merge several TLB flush for the ranges to one big TLB
> flush. This also reduce mmap_sem locking and kernel/userspace switching.
> 
> I'm running a simple memory allocation benchmark. 32 threads do random
> malloc/free/realloc.

CPU count?  (Does that matter much?)

> Corresponding jemalloc patch to utilize this API is
> attached.

No it isn't ;)

Who maintains jemalloc?  Are they signed up to actually apply the
patch?  It would be bad to add the patch to the kernel and then find
that the jemalloc maintainers choose not to use it!

> Without patch:
> real    0m18.923s
> user    1m11.819s
> sys     7m44.626s
> each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> is hotest functions. mmap_sem read locking (because of page fault) is
> also heavy.
> 
> with patch:
> real    0m15.026s
> user    0m48.548s
> sys     6m41.153s
> each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> all. mmap_sem read locking (still because of page fault) becomes the
> sole hot spot.

This is a somewhat underwhelming improvement, given that it's a
synthetic microbenchmark.

> Another test malloc a bunch of memory in 48 threads, then all threads
> free the memory. I measure the time of the memory free.
> Without patch: 34.332s
> With patch:    17.429s

This is more whelming.

Do we have a feel for how much benefit this patch will have for
real-world workloads?  That's pretty important.

> MADV_FREE does the same TLB flush as MADV_NEED, this also applies to

I'll do s/MADV_NEED/MADV_DONTNEED/

> MADV_FREE. Other madvise type can have small benefits too, like reduce
> syscalls/mmap_sem locking.

Could we please get a testcase for the syscall(s) into
tools/testing/selftests/vm?  For long-term maintenance reasons and as a
service to arch maintainers - make it easy for them to check the
functionality without having to roll their own (possibly incomplete)
test app.

I'm not sure *how* we'd develop a test case.  Use mincore()?

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -21,7 +21,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/mmu_notifier.h>
> -
> +#include <linux/uio.h>
> +#ifdef CONFIG_COMPAT
> +#include <linux/compat.h>
> +#endif

I'll nuke the ifdefs - compat.h already does that.


It would be good for us to have a look at the manpage before going too
far with the patch - this helps reviewers to think about the proposed
interface and behaviour.

I'll queue this up for a bit of testing, although it won't get tested
much.  The syscall fuzzers will presumably hit on it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
@ 2016-02-17  0:08     ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2016-02-17  0:08 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-api, Kernel-team, Rik van Riel, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Andrea Arcangeli, Andi Kleen,
	Minchan Kim, Arnd Bergmann

On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li <shli@fb.com> wrote:

> In jemalloc, a free(3) doesn't immediately free the memory to OS even
> the memory is page aligned/size, and hope the memory can be reused soon.
> Later the virtual address becomes fragmented, and more and more free
> memory are aggregated. If the free memory size is large, jemalloc uses
> madvise(DONT_NEED) to actually free the memory back to OS.
> 
> The madvise has significantly overhead paritcularly because of TLB
> flush. jemalloc does madvise for several virtual address space ranges
> one time. Instead of calling madvise for each of the ranges, we
> introduce a new syscall to purge memory for several ranges one time. In
> this way, we can merge several TLB flush for the ranges to one big TLB
> flush. This also reduce mmap_sem locking and kernel/userspace switching.
> 
> I'm running a simple memory allocation benchmark. 32 threads do random
> malloc/free/realloc.

CPU count?  (Does that matter much?)

> Corresponding jemalloc patch to utilize this API is
> attached.

No it isn't ;)

Who maintains jemalloc?  Are they signed up to actually apply the
patch?  It would be bad to add the patch to the kernel and then find
that the jemalloc maintainers choose not to use it!

> Without patch:
> real    0m18.923s
> user    1m11.819s
> sys     7m44.626s
> each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> is hotest functions. mmap_sem read locking (because of page fault) is
> also heavy.
> 
> with patch:
> real    0m15.026s
> user    0m48.548s
> sys     6m41.153s
> each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> all. mmap_sem read locking (still because of page fault) becomes the
> sole hot spot.

This is a somewhat underwhelming improvement, given that it's a
synthetic microbenchmark.

> Another test malloc a bunch of memory in 48 threads, then all threads
> free the memory. I measure the time of the memory free.
> Without patch: 34.332s
> With patch:    17.429s

This is more whelming.

Do we have a feel for how much benefit this patch will have for
real-world workloads?  That's pretty important.

> MADV_FREE does the same TLB flush as MADV_NEED, this also applies to

I'll do s/MADV_NEED/MADV_DONTNEED/

> MADV_FREE. Other madvise type can have small benefits too, like reduce
> syscalls/mmap_sem locking.

Could we please get a testcase for the syscall(s) into
tools/testing/selftests/vm?  For long-term maintenance reasons and as a
service to arch maintainers - make it easy for them to check the
functionality without having to roll their own (possibly incomplete)
test app.

I'm not sure *how* we'd develop a test case.  Use mincore()?

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -21,7 +21,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/mmu_notifier.h>
> -
> +#include <linux/uio.h>
> +#ifdef CONFIG_COMPAT
> +#include <linux/compat.h>
> +#endif

I'll nuke the ifdefs - compat.h already does that.


It would be good for us to have a look at the manpage before going too
far with the patch - this helps reviewers to think about the proposed
interface and behaviour.

I'll queue this up for a bit of testing, although it won't get tested
much.  The syscall fuzzers will presumably hit on it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
  2016-02-17  0:08     ` Andrew Morton
  (?)
@ 2016-02-17 17:47     ` Shaohua Li
       [not found]       ` <20160217174654.GA3505386-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  -1 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2016-02-17 17:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, Kernel-team, Rik van Riel, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Andrea Arcangeli, Andi Kleen,
	Minchan Kim, Arnd Bergmann, Michael Kerrisk, Jason Evans

[-- Attachment #1: Type: text/plain, Size: 3969 bytes --]

On Tue, Feb 16, 2016 at 04:08:02PM -0800, Andrew Morton wrote:
> On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li <shli@fb.com> wrote:
> 
> > In jemalloc, a free(3) doesn't immediately free the memory to OS even
> > the memory is page aligned/size, and hope the memory can be reused soon.
> > Later the virtual address becomes fragmented, and more and more free
> > memory are aggregated. If the free memory size is large, jemalloc uses
> > madvise(DONT_NEED) to actually free the memory back to OS.
> > 
> > The madvise has significantly overhead paritcularly because of TLB
> > flush. jemalloc does madvise for several virtual address space ranges
> > one time. Instead of calling madvise for each of the ranges, we
> > introduce a new syscall to purge memory for several ranges one time. In
> > this way, we can merge several TLB flush for the ranges to one big TLB
> > flush. This also reduce mmap_sem locking and kernel/userspace switching.
> > 
> > I'm running a simple memory allocation benchmark. 32 threads do random
> > malloc/free/realloc.
> 
> CPU count?  (Does that matter much?)

32. It does. the tlb flush overhead depends on the cpu count. 
> > Corresponding jemalloc patch to utilize this API is
> > attached.
> 
> No it isn't ;)

Sorry, I attached it in first post, but not this one. Attached is the
one I tested against this patch.

> Who maintains jemalloc?  Are they signed up to actually apply the
> patch?  It would be bad to add the patch to the kernel and then find
> that the jemalloc maintainers choose not to use it!

Jason Evans (cced) is the author of jemalloc. I talked to him before, he
is very positive to this new syscall.

> > Without patch:
> > real    0m18.923s
> > user    1m11.819s
> > sys     7m44.626s
> > each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> > is hotest functions. mmap_sem read locking (because of page fault) is
> > also heavy.
> > 
> > with patch:
> > real    0m15.026s
> > user    0m48.548s
> > sys     6m41.153s
> > each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> > all. mmap_sem read locking (still because of page fault) becomes the
> > sole hot spot.
> 
> This is a somewhat underwhelming improvement, given that it's a
> synthetic microbenchmark.

Yes, this test does malloc, free, calloc, realloc, so it doesn't only
benchmark the madvisev.
> > Another test malloc a bunch of memory in 48 threads, then all threads
> > free the memory. I measure the time of the memory free.
> > Without patch: 34.332s
> > With patch:    17.429s
> 
> This is more whelming.
> 
> Do we have a feel for how much benefit this patch will have for
> real-world workloads?  That's pretty important.

Sure, we'll post some real-world data.
> > MADV_FREE does the same TLB flush as MADV_NEED, this also applies to
> 
> I'll do s/MADV_NEED/MADV_DONTNEED/
> 
> > MADV_FREE. Other madvise type can have small benefits too, like reduce
> > syscalls/mmap_sem locking.
> 
> Could we please get a testcase for the syscall(s) into
> tools/testing/selftests/vm?  For long-term maintenance reasons and as a
> service to arch maintainers - make it easy for them to check the
> functionality without having to roll their own (possibly incomplete)
> test app.
> 
> I'm not sure *how* we'd develop a test case.  Use mincore()?

Ok, I'll add this later.
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -21,7 +21,10 @@
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> >  #include <linux/mmu_notifier.h>
> > -
> > +#include <linux/uio.h>
> > +#ifdef CONFIG_COMPAT
> > +#include <linux/compat.h>
> > +#endif
> 
> I'll nuke the ifdefs - compat.h already does that.
> 
> 
> It would be good for us to have a look at the manpage before going too
> far with the patch - this helps reviewers to think about the proposed
> interface and behaviour.
> 
> I'll queue this up for a bit of testing, although it won't get tested
> much.  The syscall fuzzers will presumably hit on it.

Thanks!

[-- Attachment #2: je.patch --]
[-- Type: text/plain, Size: 2618 bytes --]

diff --git a/src/arena.c b/src/arena.c
index 43733cc..5c1a3b3 100644
--- a/src/arena.c
+++ b/src/arena.c
@@ -1266,6 +1266,7 @@ arena_dirty_count(arena_t *arena)
 	return (ndirty);
 }
 
+#define PURGE_VEC 1
 static size_t
 arena_compute_npurge(arena_t *arena, bool all)
 {
@@ -1280,6 +1281,10 @@ arena_compute_npurge(arena_t *arena, bool all)
 		threshold = threshold < chunk_npages ? chunk_npages : threshold;
 
 		npurge = arena->ndirty - threshold;
+#if PURGE_VEC
+		if (npurge < arena->ndirty / 2)
+			npurge = arena->ndirty / 2;
+#endif
 	} else
 		npurge = arena->ndirty;
 
@@ -1366,6 +1371,31 @@ arena_stash_dirty(arena_t *arena, chunk_hooks_t *chunk_hooks, bool all,
 	return (nstashed);
 }
 
+#if PURGE_VEC
+int mycomp(const void *a, const void *b)
+{
+	unsigned long a_b = (unsigned long)((struct iovec *)a)->iov_base;
+	unsigned long b_b = (unsigned long)((struct iovec *)b)->iov_base;
+
+	if (a_b < b_b)
+		return -1;
+	if (a_b > b_b)
+		return 1;
+	return 0;
+}
+
+#define MAX_IOVEC 32
+bool pages_purge_vec(struct iovec *iov, unsigned long nr_segs)
+{
+	int ret;
+
+	qsort(iov, nr_segs, sizeof(struct iovec), mycomp);
+	ret = syscall(327, iov, nr_segs, MADV_DONTNEED);
+
+	return !!ret;
+}
+#endif
+
 static size_t
 arena_purge_stashed(arena_t *arena, chunk_hooks_t *chunk_hooks,
     arena_runs_dirty_link_t *purge_runs_sentinel,
@@ -1374,6 +1404,10 @@ arena_purge_stashed(arena_t *arena, chunk_hooks_t *chunk_hooks,
 	size_t npurged, nmadvise;
 	arena_runs_dirty_link_t *rdelm;
 	extent_node_t *chunkselm;
+#if PURGE_VEC
+	struct iovec iovec[MAX_IOVEC];
+	int vec_index = 0;
+#endif
 
 	if (config_stats)
 		nmadvise = 0;
@@ -1418,9 +1452,21 @@ arena_purge_stashed(arena_t *arena, chunk_hooks_t *chunk_hooks,
 				flag_unzeroed = 0;
 				flags = CHUNK_MAP_DECOMMITTED;
 			} else {
+#if !PURGE_VEC
 				flag_unzeroed = chunk_purge_wrapper(arena,
 				    chunk_hooks, chunk, chunksize, pageind <<
 				    LG_PAGE, run_size) ? CHUNK_MAP_UNZEROED : 0;
+#else
+				flag_unzeroed = 0;
+				iovec[vec_index].iov_base = (void *)((uintptr_t)chunk +
+					(pageind << LG_PAGE));
+				iovec[vec_index].iov_len = run_size;
+				vec_index++;
+				if (vec_index >= MAX_IOVEC) {
+					pages_purge_vec(iovec, vec_index);
+					vec_index = 0;
+				}
+#endif
 				flags = flag_unzeroed;
 			}
 			arena_mapbits_large_set(chunk, pageind+npages-1, 0,
@@ -1449,6 +1495,10 @@ arena_purge_stashed(arena_t *arena, chunk_hooks_t *chunk_hooks,
 		if (config_stats)
 			nmadvise++;
 	}
+#if PURGE_VEC
+	if (vec_index > 0)
+		pages_purge_vec(iovec, vec_index);
+#endif
 	malloc_mutex_lock(&arena->lock);
 
 	if (config_stats) {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
  2016-02-17 17:47     ` Shaohua Li
@ 2016-02-18 12:38           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-02-18 12:38 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux API,
	Kernel-team-b10kYP2dOMg, Rik van Riel, Mel Gorman, Hugh Dickins,
	Johannes Weiner, Andrea Arcangeli, Andi Kleen, Minchan Kim,
	Arnd Bergmann, Jason Evans

Hello Shaohua Li,

On 17 February 2016 at 18:47, Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:
> On Tue, Feb 16, 2016 at 04:08:02PM -0800, Andrew Morton wrote:
>> On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:

[...]

>> It would be good for us to have a look at the manpage before going too
>> far with the patch - this helps reviewers to think about the proposed
>> interface and behaviour.
>>
>> I'll queue this up for a bit of testing, although it won't get tested
>> much.  The syscall fuzzers will presumably hit on it.

Please don't forget the other piece. A patch, or a suitable piece of
text for the madvise.2 man page please!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
@ 2016-02-18 12:38           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 13+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-02-18 12:38 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-mm, Linux API, Kernel-team, Rik van Riel,
	Mel Gorman, Hugh Dickins, Johannes Weiner, Andrea Arcangeli,
	Andi Kleen, Minchan Kim, Arnd Bergmann, Jason Evans

Hello Shaohua Li,

On 17 February 2016 at 18:47, Shaohua Li <shli@fb.com> wrote:
> On Tue, Feb 16, 2016 at 04:08:02PM -0800, Andrew Morton wrote:
>> On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li <shli@fb.com> wrote:

[...]

>> It would be good for us to have a look at the manpage before going too
>> far with the patch - this helps reviewers to think about the proposed
>> interface and behaviour.
>>
>> I'll queue this up for a bit of testing, although it won't get tested
>> much.  The syscall fuzzers will presumably hit on it.

Please don't forget the other piece. A patch, or a suitable piece of
text for the madvise.2 man page please!

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
  2016-02-17 17:47     ` Shaohua Li
@ 2016-03-08 22:37           ` Shaohua Li
  0 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2016-03-08 22:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Kernel-team-b10kYP2dOMg,
	Rik van Riel, Mel Gorman, Hugh Dickins, Johannes Weiner,
	Andrea Arcangeli, Andi Kleen, Minchan Kim, Arnd Bergmann,
	Michael Kerrisk, Jason Evans, Dave Watson

On Wed, Feb 17, 2016 at 09:47:06AM -0800, Shaohua Li wrote:
> On Tue, Feb 16, 2016 at 04:08:02PM -0800, Andrew Morton wrote:
> > On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li <shli-b10kYP2dOMg@public.gmane.org> wrote:
> > 
> > > In jemalloc, a free(3) doesn't immediately free the memory to OS even
> > > the memory is page aligned/size, and hope the memory can be reused soon.
> > > Later the virtual address becomes fragmented, and more and more free
> > > memory are aggregated. If the free memory size is large, jemalloc uses
> > > madvise(DONT_NEED) to actually free the memory back to OS.
> > > 
> > > The madvise has significantly overhead paritcularly because of TLB
> > > flush. jemalloc does madvise for several virtual address space ranges
> > > one time. Instead of calling madvise for each of the ranges, we
> > > introduce a new syscall to purge memory for several ranges one time. In
> > > this way, we can merge several TLB flush for the ranges to one big TLB
> > > flush. This also reduce mmap_sem locking and kernel/userspace switching.
> > > 
> > > I'm running a simple memory allocation benchmark. 32 threads do random
> > > malloc/free/realloc.
> > 
> > CPU count?  (Does that matter much?)
> 
> 32. It does. the tlb flush overhead depends on the cpu count. 
> > > Corresponding jemalloc patch to utilize this API is
> > > attached.
> > 
> > No it isn't ;)
> 
> Sorry, I attached it in first post, but not this one. Attached is the
> one I tested against this patch.
> 
> > Who maintains jemalloc?  Are they signed up to actually apply the
> > patch?  It would be bad to add the patch to the kernel and then find
> > that the jemalloc maintainers choose not to use it!
> 
> Jason Evans (cced) is the author of jemalloc. I talked to him before, he
> is very positive to this new syscall.
> 
> > > Without patch:
> > > real    0m18.923s
> > > user    1m11.819s
> > > sys     7m44.626s
> > > each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> > > is hotest functions. mmap_sem read locking (because of page fault) is
> > > also heavy.
> > > 
> > > with patch:
> > > real    0m15.026s
> > > user    0m48.548s
> > > sys     6m41.153s
> > > each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> > > all. mmap_sem read locking (still because of page fault) becomes the
> > > sole hot spot.
> > 
> > This is a somewhat underwhelming improvement, given that it's a
> > synthetic microbenchmark.
> 
> Yes, this test does malloc, free, calloc, realloc, so it doesn't only
> benchmark the madvisev.
> > > Another test malloc a bunch of memory in 48 threads, then all threads
> > > free the memory. I measure the time of the memory free.
> > > Without patch: 34.332s
> > > With patch:    17.429s
> > 
> > This is more whelming.
> > 
> > Do we have a feel for how much benefit this patch will have for
> > real-world workloads?  That's pretty important.
> 
> Sure, we'll post some real-world data.

Hi Andrew,

Sorry I can't post real-world data. Our workloads used to suffer from
TLB flush overhead very much, but now looks something is changed, TLB
flush overhead isn't significant in the workloads.

Jemalloc guys (Dave, CCed) also made progress to improve jemalloc, they
can reduce TLB flush without kernel changes.

In the summary, the patch doesn't have benefit as expected in our real
workloads now. Unless somebody has other usage cases, I'd drop this
patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall
@ 2016-03-08 22:37           ` Shaohua Li
  0 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2016-03-08 22:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, Kernel-team, Rik van Riel, Mel Gorman,
	Hugh Dickins, Johannes Weiner, Andrea Arcangeli, Andi Kleen,
	Minchan Kim, Arnd Bergmann, Michael Kerrisk, Jason Evans,
	Dave Watson

On Wed, Feb 17, 2016 at 09:47:06AM -0800, Shaohua Li wrote:
> On Tue, Feb 16, 2016 at 04:08:02PM -0800, Andrew Morton wrote:
> > On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li <shli@fb.com> wrote:
> > 
> > > In jemalloc, a free(3) doesn't immediately free the memory to OS even
> > > the memory is page aligned/size, and hope the memory can be reused soon.
> > > Later the virtual address becomes fragmented, and more and more free
> > > memory are aggregated. If the free memory size is large, jemalloc uses
> > > madvise(DONT_NEED) to actually free the memory back to OS.
> > > 
> > > The madvise has significantly overhead paritcularly because of TLB
> > > flush. jemalloc does madvise for several virtual address space ranges
> > > one time. Instead of calling madvise for each of the ranges, we
> > > introduce a new syscall to purge memory for several ranges one time. In
> > > this way, we can merge several TLB flush for the ranges to one big TLB
> > > flush. This also reduce mmap_sem locking and kernel/userspace switching.
> > > 
> > > I'm running a simple memory allocation benchmark. 32 threads do random
> > > malloc/free/realloc.
> > 
> > CPU count?  (Does that matter much?)
> 
> 32. It does. the tlb flush overhead depends on the cpu count. 
> > > Corresponding jemalloc patch to utilize this API is
> > > attached.
> > 
> > No it isn't ;)
> 
> Sorry, I attached it in first post, but not this one. Attached is the
> one I tested against this patch.
> 
> > Who maintains jemalloc?  Are they signed up to actually apply the
> > patch?  It would be bad to add the patch to the kernel and then find
> > that the jemalloc maintainers choose not to use it!
> 
> Jason Evans (cced) is the author of jemalloc. I talked to him before, he
> is very positive to this new syscall.
> 
> > > Without patch:
> > > real    0m18.923s
> > > user    1m11.819s
> > > sys     7m44.626s
> > > each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> > > is hotest functions. mmap_sem read locking (because of page fault) is
> > > also heavy.
> > > 
> > > with patch:
> > > real    0m15.026s
> > > user    0m48.548s
> > > sys     6m41.153s
> > > each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> > > all. mmap_sem read locking (still because of page fault) becomes the
> > > sole hot spot.
> > 
> > This is a somewhat underwhelming improvement, given that it's a
> > synthetic microbenchmark.
> 
> Yes, this test does malloc, free, calloc, realloc, so it doesn't only
> benchmark the madvisev.
> > > Another test malloc a bunch of memory in 48 threads, then all threads
> > > free the memory. I measure the time of the memory free.
> > > Without patch: 34.332s
> > > With patch:    17.429s
> > 
> > This is more whelming.
> > 
> > Do we have a feel for how much benefit this patch will have for
> > real-world workloads?  That's pretty important.
> 
> Sure, we'll post some real-world data.

Hi Andrew,

Sorry I can't post real-world data. Our workloads used to suffer from
TLB flush overhead very much, but now looks something is changed, TLB
flush overhead isn't significant in the workloads.

Jemalloc guys (Dave, CCed) also made progress to improve jemalloc, they
can reduce TLB flush without kernel changes.

In the summary, the patch doesn't have benefit as expected in our real
workloads now. Unless somebody has other usage cases, I'd drop this
patch.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-03-08 22:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-11  0:03 [PATCH V4][for-next]mm: add a new vector based madvise syscall Shaohua Li
2015-12-11  0:03 ` Shaohua Li
2015-12-11 18:31 ` Andi Kleen
2016-01-11 19:00 ` Shaohua Li
     [not found]   ` <20160111190021.GA3589410-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2016-01-11 19:06     ` Andrew Morton
2016-01-11 19:06       ` Andrew Morton
     [not found] ` <d01698140a51cf9b2ce233c7574c2ece9f6fa241.1449791762.git.shli-b10kYP2dOMg@public.gmane.org>
2016-02-17  0:08   ` Andrew Morton
2016-02-17  0:08     ` Andrew Morton
2016-02-17 17:47     ` Shaohua Li
     [not found]       ` <20160217174654.GA3505386-tb7CFzD8y5b7E6g3fPdp/g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2016-02-18 12:38         ` Michael Kerrisk (man-pages)
2016-02-18 12:38           ` Michael Kerrisk (man-pages)
2016-03-08 22:37         ` Shaohua Li
2016-03-08 22:37           ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.