* [PATCH 1/2] fs: use current->mm for io_uring
@ 2020-04-23 14:52 Minchan Kim
2020-04-23 14:52 ` [PATCH 2/2] mm: support vector address ranges for process_madvise Minchan Kim
0 siblings, 1 reply; 4+ messages in thread
From: Minchan Kim @ 2020-04-23 14:52 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, Minchan Kim, Jens Axboe, Jann Horn
per-request by Jens,
https://lore.kernel.org/io-uring/46e5b8bf-0f14-caff-f706-91794191e730@kernel.dk/
Andrew, please fold this into "mm/madvise: pass task and mm to do_madvise"
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
fs/io_uring.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 54b7c4818289..94af6cd05fb5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3267,7 +3267,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock)
if (force_nonblock)
return -EAGAIN;
- ret = do_madvise(NULL, req->work.mm, ma->addr, ma->len, ma->advice);
+ ret = do_madvise(NULL, current->mm, ma->addr, ma->len, ma->advice);
if (ret < 0)
req_set_fail_links(req);
io_cqring_add_event(req, ret);
--
2.26.1.301.g55bc3eb7cb9-goog
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/2] mm: support vector address ranges for process_madvise
2020-04-23 14:52 [PATCH 1/2] fs: use current->mm for io_uring Minchan Kim
@ 2020-04-23 14:52 ` Minchan Kim
2020-04-23 15:14 ` Matthew Wilcox
0 siblings, 1 reply; 4+ messages in thread
From: Minchan Kim @ 2020-04-23 14:52 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, Minchan Kim, Jens Axboe, Jann Horn, David Rientjes,
Arjun Roy, Tim Murray, Daniel Colascione, Sonny Rao,
Brian Geffon, Shakeel Butt, John Dias, Joel Fernandes,
SeongJae Park, Oleksandr Natalenko, Suren Baghdasaryan,
Sandeep Patil, Michal Hocko, Johannes Weiner, Vlastimil Babka,
linux-man
This patch extends a) process_madvise(2) support vector address ranges
in a system call and then b) support the vector address ranges to
local process as well as external process.
Android app has thousands of vmas due to zygote so it's totally waste
of CPU and power if we should call the syscall one by one for each vma.
(With testing 2000-vma syscall vs 1-vector syscall, it showed 15%
performance improvement. I think it would be bigger in real practice
because the testing ran very cache friendly environment).
Another potential use case for the vector range is to amortize the
cost of TLB shootdowns for multiple ranges when using MADV_DONTNEED;
this could benefit users like TCP receive zerocopy and malloc
implementations. In future, we could find more usecases for other
advises so let's make it happens as API since we introduce a new syscall
at this moment. With that, existing madvise(2) user could replace it with
process_madvise(2) with their own pid if they want to have batch address
ranges support feature.
So finally, the API is as follows,
ssize_t process_madvise(idtype_t idtype, id_t id,
const struct iovec *iovec, unsigned long vlen,
int advice, unsigned long flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve system
or application performance.
The idtype and id arguments select the target process to be advised as
follows:
idtype == P_PID
select the process whose process ID matches id
idtype == P_PIDFD
select the process referred to by the PID file descriptor
specified in id. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by idtype and
id is external.
MADV_COLD
MADV_PAGEOUT
MADV_MERGEABLE
MADV_UNMERGEABLE
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
Cc: David Rientjes <rientjes@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/madvise.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 40 insertions(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 097506466fdc..3082d7fa64ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1195,20 +1195,39 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return do_madvise(current, current->mm, start, len_in, behavior);
}
-SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
- size_t, len_in, int, behavior, unsigned long, flags)
+static int do_process_madvise(struct task_struct *target_task,
+ struct mm_struct *mm, struct iov_iter *iter, int behavior)
{
- int ret;
+ struct iovec iovec;
+ int ret = 0;
+
+ while (iov_iter_count(iter)) {
+ iovec = iov_iter_iovec(iter);
+ ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
+ iovec.iov_len, behavior);
+ if (ret < 0)
+ break;
+ iov_iter_advance(iter, iovec.iov_len);
+ }
+
+ return ret;
+}
+
+SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
+ const struct iovec __user *, vec, unsigned long, vlen,
+ int, behavior, unsigned long, flags)
+{
+ ssize_t ret;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
if (flags != 0)
return -EINVAL;
- if (!process_madvise_behavior_valid(behavior))
- return -EINVAL;
-
switch (which) {
case P_PID:
if (upid <= 0)
@@ -1236,13 +1255,27 @@ SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
goto put_pid;
}
+ if (task->mm != current->mm &&
+ !process_madvise_behavior_valid(behavior)) {
+ ret = -EINVAL;
+ goto release_task;
+ }
+
mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}
- ret = do_madvise(task, mm, start, len_in, behavior);
+ ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
+ if (ret >= 0) {
+ size_t total_len = iov_iter_count(&iter);
+
+ ret = do_process_madvise(task, mm, &iter, behavior);
+ if (ret >= 0)
+ ret = total_len - iov_iter_count(&iter);
+ kfree(iov);
+ }
mmput(mm);
release_task:
put_task_struct(task);
--
2.26.1.301.g55bc3eb7cb9-goog
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH 2/2] mm: support vector address ranges for process_madvise
2020-04-23 14:52 ` [PATCH 2/2] mm: support vector address ranges for process_madvise Minchan Kim
@ 2020-04-23 15:14 ` Matthew Wilcox
2020-04-23 19:58 ` Minchan Kim
0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2020-04-23 15:14 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, Jens Axboe, Jann Horn, David Rientjes,
Arjun Roy, Tim Murray, Daniel Colascione, Sonny Rao,
Brian Geffon, Shakeel Butt, John Dias, Joel Fernandes,
SeongJae Park, Oleksandr Natalenko, Suren Baghdasaryan,
Sandeep Patil, Michal Hocko, Johannes Weiner, Vlastimil Babka,
linux-man
On Thu, Apr 23, 2020 at 07:52:15AM -0700, Minchan Kim wrote:
> +SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
> + const struct iovec __user *, vec, unsigned long, vlen,
> + int, behavior, unsigned long, flags)
> +{
Don't we now need a compat version of this that calls compat_import_iovec()?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 2/2] mm: support vector address ranges for process_madvise
2020-04-23 15:14 ` Matthew Wilcox
@ 2020-04-23 19:58 ` Minchan Kim
0 siblings, 0 replies; 4+ messages in thread
From: Minchan Kim @ 2020-04-23 19:58 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, linux-mm, Jens Axboe, Jann Horn, David Rientjes,
Arjun Roy, Tim Murray, Daniel Colascione, Sonny Rao,
Brian Geffon, Shakeel Butt, John Dias, Joel Fernandes,
SeongJae Park, Oleksandr Natalenko, Suren Baghdasaryan,
Sandeep Patil, Michal Hocko, Johannes Weiner, Vlastimil Babka,
linux-man
On Thu, Apr 23, 2020 at 08:14:10AM -0700, Matthew Wilcox wrote:
> On Thu, Apr 23, 2020 at 07:52:15AM -0700, Minchan Kim wrote:
> > +SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
> > + const struct iovec __user *, vec, unsigned long, vlen,
> > + int, behavior, unsigned long, flags)
> > +{
>
> Don't we now need a compat version of this that calls compat_import_iovec()?
Yub, Thanks.
From 72fd6dbbab734803110817ef180ac2c6e3c2ca2a Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Thu, 23 Apr 2020 12:53:53 -0700
Subject: [PATCH] mm: support compat_sys_proess_madvise
This patch supports compat syscall for process_madvise
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
arch/arm64/include/asm/unistd32.h | 2 +-
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +-
arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +-
arch/s390/kernel/syscalls/syscall.tbl | 2 +-
arch/sparc/kernel/syscalls/syscall.tbl | 2 +-
arch/x86/entry/syscalls/syscall_32.tbl | 2 +-
arch/x86/entry/syscalls/syscall_64.tbl | 3 +-
include/linux/compat.h | 3 ++
include/uapi/asm-generic/unistd.h | 3 +-
kernel/sys_ni.c | 1 +
mm/madvise.c | 60 +++++++++++++++++------
12 files changed, 60 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index a1eec8d879d4..4958633ea7c2 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -884,7 +884,7 @@ __SYSCALL(__NR_openat2, sys_openat2)
#define __NR_pidfd_getfd 438
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_process_madvise 439
-__SYSCALL(__NR_process_madvise, sys_process_madvise)
+__SYSCALL(__NR_process_madvise, compat_sys_process_madvise)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 8ec8c558aa9c..1a20c9fcdcdb 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -376,4 +376,4 @@
435 n32 clone3 __sys_clone3
437 n32 openat2 sys_openat2
438 n32 pidfd_getfd sys_pidfd_getfd
-439 n32 process_madvise sys_process_madvise
+439 n32 process_madvise compat_sys_process_madvise
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 09c3b5dc6855..3cbf5f5edab5 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -435,4 +435,4 @@
435 common clone3 sys_clone3_wrapper
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
-439 common process_madvise sys_process_madvise
+439 common process_madvise sys_process_madvise compat_sys_process_madvise
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 8a0c7239a6f3..03b8bc7707db 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -527,4 +527,4 @@
435 spu clone3 sys_ni_syscall
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
-439 common process_madvise sys_process_madvise
+439 common process_madvise sys_process_madvise compat_sys_process_madvise
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 8dc8bfd958ea..69f4a5459f0e 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -440,4 +440,4 @@
435 common clone3 sys_clone3 sys_clone3
437 common openat2 sys_openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
-439 common process_madvise sys_process_madvise sys_process_madvise
+439 common process_madvise sys_process_madvise compat_sys_process_madvise
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 6f6e66dd51f9..bff349fcba0d 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -483,4 +483,4 @@
# 435 reserved for clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
-439 common process_madvise sys_process_madvise
+439 common process_madvise sys_process_madvise compat_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 90950255ae5c..fd8032ddffb2 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,4 +442,4 @@
435 i386 clone3 sys_clone3
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
-439 i386 process_madvise sys_process_madvise
+439 i386 process_madvise sys_process_madvise compat_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index bcf0d6d0dffe..3ff4be446922 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,7 +359,7 @@
435 common clone3 sys_clone3
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
-439 common process_madvise sys_process_madvise
+439 64 process_madvise sys_process_madvise
#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -403,3 +403,4 @@
545 x32 execveat compat_sys_execveat
546 x32 preadv2 compat_sys_preadv64v2
547 x32 pwritev2 compat_sys_pwritev64v2
+548 x32 process_madvise compat_sys_process_madvise
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 0480ba4db592..1134ba3e61d0 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -820,6 +820,9 @@ asmlinkage long compat_sys_pwritev64v2(unsigned long fd,
unsigned long vlen, loff_t pos, rwf_t flags);
#endif
+asmlinkage ssize_t compat_sys_process_madvise(int which,
+ compat_pid_t upid, const struct compat_iovec __user *vec,
+ unsigned long vlen, int behavior, unsigned long flags);
/*
* Deprecated system calls which are still defined in
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index fa289b91410e..7fde54d6a8ac 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -856,7 +856,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
#define __NR_pidfd_getfd 438
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_process_madvise 439
-__SYSCALL(__NR_process_madvise, sys_process_madvise)
+__SC_COMP(__NR_process_madvise, sys_process_madvise, \
+ compat_sys_process_madvise)
#undef __NR_syscalls
#define __NR_syscalls 440
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6c7332776e8e..16fca3a43633 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -281,6 +281,7 @@ COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
COND_SYSCALL(process_madvise);
+COND_SYSCALL_COMPAT(process_madvise);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 3082d7fa64ee..29bf2535624a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1195,7 +1195,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return do_madvise(current, current->mm, start, len_in, behavior);
}
-static int do_process_madvise(struct task_struct *target_task,
+static int process_madvise_vec(struct task_struct *target_task,
struct mm_struct *mm, struct iov_iter *iter, int behavior)
{
struct iovec iovec;
@@ -1213,17 +1213,14 @@ static int do_process_madvise(struct task_struct *target_task,
return ret;
}
-SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
- const struct iovec __user *, vec, unsigned long, vlen,
- int, behavior, unsigned long, flags)
+ssize_t do_process_madvise(int which, pid_t upid, struct iov_iter *iter,
+ int behavior, unsigned long flags)
{
ssize_t ret;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
- struct iovec iovstack[UIO_FASTIOV];
- struct iovec *iov = iovstack;
- struct iov_iter iter;
+ size_t total_len = iov_iter_count(iter);
if (flags != 0)
return -EINVAL;
@@ -1267,15 +1264,10 @@ SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
goto release_task;
}
- ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
- if (ret >= 0) {
- size_t total_len = iov_iter_count(&iter);
+ ret = process_madvise_vec(task, mm, iter, behavior);
+ if (ret >= 0)
+ ret = total_len - iov_iter_count(iter);
- ret = do_process_madvise(task, mm, &iter, behavior);
- if (ret >= 0)
- ret = total_len - iov_iter_count(&iter);
- kfree(iov);
- }
mmput(mm);
release_task:
put_task_struct(task);
@@ -1283,3 +1275,41 @@ SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
put_pid(pid);
return ret;
}
+
+SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
+ const struct iovec __user *, vec, unsigned long, vlen,
+ int, behavior, unsigned long, flags)
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+
+ ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
+ if (ret >= 0) {
+ ret = do_process_madvise(which, upid, &iter, behavior, flags);
+ kfree(iov);
+ }
+ return ret;
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE6(process_madvise, int, which, compat_pid_t, upid,
+ const struct compat_iovec __user *, vec, unsigned long, vlen,
+ int, behavior, unsigned long, flags)
+
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+
+ ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
+ &iov, &iter);
+ if (ret >= 0) {
+ ret = do_process_madvise(which, upid, &iter, behavior, flags);
+ kfree(iov);
+ }
+ return ret;
+}
+#endif
--
2.26.1.301.g55bc3eb7cb9-goog
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-04-23 19:58 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-23 14:52 [PATCH 1/2] fs: use current->mm for io_uring Minchan Kim
2020-04-23 14:52 ` [PATCH 2/2] mm: support vector address ranges for process_madvise Minchan Kim
2020-04-23 15:14 ` Matthew Wilcox
2020-04-23 19:58 ` Minchan Kim
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.