From: Minchan Kim <minchan@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm <linux-mm@kvack.org>, Minchan Kim <minchan@kernel.org>,
Jens Axboe <axboe@kernel.dk>, Jann Horn <jannh@google.com>,
David Rientjes <rientjes@google.com>,
Arjun Roy <arjunroy@google.com>,
Tim Murray <timmurray@google.com>,
Daniel Colascione <dancol@google.com>,
Sonny Rao <sonnyrao@google.com>,
Brian Geffon <bgeffon@google.com>,
Shakeel Butt <shakeelb@google.com>,
John Dias <joaodias@google.com>,
Joel Fernandes <joel@joelfernandes.org>,
SeongJae Park <sj38.park@gmail.com>,
Oleksandr Natalenko <oleksandr@redhat.com>,
Suren Baghdasaryan <surenb@google.com>,
Sandeep Patil <sspatil@google.com>,
Michal Hocko <mhocko@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Vlastimil Babka <vbabka@suse.cz>,
linux-man@vger.kernel.org
Subject: [PATCH 2/2] mm: support vector address ranges for process_madvise
Date: Thu, 23 Apr 2020 07:52:15 -0700 [thread overview]
Message-ID: <20200423145215.72666-2-minchan@kernel.org> (raw)
In-Reply-To: <20200423145215.72666-1-minchan@kernel.org>
This patch extends a) process_madvise(2) support vector address ranges
in a system call and then b) support the vector address ranges to
local process as well as external process.
Android app has thousands of vmas due to zygote so it's totally waste
of CPU and power if we should call the syscall one by one for each vma.
(With testing 2000-vma syscall vs 1-vector syscall, it showed 15%
performance improvement. I think it would be bigger in real practice
because the testing ran very cache friendly environment).
Another potential use case for the vector range is to amortize the
cost of TLB shootdowns for multiple ranges when using MADV_DONTNEED;
this could benefit users like TCP receive zerocopy and malloc
implementations. In future, we could find more usecases for other
advises so let's make it happens as API since we introduce a new syscall
at this moment. With that, existing madvise(2) user could replace it with
process_madvise(2) with their own pid if they want to have batch address
ranges support feature.
So finally, the API is as follows,
ssize_t process_madvise(idtype_t idtype, id_t id,
const struct iovec *iovec, unsigned long vlen,
int advice, unsigned long flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve system
or application performance.
The idtype and id arguments select the target process to be advised as
follows:
idtype == P_PID
select the process whose process ID matches id
idtype == P_PIDFD
select the process referred to by the PID file descriptor
specified in id. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by idtype and
id is external.
MADV_COLD
MADV_PAGEOUT
MADV_MERGEABLE
MADV_UNMERGEABLE
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
Cc: David Rientjes <rientjes@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/madvise.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 40 insertions(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 097506466fdc..3082d7fa64ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1195,20 +1195,39 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return do_madvise(current, current->mm, start, len_in, behavior);
}
-SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
- size_t, len_in, int, behavior, unsigned long, flags)
+static int do_process_madvise(struct task_struct *target_task,
+ struct mm_struct *mm, struct iov_iter *iter, int behavior)
{
- int ret;
+ struct iovec iovec;
+ int ret = 0;
+
+ while (iov_iter_count(iter)) {
+ iovec = iov_iter_iovec(iter);
+ ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
+ iovec.iov_len, behavior);
+ if (ret < 0)
+ break;
+ iov_iter_advance(iter, iovec.iov_len);
+ }
+
+ return ret;
+}
+
+SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
+ const struct iovec __user *, vec, unsigned long, vlen,
+ int, behavior, unsigned long, flags)
+{
+ ssize_t ret;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
if (flags != 0)
return -EINVAL;
- if (!process_madvise_behavior_valid(behavior))
- return -EINVAL;
-
switch (which) {
case P_PID:
if (upid <= 0)
@@ -1236,13 +1255,27 @@ SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, unsigned long, start,
goto put_pid;
}
+ if (task->mm != current->mm &&
+ !process_madvise_behavior_valid(behavior)) {
+ ret = -EINVAL;
+ goto release_task;
+ }
+
mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
if (IS_ERR_OR_NULL(mm)) {
ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
goto release_task;
}
- ret = do_madvise(task, mm, start, len_in, behavior);
+ ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
+ if (ret >= 0) {
+ size_t total_len = iov_iter_count(&iter);
+
+ ret = do_process_madvise(task, mm, &iter, behavior);
+ if (ret >= 0)
+ ret = total_len - iov_iter_count(&iter);
+ kfree(iov);
+ }
mmput(mm);
release_task:
put_task_struct(task);
--
2.26.1.301.g55bc3eb7cb9-goog
next parent reply other threads:[~2020-04-23 14:52 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20200423145215.72666-1-minchan@kernel.org>
2020-04-23 14:52 ` Minchan Kim [this message]
2020-04-23 15:14 ` [PATCH 2/2] mm: support vector address ranges for process_madvise Matthew Wilcox
2020-04-23 19:58 ` Minchan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200423145215.72666-2-minchan@kernel.org \
--to=minchan@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=arjunroy@google.com \
--cc=axboe@kernel.dk \
--cc=bgeffon@google.com \
--cc=dancol@google.com \
--cc=hannes@cmpxchg.org \
--cc=jannh@google.com \
--cc=joaodias@google.com \
--cc=joel@joelfernandes.org \
--cc=linux-man@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=oleksandr@redhat.com \
--cc=rientjes@google.com \
--cc=shakeelb@google.com \
--cc=sj38.park@gmail.com \
--cc=sonnyrao@google.com \
--cc=sspatil@google.com \
--cc=surenb@google.com \
--cc=timmurray@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).