mm-commits Archive on lore.kernel.org
 help / color / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, alexander.h.duyck@linux.intel.com,
	axboe@kernel.dk, bgeffon@google.com,
	christian.brauner@ubuntu.com, christian@brauner.io,
	dancol@google.com, fw@deneb.enyo.de, hannes@cmpxchg.org,
	jannh@google.com, joaodias@google.com, joel@joelfernandes.org,
	ktkhai@virtuozzo.com, linux-man@vger.kernel.org,
	linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org,
	mm-commits@vger.kernel.org, oleksandr@redhat.com,
	rientjes@google.com, shakeelb@google.com, sj38.park@gmail.com,
	sjpark@amazon.de, sonnyrao@google.com, sspatil@google.com,
	surenb@google.com, timmurray@google.com,
	torvalds@linux-foundation.org, vbabka@suse.cz
Subject: [patch 24/40] mm/madvise: pass mm to do_madvise
Date: Sat, 17 Oct 2020 16:14:50 -0700
Message-ID: <20201017231450.ZUTJgOnD5%akpm@linux-foundation.org> (raw)
In-Reply-To: <20201017161314.88890b87fae7446ccc13c902@linux-foundation.org>

From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: pass mm to do_madvise

Patch series "introduce memory hinting API for external process", v9.

Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
that, application could give hints to kernel what memory range are
preferred to be reclaimed.  However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.

To solve the concern, this patch introduces new syscall -
process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
has some differences.

1. It needs pidfd of target process to provide the hint

2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
   moment.  Other hints in madvise will be opened when there are explicit
   requests from community to prevent unexpected bugs we couldn't support.

3. Only privileged processes can do something for other process's
   address space.

For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.

This patch (of 3):

In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.

Furthermore, we must not access mm_struct via task->mm, but obtain it via
access_mm() once (in the following patch) and only use that pointer [1],
so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
safe, so we can use them further down the call stack.

And let's pass current->mm as arguments of do_madvise so it shouldn't
change existing behavior but prepare next patch to make review easy.

[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
  Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]
Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/io_uring.c      |    2 +-
 include/linux/mm.h |    2 +-
 mm/madvise.c       |   32 ++++++++++++++++++--------------
 3 files changed, 20 insertions(+), 16 deletions(-)

--- a/fs/io_uring.c~mm-madvise-pass-mm-to-do_madvise
+++ a/fs/io_uring.c
@@ -3989,7 +3989,7 @@ static int io_madvise(struct io_kiocb *r
 	if (force_nonblock)
 		return -EAGAIN;
 
-	ret = do_madvise(ma->addr, ma->len, ma->advice);
+	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_req_complete(req, ret);
--- a/include/linux/mm.h~mm-madvise-pass-mm-to-do_madvise
+++ a/include/linux/mm.h
@@ -2579,7 +2579,7 @@ extern int __do_munmap(struct mm_struct
 		       struct list_head *uf, bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
-extern int do_madvise(unsigned long start, size_t len_in, int behavior);
+extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
 
 #ifdef CONFIG_MMU
 extern int __mm_populate(unsigned long addr, unsigned long len,
--- a/mm/madvise.c~mm-madvise-pass-mm-to-do_madvise
+++ a/mm/madvise.c
@@ -258,6 +258,7 @@ static long madvise_willneed(struct vm_a
 			     struct vm_area_struct **prev,
 			     unsigned long start, unsigned long end)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct file *file = vma->vm_file;
 	loff_t offset;
 
@@ -294,10 +295,10 @@ static long madvise_willneed(struct vm_a
 	get_file(file);
 	offset = (loff_t)(start - vma->vm_start)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
-	mmap_read_unlock(current->mm);
+	mmap_read_unlock(mm);
 	vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED);
 	fput(file);
-	mmap_read_lock(current->mm);
+	mmap_read_lock(mm);
 	return 0;
 }
 
@@ -766,6 +767,8 @@ static long madvise_dontneed_free(struct
 				  unsigned long start, unsigned long end,
 				  int behavior)
 {
+	struct mm_struct *mm = vma->vm_mm;
+
 	*prev = vma;
 	if (!can_madv_lru_vma(vma))
 		return -EINVAL;
@@ -773,8 +776,8 @@ static long madvise_dontneed_free(struct
 	if (!userfaultfd_remove(vma, start, end)) {
 		*prev = NULL; /* mmap_lock has been dropped, prev is stale */
 
-		mmap_read_lock(current->mm);
-		vma = find_vma(current->mm, start);
+		mmap_read_lock(mm);
+		vma = find_vma(mm, start);
 		if (!vma)
 			return -ENOMEM;
 		if (start < vma->vm_start) {
@@ -828,6 +831,7 @@ static long madvise_remove(struct vm_are
 	loff_t offset;
 	int error;
 	struct file *f;
+	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = NULL;	/* tell sys_madvise we drop mmap_lock */
 
@@ -855,13 +859,13 @@ static long madvise_remove(struct vm_are
 	get_file(f);
 	if (userfaultfd_remove(vma, start, end)) {
 		/* mmap_lock was not released by userfaultfd_remove() */
-		mmap_read_unlock(current->mm);
+		mmap_read_unlock(mm);
 	}
 	error = vfs_fallocate(f,
 				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
 				offset, end - start);
 	fput(f);
-	mmap_read_lock(current->mm);
+	mmap_read_lock(mm);
 	return error;
 }
 
@@ -1045,7 +1049,7 @@ madvise_behavior_valid(int behavior)
  *  -EBADF  - map exists, but area maps something that isn't a file.
  *  -EAGAIN - a kernel resource was temporarily unavailable.
  */
-int do_madvise(unsigned long start, size_t len_in, int behavior)
+int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
 	unsigned long end, tmp;
 	struct vm_area_struct *vma, *prev;
@@ -1083,10 +1087,10 @@ int do_madvise(unsigned long start, size
 
 	write = madvise_need_mmap_write(behavior);
 	if (write) {
-		if (mmap_write_lock_killable(current->mm))
+		if (mmap_write_lock_killable(mm))
 			return -EINTR;
 	} else {
-		mmap_read_lock(current->mm);
+		mmap_read_lock(mm);
 	}
 
 	/*
@@ -1094,7 +1098,7 @@ int do_madvise(unsigned long start, size
 	 * ranges, just ignore them, but return -ENOMEM at the end.
 	 * - different from the way of handling in mlock etc.
 	 */
-	vma = find_vma_prev(current->mm, start, &prev);
+	vma = find_vma_prev(mm, start, &prev);
 	if (vma && start > vma->vm_start)
 		prev = vma;
 
@@ -1131,19 +1135,19 @@ int do_madvise(unsigned long start, size
 		if (prev)
 			vma = prev->vm_next;
 		else	/* madvise_remove dropped mmap_lock */
-			vma = find_vma(current->mm, start);
+			vma = find_vma(mm, start);
 	}
 out:
 	blk_finish_plug(&plug);
 	if (write)
-		mmap_write_unlock(current->mm);
+		mmap_write_unlock(mm);
 	else
-		mmap_read_unlock(current->mm);
+		mmap_read_unlock(mm);
 
 	return error;
 }
 
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	return do_madvise(start, len_in, behavior);
+	return do_madvise(current->mm, start, len_in, behavior);
 }
_

  parent reply index

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-17 23:13 incoming Andrew Morton
2020-10-17 23:13 ` [patch 01/40] ia64: fix build error with !COREDUMP Andrew Morton
2020-10-17 23:13 ` [patch 02/40] mm, memcg: rework remote charging API to support nesting Andrew Morton
2020-10-17 23:13 ` [patch 03/40] mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current() Andrew Morton
2020-10-17 23:13 ` [patch 04/40] mm: kmem: remove redundant checks from get_obj_cgroup_from_current() Andrew Morton
2020-10-17 23:13 ` [patch 05/40] mm: kmem: prepare remote memcg charging infra for interrupt contexts Andrew Morton
2020-10-17 23:13 ` [patch 06/40] mm: kmem: enable kernel memcg accounting from " Andrew Morton
2020-10-17 23:13 ` [patch 07/40] mm/memory-failure: remove a wrapper for alloc_migration_target() Andrew Morton
2020-10-17 23:14 ` [patch 08/40] mm/memory_hotplug: " Andrew Morton
2020-10-17 23:14 ` [patch 09/40] mm/migrate: avoid possible unnecessary process right check in kernel_move_pages() Andrew Morton
2020-10-17 23:14 ` [patch 10/40] mm/mmap: add inline vma_next() for readability of mmap code Andrew Morton
2020-10-17 23:14 ` [patch 11/40] mm/mmap: add inline munmap_vma_range() for code readability Andrew Morton
2020-10-17 23:14 ` [patch 12/40] mm/gup_benchmark: take the mmap lock around GUP Andrew Morton
2020-10-17 23:14 ` [patch 13/40] binfmt_elf: take the mmap lock around find_extend_vma() Andrew Morton
2020-10-17 23:14 ` [patch 14/40] mm/gup: assert that the mmap lock is held in __get_user_pages() Andrew Morton
2020-10-17 23:14 ` [patch 15/40] mm/gup_benchmark: rename to mm/gup_test Andrew Morton
2020-10-17 23:14 ` [patch 16/40] selftests/vm: use a common gup_test.h Andrew Morton
2020-10-17 23:14 ` [patch 17/40] selftests/vm: rename run_vmtests --> run_vmtests.sh Andrew Morton
2020-10-17 23:14 ` [patch 18/40] selftests/vm: minor cleanup: Makefile and gup_test.c Andrew Morton
2020-10-17 23:14 ` [patch 19/40] selftests/vm: only some gup_test items are really benchmarks Andrew Morton
2020-10-17 23:14 ` [patch 20/40] selftests/vm: gup_test: introduce the dump_pages() sub-test Andrew Morton
2020-10-17 23:14 ` [patch 21/40] selftests/vm: run_vmtests.sh: update and clean up gup_test invocation Andrew Morton
2020-10-17 23:14 ` [patch 22/40] selftests/vm: hmm-tests: remove the libhugetlbfs dependency Andrew Morton
2020-10-17 23:14 ` [patch 23/40] selftests/vm: 10x speedup for hmm-tests Andrew Morton
2020-10-17 23:14 ` Andrew Morton [this message]
2020-10-17 23:14 ` [patch 25/40] pid: move pidfd_get_pid() to pid.c Andrew Morton
2020-10-17 23:14 ` [patch 26/40] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
2020-10-17 23:15 ` [patch 27/40] mm: update the documentation for vfree Andrew Morton
2020-10-17 23:15 ` [patch 28/40] mm: add a VM_MAP_PUT_PAGES flag for vmap Andrew Morton
2020-10-17 23:15 ` [patch 29/40] mm: add a vmap_pfn function Andrew Morton
2020-10-17 23:15 ` [patch 30/40] mm: allow a NULL fn callback in apply_to_page_range Andrew Morton
2020-10-17 23:15 ` [patch 31/40] zsmalloc: switch from alloc_vm_area to get_vm_area Andrew Morton
2020-10-17 23:15 ` [patch 32/40] drm/i915: use vmap in shmem_pin_map Andrew Morton
2020-10-17 23:15 ` [patch 33/40] drm/i915: stop using kmap in i915_gem_object_map Andrew Morton
2020-10-17 23:15 ` [patch 34/40] drm/i915: use vmap " Andrew Morton
2020-10-17 23:15 ` [patch 35/40] xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv Andrew Morton
2020-10-17 23:15 ` [patch 36/40] x86/xen: open code alloc_vm_area in arch_gnttab_valloc Andrew Morton
2020-10-17 23:15 ` [patch 37/40] mm: remove alloc_vm_area Andrew Morton
2020-10-17 23:15 ` [patch 38/40] mm: cleanup the gfp_mask handling in __vmalloc_area_node Andrew Morton
2020-10-17 23:15 ` [patch 39/40] mm: remove the filename in the top of file comment in vmalloc.c Andrew Morton
2020-10-17 23:15 ` [patch 40/40] mm: remove duplicate include statement in mmu.c Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201017231450.ZUTJgOnD5%akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=axboe@kernel.dk \
    --cc=bgeffon@google.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=christian@brauner.io \
    --cc=dancol@google.com \
    --cc=fw@deneb.enyo.de \
    --cc=hannes@cmpxchg.org \
    --cc=jannh@google.com \
    --cc=joaodias@google.com \
    --cc=joel@joelfernandes.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=oleksandr@redhat.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=sj38.park@gmail.com \
    --cc=sjpark@amazon.de \
    --cc=sonnyrao@google.com \
    --cc=sspatil@google.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

mm-commits Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/mm-commits/0 mm-commits/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 mm-commits mm-commits/ https://lore.kernel.org/mm-commits \
		mm-commits@vger.kernel.org
	public-inbox-index mm-commits

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.mm-commits


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git