linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: kbuild test robot <lkp@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: kbuild-all@lists.01.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [PATCH v6 2/7] mm: introduce external memory hinting API
Date: Thu, 20 Feb 2020 13:21:35 -0800	[thread overview]
Message-ID: <20200220212135.GB226145@google.com> (raw)
In-Reply-To: <20200220211510.GA226145@google.com>

Hi Andrew,

I submit the fix with inlining here thread but if you prefer submitting
new v7 revision with more inputs, please tell me.
I am happy to resend whole patchset. 

Thanks.

On Thu, Feb 20, 2020 at 01:15:10PM -0800, Minchan Kim wrote:
> On Fri, Feb 21, 2020 at 03:13:49AM +0800, kbuild test robot wrote:
> > Hi Minchan,
> > 
> > I love your patch! Perhaps something to improve:
> > 
> > [auto build test WARNING on m68k/for-next]
> > [also build test WARNING on powerpc/next s390/features linus/master v5.6-rc2 next-20200220]
> > [cannot apply to arm64/for-next/core tip/x86/asm arm/for-next hp-parisc/for-next]
> > [if your patch is applied to the wrong git tree, please drop us a note to help
> > improve the system. BTW, we also suggest to use '--base' option to specify the
> > base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
> > 
> > url:    https://github.com/0day-ci/linux/commits/Minchan-Kim/introduce-memory-hinting-API-for-external-process/20200220-225155
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k.git for-next
> > config: nds32-randconfig-a001-20200220 (attached as .config)
> > compiler: nds32le-linux-gcc (GCC) 9.2.0
> > reproduce:
> >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> >         chmod +x ~/bin/make.cross
> >         # save the attached .config to linux build tree
> >         GCC_VERSION=9.2.0 make.cross ARCH=nds32 
> > 
> > If you fix the issue, kindly add following tag
> > Reported-by: kbuild test robot <lkp@intel.com>
> > 
> > All warnings (new ones prefixed by >>):
> > 
> >    In file included from arch/nds32/include/uapi/asm/unistd.h:10,
> >                     from arch/nds32/include/asm/unistd.h:6,
> >                     from arch/nds32/kernel/vdso/sigreturn.S:6:
> > >> include/uapi/asm-generic/unistd.h:858: warning: "__NR_pidfd_getfd" redefined
> >      858 | #define __NR_pidfd_getfd 439
> >          | 
> >    include/uapi/asm-generic/unistd.h:856: note: this is the location of the previous definition
> >      856 | #define __NR_pidfd_getfd 438
> >          | 
> > --
> >    In file included from arch/nds32/include/uapi/asm/unistd.h:10,
> >                     from arch/nds32/include/asm/unistd.h:6,
> >                     from <stdin>:2:
> > >> include/uapi/asm-generic/unistd.h:858: warning: "__NR_pidfd_getfd" redefined
> >      858 | #define __NR_pidfd_getfd 439
> >          | 
> >    include/uapi/asm-generic/unistd.h:856: note: this is the location of the previous definition
> >      856 | #define __NR_pidfd_getfd 438
> >          | 
> >    <stdin>:1511:2: warning: #warning syscall clone3 not implemented [-Wcpp]
> > >> <stdin>:1520:2: warning: #warning syscall process_madvise not implemented [-Wcpp]
> > --
> >    In file included from arch/nds32/include/uapi/asm/unistd.h:10,
> >                     from arch/nds32/include/asm/unistd.h:6,
> >                     from <stdin>:2:
> > >> include/uapi/asm-generic/unistd.h:858: warning: "__NR_pidfd_getfd" redefined
> >      858 | #define __NR_pidfd_getfd 439
> >          | 
> >    include/uapi/asm-generic/unistd.h:856: note: this is the location of the previous definition
> >      856 | #define __NR_pidfd_getfd 438
> >          | 
> >    <stdin>:1511:2: warning: #warning syscall clone3 not implemented [-Wcpp]
> > >> <stdin>:1520:2: warning: #warning syscall process_madvise not implemented [-Wcpp]
> >    16 real  2 user  4 sys  39.01% cpu 	make modules_prepare
> > --
> >    In file included from arch/nds32/include/uapi/asm/unistd.h:10,
> >                     from arch/nds32/include/asm/unistd.h:6,
> >                     from <stdin>:2:
> > >> include/uapi/asm-generic/unistd.h:858: warning: "__NR_pidfd_getfd" redefined
> >      858 | #define __NR_pidfd_getfd 439
> >          | 
> >    include/uapi/asm-generic/unistd.h:856: note: this is the location of the previous definition
> >      856 | #define __NR_pidfd_getfd 438
> >          | 
> >    <stdin>:1511:2: warning: #warning syscall clone3 not implemented [-Wcpp]
> > >> <stdin>:1520:2: warning: #warning syscall process_madvise not implemented [-Wcpp]
> >    In file included from arch/nds32/include/uapi/asm/unistd.h:10,
> >                     from arch/nds32/include/asm/unistd.h:6,
> >                     from arch/nds32/kernel/vdso/sigreturn.S:6:
> > >> include/uapi/asm-generic/unistd.h:858: warning: "__NR_pidfd_getfd" redefined
> >      858 | #define __NR_pidfd_getfd 439
> >          | 
> >    include/uapi/asm-generic/unistd.h:856: note: this is the location of the previous definition
> >      856 | #define __NR_pidfd_getfd 438
> >          | 
> >    In file included from arch/nds32/include/uapi/asm/unistd.h:10,
> >                     from arch/nds32/include/asm/unistd.h:6,
> >                     from arch/nds32/kernel/vdso/gettimeofday.c:11:
> > >> include/uapi/asm-generic/unistd.h:858: warning: "__NR_pidfd_getfd" redefined
> >      858 | #define __NR_pidfd_getfd 439
> >          | 
> >    include/uapi/asm-generic/unistd.h:856: note: this is the location of the previous definition
> >      856 | #define __NR_pidfd_getfd 438
> >          | 
> >    17 real  4 user  6 sys  59.28% cpu 	make prepare
> > 
> > vim +/__NR_pidfd_getfd +858 include/uapi/asm-generic/unistd.h
> > 
> >    853	
> >    854	#define __NR_openat2 437
> >    855	__SYSCALL(__NR_openat2, sys_openat2)
> >    856	#define __NR_pidfd_getfd 438
> >    857	__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
> >  > 858	#define __NR_pidfd_getfd 439
> >    859	__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >    860	
> > 
> > ---
> > 0-DAY CI Kernel Test Service, Intel Corporation
> > https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> 
> 
> Hi 0-Day,
> 
> Thanks for catching this. Here the fix goes.
> 
> From ff74a830277d880716fa0f69c80e9ec9337303d5 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Fri, 14 Feb 2020 07:42:03 -0800
> Subject: [PATCH v6 2/8] mm: introduce external memory hinting API
> 
> There is usecase that System Management Software(SMS) want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other processes and
> in the case of Android, it is the ActivityManagerService.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
> 
> To solve the issue, this patch introduces a new syscall process_madvise(2).
> It uses pidfd of an external process to give the hint.
> 
>  int process_madvise(int pidfd, void *addr, size_t length, int advise,
> 			unsigned long flag);
> 
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
> The flag argument is reserved for future use if we need to extend the
> API.
> 
> I think supporting all hints madvise has/will supported/support to
> process_madvise is rather risky. Because we are not sure all hints make
> sense from external process and implementation for the hint may rely on
> the caller being in the current context so it could be error-prone.
> Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
> 
> If someone want to add other hints, we could hear hear the usecase and
> review it for each hint. It's safer for maintenance rather than
> introducing a buggy syscall but hard to fix it later.
> 
> Q.1 - Why does any external entity have better knowledge?
> 
> Quote from Sandeep
> "For Android, every application (including the special SystemServer) are forked
> from Zygote. The reason of course is to share as many libraries and classes between
> the two as possible to benefit from the preloading during boot.
> 
> After applications start, (almost) all of the APIs  end up calling into this
> SystemServer process over IPC (binder) and back to the application.
> 
> In a fully running system, the SystemServer monitors every single process
> periodically to calculate their PSS / RSS and also decides which process is
> "important" to the user for interactivity.
> 
> So, because of how these processes start _and_ the fact that the SystemServer
> is looping to monitor each process, it does tend to *know* which address
> range of the application is not used / useful.
> 
> Besides, we can never rely on applications to clean things up themselves.
> We've had the "hey app1, the system is low on memory, please trim your
> memory usage down" notifications for a long time[1]. They rely on
> applications honoring the broadcasts and very few do.
> 
> So, if we want to avoid the inevitable killing of the application and
> restarting it, some way to be able to tell the OS about unimportant memory in
> these applications will be useful.
> 
> - ssp
> 
> Q.2 - How to guarantee the race(i.e., object validation) between when giving a
> hint from an external process and get the hint from the target process?
> 
> process_madvise operates on the target process's address space as it exists
> at the instant that process_madvise is called. If the space target process
> can run between the time the process_madvise process inspects the target
> process address space and the time that process_madvise is actually called,
> process_madvise may operate on memory regions that the calling process does
> not expect. It's the responsibility of the process calling process_madvise
> to close this race condition. For example, the calling process can suspend
> the target process with ptrace, SIGSTOP, or the freezer cgroup so that it
> doesn't have an opportunity to change its own address space before
> process_madvise is called. Another option is to operate on memory regions
> that the caller knows a priori will be unchanged in the target process.
> Yet another option is to accept the race for certain process_madvise calls
> after reasoning that mistargeting will do no harm. The suggested API itself
> does not provide synchronization. It also apply other APIs like move_pages,
> process_vm_write.
> 
> The race isn't really a problem though. Why is it so wrong to require
> that callers do their own synchronization in some manner? Nobody objects
> to write(2) merely because it's possible for two processes to open the same
> file and clobber each other's writes --- instead, we tell people to use
> flock or something. Think about mmap. It never guarantees newly allocated
> address space is still valid when the user tries to access it because other
> threads could unmap the memory right before. That's where we need
> synchronization by using other API or design from userside. It shouldn't
> be part of API itself. If someone needs more fine-grained synchronization
> rather than process level, there were two ideas suggested - cookie[2] and
> anon-fd[3]. Both are applicable via using last reserved argument of the API
> but I don't think it's necessary right now since we have already ways to
> prevent the race so don't want to add additional complexity with more
> fine-grained optimization model.
> 
> To make the API extend, it reserved an unsigned long as last argument
> so we could support it in future if someone really needs it.
> 
> Q.3 - Why doesn't ptrace work?
> 
> Injecting an madvise in the target process using ptrace would not work for us
> because such injected madvise would have to be executed by the target process,
> which means that process would have to be runnable and that creates the risk
> of the abovementioned race and hinting a wrong VMA. Furthermore, we want to
> act the hint in caller's context, not calle because calle is usually limited
> in cpuset/cgroups or even freezed state so they can't act by themselves
> quick enough, which causes more thrashing/kill. It doesn't work if the
> target process are ptraced(e.g., strace, debugger, minidump) because  a
> process can have at most one ptracer.
> 
> [1] https://developer.android.com/topic/performance/memory"
> [2] process_getinfo for getting the cookie which is updated whenever
>     vma of process address layout are changed - Daniel Colascione
> - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
> [3] anonymous fd which is used for the object(i.e., address range)
>     validation - Michal Hocko
> - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
>  arch/arm/tools/syscall.tbl                  |  1 +
>  arch/arm64/include/asm/unistd.h             |  2 +-
>  arch/arm64/include/asm/unistd32.h           |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |  1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |  1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
>  include/linux/syscalls.h                    |  2 +
>  include/uapi/asm-generic/unistd.h           |  4 +-
>  kernel/sys_ni.c                             |  1 +
>  mm/madvise.c                                | 64 +++++++++++++++++++++
>  21 files changed, 88 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 36d42da7466a..c82952e6fb80 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -477,3 +477,4 @@
>  # 545 reserved for clone3
>  547	common	openat2				sys_openat2
>  548	common	pidfd_getfd			sys_pidfd_getfd
> +549	common	process_madvise			sys_process_madvise
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 4d1cf74a2caa..54c2719fec46 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -451,3 +451,4 @@
>  435	common	clone3				sys_clone3
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 1dd22da1c3a9..75f04a1023be 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>  #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls		439
> +#define __NR_compat_syscalls		440
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index c1c61635f89c..2a27be7a1f91 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -883,6 +883,8 @@ __SYSCALL(__NR_clone3, sys_clone3)
>  __SYSCALL(__NR_openat2, sys_openat2)
>  #define __NR_pidfd_getfd 438
>  __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
> +#define __NR_process_madvise 439
> +__SYSCALL(__NR_process_madvise, process_madvise)
>  
>  /*
>   * Please add new compat syscalls above this comment and update
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index 042911e670b8..9524af1c318c 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -358,3 +358,4 @@
>  # 435 reserved for clone3
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index f4f49fcb76d0..8197050c097c 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -437,3 +437,4 @@
>  435	common	clone3				__sys_clone3
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index 4c67b11f9c9e..c5b6c8afe445 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -443,3 +443,4 @@
>  435	common	clone3				sys_clone3
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 1f9e8ad636cc..8ec8c558aa9c 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -376,3 +376,4 @@
>  435	n32	clone3				__sys_clone3
>  437	n32	openat2				sys_openat2
>  438	n32	pidfd_getfd			sys_pidfd_getfd
> +439	n32	process_madvise			sys_process_madvise
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index c0b9d802dbf6..0078f891bb92 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -352,3 +352,4 @@
>  435	n64	clone3				__sys_clone3
>  437	n64	openat2				sys_openat2
>  438	n64	pidfd_getfd			sys_pidfd_getfd
> +439	n64	process_madvise			sys_process_madvise
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index 52a15f5cd130..09c3b5dc6855 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -435,3 +435,4 @@
>  435	common	clone3				sys_clone3_wrapper
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 35b61bfc1b1a..97eac48c2937 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -519,3 +519,4 @@
>  435	nospu	clone3				ppc_clone3
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index bd7bd3581a0f..8dc8bfd958ea 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -440,3 +440,4 @@
>  435  common	clone3			sys_clone3			sys_clone3
>  437  common	openat2			sys_openat2			sys_openat2
>  438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
> +439  common	process_madvise		sys_process_madvise		sys_process_madvise
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index c7a30fcd135f..e69d98040777 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -440,3 +440,4 @@
>  # 435 reserved for clone3
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index f13615ecdecc..6f6e66dd51f9 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -483,3 +483,4 @@
>  # 435 reserved for clone3
>  437	common	openat2			sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index c17cb77eb150..1b2184549e27 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -442,3 +442,4 @@
>  435	i386	clone3			sys_clone3			__ia32_sys_clone3
>  437	i386	openat2			sys_openat2			__ia32_sys_openat2
>  438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
> +439	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 44d510bc9b78..82d60eb1e00d 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -359,6 +359,7 @@
>  435	common	clone3			__x64_sys_clone3/ptregs
>  437	common	openat2			__x64_sys_openat2
>  438	common	pidfd_getfd		__x64_sys_pidfd_getfd
> +439	common	process_madvise		__x64_sys_process_madvise
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 85a9ab1bc04d..165cae047770 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -408,3 +408,4 @@
>  435	common	clone3				sys_clone3
>  437	common	openat2				sys_openat2
>  438	common	pidfd_getfd			sys_pidfd_getfd
> +439	common	process_madvise			sys_process_madvise
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 1815065d52f3..e4cd2c2f8bb4 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -876,6 +876,8 @@ asmlinkage long sys_munlockall(void);
>  asmlinkage long sys_mincore(unsigned long start, size_t len,
>  				unsigned char __user * vec);
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> +asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
> +			size_t len, int behavior, unsigned long flags);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 3a3201e4618e..fa289b91410e 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -855,9 +855,11 @@ __SYSCALL(__NR_clone3, sys_clone3)
>  __SYSCALL(__NR_openat2, sys_openat2)
>  #define __NR_pidfd_getfd 438
>  __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
> +#define __NR_process_madvise 439
> +__SYSCALL(__NR_process_madvise, sys_process_madvise)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 439
> +#define __NR_syscalls 440
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 3b69a560a7ac..6c7332776e8e 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -280,6 +280,7 @@ COND_SYSCALL(mlockall);
>  COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
> +COND_SYSCALL(process_madvise);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index f75c86b6c463..f29155b8185d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -17,6 +17,7 @@
>  #include <linux/falloc.h>
>  #include <linux/fadvise.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> @@ -986,6 +987,18 @@ madvise_behavior_valid(int behavior)
>  	}
>  }
>  
> +static bool
> +process_madvise_behavior_valid(int behavior)
> +{
> +	switch (behavior) {
> +	case MADV_COLD:
> +	case MADV_PAGEOUT:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>  /*
>   * The madvise(2) system call.
>   *
> @@ -1033,6 +1046,11 @@ madvise_behavior_valid(int behavior)
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *		from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *  MADV_COLD - the application uses the memory less so the kernel can
> + *		deactivate the memory to evict them quickly when the memory
> + *		pressure happen.
> + *  MADV_PAGEOUT - the application uses the memroy very rarely so kernel can
> + *		page out the memory instantly.
>   *
>   * return values:
>   *  zero    - success
> @@ -1150,3 +1168,49 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
>  	return do_madvise(current, current->mm, start, len_in, behavior);
>  }
> +
> +SYSCALL_DEFINE5(process_madvise, int, pidfd, unsigned long, start,
> +		size_t, len_in, int, behavior, unsigned long, flags)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	if (!process_madvise_behavior_valid(behavior))
> +		return -EINVAL;
> +
> +	f = fdget(pidfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = pidfd_pid(f.file);
> +	if (IS_ERR(pid)) {
> +		ret = PTR_ERR(pid);
> +		goto fdput;
> +	}
> +
> +	task = get_pid_task(pid, PIDTYPE_PID);
> +	if (!task) {
> +		ret = -ESRCH;
> +		goto fdput;
> +	}
> +
> +	mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> +	if (IS_ERR_OR_NULL(mm)) {
> +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +		goto release_task;
> +	}
> +
> +	ret = do_madvise(task, mm, start, len_in, behavior);
> +	mmput(mm);
> +release_task:
> +	put_task_struct(task);
> +fdput:
> +	fdput(f);
> +	return ret;
> +}
> -- 
> 2.25.0.265.gbab2e86ba0-goog
> 


  reply	other threads:[~2020-02-20 21:21 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-19  1:44 [PATCH v6 0/7] introduce memory hinting API for external process Minchan Kim
2020-02-19  1:44 ` [PATCH v6 1/7] mm: pass task and mm to do_madvise Minchan Kim
2020-02-28 22:15   ` Suren Baghdasaryan
2020-02-19  1:44 ` [PATCH v6 2/7] mm: introduce external memory hinting API Minchan Kim
2020-02-20 19:13   ` kbuild test robot
2020-02-20 21:15     ` Minchan Kim
2020-02-20 21:21       ` Minchan Kim [this message]
2020-02-28 22:14         ` Suren Baghdasaryan
2020-03-02 19:18           ` Minchan Kim
2020-02-20 20:48   ` kbuild test robot
2020-02-19  1:44 ` [PATCH v6 3/7] mm: check fatal signal pending of target process Minchan Kim
2020-02-28 22:20   ` Suren Baghdasaryan
2020-02-19  1:44 ` [PATCH v6 4/7] pid: move pidfd_get_pid function to pid.c Minchan Kim
2020-02-28 22:22   ` Suren Baghdasaryan
2020-02-19  1:44 ` [PATCH v6 5/7] mm: support both pid and pidfd for process_madvise Minchan Kim
2020-02-28 22:41   ` Suren Baghdasaryan
2020-03-02 19:23     ` Minchan Kim
2020-03-02 19:38       ` Suren Baghdasaryan
2020-02-19  1:44 ` [PATCH v6 6/7] mm/madvise: employ mmget_still_valid for write lock Minchan Kim
2020-02-28 23:19   ` Suren Baghdasaryan
2020-03-02  7:33     ` Oleksandr Natalenko
2020-03-02 16:32       ` Suren Baghdasaryan
2020-02-19  1:44 ` [PATCH v6 7/7] mm/madvise: allow KSM hints for remote API Minchan Kim
2020-02-19 20:01 ` [PATCH v6 0/7] introduce memory hinting API for external process Andrew Morton
2020-02-19 21:05   ` Suren Baghdasaryan
2020-02-19 22:32   ` Minchan Kim
2020-02-19 22:51     ` Brian Geffon
2020-02-20  9:16   ` SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200220212135.GB226145@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=kbuild-all@lists.01.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).