linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC 1/7] mm: introduce MADV_COOL
       [not found] ` <20190520035254.57579-2-minchan@kernel.org>
@ 2019-05-20  8:16   ` Michal Hocko
  2019-05-20  8:19     ` Michal Hocko
  2019-05-20 22:54     ` Minchan Kim
  0 siblings, 2 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-20  8:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[CC linux-api]

On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> it could hint kernel that the pages can be reclaimed
> when memory pressure happens but data should be preserved
> for future use.  This could reduce workingset eviction so it
> ends up increasing performance.
> 
> This patch introduces the new MADV_COOL hint to madvise(2)
> syscall. MADV_COOL can be used by a process to mark a memory range
> as not expected to be used in the near future. The hint can help
> kernel in deciding which pages to evict early during memory
> pressure.

I do not want to start naming fight but MADV_COOL sounds a bit
misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
or MADV_DONTNEED_PRESERVE.

> Internally, it works via deactivating memory from active list to
> inactive's head so when the memory pressure happens, they will be
> reclaimed earlier than other active pages unless there is no
> access until the time.

Could you elaborate about the decision to move to the head rather than
tail? What should happen to inactive pages? Should we move them to the
tail? Your implementation seems to ignore those completely. Why?

What should happen for shared pages? In other words do we want to allow
less privileged process to control evicting of shared pages with a more
privileged one? E.g. think of all sorts of side channel attacks. Maybe
we want to do the same thing as for mincore where write access is
required.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:16   ` [RFC 1/7] mm: introduce MADV_COOL Michal Hocko
@ 2019-05-20  8:19     ` Michal Hocko
  2019-05-20 15:08       ` Suren Baghdasaryan
  2019-05-20 22:55       ` Minchan Kim
  2019-05-20 22:54     ` Minchan Kim
  1 sibling, 2 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-20  8:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> [CC linux-api]
> 
> On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > it could hint kernel that the pages can be reclaimed
> > when memory pressure happens but data should be preserved
> > for future use.  This could reduce workingset eviction so it
> > ends up increasing performance.
> > 
> > This patch introduces the new MADV_COOL hint to madvise(2)
> > syscall. MADV_COOL can be used by a process to mark a memory range
> > as not expected to be used in the near future. The hint can help
> > kernel in deciding which pages to evict early during memory
> > pressure.
> 
> I do not want to start naming fight but MADV_COOL sounds a bit
> misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> or MADV_DONTNEED_PRESERVE.

OK, I can see that you have used MADV_COLD for a different mode.
So this one is effectively a non destructive MADV_FREE alternative
so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
in other patch would then be MADV_DONTNEED_PRESERVE. Right?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
       [not found] ` <20190520035254.57579-4-minchan@kernel.org>
@ 2019-05-20  8:27   ` Michal Hocko
  2019-05-20 23:00     ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-20  8:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> for a long time, it could hint kernel that the pages can be
> reclaimed instantly but data should be preserved for future use.
> This could reduce workingset eviction so it ends up increasing
> performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2)
> syscall. MADV_COLD can be used by a process to mark a memory range
> as not expected to be used for a long time. The hint can help
> kernel in deciding which pages to evict proactively.

As mentioned in other email this looks like a non-destructive
MADV_DONTNEED alternative.

> Internally, it works via reclaiming memory in process context
> the syscall is called. If the page is dirty but backing storage
> is not synchronous device, the written page will be rotate back
> into LRU's tail once the write is done so they will reclaim easily
> when memory pressure happens. If backing storage is
> synchrnous device(e.g., zram), hte page will be reclaimed instantly.

Why do we special case async backing storage? Please always try to
explain _why_ the decision is made.

I haven't checked the implementation yet so I cannot comment on that.

> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h                   |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c                           | 123 +++++++++++++++++++++++++
>  mm/vmscan.c                            |  74 +++++++++++++++
>  4 files changed, 199 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 64795abea003..7f32a948fc6a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -365,6 +365,7 @@ extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern unsigned long vm_total_pages;
>  
> +extern unsigned long reclaim_pages(struct list_head *page_list);
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim_mode;
>  extern int sysctl_min_unmapped_ratio;
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f7a4a5d4b642..b9b51eeb8e1a 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -43,6 +43,7 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
>  #define MADV_COOL	5		/* deactivatie these pages */
> +#define MADV_COLD	6		/* reclaim these pages */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c05817fb570d..9a6698b56845 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_COOL:
> +	case MADV_COLD:
>  	case MADV_FREE:
>  		return 0;
>  	default:
> @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	LIST_HEAD(page_list);
> +	struct page *page;
> +	int isolated = 0;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		spinlock_t *ptl;
> +
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		if (isolate_lru_page(page))
> +			goto huge_unlock;
> +
> +		list_add(&page->lru, &page_list);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		reclaim_pages(&page_list);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +regular_page:
> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		if (isolate_lru_page(page))
> +			continue;
> +
> +		isolated++;
> +		list_add(&page->lru, &page_list);
> +		if (isolated >= SWAP_CLUSTER_MAX) {
> +			pte_unmap_unlock(orig_pte, ptl);
> +			reclaim_pages(&page_list);
> +			isolated = 0;
> +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +			orig_pte = pte;
> +		}
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	reclaim_pages(&page_list);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static void madvise_cold_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct mm_walk warm_walk = {
> +		.pmd_entry = madvise_cold_pte_range,
> +		.mm = vma->vm_mm,
> +	};
> +
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &warm_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +
> +static long madvise_cold(struct vm_area_struct *vma,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> +		return -EINVAL;
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				unsigned long end, struct mm_walk *walk)
>  
> @@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COOL:
>  		return madvise_cool(vma, start, end);
> +	case MADV_COLD:
> +		return madvise_cold(vma, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> @@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
>  	case MADV_COOL:
> +	case MADV_COLD:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a28e5d17b495..1701b31f70a8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			nr_deactivate, nr_rotated, sc->priority, file);
>  }
>  
> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> +	int nid = -1;
> +	unsigned long nr_isolated[2] = {0, };
> +	unsigned long nr_reclaimed = 0;
> +	LIST_HEAD(node_page_list);
> +	struct reclaim_stat dummy_stat;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
> +		.may_writepage = 1,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +	};
> +
> +	while (!list_empty(page_list)) {
> +		struct page *page;
> +
> +		page = lru_to_page(page_list);
> +		list_del(&page->lru);
> +
> +		if (nid == -1) {
> +			nid = page_to_nid(page);
> +			INIT_LIST_HEAD(&node_page_list);
> +			nr_isolated[0] = nr_isolated[1] = 0;
> +		}
> +
> +		if (nid == page_to_nid(page)) {
> +			list_add(&page->lru, &node_page_list);
> +			nr_isolated[!!page_is_file_cache(page)] +=
> +						hpage_nr_pages(page);
> +			continue;
> +		}
> +
> +		nid = page_to_nid(page);
> +
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					nr_isolated[1]);
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> +				&dummy_stat, true);
> +		while (!list_empty(&node_page_list)) {
> +			struct page *page = lru_to_page(page_list);
> +
> +			list_del(&page->lru);
> +			putback_lru_page(page);
> +		}
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					-nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					-nr_isolated[1]);
> +		nr_isolated[0] = nr_isolated[1] = 0;
> +		INIT_LIST_HEAD(&node_page_list);
> +	}
> +
> +	if (!list_empty(&node_page_list)) {
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					nr_isolated[1]);
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> +				&dummy_stat, true);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					-nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					-nr_isolated[1]);
> +	}
> +
> +	return nr_reclaimed;
> +}
> +
>  /*
>   * The inactive anon list should be small enough that the VM never has
>   * to do too much work.
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
       [not found] ` <20190520035254.57579-6-minchan@kernel.org>
@ 2019-05-20  9:18   ` Michal Hocko
  2019-05-21  2:41     ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-20  9:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COOL|COLD] to other process. Android's
> ActivityManagerService is one of them.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.

Could you expand some more about how this all works? How does the
centralized daemon track respective ranges? How does it synchronize
against parallel modification of the address space etc.

> To solve the issue, this patch introduces new syscall process_madvise(2)
> which works based on pidfd so it could give a hint to the exeternal
> process.
> 
> int process_madvise(int pidfd, void *addr, size_t length, int advise);

OK, this makes some sense from the API point of view. When we have
discussed that at LSFMM I was contemplating about something like that
except the fd would be a VMA fd rather than the process. We could extend
and reuse /proc/<pid>/map_files interface which doesn't support the
anonymous memory right now. 

I am not saying this would be a better interface but I wanted to mention
it here for a further discussion. One slight advantage would be that
you know the exact object that you are operating on because you have a
fd for the VMA and we would have a more straightforward way to reject
operation if the underlying object has changed (e.g. unmapped and reused
for a different mapping).

> All advises madvise provides can be supported in process_madvise, too.
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrrace the process could use it successfully.

proc_mem_open model we use for accessing address space via proc sounds
like a good mode. You are doing something similar.

> Please suggest better idea if you have other idea about the permission.
> 
> * from v1r1
>   * use ptrace capability - surenb, dancol
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/proc_fs.h                |  1 +
>  include/linux/syscalls.h               |  2 ++
>  include/uapi/asm-generic/unistd.h      |  2 ++
>  kernel/signal.c                        |  2 +-
>  kernel/sys_ni.c                        |  1 +
>  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
>  8 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4cd5f982b1e5..5b9dd55d6b57 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
>  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
>  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 64ca0d06259a..0e5ee78161c9 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  425	common	io_uring_setup		__x64_sys_io_uring_setup
>  426	common	io_uring_enter		__x64_sys_io_uring_enter
>  427	common	io_uring_register	__x64_sys_io_uring_register
> +428	common	process_madvise		__x64_sys_process_madvise
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 52a283ba0465..f8545d7c5218 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
>  
>  #endif /* CONFIG_PROC_FS */
>  
> +extern struct pid *pidfd_to_pid(const struct file *file);
>  struct net;
>  
>  static inline struct proc_dir_entry *proc_net_mkdir(
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..21c6c9a62006 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
>  asmlinkage long sys_mincore(unsigned long start, size_t len,
>  				unsigned char __user * vec);
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> +				size_t len, int behavior);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dee7292e1df6..7ee82ce04620 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
>  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
>  #define __NR_io_uring_register 427
>  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_process_madvise 428
> +__SYSCALL(__NR_process_madvise, sys_process_madvise)
>  
>  #undef __NR_syscalls
>  #define __NR_syscalls 428
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1c86b78a7597..04e75daab1f8 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
>  	return copy_siginfo_from_user(kinfo, info);
>  }
>  
> -static struct pid *pidfd_to_pid(const struct file *file)
> +struct pid *pidfd_to_pid(const struct file *file)
>  {
>  	if (file->f_op == &pidfd_fops)
>  		return file->private_data;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 4d9ae5ea6caf..5277421795ab 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
>  COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
> +COND_SYSCALL(process_madvise);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 119e82e1f065..af02aa17e5c1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -9,6 +9,7 @@
>  #include <linux/mman.h>
>  #include <linux/pagemap.h>
>  #include <linux/page_idle.h>
> +#include <linux/proc_fs.h>
>  #include <linux/syscalls.h>
>  #include <linux/mempolicy.h>
>  #include <linux/page-isolation.h>
> @@ -16,6 +17,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/falloc.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
>  	return madvise_core(current, start, len_in, behavior);
>  }
> +
> +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> +		size_t, len_in, int, behavior)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	struct task_struct *tsk;
> +	struct mm_struct *mm;
> +
> +	f = fdget(pidfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = pidfd_to_pid(f.file);
> +	if (IS_ERR(pid)) {
> +		ret = PTR_ERR(pid);
> +		goto err;
> +	}
> +
> +	ret = -EINVAL;
> +	rcu_read_lock();
> +	tsk = pid_task(pid, PIDTYPE_PID);
> +	if (!tsk) {
> +		rcu_read_unlock();
> +		goto err;
> +	}
> +	get_task_struct(tsk);
> +	rcu_read_unlock();
> +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> +	if (!mm || IS_ERR(mm)) {
> +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +		if (ret == -EACCES)
> +			ret = -EPERM;
> +		goto err;
> +	}
> +	ret = madvise_core(tsk, start, len_in, behavior);
> +	mmput(mm);
> +	put_task_struct(tsk);
> +err:
> +	fdput(f);
> +	return ret;
> +}
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
       [not found] ` <20190520035254.57579-7-minchan@kernel.org>
@ 2019-05-20  9:22   ` Michal Hocko
  2019-05-21  2:48     ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-20  9:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> Currently, process_madvise syscall works for only one address range
> so user should call the syscall several times to give hints to
> multiple address range.

Is that a problem? How big of a problem? Any numbers?

> This patch extends process_madvise syscall to support multiple
> hints, address ranges and return vaules so user could give hints
> all at once.
> 
> struct pr_madvise_param {
>     int size;                       /* the size of this structure */
>     const struct iovec __user *vec; /* address range array */
> }
> 
> int process_madvise(int pidfd, ssize_t nr_elem,
> 		    int *behavior,
> 		    struct pr_madvise_param *results,
> 		    struct pr_madvise_param *ranges,
> 		    unsigned long flags);
> 
> - pidfd
> 
> target process fd
> 
> - nr_elem
> 
> the number of elemenent of array behavior, results, ranges
> 
> - behavior
> 
> hints for each address range in remote process so that user could
> give different hints for each range.

What is the guarantee of a single call? Do all hints get applied or the
first failure backs of? What are the atomicity guarantees?

> 
> - results
> 
> array of buffers to get results for associated remote address range
> action.
> 
> - ranges
> 
> array to buffers to have remote process's address ranges to be
> processed
> 
> - flags
> 
> extra argument for the future. It should be zero this moment.
> 
> Example)
> 
> struct pr_madvise_param {
>         int size;
>         const struct iovec *vec;
> };
> 
> int main(int argc, char *argv[])
> {
>         struct pr_madvise_param retp, rangep;
>         struct iovec result_vec[2], range_vec[2];
>         int hints[2];
>         long ret[2];
>         void *addr[2];
> 
>         pid_t pid;
>         char cmd[64] = {0,};
>         addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
>                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>         if (MAP_FAILED == addr[0])
>                 return 1;
> 
>         addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
>                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>         if (MAP_FAILED == addr[1])
>                 return 1;
> 
>         hints[0] = MADV_COLD;
> 	range_vec[0].iov_base = addr[0];
>         range_vec[0].iov_len = ALLOC_SIZE;
>         result_vec[0].iov_base = &ret[0];
>         result_vec[0].iov_len = sizeof(long);
> 	retp.vec = result_vec;
>         retp.size = sizeof(struct pr_madvise_param);
> 
>         hints[1] = MADV_COOL;
>         range_vec[1].iov_base = addr[1];
>         range_vec[1].iov_len = ALLOC_SIZE;
>         result_vec[1].iov_base = &ret[1];
>         result_vec[1].iov_len = sizeof(long);
>         rangep.vec = range_vec;
>         rangep.size = sizeof(struct pr_madvise_param);
> 
>         pid = fork();
>         if (!pid) {
>                 sleep(10);
>         } else {
>                 int pidfd = open(cmd,  O_DIRECTORY | O_CLOEXEC);
>                 if (pidfd < 0)
>                         return 1;
> 
>                 /* munmap to make pages private for the child */
>                 munmap(addr[0], ALLOC_SIZE);
>                 munmap(addr[1], ALLOC_SIZE);
>                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
>                 if (syscall(__NR_process_madvise, pidfd, 2, behaviors,
> 						&retp, &rangep, 0))
>                         perror("process_madvise fail\n");
>                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
>         }
> 
>         return 0;
> }
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/uapi/asm-generic/mman-common.h |   5 +
>  mm/madvise.c                           | 184 +++++++++++++++++++++----
>  2 files changed, 166 insertions(+), 23 deletions(-)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index b9b51eeb8e1a..b8e230de84a6 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -74,4 +74,9 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +struct pr_madvise_param {
> +	int size;			/* the size of this structure */
> +	const struct iovec __user *vec;	/* address range array */
> +};
> +
>  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index af02aa17e5c1..f4f569dac2bd 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
>  	struct page *page;
>  	struct vm_area_struct *vma = walk->vma;
>  	unsigned long next;
> +	long nr_pages = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  		ptep_test_and_clear_young(vma, addr, pte);
>  		deactivate_page(page);
> +		nr_pages++;
> +
>  	}
>  
>  	pte_unmap_unlock(orig_pte, ptl);
> +	*(long *)walk->private += nr_pages;
>  	cond_resched();
>  
>  	return 0;
> @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  static void madvise_cool_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
> -			     unsigned long addr, unsigned long end)
> +			     unsigned long addr, unsigned long end,
> +			     long *nr_pages)
>  {
>  	struct mm_walk cool_walk = {
>  		.pmd_entry = madvise_cool_pte_range,
>  		.mm = vma->vm_mm,
> +		.private = nr_pages
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> @@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb,
>  }
>  
>  static long madvise_cool(struct vm_area_struct *vma,
> -			unsigned long start_addr, unsigned long end_addr)
> +			unsigned long start_addr, unsigned long end_addr,
> +			long *nr_pages)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_gather tlb;
> @@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma,
>  
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> -	madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
> +	madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
>  	tlb_finish_mmu(&tlb, start_addr, end_addr);
>  
>  	return 0;
> @@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  	int isolated = 0;
>  	struct vm_area_struct *vma = walk->vma;
>  	unsigned long next;
> +	long nr_pages = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  		list_add(&page->lru, &page_list);
>  		if (isolated >= SWAP_CLUSTER_MAX) {
>  			pte_unmap_unlock(orig_pte, ptl);
> -			reclaim_pages(&page_list);
> +			nr_pages += reclaim_pages(&page_list);
>  			isolated = 0;
>  			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  			orig_pte = pte;
> @@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  	}
>  
>  	pte_unmap_unlock(orig_pte, ptl);
> -	reclaim_pages(&page_list);
> +	nr_pages += reclaim_pages(&page_list);
>  	cond_resched();
>  
> +	*(long *)walk->private += nr_pages;
>  	return 0;
>  }
>  
>  static void madvise_cold_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
> -			     unsigned long addr, unsigned long end)
> +			     unsigned long addr, unsigned long end,
> +			     long *nr_pages)
>  {
>  	struct mm_walk warm_walk = {
>  		.pmd_entry = madvise_cold_pte_range,
>  		.mm = vma->vm_mm,
> +		.private = nr_pages,
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> @@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
>  
>  
>  static long madvise_cold(struct vm_area_struct *vma,
> -			unsigned long start_addr, unsigned long end_addr)
> +			unsigned long start_addr, unsigned long end_addr,
> +			long *nr_pages)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_gather tlb;
> @@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma,
>  
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> -	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
>  	tlb_finish_mmu(&tlb, start_addr, end_addr);
>  
>  	return 0;
> @@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior,
>  static long
>  madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
>  		struct vm_area_struct **prev, unsigned long start,
> -		unsigned long end, int behavior)
> +		unsigned long end, int behavior, long *nr_pages)
>  {
>  	switch (behavior) {
>  	case MADV_REMOVE:
> @@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COOL:
> -		return madvise_cool(vma, start, end);
> +		return madvise_cool(vma, start, end, nr_pages);
>  	case MADV_COLD:
> -		return madvise_cold(vma, start, end);
> +		return madvise_cold(vma, start, end, nr_pages);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(tsk, vma, prev, start,
> @@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior)
>  }
>  
>  static int madvise_core(struct task_struct *tsk, unsigned long start,
> -			size_t len_in, int behavior)
> +			size_t len_in, int behavior, long *nr_pages)
>  {
>  	unsigned long end, tmp;
>  	struct vm_area_struct *vma, *prev;
> @@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  
>  	if (start & ~PAGE_MASK)
>  		return error;
> +
>  	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>  
>  	/* Check to see whether len was rounded up from small -ve to zero */
> @@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  	blk_start_plug(&plug);
>  	for (;;) {
>  		/* Still start < end. */
> +		long pages = 0;
> +
>  		error = -ENOMEM;
>  		if (!vma)
>  			goto out;
> @@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  			tmp = end;
>  
>  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> -		error = madvise_vma(tsk, vma, &prev, start, tmp, behavior);
> +		error = madvise_vma(tsk, vma, &prev, start, tmp,
> +					behavior, &pages);
>  		if (error)
>  			goto out;
> +		*nr_pages += pages;
>  		start = tmp;
>  		if (prev && start < prev->vm_end)
>  			start = prev->vm_end;
> @@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>   */
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -	return madvise_core(current, start, len_in, behavior);
> +	unsigned long dummy;
> +
> +	return madvise_core(current, start, len_in, behavior, &dummy);
>  }
>  
> -SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> -		size_t, len_in, int, behavior)
> +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
> +		struct pr_madvise_param *param)
> +{
> +	u32 size;
> +	int ret;
> +
> +	memset(param, 0, sizeof(*param));
> +
> +	ret = get_user(size, &u_param->size);
> +	if (ret)
> +		return ret;
> +
> +	if (size > PAGE_SIZE)
> +		return -E2BIG;
> +
> +	if (!size || size > sizeof(struct pr_madvise_param))
> +		return -EINVAL;
> +
> +	ret = copy_from_user(param, u_param, size);
> +	if (ret)
> +		return -EFAULT;
> +
> +	return ret;
> +}
> +
> +static int process_madvise_core(struct task_struct *tsk, int *behaviors,
> +				struct iov_iter *iter,
> +				const struct iovec *range_vec,
> +				unsigned long riovcnt,
> +				unsigned long flags)
> +{
> +	int i;
> +	long err;
> +
> +	for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) {
> +		long ret = 0;
> +
> +		err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base,
> +				range_vec[i].iov_len, behaviors[i],
> +				&ret);
> +		if (err)
> +			ret = err;
> +
> +		if (copy_to_iter(&ret, sizeof(long), iter) !=
> +				sizeof(long)) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		err = 0;
> +	}
> +
> +	return err;
> +}
> +
> +SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem,
> +			const int __user *, hints,
> +			struct pr_madvise_param __user *, results,
> +			struct pr_madvise_param __user *, ranges,
> +			unsigned long, flags)
>  {
>  	int ret;
>  	struct fd f;
>  	struct pid *pid;
>  	struct task_struct *tsk;
>  	struct mm_struct *mm;
> +	struct pr_madvise_param result_p, range_p;
> +	const struct iovec __user *result_vec, __user *range_vec;
> +	int *behaviors;
> +	struct iovec iovstack_result[UIO_FASTIOV];
> +	struct iovec iovstack_r[UIO_FASTIOV];
> +	struct iovec *iov_l = iovstack_result;
> +	struct iovec *iov_r = iovstack_r;
> +	struct iov_iter iter;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	ret = pr_madvise_copy_param(results, &result_p);
> +	if (ret)
> +		return ret;
> +
> +	ret = pr_madvise_copy_param(ranges, &range_p);
> +	if (ret)
> +		return ret;
> +
> +	result_vec = result_p.vec;
> +	range_vec = range_p.vec;
> +
> +	if (result_p.size != sizeof(struct pr_madvise_param) ||
> +			range_p.size != sizeof(struct pr_madvise_param))
> +		return -EINVAL;
> +
> +	behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
> +	if (!behaviors)
> +		return -ENOMEM;
> +
> +	ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem);
> +	if (ret < 0)
> +		goto free_behavior_vec;
> +
> +	ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV,
> +				&iov_l, &iter);
> +	if (ret < 0)
> +		goto free_behavior_vec;
> +
> +	if (!iov_iter_count(&iter)) {
> +		ret = -EINVAL;
> +		goto free_iovecs;
> +	}
> +
> +	ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
> +				UIO_FASTIOV, iovstack_r, &iov_r);
> +	if (ret <= 0)
> +		goto free_iovecs;
>  
>  	f = fdget(pidfd);
> -	if (!f.file)
> -		return -EBADF;
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto free_iovecs;
> +	}
>  
>  	pid = pidfd_to_pid(f.file);
>  	if (IS_ERR(pid)) {
>  		ret = PTR_ERR(pid);
> -		goto err;
> +		goto put_fd;
>  	}
>  
>  	ret = -EINVAL;
> @@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
>  	tsk = pid_task(pid, PIDTYPE_PID);
>  	if (!tsk) {
>  		rcu_read_unlock();
> -		goto err;
> +		goto put_fd;
>  	}
>  	get_task_struct(tsk);
>  	rcu_read_unlock();
> @@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
>  		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>  		if (ret == -EACCES)
>  			ret = -EPERM;
> -		goto err;
> +		goto put_task;
>  	}
> -	ret = madvise_core(tsk, start, len_in, behavior);
> +
> +	ret = process_madvise_core(tsk, behaviors, &iter, iov_r,
> +					nr_elem, flags);
>  	mmput(mm);
> +put_task:
>  	put_task_struct(tsk);
> -err:
> +put_fd:
>  	fdput(f);
> +free_iovecs:
> +	if (iov_r != iovstack_r)
> +		kfree(iov_r);
> +	kfree(iov_l);
> +free_behavior_vec:
> +	kfree(behaviors);
> +
>  	return ret;
>  }
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
       [not found] ` <20190520035254.57579-8-minchan@kernel.org>
@ 2019-05-20  9:28   ` Michal Hocko
  2019-05-21  2:55     ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-20  9:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[cc linux-api]

On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> System could have much faster swap device like zRAM. In that case, swapping
> is extremely cheaper than file-IO on the low-end storage.
> In this configuration, userspace could handle different strategy for each
> kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> while it keeps file-backed pages in inactive LRU by MADV_COOL because
> file IO is more expensive in this case so want to keep them in memory
> until memory pressure happens.
> 
> To support such strategy easier, this patch introduces
> MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> that /proc/<pid>/clear_refs already has supported same filters.
> They are filters could be Ored with other existing hints using top two bits
> of (int behavior).

madvise operates on top of ranges and it is quite trivial to do the
filtering from the userspace so why do we need any additional filtering?

> Once either of them is set, the hint could affect only the interested vma
> either anonymous or file-backed.
> 
> With that, user could call a process_madvise syscall simply with a entire
> range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> MADV_FILE_FILTER so there is no need to call the syscall range by range.

OK, so here is the reason you want that. The immediate question is why
cannot the monitor do the filtering from the userspace. Slightly more
work, all right, but less of an API to expose and that itself is a
strong argument against.

> * from v1r2
>   * use consistent check with clear_refs to identify anon/file vma - surenb
> 
> * from v1r1
>   * use naming "filter" for new madvise option - dancol
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/uapi/asm-generic/mman-common.h |  5 +++++
>  mm/madvise.c                           | 14 ++++++++++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index b8e230de84a6..be59a1b90284 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -66,6 +66,11 @@
>  #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
>  #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
>  
> +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER))
> +
> +#define MADV_ANONYMOUS_FILTER	(1<<31)	/* works for only anonymous vma */
> +#define MADV_FILE_FILTER	(1<<30)	/* works for only file-backed vma */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index f4f569dac2bd..116131243540 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  	int write;
>  	size_t len;
>  	struct blk_plug plug;
> +	bool anon_only, file_only;
>  
> +	anon_only = behavior & MADV_ANONYMOUS_FILTER;
> +	file_only = behavior & MADV_FILE_FILTER;
> +
> +	if (anon_only && file_only)
> +		return error;
> +
> +	behavior = behavior & MADV_BEHAVIOR_MASK;
>  	if (!madvise_behavior_valid(behavior))
>  		return error;
>  
> @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  		if (end < tmp)
>  			tmp = end;
>  
> +		if (anon_only && vma->vm_file)
> +			goto next;
> +		if (file_only && !vma->vm_file)
> +			goto next;
> +
>  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
>  		error = madvise_vma(tsk, vma, &prev, start, tmp,
>  					behavior, &pages);
>  		if (error)
>  			goto out;
>  		*nr_pages += pages;
> +next:
>  		start = tmp;
>  		if (prev && start < prev->vm_end)
>  			start = prev->vm_end;
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
       [not found] <20190520035254.57579-1-minchan@kernel.org>
                   ` (4 preceding siblings ...)
       [not found] ` <20190520035254.57579-8-minchan@kernel.org>
@ 2019-05-20  9:28 ` Michal Hocko
       [not found] ` <20190520164605.GA11665@cmpxchg.org>
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-20  9:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:47, Minchan Kim wrote:
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
> 	struct pr_madvise_param {
> 		int size;
> 		const struct iovec *vec;
> 	}
> 
> 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> 				struct pr_madvise_param *restuls,
> 				struct pr_madvise_param *ranges,
> 				unsigned long flags);
> 
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
> 
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
> 
> - Experiment
> 
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
> 
> A is vanilla and B is process_madvise.
> 
> 
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
> 
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
> 
> This patchset is against on next-20190517.
> 
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
> 
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:19     ` Michal Hocko
@ 2019-05-20 15:08       ` Suren Baghdasaryan
  2019-05-20 22:55       ` Minchan Kim
  1 sibling, 0 replies; 68+ messages in thread
From: Suren Baghdasaryan @ 2019-05-20 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Daniel Colascione, Shakeel Butt,
	Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 1:19 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> > [CC linux-api]
> >
> > On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > it could hint kernel that the pages can be reclaimed
> > > when memory pressure happens but data should be preserved
> > > for future use.  This could reduce workingset eviction so it
> > > ends up increasing performance.
> > >
> > > This patch introduces the new MADV_COOL hint to madvise(2)
> > > syscall. MADV_COOL can be used by a process to mark a memory range
> > > as not expected to be used in the near future. The hint can help
> > > kernel in deciding which pages to evict early during memory
> > > pressure.
> >
> > I do not want to start naming fight but MADV_COOL sounds a bit
> > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> > or MADV_DONTNEED_PRESERVE.
>
> OK, I can see that you have used MADV_COLD for a different mode.
> So this one is effectively a non destructive MADV_FREE alternative
> so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
> in other patch would then be MADV_DONTNEED_PRESERVE. Right?
>

I agree that naming them this way would be more in-line with the
existing API. Another good option IMO could be MADV_RECLAIM_NOW /
MADV_RECLAIM_LAZY which might explain a bit better what they do but
Michal's proposal is more consistent with the current API.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:16   ` [RFC 1/7] mm: introduce MADV_COOL Michal Hocko
  2019-05-20  8:19     ` Michal Hocko
@ 2019-05-20 22:54     ` Minchan Kim
  2019-05-21  6:04       ` Michal Hocko
  1 sibling, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-20 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> [CC linux-api]

Thanks, Michal. I forgot to add it.

> 
> On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > it could hint kernel that the pages can be reclaimed
> > when memory pressure happens but data should be preserved
> > for future use.  This could reduce workingset eviction so it
> > ends up increasing performance.
> > 
> > This patch introduces the new MADV_COOL hint to madvise(2)
> > syscall. MADV_COOL can be used by a process to mark a memory range
> > as not expected to be used in the near future. The hint can help
> > kernel in deciding which pages to evict early during memory
> > pressure.
> 
> I do not want to start naming fight but MADV_COOL sounds a bit
> misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> or MADV_DONTNEED_PRESERVE.

Thanks for the suggestion. Since I got several suggestions, Let's discuss
them all at once in cover-letter.

> 
> > Internally, it works via deactivating memory from active list to
> > inactive's head so when the memory pressure happens, they will be
> > reclaimed earlier than other active pages unless there is no
> > access until the time.
> 
> Could you elaborate about the decision to move to the head rather than
> tail? What should happen to inactive pages? Should we move them to the
> tail? Your implementation seems to ignore those completely. Why?

Normally, inactive LRU could have used-once pages without any mapping
to user's address space. Such pages would be better candicate to
reclaim when the memory pressure happens. With deactivating only
active LRU pages of the process to the head of inactive LRU, we will
keep them in RAM longer than used-once pages and could have more chance
to be activated once the process is resumed.

> 
> What should happen for shared pages? In other words do we want to allow
> less privileged process to control evicting of shared pages with a more
> privileged one? E.g. think of all sorts of side channel attacks. Maybe
> we want to do the same thing as for mincore where write access is
> required.

It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
in the description.

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:19     ` Michal Hocko
  2019-05-20 15:08       ` Suren Baghdasaryan
@ 2019-05-20 22:55       ` Minchan Kim
  1 sibling, 0 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-20 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 10:19:43AM +0200, Michal Hocko wrote:
> On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> > [CC linux-api]
> > 
> > On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > it could hint kernel that the pages can be reclaimed
> > > when memory pressure happens but data should be preserved
> > > for future use.  This could reduce workingset eviction so it
> > > ends up increasing performance.
> > > 
> > > This patch introduces the new MADV_COOL hint to madvise(2)
> > > syscall. MADV_COOL can be used by a process to mark a memory range
> > > as not expected to be used in the near future. The hint can help
> > > kernel in deciding which pages to evict early during memory
> > > pressure.
> > 
> > I do not want to start naming fight but MADV_COOL sounds a bit
> > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> > or MADV_DONTNEED_PRESERVE.
> 
> OK, I can see that you have used MADV_COLD for a different mode.
> So this one is effectively a non destructive MADV_FREE alternative
> so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
> in other patch would then be MADV_DONTNEED_PRESERVE. Right?

Correct.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20  8:27   ` [RFC 3/7] mm: introduce MADV_COLD Michal Hocko
@ 2019-05-20 23:00     ` Minchan Kim
  2019-05-21  6:08       ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-20 23:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > for a long time, it could hint kernel that the pages can be
> > reclaimed instantly but data should be preserved for future use.
> > This could reduce workingset eviction so it ends up increasing
> > performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2)
> > syscall. MADV_COLD can be used by a process to mark a memory range
> > as not expected to be used for a long time. The hint can help
> > kernel in deciding which pages to evict proactively.
> 
> As mentioned in other email this looks like a non-destructive
> MADV_DONTNEED alternative.
> 
> > Internally, it works via reclaiming memory in process context
> > the syscall is called. If the page is dirty but backing storage
> > is not synchronous device, the written page will be rotate back
> > into LRU's tail once the write is done so they will reclaim easily
> > when memory pressure happens. If backing storage is
> > synchrnous device(e.g., zram), hte page will be reclaimed instantly.
> 
> Why do we special case async backing storage? Please always try to
> explain _why_ the decision is made.

I didn't make any decesion. ;-) That's how current reclaim works to
avoid latency of freeing page in interrupt context. I had a patchset
to resolve the concern a few years ago but got distracted.

> 
> I haven't checked the implementation yet so I cannot comment on that.
> 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/madvise.c                           | 123 +++++++++++++++++++++++++
> >  mm/vmscan.c                            |  74 +++++++++++++++
> >  4 files changed, 199 insertions(+)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 64795abea003..7f32a948fc6a 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -365,6 +365,7 @@ extern int vm_swappiness;
> >  extern int remove_mapping(struct address_space *mapping, struct page *page);
> >  extern unsigned long vm_total_pages;
> >  
> > +extern unsigned long reclaim_pages(struct list_head *page_list);
> >  #ifdef CONFIG_NUMA
> >  extern int node_reclaim_mode;
> >  extern int sysctl_min_unmapped_ratio;
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index f7a4a5d4b642..b9b51eeb8e1a 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -43,6 +43,7 @@
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> >  #define MADV_COOL	5		/* deactivatie these pages */
> > +#define MADV_COLD	6		/* reclaim these pages */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c05817fb570d..9a6698b56845 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> >  	case MADV_COOL:
> > +	case MADV_COLD:
> >  	case MADV_FREE:
> >  		return 0;
> >  	default:
> > @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	LIST_HEAD(page_list);
> > +	struct page *page;
> > +	int isolated = 0;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		spinlock_t *ptl;
> > +
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		if (isolate_lru_page(page))
> > +			goto huge_unlock;
> > +
> > +		list_add(&page->lru, &page_list);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		reclaim_pages(&page_list);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +regular_page:
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		if (isolate_lru_page(page))
> > +			continue;
> > +
> > +		isolated++;
> > +		list_add(&page->lru, &page_list);
> > +		if (isolated >= SWAP_CLUSTER_MAX) {
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			reclaim_pages(&page_list);
> > +			isolated = 0;
> > +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +			orig_pte = pte;
> > +		}
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	reclaim_pages(&page_list);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static void madvise_cold_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_walk warm_walk = {
> > +		.pmd_entry = madvise_cold_pte_range,
> > +		.mm = vma->vm_mm,
> > +	};
> > +
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &warm_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +
> > +static long madvise_cold(struct vm_area_struct *vma,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > +		return -EINVAL;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> > +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > +
> > +	return 0;
> > +}
> > +
> >  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  				unsigned long end, struct mm_walk *walk)
> >  
> > @@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_willneed(vma, prev, start, end);
> >  	case MADV_COOL:
> >  		return madvise_cool(vma, start, end);
> > +	case MADV_COLD:
> > +		return madvise_cold(vma, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > @@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_DONTNEED:
> >  	case MADV_FREE:
> >  	case MADV_COOL:
> > +	case MADV_COLD:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a28e5d17b495..1701b31f70a8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan,
> >  			nr_deactivate, nr_rotated, sc->priority, file);
> >  }
> >  
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > +	int nid = -1;
> > +	unsigned long nr_isolated[2] = {0, };
> > +	unsigned long nr_reclaimed = 0;
> > +	LIST_HEAD(node_page_list);
> > +	struct reclaim_stat dummy_stat;
> > +	struct scan_control sc = {
> > +		.gfp_mask = GFP_KERNEL,
> > +		.priority = DEF_PRIORITY,
> > +		.may_writepage = 1,
> > +		.may_unmap = 1,
> > +		.may_swap = 1,
> > +	};
> > +
> > +	while (!list_empty(page_list)) {
> > +		struct page *page;
> > +
> > +		page = lru_to_page(page_list);
> > +		list_del(&page->lru);
> > +
> > +		if (nid == -1) {
> > +			nid = page_to_nid(page);
> > +			INIT_LIST_HEAD(&node_page_list);
> > +			nr_isolated[0] = nr_isolated[1] = 0;
> > +		}
> > +
> > +		if (nid == page_to_nid(page)) {
> > +			list_add(&page->lru, &node_page_list);
> > +			nr_isolated[!!page_is_file_cache(page)] +=
> > +						hpage_nr_pages(page);
> > +			continue;
> > +		}
> > +
> > +		nid = page_to_nid(page);
> > +
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					nr_isolated[1]);
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > +				&dummy_stat, true);
> > +		while (!list_empty(&node_page_list)) {
> > +			struct page *page = lru_to_page(page_list);
> > +
> > +			list_del(&page->lru);
> > +			putback_lru_page(page);
> > +		}
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					-nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					-nr_isolated[1]);
> > +		nr_isolated[0] = nr_isolated[1] = 0;
> > +		INIT_LIST_HEAD(&node_page_list);
> > +	}
> > +
> > +	if (!list_empty(&node_page_list)) {
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					nr_isolated[1]);
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > +				&dummy_stat, true);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					-nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					-nr_isolated[1]);
> > +	}
> > +
> > +	return nr_reclaimed;
> > +}
> > +
> >  /*
> >   * The inactive anon list should be small enough that the VM never has
> >   * to do too much work.
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  9:18   ` [RFC 5/7] mm: introduce external memory hinting API Michal Hocko
@ 2019-05-21  2:41     ` Minchan Kim
  2019-05-21  6:17       ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-21  2:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > ActivityManagerService is one of them.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> 
> Could you expand some more about how this all works? How does the
> centralized daemon track respective ranges? How does it synchronize
> against parallel modification of the address space etc.

Currently, we don't track each address ranges because we have two
policies at this moment:

	deactive file pages and reclaim anonymous pages of the app.

Since the daemon has a ability to let background apps resume(IOW, process
will be run by the daemon) and both hints are non-disruptive stabilty point
of view, we are okay for the race.

> 
> > To solve the issue, this patch introduces new syscall process_madvise(2)
> > which works based on pidfd so it could give a hint to the exeternal
> > process.
> > 
> > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> 
> OK, this makes some sense from the API point of view. When we have
> discussed that at LSFMM I was contemplating about something like that
> except the fd would be a VMA fd rather than the process. We could extend
> and reuse /proc/<pid>/map_files interface which doesn't support the
> anonymous memory right now. 
> 
> I am not saying this would be a better interface but I wanted to mention
> it here for a further discussion. One slight advantage would be that
> you know the exact object that you are operating on because you have a
> fd for the VMA and we would have a more straightforward way to reject
> operation if the underlying object has changed (e.g. unmapped and reused
> for a different mapping).

I agree your point. If I didn't miss something, such kinds of vma level
modify notification doesn't work even file mapped vma at this moment.
For anonymous vma, I think we could use userfaultfd, pontentially.
It would be great if someone want to do with disruptive hints like
MADV_DONTNEED.

I'd like to see it further enhancement after landing address range based
operation via limiting hints process_madvise supports to non-disruptive
only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
when someone want to extend the API.

> 
> > All advises madvise provides can be supported in process_madvise, too.
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrrace the process could use it successfully.
> 
> proc_mem_open model we use for accessing address space via proc sounds
> like a good mode. You are doing something similar.
> 
> > Please suggest better idea if you have other idea about the permission.
> > 
> > * from v1r1
> >   * use ptrace capability - surenb, dancol
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/proc_fs.h                |  1 +
> >  include/linux/syscalls.h               |  2 ++
> >  include/uapi/asm-generic/unistd.h      |  2 ++
> >  kernel/signal.c                        |  2 +-
> >  kernel/sys_ni.c                        |  1 +
> >  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
> >  8 files changed, 54 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 4cd5f982b1e5..5b9dd55d6b57 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -438,3 +438,4 @@
> >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 64ca0d06259a..0e5ee78161c9 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -355,6 +355,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	process_madvise		__x64_sys_process_madvise
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > index 52a283ba0465..f8545d7c5218 100644
> > --- a/include/linux/proc_fs.h
> > +++ b/include/linux/proc_fs.h
> > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
> >  
> >  #endif /* CONFIG_PROC_FS */
> >  
> > +extern struct pid *pidfd_to_pid(const struct file *file);
> >  struct net;
> >  
> >  static inline struct proc_dir_entry *proc_net_mkdir(
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e2870fe1be5b..21c6c9a62006 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
> >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  				unsigned char __user * vec);
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> > +				size_t len, int behavior);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >  			unsigned long prot, unsigned long pgoff,
> >  			unsigned long flags);
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..7ee82ce04620 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_process_madvise 428
> > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  
> >  #undef __NR_syscalls
> >  #define __NR_syscalls 428
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1c86b78a7597..04e75daab1f8 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> >  	return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > -static struct pid *pidfd_to_pid(const struct file *file)
> > +struct pid *pidfd_to_pid(const struct file *file)
> >  {
> >  	if (file->f_op == &pidfd_fops)
> >  		return file->private_data;
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 4d9ae5ea6caf..5277421795ab 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
> >  COND_SYSCALL(munlockall);
> >  COND_SYSCALL(mincore);
> >  COND_SYSCALL(madvise);
> > +COND_SYSCALL(process_madvise);
> >  COND_SYSCALL(remap_file_pages);
> >  COND_SYSCALL(mbind);
> >  COND_SYSCALL_COMPAT(mbind);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 119e82e1f065..af02aa17e5c1 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/proc_fs.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/mempolicy.h>
> >  #include <linux/page-isolation.h>
> > @@ -16,6 +17,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/falloc.h>
> >  #include <linux/sched.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/ksm.h>
> >  #include <linux/fs.h>
> >  #include <linux/file.h>
> > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> >  	return madvise_core(current, start, len_in, behavior);
> >  }
> > +
> > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > +		size_t, len_in, int, behavior)
> > +{
> > +	int ret;
> > +	struct fd f;
> > +	struct pid *pid;
> > +	struct task_struct *tsk;
> > +	struct mm_struct *mm;
> > +
> > +	f = fdget(pidfd);
> > +	if (!f.file)
> > +		return -EBADF;
> > +
> > +	pid = pidfd_to_pid(f.file);
> > +	if (IS_ERR(pid)) {
> > +		ret = PTR_ERR(pid);
> > +		goto err;
> > +	}
> > +
> > +	ret = -EINVAL;
> > +	rcu_read_lock();
> > +	tsk = pid_task(pid, PIDTYPE_PID);
> > +	if (!tsk) {
> > +		rcu_read_unlock();
> > +		goto err;
> > +	}
> > +	get_task_struct(tsk);
> > +	rcu_read_unlock();
> > +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> > +	if (!mm || IS_ERR(mm)) {
> > +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > +		if (ret == -EACCES)
> > +			ret = -EPERM;
> > +		goto err;
> > +	}
> > +	ret = madvise_core(tsk, start, len_in, behavior);
> > +	mmput(mm);
> > +	put_task_struct(tsk);
> > +err:
> > +	fdput(f);
> > +	return ret;
> > +}
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-20  9:22   ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Michal Hocko
@ 2019-05-21  2:48     ` Minchan Kim
  2019-05-21  6:24       ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-21  2:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > Currently, process_madvise syscall works for only one address range
> > so user should call the syscall several times to give hints to
> > multiple address range.
> 
> Is that a problem? How big of a problem? Any numbers?

We easily have 2000+ vma so it's not trivial overhead. I will come up
with number in the description at respin.

> 
> > This patch extends process_madvise syscall to support multiple
> > hints, address ranges and return vaules so user could give hints
> > all at once.
> > 
> > struct pr_madvise_param {
> >     int size;                       /* the size of this structure */
> >     const struct iovec __user *vec; /* address range array */
> > }
> > 
> > int process_madvise(int pidfd, ssize_t nr_elem,
> > 		    int *behavior,
> > 		    struct pr_madvise_param *results,
> > 		    struct pr_madvise_param *ranges,
> > 		    unsigned long flags);
> > 
> > - pidfd
> > 
> > target process fd
> > 
> > - nr_elem
> > 
> > the number of elemenent of array behavior, results, ranges
> > 
> > - behavior
> > 
> > hints for each address range in remote process so that user could
> > give different hints for each range.
> 
> What is the guarantee of a single call? Do all hints get applied or the
> first failure backs of? What are the atomicity guarantees?

All hints will be tried even though one of them is failed. User will
see the success or failure from the restuls parameter.
For the single call, there is no guarantee of atomicity.

> 
> > 
> > - results
> > 
> > array of buffers to get results for associated remote address range
> > action.
> > 
> > - ranges
> > 
> > array to buffers to have remote process's address ranges to be
> > processed
> > 
> > - flags
> > 
> > extra argument for the future. It should be zero this moment.
> > 
> > Example)
> > 
> > struct pr_madvise_param {
> >         int size;
> >         const struct iovec *vec;
> > };
> > 
> > int main(int argc, char *argv[])
> > {
> >         struct pr_madvise_param retp, rangep;
> >         struct iovec result_vec[2], range_vec[2];
> >         int hints[2];
> >         long ret[2];
> >         void *addr[2];
> > 
> >         pid_t pid;
> >         char cmd[64] = {0,};
> >         addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
> >                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 
> >         if (MAP_FAILED == addr[0])
> >                 return 1;
> > 
> >         addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
> >                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 
> >         if (MAP_FAILED == addr[1])
> >                 return 1;
> > 
> >         hints[0] = MADV_COLD;
> > 	range_vec[0].iov_base = addr[0];
> >         range_vec[0].iov_len = ALLOC_SIZE;
> >         result_vec[0].iov_base = &ret[0];
> >         result_vec[0].iov_len = sizeof(long);
> > 	retp.vec = result_vec;
> >         retp.size = sizeof(struct pr_madvise_param);
> > 
> >         hints[1] = MADV_COOL;
> >         range_vec[1].iov_base = addr[1];
> >         range_vec[1].iov_len = ALLOC_SIZE;
> >         result_vec[1].iov_base = &ret[1];
> >         result_vec[1].iov_len = sizeof(long);
> >         rangep.vec = range_vec;
> >         rangep.size = sizeof(struct pr_madvise_param);
> > 
> >         pid = fork();
> >         if (!pid) {
> >                 sleep(10);
> >         } else {
> >                 int pidfd = open(cmd,  O_DIRECTORY | O_CLOEXEC);
> >                 if (pidfd < 0)
> >                         return 1;
> > 
> >                 /* munmap to make pages private for the child */
> >                 munmap(addr[0], ALLOC_SIZE);
> >                 munmap(addr[1], ALLOC_SIZE);
> >                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
> >                 if (syscall(__NR_process_madvise, pidfd, 2, behaviors,
> > 						&retp, &rangep, 0))
> >                         perror("process_madvise fail\n");
> >                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
> >         }
> > 
> >         return 0;
> > }
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/uapi/asm-generic/mman-common.h |   5 +
> >  mm/madvise.c                           | 184 +++++++++++++++++++++----
> >  2 files changed, 166 insertions(+), 23 deletions(-)
> > 
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index b9b51eeb8e1a..b8e230de84a6 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -74,4 +74,9 @@
> >  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
> >  				 PKEY_DISABLE_WRITE)
> >  
> > +struct pr_madvise_param {
> > +	int size;			/* the size of this structure */
> > +	const struct iovec __user *vec;	/* address range array */
> > +};
> > +
> >  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index af02aa17e5c1..f4f569dac2bd 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> >  	struct page *page;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	unsigned long next;
> > +	long nr_pages = 0;
> >  
> >  	next = pmd_addr_end(addr, end);
> >  	if (pmd_trans_huge(*pmd)) {
> > @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> >  
> >  		ptep_test_and_clear_young(vma, addr, pte);
> >  		deactivate_page(page);
> > +		nr_pages++;
> > +
> >  	}
> >  
> >  	pte_unmap_unlock(orig_pte, ptl);
> > +	*(long *)walk->private += nr_pages;
> >  	cond_resched();
> >  
> >  	return 0;
> > @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> >  
> >  static void madvise_cool_page_range(struct mmu_gather *tlb,
> >  			     struct vm_area_struct *vma,
> > -			     unsigned long addr, unsigned long end)
> > +			     unsigned long addr, unsigned long end,
> > +			     long *nr_pages)
> >  {
> >  	struct mm_walk cool_walk = {
> >  		.pmd_entry = madvise_cool_pte_range,
> >  		.mm = vma->vm_mm,
> > +		.private = nr_pages
> >  	};
> >  
> >  	tlb_start_vma(tlb, vma);
> > @@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb,
> >  }
> >  
> >  static long madvise_cool(struct vm_area_struct *vma,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +			unsigned long start_addr, unsigned long end_addr,
> > +			long *nr_pages)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_gather tlb;
> > @@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma,
> >  
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > -	madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
> > +	madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
> >  	tlb_finish_mmu(&tlb, start_addr, end_addr);
> >  
> >  	return 0;
> > @@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >  	int isolated = 0;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	unsigned long next;
> > +	long nr_pages = 0;
> >  
> >  	next = pmd_addr_end(addr, end);
> >  	if (pmd_trans_huge(*pmd)) {
> > @@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >  		list_add(&page->lru, &page_list);
> >  		if (isolated >= SWAP_CLUSTER_MAX) {
> >  			pte_unmap_unlock(orig_pte, ptl);
> > -			reclaim_pages(&page_list);
> > +			nr_pages += reclaim_pages(&page_list);
> >  			isolated = 0;
> >  			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  			orig_pte = pte;
> > @@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >  	}
> >  
> >  	pte_unmap_unlock(orig_pte, ptl);
> > -	reclaim_pages(&page_list);
> > +	nr_pages += reclaim_pages(&page_list);
> >  	cond_resched();
> >  
> > +	*(long *)walk->private += nr_pages;
> >  	return 0;
> >  }
> >  
> >  static void madvise_cold_page_range(struct mmu_gather *tlb,
> >  			     struct vm_area_struct *vma,
> > -			     unsigned long addr, unsigned long end)
> > +			     unsigned long addr, unsigned long end,
> > +			     long *nr_pages)
> >  {
> >  	struct mm_walk warm_walk = {
> >  		.pmd_entry = madvise_cold_pte_range,
> >  		.mm = vma->vm_mm,
> > +		.private = nr_pages,
> >  	};
> >  
> >  	tlb_start_vma(tlb, vma);
> > @@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
> >  
> >  
> >  static long madvise_cold(struct vm_area_struct *vma,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +			unsigned long start_addr, unsigned long end_addr,
> > +			long *nr_pages)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_gather tlb;
> > @@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma,
> >  
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > -	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> > +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
> >  	tlb_finish_mmu(&tlb, start_addr, end_addr);
> >  
> >  	return 0;
> > @@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior,
> >  static long
> >  madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
> >  		struct vm_area_struct **prev, unsigned long start,
> > -		unsigned long end, int behavior)
> > +		unsigned long end, int behavior, long *nr_pages)
> >  {
> >  	switch (behavior) {
> >  	case MADV_REMOVE:
> > @@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
> >  	case MADV_WILLNEED:
> >  		return madvise_willneed(vma, prev, start, end);
> >  	case MADV_COOL:
> > -		return madvise_cool(vma, start, end);
> > +		return madvise_cool(vma, start, end, nr_pages);
> >  	case MADV_COLD:
> > -		return madvise_cold(vma, start, end);
> > +		return madvise_cold(vma, start, end, nr_pages);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(tsk, vma, prev, start,
> > @@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior)
> >  }
> >  
> >  static int madvise_core(struct task_struct *tsk, unsigned long start,
> > -			size_t len_in, int behavior)
> > +			size_t len_in, int behavior, long *nr_pages)
> >  {
> >  	unsigned long end, tmp;
> >  	struct vm_area_struct *vma, *prev;
> > @@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  
> >  	if (start & ~PAGE_MASK)
> >  		return error;
> > +
> >  	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> >  
> >  	/* Check to see whether len was rounded up from small -ve to zero */
> > @@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  	blk_start_plug(&plug);
> >  	for (;;) {
> >  		/* Still start < end. */
> > +		long pages = 0;
> > +
> >  		error = -ENOMEM;
> >  		if (!vma)
> >  			goto out;
> > @@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  			tmp = end;
> >  
> >  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> > -		error = madvise_vma(tsk, vma, &prev, start, tmp, behavior);
> > +		error = madvise_vma(tsk, vma, &prev, start, tmp,
> > +					behavior, &pages);
> >  		if (error)
> >  			goto out;
> > +		*nr_pages += pages;
> >  		start = tmp;
> >  		if (prev && start < prev->vm_end)
> >  			start = prev->vm_end;
> > @@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >   */
> >  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> > -	return madvise_core(current, start, len_in, behavior);
> > +	unsigned long dummy;
> > +
> > +	return madvise_core(current, start, len_in, behavior, &dummy);
> >  }
> >  
> > -SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > -		size_t, len_in, int, behavior)
> > +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
> > +		struct pr_madvise_param *param)
> > +{
> > +	u32 size;
> > +	int ret;
> > +
> > +	memset(param, 0, sizeof(*param));
> > +
> > +	ret = get_user(size, &u_param->size);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (size > PAGE_SIZE)
> > +		return -E2BIG;
> > +
> > +	if (!size || size > sizeof(struct pr_madvise_param))
> > +		return -EINVAL;
> > +
> > +	ret = copy_from_user(param, u_param, size);
> > +	if (ret)
> > +		return -EFAULT;
> > +
> > +	return ret;
> > +}
> > +
> > +static int process_madvise_core(struct task_struct *tsk, int *behaviors,
> > +				struct iov_iter *iter,
> > +				const struct iovec *range_vec,
> > +				unsigned long riovcnt,
> > +				unsigned long flags)
> > +{
> > +	int i;
> > +	long err;
> > +
> > +	for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) {
> > +		long ret = 0;
> > +
> > +		err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base,
> > +				range_vec[i].iov_len, behaviors[i],
> > +				&ret);
> > +		if (err)
> > +			ret = err;
> > +
> > +		if (copy_to_iter(&ret, sizeof(long), iter) !=
> > +				sizeof(long)) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		err = 0;
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem,
> > +			const int __user *, hints,
> > +			struct pr_madvise_param __user *, results,
> > +			struct pr_madvise_param __user *, ranges,
> > +			unsigned long, flags)
> >  {
> >  	int ret;
> >  	struct fd f;
> >  	struct pid *pid;
> >  	struct task_struct *tsk;
> >  	struct mm_struct *mm;
> > +	struct pr_madvise_param result_p, range_p;
> > +	const struct iovec __user *result_vec, __user *range_vec;
> > +	int *behaviors;
> > +	struct iovec iovstack_result[UIO_FASTIOV];
> > +	struct iovec iovstack_r[UIO_FASTIOV];
> > +	struct iovec *iov_l = iovstack_result;
> > +	struct iovec *iov_r = iovstack_r;
> > +	struct iov_iter iter;
> > +
> > +	if (flags != 0)
> > +		return -EINVAL;
> > +
> > +	ret = pr_madvise_copy_param(results, &result_p);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pr_madvise_copy_param(ranges, &range_p);
> > +	if (ret)
> > +		return ret;
> > +
> > +	result_vec = result_p.vec;
> > +	range_vec = range_p.vec;
> > +
> > +	if (result_p.size != sizeof(struct pr_madvise_param) ||
> > +			range_p.size != sizeof(struct pr_madvise_param))
> > +		return -EINVAL;
> > +
> > +	behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
> > +	if (!behaviors)
> > +		return -ENOMEM;
> > +
> > +	ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem);
> > +	if (ret < 0)
> > +		goto free_behavior_vec;
> > +
> > +	ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV,
> > +				&iov_l, &iter);
> > +	if (ret < 0)
> > +		goto free_behavior_vec;
> > +
> > +	if (!iov_iter_count(&iter)) {
> > +		ret = -EINVAL;
> > +		goto free_iovecs;
> > +	}
> > +
> > +	ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
> > +				UIO_FASTIOV, iovstack_r, &iov_r);
> > +	if (ret <= 0)
> > +		goto free_iovecs;
> >  
> >  	f = fdget(pidfd);
> > -	if (!f.file)
> > -		return -EBADF;
> > +	if (!f.file) {
> > +		ret = -EBADF;
> > +		goto free_iovecs;
> > +	}
> >  
> >  	pid = pidfd_to_pid(f.file);
> >  	if (IS_ERR(pid)) {
> >  		ret = PTR_ERR(pid);
> > -		goto err;
> > +		goto put_fd;
> >  	}
> >  
> >  	ret = -EINVAL;
> > @@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> >  	tsk = pid_task(pid, PIDTYPE_PID);
> >  	if (!tsk) {
> >  		rcu_read_unlock();
> > -		goto err;
> > +		goto put_fd;
> >  	}
> >  	get_task_struct(tsk);
> >  	rcu_read_unlock();
> > @@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> >  		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >  		if (ret == -EACCES)
> >  			ret = -EPERM;
> > -		goto err;
> > +		goto put_task;
> >  	}
> > -	ret = madvise_core(tsk, start, len_in, behavior);
> > +
> > +	ret = process_madvise_core(tsk, behaviors, &iter, iov_r,
> > +					nr_elem, flags);
> >  	mmput(mm);
> > +put_task:
> >  	put_task_struct(tsk);
> > -err:
> > +put_fd:
> >  	fdput(f);
> > +free_iovecs:
> > +	if (iov_r != iovstack_r)
> > +		kfree(iov_r);
> > +	kfree(iov_l);
> > +free_behavior_vec:
> > +	kfree(behaviors);
> > +
> >  	return ret;
> >  }
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-20  9:28   ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Michal Hocko
@ 2019-05-21  2:55     ` Minchan Kim
  2019-05-21  6:26       ` Michal Hocko
  2019-05-21 15:33       ` Johannes Weiner
  0 siblings, 2 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-21  2:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> [cc linux-api]
> 
> On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > System could have much faster swap device like zRAM. In that case, swapping
> > is extremely cheaper than file-IO on the low-end storage.
> > In this configuration, userspace could handle different strategy for each
> > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > file IO is more expensive in this case so want to keep them in memory
> > until memory pressure happens.
> > 
> > To support such strategy easier, this patch introduces
> > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > that /proc/<pid>/clear_refs already has supported same filters.
> > They are filters could be Ored with other existing hints using top two bits
> > of (int behavior).
> 
> madvise operates on top of ranges and it is quite trivial to do the
> filtering from the userspace so why do we need any additional filtering?
> 
> > Once either of them is set, the hint could affect only the interested vma
> > either anonymous or file-backed.
> > 
> > With that, user could call a process_madvise syscall simply with a entire
> > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> 
> OK, so here is the reason you want that. The immediate question is why
> cannot the monitor do the filtering from the userspace. Slightly more
> work, all right, but less of an API to expose and that itself is a
> strong argument against.

What I should do if we don't have such filter option is to enumerate all of
vma via /proc/<pid>/maps and then parse every ranges and inode from string,
which would be painful for 2000+ vmas.

> 
> > * from v1r2
> >   * use consistent check with clear_refs to identify anon/file vma - surenb
> > 
> > * from v1r1
> >   * use naming "filter" for new madvise option - dancol
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/uapi/asm-generic/mman-common.h |  5 +++++
> >  mm/madvise.c                           | 14 ++++++++++++++
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index b8e230de84a6..be59a1b90284 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -66,6 +66,11 @@
> >  #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
> >  #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
> >  
> > +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER))
> > +
> > +#define MADV_ANONYMOUS_FILTER	(1<<31)	/* works for only anonymous vma */
> > +#define MADV_FILE_FILTER	(1<<30)	/* works for only file-backed vma */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE	0
> >  
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index f4f569dac2bd..116131243540 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  	int write;
> >  	size_t len;
> >  	struct blk_plug plug;
> > +	bool anon_only, file_only;
> >  
> > +	anon_only = behavior & MADV_ANONYMOUS_FILTER;
> > +	file_only = behavior & MADV_FILE_FILTER;
> > +
> > +	if (anon_only && file_only)
> > +		return error;
> > +
> > +	behavior = behavior & MADV_BEHAVIOR_MASK;
> >  	if (!madvise_behavior_valid(behavior))
> >  		return error;
> >  
> > @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  		if (end < tmp)
> >  			tmp = end;
> >  
> > +		if (anon_only && vma->vm_file)
> > +			goto next;
> > +		if (file_only && !vma->vm_file)
> > +			goto next;
> > +
> >  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >  		error = madvise_vma(tsk, vma, &prev, start, tmp,
> >  					behavior, &pages);
> >  		if (error)
> >  			goto out;
> >  		*nr_pages += pages;
> > +next:
> >  		start = tmp;
> >  		if (prev && start < prev->vm_end)
> >  			start = prev->vm_end;
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20 22:54     ` Minchan Kim
@ 2019-05-21  6:04       ` Michal Hocko
  2019-05-21  9:11         ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-21  6:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
[...]
> > > Internally, it works via deactivating memory from active list to
> > > inactive's head so when the memory pressure happens, they will be
> > > reclaimed earlier than other active pages unless there is no
> > > access until the time.
> > 
> > Could you elaborate about the decision to move to the head rather than
> > tail? What should happen to inactive pages? Should we move them to the
> > tail? Your implementation seems to ignore those completely. Why?
> 
> Normally, inactive LRU could have used-once pages without any mapping
> to user's address space. Such pages would be better candicate to
> reclaim when the memory pressure happens. With deactivating only
> active LRU pages of the process to the head of inactive LRU, we will
> keep them in RAM longer than used-once pages and could have more chance
> to be activated once the process is resumed.

You are making some assumptions here. You have an explicit call what is
cold now you are assuming something is even colder. Is this assumption a
general enough to make people depend on it? Not that we wouldn't be able
to change to logic later but that will always be risky - especially in
the area when somebody want to make a user space driven memory
management.
 
> > What should happen for shared pages? In other words do we want to allow
> > less privileged process to control evicting of shared pages with a more
> > privileged one? E.g. think of all sorts of side channel attacks. Maybe
> > we want to do the same thing as for mincore where write access is
> > required.
> 
> It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
> in the description.

OK, this is good for the starter. It makes the implementation simpler
and we can add shared mappings coverage later.

Although I would argue that touching only writeable mappings should be
reasonably safe.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20 23:00     ` Minchan Kim
@ 2019-05-21  6:08       ` Michal Hocko
  2019-05-21  9:13         ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-21  6:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 08:00:38, Minchan Kim wrote:
> On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > for a long time, it could hint kernel that the pages can be
> > > reclaimed instantly but data should be preserved for future use.
> > > This could reduce workingset eviction so it ends up increasing
> > > performance.
> > > 
> > > This patch introduces the new MADV_COLD hint to madvise(2)
> > > syscall. MADV_COLD can be used by a process to mark a memory range
> > > as not expected to be used for a long time. The hint can help
> > > kernel in deciding which pages to evict proactively.
> > 
> > As mentioned in other email this looks like a non-destructive
> > MADV_DONTNEED alternative.
> > 
> > > Internally, it works via reclaiming memory in process context
> > > the syscall is called. If the page is dirty but backing storage
> > > is not synchronous device, the written page will be rotate back
> > > into LRU's tail once the write is done so they will reclaim easily
> > > when memory pressure happens. If backing storage is
> > > synchrnous device(e.g., zram), hte page will be reclaimed instantly.
> > 
> > Why do we special case async backing storage? Please always try to
> > explain _why_ the decision is made.
> 
> I didn't make any decesion. ;-) That's how current reclaim works to
> avoid latency of freeing page in interrupt context. I had a patchset
> to resolve the concern a few years ago but got distracted.

Please articulate that in the changelog then. Or even do not go into
implementation details and stick with - reuse the current reclaim
implementation. If you call out some of the specific details you are
risking people will start depending on them. The fact that this reuses
the currect reclaim logic is enough from the review point of view
because we know that there is no additional special casing to worry
about.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-21  2:41     ` Minchan Kim
@ 2019-05-21  6:17       ` Michal Hocko
  2019-05-21 10:32         ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-21  6:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 11:41:07, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > ActivityManagerService is one of them.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > 
> > Could you expand some more about how this all works? How does the
> > centralized daemon track respective ranges? How does it synchronize
> > against parallel modification of the address space etc.
> 
> Currently, we don't track each address ranges because we have two
> policies at this moment:
> 
> 	deactive file pages and reclaim anonymous pages of the app.
> 
> Since the daemon has a ability to let background apps resume(IOW, process
> will be run by the daemon) and both hints are non-disruptive stabilty point
> of view, we are okay for the race.

Fair enough but the API should consider future usecases where this might
be a problem. So we should really think about those potential scenarios
now. If we are ok with that, fine, but then we should be explicit and
document it that way. Essentially say that any sort of synchronization
is supposed to be done by monitor. This will make the API less usable
but maybe that is enough.
 
> > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > which works based on pidfd so it could give a hint to the exeternal
> > > process.
> > > 
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > 
> > OK, this makes some sense from the API point of view. When we have
> > discussed that at LSFMM I was contemplating about something like that
> > except the fd would be a VMA fd rather than the process. We could extend
> > and reuse /proc/<pid>/map_files interface which doesn't support the
> > anonymous memory right now. 
> > 
> > I am not saying this would be a better interface but I wanted to mention
> > it here for a further discussion. One slight advantage would be that
> > you know the exact object that you are operating on because you have a
> > fd for the VMA and we would have a more straightforward way to reject
> > operation if the underlying object has changed (e.g. unmapped and reused
> > for a different mapping).
> 
> I agree your point. If I didn't miss something, such kinds of vma level
> modify notification doesn't work even file mapped vma at this moment.
> For anonymous vma, I think we could use userfaultfd, pontentially.
> It would be great if someone want to do with disruptive hints like
> MADV_DONTNEED.
> 
> I'd like to see it further enhancement after landing address range based
> operation via limiting hints process_madvise supports to non-disruptive
> only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
> when someone want to extend the API.

So do you think we want both interfaces (process_madvise and madvisefd)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21  2:48     ` Minchan Kim
@ 2019-05-21  6:24       ` Michal Hocko
  2019-05-21 10:26         ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-21  6:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > Currently, process_madvise syscall works for only one address range
> > > so user should call the syscall several times to give hints to
> > > multiple address range.
> > 
> > Is that a problem? How big of a problem? Any numbers?
> 
> We easily have 2000+ vma so it's not trivial overhead. I will come up
> with number in the description at respin.

Does this really have to be a fast operation? I would expect the monitor
is by no means a fast path. The system call overhead is not what it used
to be, sigh, but still for something that is not a hot path it should be
tolerable, especially when the whole operation is quite expensive on its
own (wrt. the syscall entry/exit).

I am not saying we do not need a multiplexing API, I am just not sure
we need it right away. Btw. there was some demand for other MM syscalls
to provide a multiplexing API (e.g. mprotect), maybe it would be better
to handle those in one go?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21  2:55     ` Minchan Kim
@ 2019-05-21  6:26       ` Michal Hocko
  2019-05-27  7:58         ` Minchan Kim
  2019-05-21 15:33       ` Johannes Weiner
  1 sibling, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-21  6:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > [cc linux-api]
> > 
> > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > System could have much faster swap device like zRAM. In that case, swapping
> > > is extremely cheaper than file-IO on the low-end storage.
> > > In this configuration, userspace could handle different strategy for each
> > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > file IO is more expensive in this case so want to keep them in memory
> > > until memory pressure happens.
> > > 
> > > To support such strategy easier, this patch introduces
> > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > that /proc/<pid>/clear_refs already has supported same filters.
> > > They are filters could be Ored with other existing hints using top two bits
> > > of (int behavior).
> > 
> > madvise operates on top of ranges and it is quite trivial to do the
> > filtering from the userspace so why do we need any additional filtering?
> > 
> > > Once either of them is set, the hint could affect only the interested vma
> > > either anonymous or file-backed.
> > > 
> > > With that, user could call a process_madvise syscall simply with a entire
> > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > 
> > OK, so here is the reason you want that. The immediate question is why
> > cannot the monitor do the filtering from the userspace. Slightly more
> > work, all right, but less of an API to expose and that itself is a
> > strong argument against.
> 
> What I should do if we don't have such filter option is to enumerate all of
> vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> which would be painful for 2000+ vmas.

Painful is not an argument to add a new user API. If the existing API
suits the purpose then it should be used. If it is not usable, we can
think of a different way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
       [not found]   ` <20190521043950.GJ10039@google.com>
@ 2019-05-21  6:32     ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-21  6:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Tue 21-05-19 13:39:50, Minchan Kim wrote:
> On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote:
> > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > - Approach
> > > 
> > > The approach we chose was to use a new interface to allow userspace to
> > > proactively reclaim entire processes by leveraging platform information.
> > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > that are known to be cold from userspace and to avoid races with lmkd
> > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > it could provide many chances for platform to use much information to
> > > optimize memory efficiency.
> > > 
> > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > when memory pressure rises.
> > 
> > I agree with this approach and the semantics. But these names are very
> > vague and extremely easy to confuse since they're so similar.
> > 
> > MADV_COLD could be a good name, but for deactivating pages, not
> > reclaiming them - marking memory "cold" on the LRU for later reclaim.
> > 
> > For the immediate reclaim one, I think there is a better option too:
> > In virtual memory speak, putting a page into secondary storage (or
> > ensuring it's already there), and then freeing its in-memory copy, is
> > called "paging out". And that's what this flag is supposed to do. So
> > how about MADV_PAGEOUT?
> > 
> > With that, we'd have:
> > 
> > MADV_FREE: Mark data invalid, free memory when needed
> > MADV_DONTNEED: Mark data invalid, free memory immediately
> > 
> > MADV_COLD: Data is not used for a while, free memory when needed
> > MADV_PAGEOUT: Data is not used for a while, free memory immediately
> > 
> > What do you think?
> 
> There are several suggestions until now. Thanks, Folks!
> 
> For deactivating:
> 
> - MADV_COOL
> - MADV_RECLAIM_LAZY
> - MADV_DEACTIVATE
> - MADV_COLD
> - MADV_FREE_PRESERVE
> 
> 
> For reclaiming:
> 
> - MADV_COLD
> - MADV_RECLAIM_NOW
> - MADV_RECLAIMING
> - MADV_PAGEOUT
> - MADV_DONTNEED_PRESERVE
> 
> It seems everybody doesn't like MADV_COLD so want to go with other.
> For consisteny of view with other existing hints of madvise, -preserve
> postfix suits well. However, originally, I don't like the naming FREE
> vs DONTNEED from the beginning. They were easily confused.
> I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to
> represent reclaim with memory pressure and is supposed to paged-in
> if someone need it later. So, it imply PRESERVE.
> If there is not strong against it, I want to go with MADV_COLD and
> MADV_PAGEOUT.
> 
> Other opinion?

I do not really care strongly. I am pretty sure we will have a lot of
suggestions because people tend to be good at arguing about that...
Anyway, unlike DONTNEED/FREE we do not have any other OS to implement
these features, right? So we shouldn't be tight to existing names.
On the other hand I kinda like the reference to the existing names but
DEACTIVATE/PAGEOUT seem a good fit to me as well. Unless there is way
much better name suggested I would go with one of those. Up to you.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
       [not found] ` <20190521014452.GA6738@bombadil.infradead.org>
@ 2019-05-21  6:34   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-21  6:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon,
	linux-api

[linux-api]

On Mon 20-05-19 18:44:52, Matthew Wilcox wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> 
> Do we tear down page tables for these ranges?  That seems like a good
> way of reclaiming potentially a substantial amount of memory.

I do not think we can in general because this is a non-destructive
operation. So at least we cannot tear down anonymous ptes (they will
turn into swap entries).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-21  6:04       ` Michal Hocko
@ 2019-05-21  9:11         ` Minchan Kim
  2019-05-21 10:05           ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-21  9:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> [...]
> > > > Internally, it works via deactivating memory from active list to
> > > > inactive's head so when the memory pressure happens, they will be
> > > > reclaimed earlier than other active pages unless there is no
> > > > access until the time.
> > > 
> > > Could you elaborate about the decision to move to the head rather than
> > > tail? What should happen to inactive pages? Should we move them to the
> > > tail? Your implementation seems to ignore those completely. Why?
> > 
> > Normally, inactive LRU could have used-once pages without any mapping
> > to user's address space. Such pages would be better candicate to
> > reclaim when the memory pressure happens. With deactivating only
> > active LRU pages of the process to the head of inactive LRU, we will
> > keep them in RAM longer than used-once pages and could have more chance
> > to be activated once the process is resumed.
> 
> You are making some assumptions here. You have an explicit call what is
> cold now you are assuming something is even colder. Is this assumption a
> general enough to make people depend on it? Not that we wouldn't be able
> to change to logic later but that will always be risky - especially in
> the area when somebody want to make a user space driven memory
> management.

Think about MADV_FREE. It moves those pages into inactive file LRU's head.
See the get_scan_count which makes forceful scanning of inactive file LRU
if it has enough size based on the memory pressure.
The reason is it's likely to have used-once pages in inactive file LRU,
generally. Those pages has been top-priority candidate to be reclaimed
for a long time.

Only parts I am aware of moving pages into tail of inactive LRU are places
writeback is done for pages VM already decide to reclaim by LRU aging or
destructive operation like invalidating but couldn't completed. It's
really strong hints with no doubt.

>  
> > > What should happen for shared pages? In other words do we want to allow
> > > less privileged process to control evicting of shared pages with a more
> > > privileged one? E.g. think of all sorts of side channel attacks. Maybe
> > > we want to do the same thing as for mincore where write access is
> > > required.
> > 
> > It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
> > in the description.
> 
> OK, this is good for the starter. It makes the implementation simpler
> and we can add shared mappings coverage later.
> 
> Although I would argue that touching only writeable mappings should be
> reasonably safe.
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-21  6:08       ` Michal Hocko
@ 2019-05-21  9:13         ` Minchan Kim
  0 siblings, 0 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-21  9:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:08:20AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 08:00:38, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote:
> > > [Cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range
> > > > for a long time, it could hint kernel that the pages can be
> > > > reclaimed instantly but data should be preserved for future use.
> > > > This could reduce workingset eviction so it ends up increasing
> > > > performance.
> > > > 
> > > > This patch introduces the new MADV_COLD hint to madvise(2)
> > > > syscall. MADV_COLD can be used by a process to mark a memory range
> > > > as not expected to be used for a long time. The hint can help
> > > > kernel in deciding which pages to evict proactively.
> > > 
> > > As mentioned in other email this looks like a non-destructive
> > > MADV_DONTNEED alternative.
> > > 
> > > > Internally, it works via reclaiming memory in process context
> > > > the syscall is called. If the page is dirty but backing storage
> > > > is not synchronous device, the written page will be rotate back
> > > > into LRU's tail once the write is done so they will reclaim easily
> > > > when memory pressure happens. If backing storage is
> > > > synchrnous device(e.g., zram), hte page will be reclaimed instantly.
> > > 
> > > Why do we special case async backing storage? Please always try to
> > > explain _why_ the decision is made.
> > 
> > I didn't make any decesion. ;-) That's how current reclaim works to
> > avoid latency of freeing page in interrupt context. I had a patchset
> > to resolve the concern a few years ago but got distracted.
> 
> Please articulate that in the changelog then. Or even do not go into
> implementation details and stick with - reuse the current reclaim
> implementation. If you call out some of the specific details you are
> risking people will start depending on them. The fact that this reuses
> the currect reclaim logic is enough from the review point of view
> because we know that there is no additional special casing to worry
> about.

I should have clarified. I will remove those lines in respin.

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-21  9:11         ` Minchan Kim
@ 2019-05-21 10:05           ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-21 10:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 18:11:34, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> > [...]
> > > > > Internally, it works via deactivating memory from active list to
> > > > > inactive's head so when the memory pressure happens, they will be
> > > > > reclaimed earlier than other active pages unless there is no
> > > > > access until the time.
> > > > 
> > > > Could you elaborate about the decision to move to the head rather than
> > > > tail? What should happen to inactive pages? Should we move them to the
> > > > tail? Your implementation seems to ignore those completely. Why?
> > > 
> > > Normally, inactive LRU could have used-once pages without any mapping
> > > to user's address space. Such pages would be better candicate to
> > > reclaim when the memory pressure happens. With deactivating only
> > > active LRU pages of the process to the head of inactive LRU, we will
> > > keep them in RAM longer than used-once pages and could have more chance
> > > to be activated once the process is resumed.
> > 
> > You are making some assumptions here. You have an explicit call what is
> > cold now you are assuming something is even colder. Is this assumption a
> > general enough to make people depend on it? Not that we wouldn't be able
> > to change to logic later but that will always be risky - especially in
> > the area when somebody want to make a user space driven memory
> > management.
> 
> Think about MADV_FREE. It moves those pages into inactive file LRU's head.
> See the get_scan_count which makes forceful scanning of inactive file LRU
> if it has enough size based on the memory pressure.
> The reason is it's likely to have used-once pages in inactive file LRU,
> generally. Those pages has been top-priority candidate to be reclaimed
> for a long time.

OK, fair enough. Being consistent with MADV_FREE is reasonable. I just
forgot we do rotate like this there.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21  6:24       ` Michal Hocko
@ 2019-05-21 10:26         ` Minchan Kim
  2019-05-21 10:37           ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-21 10:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > [Cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > Currently, process_madvise syscall works for only one address range
> > > > so user should call the syscall several times to give hints to
> > > > multiple address range.
> > > 
> > > Is that a problem? How big of a problem? Any numbers?
> > 
> > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > with number in the description at respin.
> 
> Does this really have to be a fast operation? I would expect the monitor
> is by no means a fast path. The system call overhead is not what it used
> to be, sigh, but still for something that is not a hot path it should be
> tolerable, especially when the whole operation is quite expensive on its
> own (wrt. the syscall entry/exit).

What's different with process_vm_[readv|writev] and vmsplice?
If the range needed to be covered is a lot, vector operation makes senese
to me.

> 
> I am not saying we do not need a multiplexing API, I am just not sure
> we need it right away. Btw. there was some demand for other MM syscalls
> to provide a multiplexing API (e.g. mprotect), maybe it would be better
> to handle those in one go?

That's the exactly what Daniel Colascione suggested from internal
review. That would be a interesting approach if we could aggregate
all of system call in one go.

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-21  6:17       ` Michal Hocko
@ 2019-05-21 10:32         ` Minchan Kim
  0 siblings, 0 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-21 10:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:17:43AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:41:07, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> > > [Cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > > > There is some usecase that centralized userspace daemon want to give
> > > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > > ActivityManagerService is one of them.
> > > > 
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > 
> > > Could you expand some more about how this all works? How does the
> > > centralized daemon track respective ranges? How does it synchronize
> > > against parallel modification of the address space etc.
> > 
> > Currently, we don't track each address ranges because we have two
> > policies at this moment:
> > 
> > 	deactive file pages and reclaim anonymous pages of the app.
> > 
> > Since the daemon has a ability to let background apps resume(IOW, process
> > will be run by the daemon) and both hints are non-disruptive stabilty point
> > of view, we are okay for the race.
> 
> Fair enough but the API should consider future usecases where this might
> be a problem. So we should really think about those potential scenarios
> now. If we are ok with that, fine, but then we should be explicit and
> document it that way. Essentially say that any sort of synchronization
> is supposed to be done by monitor. This will make the API less usable
> but maybe that is enough.

Okay, I will add more about that in the description.

>  
> > > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > > which works based on pidfd so it could give a hint to the exeternal
> > > > process.
> > > > 
> > > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > > 
> > > OK, this makes some sense from the API point of view. When we have
> > > discussed that at LSFMM I was contemplating about something like that
> > > except the fd would be a VMA fd rather than the process. We could extend
> > > and reuse /proc/<pid>/map_files interface which doesn't support the
> > > anonymous memory right now. 
> > > 
> > > I am not saying this would be a better interface but I wanted to mention
> > > it here for a further discussion. One slight advantage would be that
> > > you know the exact object that you are operating on because you have a
> > > fd for the VMA and we would have a more straightforward way to reject
> > > operation if the underlying object has changed (e.g. unmapped and reused
> > > for a different mapping).
> > 
> > I agree your point. If I didn't miss something, such kinds of vma level
> > modify notification doesn't work even file mapped vma at this moment.
> > For anonymous vma, I think we could use userfaultfd, pontentially.
> > It would be great if someone want to do with disruptive hints like
> > MADV_DONTNEED.
> > 
> > I'd like to see it further enhancement after landing address range based
> > operation via limiting hints process_madvise supports to non-disruptive
> > only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
> > when someone want to extend the API.
> 
> So do you think we want both interfaces (process_madvise and madvisefd)?

What I have in mind is to extend process_madvise later like this

struct pr_madvise_param {
    int size;                       /* the size of this structure */
    union {
    	const struct iovec __user *vec; /* address range array */
	int fd;				/* supported from 6.0 */
    }
}

with introducing new hint Or-able PR_MADV_RANGE_FD, so that process_madvise
can go with fd instead of address range.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21 10:26         ` Minchan Kim
@ 2019-05-21 10:37           ` Michal Hocko
  2019-05-27  7:49             ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-21 10:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > [Cc linux-api]
> > > > 
> > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > Currently, process_madvise syscall works for only one address range
> > > > > so user should call the syscall several times to give hints to
> > > > > multiple address range.
> > > > 
> > > > Is that a problem? How big of a problem? Any numbers?
> > > 
> > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > with number in the description at respin.
> > 
> > Does this really have to be a fast operation? I would expect the monitor
> > is by no means a fast path. The system call overhead is not what it used
> > to be, sigh, but still for something that is not a hot path it should be
> > tolerable, especially when the whole operation is quite expensive on its
> > own (wrt. the syscall entry/exit).
> 
> What's different with process_vm_[readv|writev] and vmsplice?
> If the range needed to be covered is a lot, vector operation makes senese
> to me.

I am not saying that the vector API is wrong. All I am trying to say is
that the benefit is not really clear so far. If you want to push it
through then you should better get some supporting data.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
       [not found] <20190520035254.57579-1-minchan@kernel.org>
                   ` (7 preceding siblings ...)
       [not found] ` <20190521014452.GA6738@bombadil.infradead.org>
@ 2019-05-21 12:53 ` Shakeel Butt
       [not found] ` <dbe801f0-4bbe-5f6e-9053-4b7deb38e235@arm.com>
  9 siblings, 0 replies; 68+ messages in thread
From: Shakeel Butt @ 2019-05-21 12:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Sonny Rao, Brian Geffon, linux-api

On Sun, May 19, 2019 at 8:53 PM Minchan Kim <minchan@kernel.org> wrote:
>
> - Background
>
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
>
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
>
> - Problem
>
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.

It is not clear what exactly is the problem from the above para. IMO
low usage of swap is not the problem but rather global memory pressure
and the reactive response to it is the problem. Killing apps over swap
is preferred as you have noted zapping frees memory faster but it
indirectly increases cold start. Also swapping on allocation causes
latency issues for the app. So, a proactive mechanism is needed to
keep global pressure away and indirectly reduces cold starts and alloc
stalls.

>
> - Approach
>
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.

I think it would be good to have clear reasoning on why "reclaim from
userspace" approach is taken. Android runtime clearly has more
accurate stale/cold information at the app/process level and can
positively influence kernel's reclaim decisions. So, "reclaim from
userspace" approach makes total sense for Android. I envision that
Chrome OS would be another very obvious user of this approach. There
can be tens of tabs which the user have not touched for sometime.
Chrome OS can proactively reclaim memory from such tabs.

>
> IMHO we should spell it out that this patchset complements MADV_WONTNEED

MADV_DONTNEED? same at couple of places below.

> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
>
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
>
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
>
>         struct pr_madvise_param {
>                 int size;
>                 const struct iovec *vec;
>         }
>
>         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
>                                 struct pr_madvise_param *restuls,
>                                 struct pr_madvise_param *ranges,
>                                 unsigned long flags);
>
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
>
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
>
> - Experiment
>
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
>
> A is vanilla and B is process_madvise.
>
>
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
>
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
>
> This patchset is against on next-20190517.
>
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
>
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
>
> --
> 2.21.0.1020.gf2820cf01a-goog
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
       [not found]     ` <1754d0ef-6756-d88b-f728-17b1fe5d5b07@arm.com>
@ 2019-05-21 12:56       ` Shakeel Butt
  2019-05-22  4:23         ` Brian Geffon
  0 siblings, 1 reply; 68+ messages in thread
From: Shakeel Butt @ 2019-05-21 12:56 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Tim Murray, Minchan Kim, Andrew Morton, LKML, linux-mm,
	Michal Hocko, Johannes Weiner, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Sonny Rao, Brian Geffon,
	linux-api

On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 05/20/2019 10:29 PM, Tim Murray wrote:
> > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >> Or Is the objective here is reduce the number of processes which get killed by
> >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> >> here as well ?
> >
> > The objective is to avoid some instances of memory pressure by
> > proactively swapping pages that userspace knows to be cold before
> > those pages reach the end of the LRUs, which in turn can prevent some
> > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > that an application is not being used and is only resident to improve
> > performance if the user returns to that app, we can kick off
> > process_madvise on that process's pages (or some portion of those
> > pages) in a power-efficient way to reduce memory pressure long before
> > the system hits the free page watermark. This allows the system more
> > time to put pages into zram versus waiting for the watermark to
> > trigger kswapd, which decreases the likelihood that later memory
> > allocations will cause enough pressure to trigger a kill of one of
> > these apps.
>
> So this opens up bit of LRU management to user space hints. Also because the app
> in itself wont know about the memory situation of the entire system, new system
> call needs to be called from an external process.
>
> >
> >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> >> is it because as it will prevent a fresh cold start which anyway will be slower
> >> than a slow hot start. Just being curious.
> >
> > First, not all swapped pages will be reloaded immediately once an app
> > is resumed. We've found that an app's working set post-process_madvise
> > is significantly smaller than what an app allocates when it first
> > launches (see the delta between pswpin and pswpout in Minchan's
> > results). Presumably because of this, faulting to fetch from zram does
>
> pswpin      417613    1392647     975034     233.00
> pswpout    1274224    2661731    1387507     108.00
>
> IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> always the case ? Or it tend to swap out from an active area of the working set
> which faulted back again.
>
> > not seem to introduce a noticeable hot start penalty, not does it
> > cause an increase in performance problems later in the app's
> > lifecycle. I've measured with and without process_madvise, and the
> > differences are within our noise bounds. Second, because we're not
>
> That is assuming that post process_madvise() working set for the application is
> always smaller. There is another challenge. The external process should ideally
> have the knowledge of active areas of the working set for an application in
> question for it to invoke process_madvise() correctly to prevent such scenarios.
>
> > preemptively evicting file pages and only making them more likely to
> > be evicted when there's already memory pressure, we avoid the case
> > where we process_madvise an app then immediately return to the app and
> > reload all file pages in the working set even though there was no
> > intervening memory pressure. Our initial version of this work evicted
>
> That would be the worst case scenario which should be avoided. Memory pressure
> must be a parameter before actually doing the swap out. But pages if know to be
> inactive/cold can be marked high priority to be swapped out.
>
> > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > that case; this patch set avoids that slowdown. Finally, the benefit
> > from avoiding cold starts is huge. The performance improvement from
> > having a hot start instead of a cold start ranges from 3x for very
> > small apps to 50x+ for larger apps like high-fidelity games.
>
> Is there any other real world scenario apart from this app based ecosystem where
> user hinted LRU management might be helpful ? Just being curious. Thanks for the
> detailed explanation. I will continue looking into this series.

Chrome OS is another real world use-case for this user hinted LRU
management approach by proactively reclaiming reclaim from tabs not
accessed by the user for some time.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21  2:55     ` Minchan Kim
  2019-05-21  6:26       ` Michal Hocko
@ 2019-05-21 15:33       ` Johannes Weiner
  2019-05-22  1:50         ` Minchan Kim
  1 sibling, 1 reply; 68+ messages in thread
From: Johannes Weiner @ 2019-05-21 15:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > [cc linux-api]
> > 
> > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > System could have much faster swap device like zRAM. In that case, swapping
> > > is extremely cheaper than file-IO on the low-end storage.
> > > In this configuration, userspace could handle different strategy for each
> > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > file IO is more expensive in this case so want to keep them in memory
> > > until memory pressure happens.
> > > 
> > > To support such strategy easier, this patch introduces
> > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > that /proc/<pid>/clear_refs already has supported same filters.
> > > They are filters could be Ored with other existing hints using top two bits
> > > of (int behavior).
> > 
> > madvise operates on top of ranges and it is quite trivial to do the
> > filtering from the userspace so why do we need any additional filtering?
> > 
> > > Once either of them is set, the hint could affect only the interested vma
> > > either anonymous or file-backed.
> > > 
> > > With that, user could call a process_madvise syscall simply with a entire
> > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > 
> > OK, so here is the reason you want that. The immediate question is why
> > cannot the monitor do the filtering from the userspace. Slightly more
> > work, all right, but less of an API to expose and that itself is a
> > strong argument against.
> 
> What I should do if we don't have such filter option is to enumerate all of
> vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> which would be painful for 2000+ vmas.

Just out of curiosity, how do you get to 2000+ distinct memory regions
in the address space of a mobile app? I'm assuming these aren't files,
but rather anon objects with poor grouping. Is that from guard pages
between individual heap allocations or something?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21 15:33       ` Johannes Weiner
@ 2019-05-22  1:50         ` Minchan Kim
  0 siblings, 0 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-22  1:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 11:33:10AM -0400, Johannes Weiner wrote:
> On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > [cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > is extremely cheaper than file-IO on the low-end storage.
> > > > In this configuration, userspace could handle different strategy for each
> > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > file IO is more expensive in this case so want to keep them in memory
> > > > until memory pressure happens.
> > > > 
> > > > To support such strategy easier, this patch introduces
> > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > They are filters could be Ored with other existing hints using top two bits
> > > > of (int behavior).
> > > 
> > > madvise operates on top of ranges and it is quite trivial to do the
> > > filtering from the userspace so why do we need any additional filtering?
> > > 
> > > > Once either of them is set, the hint could affect only the interested vma
> > > > either anonymous or file-backed.
> > > > 
> > > > With that, user could call a process_madvise syscall simply with a entire
> > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > 
> > > OK, so here is the reason you want that. The immediate question is why
> > > cannot the monitor do the filtering from the userspace. Slightly more
> > > work, all right, but less of an API to expose and that itself is a
> > > strong argument against.
> > 
> > What I should do if we don't have such filter option is to enumerate all of
> > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > which would be painful for 2000+ vmas.
> 
> Just out of curiosity, how do you get to 2000+ distinct memory regions
> in the address space of a mobile app? I'm assuming these aren't files,
> but rather anon objects with poor grouping. Is that from guard pages
> between individual heap allocations or something?

Android uses preload library model to speed up app launch so it loads
all of library in advance on zygote and forks new app based on it.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 12:56       ` Shakeel Butt
@ 2019-05-22  4:23         ` Brian Geffon
  0 siblings, 0 replies; 68+ messages in thread
From: Brian Geffon @ 2019-05-22  4:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Anshuman Khandual, Tim Murray, Minchan Kim, Andrew Morton, LKML,
	linux-mm, Michal Hocko, Johannes Weiner, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Sonny Rao, linux-api

To expand on the ChromeOS use case we're in a very similar situation
to Android. For example, the Chrome browser uses a separate process
for each individual tab (with some exceptions) and over time many tabs
remain open in a back-grounded or idle state. Given that we have a lot
of information about the weight of a tab, when it was last active,
etc, we can benefit tremendously from per-process reclaim. We're
working on getting real world numbers but all of our initial testing
shows very promising results.


On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
> >
> >
> >
> > On 05/20/2019 10:29 PM, Tim Murray wrote:
> > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >> Or Is the objective here is reduce the number of processes which get killed by
> > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> > >> here as well ?
> > >
> > > The objective is to avoid some instances of memory pressure by
> > > proactively swapping pages that userspace knows to be cold before
> > > those pages reach the end of the LRUs, which in turn can prevent some
> > > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > > that an application is not being used and is only resident to improve
> > > performance if the user returns to that app, we can kick off
> > > process_madvise on that process's pages (or some portion of those
> > > pages) in a power-efficient way to reduce memory pressure long before
> > > the system hits the free page watermark. This allows the system more
> > > time to put pages into zram versus waiting for the watermark to
> > > trigger kswapd, which decreases the likelihood that later memory
> > > allocations will cause enough pressure to trigger a kill of one of
> > > these apps.
> >
> > So this opens up bit of LRU management to user space hints. Also because the app
> > in itself wont know about the memory situation of the entire system, new system
> > call needs to be called from an external process.
> >
> > >
> > >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> > >> is it because as it will prevent a fresh cold start which anyway will be slower
> > >> than a slow hot start. Just being curious.
> > >
> > > First, not all swapped pages will be reloaded immediately once an app
> > > is resumed. We've found that an app's working set post-process_madvise
> > > is significantly smaller than what an app allocates when it first
> > > launches (see the delta between pswpin and pswpout in Minchan's
> > > results). Presumably because of this, faulting to fetch from zram does
> >
> > pswpin      417613    1392647     975034     233.00
> > pswpout    1274224    2661731    1387507     108.00
> >
> > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> > always the case ? Or it tend to swap out from an active area of the working set
> > which faulted back again.
> >
> > > not seem to introduce a noticeable hot start penalty, not does it
> > > cause an increase in performance problems later in the app's
> > > lifecycle. I've measured with and without process_madvise, and the
> > > differences are within our noise bounds. Second, because we're not
> >
> > That is assuming that post process_madvise() working set for the application is
> > always smaller. There is another challenge. The external process should ideally
> > have the knowledge of active areas of the working set for an application in
> > question for it to invoke process_madvise() correctly to prevent such scenarios.
> >
> > > preemptively evicting file pages and only making them more likely to
> > > be evicted when there's already memory pressure, we avoid the case
> > > where we process_madvise an app then immediately return to the app and
> > > reload all file pages in the working set even though there was no
> > > intervening memory pressure. Our initial version of this work evicted
> >
> > That would be the worst case scenario which should be avoided. Memory pressure
> > must be a parameter before actually doing the swap out. But pages if know to be
> > inactive/cold can be marked high priority to be swapped out.
> >
> > > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > > that case; this patch set avoids that slowdown. Finally, the benefit
> > > from avoiding cold starts is huge. The performance improvement from
> > > having a hot start instead of a cold start ranges from 3x for very
> > > small apps to 50x+ for larger apps like high-fidelity games.
> >
> > Is there any other real world scenario apart from this app based ecosystem where
> > user hinted LRU management might be helpful ? Just being curious. Thanks for the
> > detailed explanation. I will continue looking into this series.
>
> Chrome OS is another real world use-case for this user hinted LRU
> management approach by proactively reclaiming reclaim from tabs not
> accessed by the user for some time.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21 10:37           ` Michal Hocko
@ 2019-05-27  7:49             ` Minchan Kim
  2019-05-29 10:08               ` Daniel Colascione
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-27  7:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > [Cc linux-api]
> > > > > 
> > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > so user should call the syscall several times to give hints to
> > > > > > multiple address range.
> > > > > 
> > > > > Is that a problem? How big of a problem? Any numbers?
> > > > 
> > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > with number in the description at respin.
> > > 
> > > Does this really have to be a fast operation? I would expect the monitor
> > > is by no means a fast path. The system call overhead is not what it used
> > > to be, sigh, but still for something that is not a hot path it should be
> > > tolerable, especially when the whole operation is quite expensive on its
> > > own (wrt. the syscall entry/exit).
> > 
> > What's different with process_vm_[readv|writev] and vmsplice?
> > If the range needed to be covered is a lot, vector operation makes senese
> > to me.
> 
> I am not saying that the vector API is wrong. All I am trying to say is
> that the benefit is not really clear so far. If you want to push it
> through then you should better get some supporting data.

I measured 1000 madvise syscall vs. a vector range syscall with 1000
ranges on ARM64 mordern device. Even though I saw 15% improvement but
absoluate gain is just 1ms so I don't think it's worth to support.
I will drop vector support at next revision.

Thanks for the review, Michal!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21  6:26       ` Michal Hocko
@ 2019-05-27  7:58         ` Minchan Kim
  2019-05-27 12:44           ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-27  7:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > [cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > is extremely cheaper than file-IO on the low-end storage.
> > > > In this configuration, userspace could handle different strategy for each
> > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > file IO is more expensive in this case so want to keep them in memory
> > > > until memory pressure happens.
> > > > 
> > > > To support such strategy easier, this patch introduces
> > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > They are filters could be Ored with other existing hints using top two bits
> > > > of (int behavior).
> > > 
> > > madvise operates on top of ranges and it is quite trivial to do the
> > > filtering from the userspace so why do we need any additional filtering?
> > > 
> > > > Once either of them is set, the hint could affect only the interested vma
> > > > either anonymous or file-backed.
> > > > 
> > > > With that, user could call a process_madvise syscall simply with a entire
> > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > 
> > > OK, so here is the reason you want that. The immediate question is why
> > > cannot the monitor do the filtering from the userspace. Slightly more
> > > work, all right, but less of an API to expose and that itself is a
> > > strong argument against.
> > 
> > What I should do if we don't have such filter option is to enumerate all of
> > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > which would be painful for 2000+ vmas.
> 
> Painful is not an argument to add a new user API. If the existing API
> suits the purpose then it should be used. If it is not usable, we can
> think of a different way.

I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
It's never trivial.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-27  7:58         ` Minchan Kim
@ 2019-05-27 12:44           ` Michal Hocko
  2019-05-28  3:26             ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-27 12:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > [cc linux-api]
> > > > 
> > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > In this configuration, userspace could handle different strategy for each
> > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > until memory pressure happens.
> > > > > 
> > > > > To support such strategy easier, this patch introduces
> > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > of (int behavior).
> > > > 
> > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > filtering from the userspace so why do we need any additional filtering?
> > > > 
> > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > either anonymous or file-backed.
> > > > > 
> > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > 
> > > > OK, so here is the reason you want that. The immediate question is why
> > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > work, all right, but less of an API to expose and that itself is a
> > > > strong argument against.
> > > 
> > > What I should do if we don't have such filter option is to enumerate all of
> > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > which would be painful for 2000+ vmas.
> > 
> > Painful is not an argument to add a new user API. If the existing API
> > suits the purpose then it should be used. If it is not usable, we can
> > think of a different way.
> 
> I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> It's never trivial.

This is not the only option. Have you tried to simply use
/proc/<pid>/map_files interface? This will provide you with all the file
backed mappings.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-27 12:44           ` Michal Hocko
@ 2019-05-28  3:26             ` Minchan Kim
  2019-05-28  6:29               ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-28  3:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > [cc linux-api]
> > > > > 
> > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > until memory pressure happens.
> > > > > > 
> > > > > > To support such strategy easier, this patch introduces
> > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > of (int behavior).
> > > > > 
> > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > 
> > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > either anonymous or file-backed.
> > > > > > 
> > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > 
> > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > work, all right, but less of an API to expose and that itself is a
> > > > > strong argument against.
> > > > 
> > > > What I should do if we don't have such filter option is to enumerate all of
> > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > which would be painful for 2000+ vmas.
> > > 
> > > Painful is not an argument to add a new user API. If the existing API
> > > suits the purpose then it should be used. If it is not usable, we can
> > > think of a different way.
> > 
> > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > It's never trivial.
> 
> This is not the only option. Have you tried to simply use
> /proc/<pid>/map_files interface? This will provide you with all the file
> backed mappings.

I compared maps vs. map_files with 3036 file-backed vma.
Test scenario is to dump all of vmas of the process and parse address
ranges.
For map_files, it's easy to parse each address range because directory name
itself is range. However, in case of maps, I need to parse each range
line by line so need to scan all of lines.

(maps cover additional non-file-backed vmas so nr_vma is a little bigger)

performance mode:
map_files: nr_vma 3036 usec 13387
maps     : nr_vma 3078 usec 12923

powersave mode:

map_files: nr_vma 3036 usec 52614
maps     : nr_vma 3078 usec 41089

map_files is slower than maps if we dump all of vmas. I guess directory
operation needs much more jobs(e.g., dentry lookup, instantiation)
compared to maps.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  3:26             ` Minchan Kim
@ 2019-05-28  6:29               ` Michal Hocko
  2019-05-28  8:13                 ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-28  6:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 28-05-19 12:26:32, Minchan Kim wrote:
> On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> > On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > > [cc linux-api]
> > > > > > 
> > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > > until memory pressure happens.
> > > > > > > 
> > > > > > > To support such strategy easier, this patch introduces
> > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > > of (int behavior).
> > > > > > 
> > > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > > 
> > > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > > either anonymous or file-backed.
> > > > > > > 
> > > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > > 
> > > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > > work, all right, but less of an API to expose and that itself is a
> > > > > > strong argument against.
> > > > > 
> > > > > What I should do if we don't have such filter option is to enumerate all of
> > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > > which would be painful for 2000+ vmas.
> > > > 
> > > > Painful is not an argument to add a new user API. If the existing API
> > > > suits the purpose then it should be used. If it is not usable, we can
> > > > think of a different way.
> > > 
> > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > > It's never trivial.
> > 
> > This is not the only option. Have you tried to simply use
> > /proc/<pid>/map_files interface? This will provide you with all the file
> > backed mappings.
> 
> I compared maps vs. map_files with 3036 file-backed vma.
> Test scenario is to dump all of vmas of the process and parse address
> ranges.
> For map_files, it's easy to parse each address range because directory name
> itself is range. However, in case of maps, I need to parse each range
> line by line so need to scan all of lines.
> 
> (maps cover additional non-file-backed vmas so nr_vma is a little bigger)
> 
> performance mode:
> map_files: nr_vma 3036 usec 13387
> maps     : nr_vma 3078 usec 12923
> 
> powersave mode:
> 
> map_files: nr_vma 3036 usec 52614
> maps     : nr_vma 3078 usec 41089
> 
> map_files is slower than maps if we dump all of vmas. I guess directory
> operation needs much more jobs(e.g., dentry lookup, instantiation)
> compared to maps.

OK, that is somehow surprising. I am still not convinced the filter is a
good idea though. The primary reason is that it encourages using madvise
on a wide range without having a clue what the range contains. E.g. the
full address range and rely the right thing will happen. Do we really
want madvise to operate in that mode?

Btw. if we went with the per vma fd approach then you would get this
feature automatically because map_files would refer to file backed
mappings while map_anon could refer only to anonymous mappings.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  6:29               ` Michal Hocko
@ 2019-05-28  8:13                 ` Minchan Kim
  2019-05-28  8:31                   ` Daniel Colascione
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-28  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 28, 2019 at 08:29:47AM +0200, Michal Hocko wrote:
> On Tue 28-05-19 12:26:32, Minchan Kim wrote:
> > On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> > > On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > > > [cc linux-api]
> > > > > > > 
> > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > > > until memory pressure happens.
> > > > > > > > 
> > > > > > > > To support such strategy easier, this patch introduces
> > > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > > > of (int behavior).
> > > > > > > 
> > > > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > > > 
> > > > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > > > either anonymous or file-backed.
> > > > > > > > 
> > > > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > > > 
> > > > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > > > work, all right, but less of an API to expose and that itself is a
> > > > > > > strong argument against.
> > > > > > 
> > > > > > What I should do if we don't have such filter option is to enumerate all of
> > > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > > > which would be painful for 2000+ vmas.
> > > > > 
> > > > > Painful is not an argument to add a new user API. If the existing API
> > > > > suits the purpose then it should be used. If it is not usable, we can
> > > > > think of a different way.
> > > > 
> > > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > > > It's never trivial.
> > > 
> > > This is not the only option. Have you tried to simply use
> > > /proc/<pid>/map_files interface? This will provide you with all the file
> > > backed mappings.
> > 
> > I compared maps vs. map_files with 3036 file-backed vma.
> > Test scenario is to dump all of vmas of the process and parse address
> > ranges.
> > For map_files, it's easy to parse each address range because directory name
> > itself is range. However, in case of maps, I need to parse each range
> > line by line so need to scan all of lines.
> > 
> > (maps cover additional non-file-backed vmas so nr_vma is a little bigger)
> > 
> > performance mode:
> > map_files: nr_vma 3036 usec 13387
> > maps     : nr_vma 3078 usec 12923
> > 
> > powersave mode:
> > 
> > map_files: nr_vma 3036 usec 52614
> > maps     : nr_vma 3078 usec 41089
> > 
> > map_files is slower than maps if we dump all of vmas. I guess directory
> > operation needs much more jobs(e.g., dentry lookup, instantiation)
> > compared to maps.
> 
> OK, that is somehow surprising. I am still not convinced the filter is a
> good idea though. The primary reason is that it encourages using madvise
> on a wide range without having a clue what the range contains. E.g. the
> full address range and rely the right thing will happen. Do we really
> want madvise to operate in that mode?

If user space daemon(e.g., activity manager service) could know a certain
process is bakground and idle for a while, yeb, that would be good option.

> 
> Btw. if we went with the per vma fd approach then you would get this
> feature automatically because map_files would refer to file backed
> mappings while map_anon could refer only to anonymous mappings.

The reason to add such filter option is to avoid the parsing overhead
so map_anon wouldn't be helpful.


> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  8:13                 ` Minchan Kim
@ 2019-05-28  8:31                   ` Daniel Colascione
  2019-05-28  8:49                     ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28  8:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> if we went with the per vma fd approach then you would get this
> > feature automatically because map_files would refer to file backed
> > mappings while map_anon could refer only to anonymous mappings.
>
> The reason to add such filter option is to avoid the parsing overhead
> so map_anon wouldn't be helpful.

Without chiming on whether the filter option is a good idea, I'd like
to suggest that providing an efficient binary interfaces for pulling
memory map information out of processes.  Some single-system-call
method for retrieving a binary snapshot of a process's address space
complete with attributes (selectable, like statx?) for each VMA would
reduce complexity and increase performance in a variety of areas,
e.g., Android memory map debugging commands.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  8:31                   ` Daniel Colascione
@ 2019-05-28  8:49                     ` Minchan Kim
  2019-05-28  9:08                       ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-28  8:49 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > if we went with the per vma fd approach then you would get this
> > > feature automatically because map_files would refer to file backed
> > > mappings while map_anon could refer only to anonymous mappings.
> >
> > The reason to add such filter option is to avoid the parsing overhead
> > so map_anon wouldn't be helpful.
> 
> Without chiming on whether the filter option is a good idea, I'd like
> to suggest that providing an efficient binary interfaces for pulling
> memory map information out of processes.  Some single-system-call
> method for retrieving a binary snapshot of a process's address space
> complete with attributes (selectable, like statx?) for each VMA would
> reduce complexity and increase performance in a variety of areas,
> e.g., Android memory map debugging commands.

I agree it's the best we can get *generally*.
Michal, any opinion?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  8:49                     ` Minchan Kim
@ 2019-05-28  9:08                       ` Michal Hocko
  2019-05-28  9:39                         ` Daniel Colascione
  2019-05-28 10:32                         ` Minchan Kim
  0 siblings, 2 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-28  9:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > if we went with the per vma fd approach then you would get this
> > > > feature automatically because map_files would refer to file backed
> > > > mappings while map_anon could refer only to anonymous mappings.
> > >
> > > The reason to add such filter option is to avoid the parsing overhead
> > > so map_anon wouldn't be helpful.
> > 
> > Without chiming on whether the filter option is a good idea, I'd like
> > to suggest that providing an efficient binary interfaces for pulling
> > memory map information out of processes.  Some single-system-call
> > method for retrieving a binary snapshot of a process's address space
> > complete with attributes (selectable, like statx?) for each VMA would
> > reduce complexity and increase performance in a variety of areas,
> > e.g., Android memory map debugging commands.
> 
> I agree it's the best we can get *generally*.
> Michal, any opinion?

I am not really sure this is directly related. I think the primary
question that we have to sort out first is whether we want to have
the remote madvise call process or vma fd based. This is an important
distinction wrt. usability. I have only seen pid vs. pidfd discussions
so far unfortunately.

An interface to query address range information is a separate but
although a related topic. We have /proc/<pid>/[s]maps for that right
now and I understand it is not a general win for all usecases because
it tends to be slow for some. I can see how /proc/<pid>/map_anons could
provide per vma information in a binary form via a fd based interface.
But I would rather not conflate those two discussions much - well except
if it could give one of the approaches more justification but let's
focus on the madvise part first.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  9:08                       ` Michal Hocko
@ 2019-05-28  9:39                         ` Daniel Colascione
  2019-05-28 10:33                           ` Michal Hocko
  2019-05-28 10:32                         ` Minchan Kim
  1 sibling, 1 reply; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28  9:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > if we went with the per vma fd approach then you would get this
> > > > > feature automatically because map_files would refer to file backed
> > > > > mappings while map_anon could refer only to anonymous mappings.
> > > >
> > > > The reason to add such filter option is to avoid the parsing overhead
> > > > so map_anon wouldn't be helpful.
> > >
> > > Without chiming on whether the filter option is a good idea, I'd like
> > > to suggest that providing an efficient binary interfaces for pulling
> > > memory map information out of processes.  Some single-system-call
> > > method for retrieving a binary snapshot of a process's address space
> > > complete with attributes (selectable, like statx?) for each VMA would
> > > reduce complexity and increase performance in a variety of areas,
> > > e.g., Android memory map debugging commands.
> >
> > I agree it's the best we can get *generally*.
> > Michal, any opinion?
>
> I am not really sure this is directly related. I think the primary
> question that we have to sort out first is whether we want to have
> the remote madvise call process or vma fd based. This is an important
> distinction wrt. usability. I have only seen pid vs. pidfd discussions
> so far unfortunately.

I don't think the vma fd approach is viable. We have some processes
with a *lot* of VMAs --- system_server had 4204 when I checked just
now (and that's typical) --- and an FD operation per VMA would be
excessive. VMAs also come and go pretty easily depending on changes in
protections and various faults. It's also not entirely clear what the
semantics of vma FDs should be over address space mutations, while the
semantics of address ranges are well-understood. I would much prefer
an interface operating on address ranges to one operating on VMA FDs,
both for efficiency and for consistency with other memory management
APIs.

> An interface to query address range information is a separate but
> although a related topic. We have /proc/<pid>/[s]maps for that right
> now and I understand it is not a general win for all usecases because
> it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> provide per vma information in a binary form via a fd based interface.
> But I would rather not conflate those two discussions much - well except
> if it could give one of the approaches more justification but let's
> focus on the madvise part first.

I don't think it's a good idea to focus on one feature in a
multi-feature change when the interactions between features can be
very important for overall design of the multi-feature system and the
design of each feature.

Here's my thinking on the high-level design:

I'm imagining an address-range system that would work like this: we'd
create some kind of process_vm_getinfo(2) system call [1] that would
accept a statx-like attribute map and a pid/fd parameter as input and
return, on output, two things: 1) an array [2] of VMA descriptors
containing the requested information, and 2) a VMA configuration
sequence number. We'd then have process_madvise() and other
cross-process VM interfaces accept both address ranges and this
sequence number; they'd succeed only if the VMA configuration sequence
number is still current, i.e., the target process hasn't changed its
VMA configuration (implicitly or explicitly) since the call to
process_vm_getinfo().

This way, a process A that wants to perform some VM operation on
process B can slurp B's VMA configuration using process_vm_getinfo(),
figure out what it wants to do, and attempt to do it. If B modifies
its memory map in the meantime, If A finds that its local knowledge of
B's memory map has become invalid between the process_vm_getinfo() and
A taking some action based on the result, A can retry [3]. While A
could instead ptrace or otherwise suspend B, *then* read B's memory
map (knowing B is quiescent), *then* operate on B, the optimistic
approach I'm describing would be much lighter-weight in the typical
case. It's also pretty simple, IMHO. If the "operate on B" step is
some kind of vectorized operation over multiple address ranges, this
approach also gets us all-or-nothing semantics.

Or maybe the whole sequence number thing is overkill and we don't need
atomicity? But if there's a concern  that A shouldn't operate on B's
memory without knowing what it's operating on, then the scheme I've
proposed above solves this knowledge problem in a pretty lightweight
way.

[1] or some other interface
[2] or something more complicated if we want the descriptors to
contain variable-length elements, e.g., strings
[3] or override the sequence number check if it's feeling bold?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  9:08                       ` Michal Hocko
  2019-05-28  9:39                         ` Daniel Colascione
@ 2019-05-28 10:32                         ` Minchan Kim
  2019-05-28 10:41                           ` Michal Hocko
  1 sibling, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-28 10:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > if we went with the per vma fd approach then you would get this
> > > > > feature automatically because map_files would refer to file backed
> > > > > mappings while map_anon could refer only to anonymous mappings.
> > > >
> > > > The reason to add such filter option is to avoid the parsing overhead
> > > > so map_anon wouldn't be helpful.
> > > 
> > > Without chiming on whether the filter option is a good idea, I'd like
> > > to suggest that providing an efficient binary interfaces for pulling
> > > memory map information out of processes.  Some single-system-call
> > > method for retrieving a binary snapshot of a process's address space
> > > complete with attributes (selectable, like statx?) for each VMA would
> > > reduce complexity and increase performance in a variety of areas,
> > > e.g., Android memory map debugging commands.
> > 
> > I agree it's the best we can get *generally*.
> > Michal, any opinion?
> 
> I am not really sure this is directly related. I think the primary
> question that we have to sort out first is whether we want to have
> the remote madvise call process or vma fd based. This is an important
> distinction wrt. usability. I have only seen pid vs. pidfd discussions
> so far unfortunately.

With current usecase, it's per-process API with distinguishable anon/file
but thought it could be easily extended later for each address range
operation as userspace getting smarter with more information.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  9:39                         ` Daniel Colascione
@ 2019-05-28 10:33                           ` Michal Hocko
  2019-05-28 11:21                             ` Daniel Colascione
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 10:33 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > if we went with the per vma fd approach then you would get this
> > > > > > feature automatically because map_files would refer to file backed
> > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > >
> > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > so map_anon wouldn't be helpful.
> > > >
> > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > to suggest that providing an efficient binary interfaces for pulling
> > > > memory map information out of processes.  Some single-system-call
> > > > method for retrieving a binary snapshot of a process's address space
> > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > reduce complexity and increase performance in a variety of areas,
> > > > e.g., Android memory map debugging commands.
> > >
> > > I agree it's the best we can get *generally*.
> > > Michal, any opinion?
> >
> > I am not really sure this is directly related. I think the primary
> > question that we have to sort out first is whether we want to have
> > the remote madvise call process or vma fd based. This is an important
> > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > so far unfortunately.
> 
> I don't think the vma fd approach is viable. We have some processes
> with a *lot* of VMAs --- system_server had 4204 when I checked just
> now (and that's typical) --- and an FD operation per VMA would be
> excessive.

What do you mean by excessive here? Do you expect the process to have
them open all at once?

> VMAs also come and go pretty easily depending on changes in
> protections and various faults.

Is this really too much different from /proc/<pid>/map_files?

[...]

> > An interface to query address range information is a separate but
> > although a related topic. We have /proc/<pid>/[s]maps for that right
> > now and I understand it is not a general win for all usecases because
> > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > provide per vma information in a binary form via a fd based interface.
> > But I would rather not conflate those two discussions much - well except
> > if it could give one of the approaches more justification but let's
> > focus on the madvise part first.
> 
> I don't think it's a good idea to focus on one feature in a
> multi-feature change when the interactions between features can be
> very important for overall design of the multi-feature system and the
> design of each feature.
> 
> Here's my thinking on the high-level design:
> 
> I'm imagining an address-range system that would work like this: we'd
> create some kind of process_vm_getinfo(2) system call [1] that would
> accept a statx-like attribute map and a pid/fd parameter as input and
> return, on output, two things: 1) an array [2] of VMA descriptors
> containing the requested information, and 2) a VMA configuration
> sequence number. We'd then have process_madvise() and other
> cross-process VM interfaces accept both address ranges and this
> sequence number; they'd succeed only if the VMA configuration sequence
> number is still current, i.e., the target process hasn't changed its
> VMA configuration (implicitly or explicitly) since the call to
> process_vm_getinfo().

The sequence number is essentially a cookie that is transparent to the
userspace right? If yes, how does it differ from a fd (returned from
/proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can
be used to revalidate when the operation is requested and fail if
something has changed. Moreover we already do have a fd based madvise
syscall so there shouldn't be really a large need to add a new set of
syscalls.

[...]

> Or maybe the whole sequence number thing is overkill and we don't need
> atomicity? But if there's a concern  that A shouldn't operate on B's
> memory without knowing what it's operating on, then the scheme I've
> proposed above solves this knowledge problem in a pretty lightweight
> way.

This is the main question here. Do we really want to enforce an external
synchronization between the two processes to make sure that they are
both operating on the same range - aka protect from the range going away
and being reused for a different purpose. Right now it wouldn't be fatal
because both operations are non destructive but I can imagine that there
will be more madvise operations to follow (including those that are
destructive) because people will simply find usecases for that. This
should be reflected in the proposed API.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:32                         ` Minchan Kim
@ 2019-05-28 10:41                           ` Michal Hocko
  2019-05-28 11:12                             ` Minchan Kim
  2019-05-28 11:28                             ` Daniel Colascione
  0 siblings, 2 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 10:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > if we went with the per vma fd approach then you would get this
> > > > > > feature automatically because map_files would refer to file backed
> > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > >
> > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > so map_anon wouldn't be helpful.
> > > > 
> > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > to suggest that providing an efficient binary interfaces for pulling
> > > > memory map information out of processes.  Some single-system-call
> > > > method for retrieving a binary snapshot of a process's address space
> > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > reduce complexity and increase performance in a variety of areas,
> > > > e.g., Android memory map debugging commands.
> > > 
> > > I agree it's the best we can get *generally*.
> > > Michal, any opinion?
> > 
> > I am not really sure this is directly related. I think the primary
> > question that we have to sort out first is whether we want to have
> > the remote madvise call process or vma fd based. This is an important
> > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > so far unfortunately.
> 
> With current usecase, it's per-process API with distinguishable anon/file
> but thought it could be easily extended later for each address range
> operation as userspace getting smarter with more information.

Never design user API based on a single usecase, please. The "easily
extended" part is by far not clear to me TBH. As I've already mentioned
several times, the synchronization model has to be thought through
carefuly before a remote process address range operation can be
implemented.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:41                           ` Michal Hocko
@ 2019-05-28 11:12                             ` Minchan Kim
  2019-05-28 11:28                               ` Michal Hocko
  2019-05-28 11:28                             ` Daniel Colascione
  1 sibling, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-28 11:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > > 
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > > 
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > > 
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> > 
> > With current usecase, it's per-process API with distinguishable anon/file
> > but thought it could be easily extended later for each address range
> > operation as userspace getting smarter with more information.
> 
> Never design user API based on a single usecase, please. The "easily
> extended" part is by far not clear to me TBH. As I've already mentioned
> several times, the synchronization model has to be thought through
> carefuly before a remote process address range operation can be
> implemented.

I agree with you that we shouldn't design API on single usecase but what
you are concerning is actually not our usecase because we are resilient
with the race since MADV_COLD|PAGEOUT is not destruptive.
Actually, many hints are already racy in that the upcoming pattern would
be different with the behavior you thought at the moment.

If you are still concerning of address range synchronization, how about
moving such hints to per-process level like prctl?
Does it make sense to you?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:33                           ` Michal Hocko
@ 2019-05-28 11:21                             ` Daniel Colascione
  2019-05-28 11:49                               ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > >
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > >
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > >
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> >
> > I don't think the vma fd approach is viable. We have some processes
> > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > now (and that's typical) --- and an FD operation per VMA would be
> > excessive.
>
> What do you mean by excessive here? Do you expect the process to have
> them open all at once?

Minchan's already done timing. More broadly, in an era with various
speculative execution mitigations, making a system call is pretty
expensive. If we have two options for remote VMA manipulation, one
that requires thousands of system calls (with the count proportional
to the address space size of the process) and one that requires only a
few system calls no matter how large the target process is, the latter
ought to start off with more points than the former under any kind of
design scoring.

> > VMAs also come and go pretty easily depending on changes in
> > protections and various faults.
>
> Is this really too much different from /proc/<pid>/map_files?

It's very different. See below.

> > > An interface to query address range information is a separate but
> > > although a related topic. We have /proc/<pid>/[s]maps for that right
> > > now and I understand it is not a general win for all usecases because
> > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > > provide per vma information in a binary form via a fd based interface.
> > > But I would rather not conflate those two discussions much - well except
> > > if it could give one of the approaches more justification but let's
> > > focus on the madvise part first.
> >
> > I don't think it's a good idea to focus on one feature in a
> > multi-feature change when the interactions between features can be
> > very important for overall design of the multi-feature system and the
> > design of each feature.
> >
> > Here's my thinking on the high-level design:
> >
> > I'm imagining an address-range system that would work like this: we'd
> > create some kind of process_vm_getinfo(2) system call [1] that would
> > accept a statx-like attribute map and a pid/fd parameter as input and
> > return, on output, two things: 1) an array [2] of VMA descriptors
> > containing the requested information, and 2) a VMA configuration
> > sequence number. We'd then have process_madvise() and other
> > cross-process VM interfaces accept both address ranges and this
> > sequence number; they'd succeed only if the VMA configuration sequence
> > number is still current, i.e., the target process hasn't changed its
> > VMA configuration (implicitly or explicitly) since the call to
> > process_vm_getinfo().
>
> The sequence number is essentially a cookie that is transparent to the
> userspace right? If yes, how does it differ from a fd (returned from
> /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can

If you want to operate on N VMAs simultaneously under an FD-per-VMA
model, you'd need to have those N FDs all open at the same time *and*
add some kind of system call that accepted those N FDs and an
operation to perform. The sequence number I'm proposing also applies
to the whole address space, not just one VMA. Even if you did have
these N FDs open all at once and supplied them all to some batch
operation, you couldn't guarantee via the FD mechanism that some *new*
VMA didn't appear in the address range you want to manipulate. A
global sequence number would catch this case. I still think supplying
a list of address ranges (like we already do for scatter-gather IO) is
less error-prone, less resource-intensive, more consistent with
existing practice, and equally flexible, especially if we start
supporting destructive cross-process memory operations, which may be
useful for things like checkpointing and optimizing process startup.

Besides: process_vm_readv and process_vm_writev already work on
address ranges. Why should other cross-process memory APIs use a very
different model for naming memory regions?

> be used to revalidate when the operation is requested and fail if
> something has changed. Moreover we already do have a fd based madvise
> syscall so there shouldn't be really a large need to add a new set of
> syscalls.

We have various system calls that provide hints for open files, but
the memory operations are distinct. Modeling anonymous memory as a
kind of file-backed memory for purposes of VMA manipulation would also
be a departure from existing practice. Can you help me understand why
you seem to favor the FD-per-VMA approach so heavily? I don't see any
arguments *for* an FD-per-VMA model for remove memory manipulation and
I see a lot of arguments against it. Is there some compelling
advantage I'm missing?

> > Or maybe the whole sequence number thing is overkill and we don't need
> > atomicity? But if there's a concern  that A shouldn't operate on B's
> > memory without knowing what it's operating on, then the scheme I've
> > proposed above solves this knowledge problem in a pretty lightweight
> > way.
>
> This is the main question here. Do we really want to enforce an external
> synchronization between the two processes to make sure that they are
> both operating on the same range - aka protect from the range going away
> and being reused for a different purpose. Right now it wouldn't be fatal
> because both operations are non destructive but I can imagine that there
> will be more madvise operations to follow (including those that are
> destructive) because people will simply find usecases for that. This
> should be reflected in the proposed API.

A sequence number gives us this synchronization at very low cost and
adds safety. It's also a general-purpose mechanism that would
safeguard *any* cross-process VM operation, not just the VM operations
we're discussing right now.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:12                             ` Minchan Kim
@ 2019-05-28 11:28                               ` Michal Hocko
  2019-05-28 11:42                                 ` Daniel Colascione
  2019-05-28 11:44                                 ` Minchan Kim
  0 siblings, 2 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 11:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > >
> > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > so map_anon wouldn't be helpful.
> > > > > > 
> > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > memory map information out of processes.  Some single-system-call
> > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > e.g., Android memory map debugging commands.
> > > > > 
> > > > > I agree it's the best we can get *generally*.
> > > > > Michal, any opinion?
> > > > 
> > > > I am not really sure this is directly related. I think the primary
> > > > question that we have to sort out first is whether we want to have
> > > > the remote madvise call process or vma fd based. This is an important
> > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > so far unfortunately.
> > > 
> > > With current usecase, it's per-process API with distinguishable anon/file
> > > but thought it could be easily extended later for each address range
> > > operation as userspace getting smarter with more information.
> > 
> > Never design user API based on a single usecase, please. The "easily
> > extended" part is by far not clear to me TBH. As I've already mentioned
> > several times, the synchronization model has to be thought through
> > carefuly before a remote process address range operation can be
> > implemented.
> 
> I agree with you that we shouldn't design API on single usecase but what
> you are concerning is actually not our usecase because we are resilient
> with the race since MADV_COLD|PAGEOUT is not destruptive.
> Actually, many hints are already racy in that the upcoming pattern would
> be different with the behavior you thought at the moment.

How come they are racy wrt address ranges? You would have to be in
multithreaded environment and then the onus of synchronization is on
threads. That model is quite clear. But we are talking about separate
processes and some of them might be even not aware of an external entity
tweaking their address space.

> If you are still concerning of address range synchronization, how about
> moving such hints to per-process level like prctl?
> Does it make sense to you?

No it doesn't. How is prctl any relevant to any address range
operations.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:41                           ` Michal Hocko
  2019-05-28 11:12                             ` Minchan Kim
@ 2019-05-28 11:28                             ` Daniel Colascione
  1 sibling, 0 replies; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 3:41 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > >
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > >
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > >
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> >
> > With current usecase, it's per-process API with distinguishable anon/file
> > but thought it could be easily extended later for each address range
> > operation as userspace getting smarter with more information.
>
> Never design user API based on a single usecase, please. The "easily
> extended" part is by far not clear to me TBH. As I've already mentioned
> several times, the synchronization model has to be thought through
> carefuly before a remote process address range operation can be
> implemented.

I don't think anyone is overfitting for a specific use case. When some
process A wants to manipulate process B's memory, it's fair for A to
want to know what memory it's manipulating. That's a general concern
that applies to a large family of cross-process memory operations.
It's less important for non-destructive hints than for some kind of
destructive operation, but the same idea applies. If there's a simple
way to solve this A-B information problem in a general way, it seems
to be that we should apply that general solution. Likewise, an API to
get an efficiently-collected snapshot of a process's address space
would be immediately useful in several very different use cases,
including debuggers, Android memory use reporting tools, and various
kinds of metric collection. Because we're talking about mechanisms
that solve several independent problems at the same time and in a
general way, it doesn't sound to me like overfitting for a particular
use case.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:28                               ` Michal Hocko
@ 2019-05-28 11:42                                 ` Daniel Colascione
  2019-05-28 11:56                                   ` Michal Hocko
  2019-05-28 12:10                                   ` Minchan Kim
  2019-05-28 11:44                                 ` Minchan Kim
  1 sibling, 2 replies; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > >
> > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > >
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > >
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > >
> > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > but thought it could be easily extended later for each address range
> > > > operation as userspace getting smarter with more information.
> > >
> > > Never design user API based on a single usecase, please. The "easily
> > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > several times, the synchronization model has to be thought through
> > > carefuly before a remote process address range operation can be
> > > implemented.
> >
> > I agree with you that we shouldn't design API on single usecase but what
> > you are concerning is actually not our usecase because we are resilient
> > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > Actually, many hints are already racy in that the upcoming pattern would
> > be different with the behavior you thought at the moment.
>
> How come they are racy wrt address ranges? You would have to be in
> multithreaded environment and then the onus of synchronization is on
> threads. That model is quite clear. But we are talking about separate
> processes and some of them might be even not aware of an external entity
> tweaking their address space.

I don't think the difference between a thread and a process matters in
this context. Threads race on address space operations all the time
--- in the sense that multiple threads modify a process's address
space without synchronization. The main reasons that these races
hasn't been a problem are: 1) threads mostly "mind their own business"
and modify different parts of the address space or use locks to ensure
that they don't stop on each other (e.g., the malloc heap lock), and
2) POSIX mmap atomic-replacement semantics make certain classes of
operation (like "magic ring buffer" setup) safe even in the presence
of other threads stomping over an address space.

The thing that's new in this discussion from a synchronization point
of view isn't that the VM operation we're talking about is coming from
outside the process, but that we want to do a read-decide-modify-ish
thing. We want to affect (using various hints) classes of pages like
"all file pages" or "all anonymous pages" or "some pages referring to
graphics buffers up to 100MB" (to pick an example off the top of my
head of a policy that might make sense). From a synchronization point
of view, it doesn't really matter whether it's a thread within the
target process or a thread outside the target process that does the
address space manipulation. What's new is the inspection of the
address space before performing an operation.

Minchan started this thread by proposing some flags that would
implement a few of the filtering policies I used as examples above.
Personally, instead of providing a few pre-built policies as flags,
I'd rather push the page manipulation policy to userspace as much as
possible and just have the kernel provide a mechanism that *in
general* makes these read-decide-modify operations efficient and
robust. I still think there's way to achieve this goal very
inexpensively without compromising on flexibility.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:28                               ` Michal Hocko
  2019-05-28 11:42                                 ` Daniel Colascione
@ 2019-05-28 11:44                                 ` Minchan Kim
  2019-05-28 11:51                                   ` Daniel Colascione
  2019-05-28 12:06                                   ` Michal Hocko
  1 sibling, 2 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-28 11:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > 
> > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > > 
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > > 
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > > 
> > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > but thought it could be easily extended later for each address range
> > > > operation as userspace getting smarter with more information.
> > > 
> > > Never design user API based on a single usecase, please. The "easily
> > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > several times, the synchronization model has to be thought through
> > > carefuly before a remote process address range operation can be
> > > implemented.
> > 
> > I agree with you that we shouldn't design API on single usecase but what
> > you are concerning is actually not our usecase because we are resilient
> > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > Actually, many hints are already racy in that the upcoming pattern would
> > be different with the behavior you thought at the moment.
> 
> How come they are racy wrt address ranges? You would have to be in
> multithreaded environment and then the onus of synchronization is on
> threads. That model is quite clear. But we are talking about separate

Think about MADV_FREE. Allocator would think the chunk is worth to mark
"freeable" but soon, user of the allocator asked the chunk - ie, it's not
freeable any longer once user start to use it.

My point is that kinds of *hints* are always racy so any synchronization
couldn't help a lot. That's why I want to restrict hints process_madvise
supports as such kinds of non-destruptive one at next respin.

> processes and some of them might be even not aware of an external entity
> tweaking their address space.
> 
> > If you are still concerning of address range synchronization, how about
> > moving such hints to per-process level like prctl?
> > Does it make sense to you?
> 
> No it doesn't. How is prctl any relevant to any address range
> operations.

"whether we want to have the remote madvise call process or vma fd based."

You asked the above question and I answered we are using process level
hints but anon/vma filter at this moment. That's why I told you prctl to
make forward progress on discussion.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:21                             ` Daniel Colascione
@ 2019-05-28 11:49                               ` Michal Hocko
  2019-05-28 12:11                                 ` Daniel Colascione
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 11:49 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 04:21:44, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > >
> > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > so map_anon wouldn't be helpful.
> > > > > >
> > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > memory map information out of processes.  Some single-system-call
> > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > e.g., Android memory map debugging commands.
> > > > >
> > > > > I agree it's the best we can get *generally*.
> > > > > Michal, any opinion?
> > > >
> > > > I am not really sure this is directly related. I think the primary
> > > > question that we have to sort out first is whether we want to have
> > > > the remote madvise call process or vma fd based. This is an important
> > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > so far unfortunately.
> > >
> > > I don't think the vma fd approach is viable. We have some processes
> > > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > > now (and that's typical) --- and an FD operation per VMA would be
> > > excessive.
> >
> > What do you mean by excessive here? Do you expect the process to have
> > them open all at once?
> 
> Minchan's already done timing. More broadly, in an era with various
> speculative execution mitigations, making a system call is pretty
> expensive.

This is a completely separate discussion. This could be argued about
many other syscalls. Let's make the semantic correct first before we
even start thinking about mutliplexing. It is easier to multiplex on an
existing and sane interface.

Btw. Minchan concluded that multiplexing is not really all that
important based on his numbers http://lkml.kernel.org/r/20190527074940.GB6879@google.com

[...]

> > Is this really too much different from /proc/<pid>/map_files?
> 
> It's very different. See below.
> 
> > > > An interface to query address range information is a separate but
> > > > although a related topic. We have /proc/<pid>/[s]maps for that right
> > > > now and I understand it is not a general win for all usecases because
> > > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > > > provide per vma information in a binary form via a fd based interface.
> > > > But I would rather not conflate those two discussions much - well except
> > > > if it could give one of the approaches more justification but let's
> > > > focus on the madvise part first.
> > >
> > > I don't think it's a good idea to focus on one feature in a
> > > multi-feature change when the interactions between features can be
> > > very important for overall design of the multi-feature system and the
> > > design of each feature.
> > >
> > > Here's my thinking on the high-level design:
> > >
> > > I'm imagining an address-range system that would work like this: we'd
> > > create some kind of process_vm_getinfo(2) system call [1] that would
> > > accept a statx-like attribute map and a pid/fd parameter as input and
> > > return, on output, two things: 1) an array [2] of VMA descriptors
> > > containing the requested information, and 2) a VMA configuration
> > > sequence number. We'd then have process_madvise() and other
> > > cross-process VM interfaces accept both address ranges and this
> > > sequence number; they'd succeed only if the VMA configuration sequence
> > > number is still current, i.e., the target process hasn't changed its
> > > VMA configuration (implicitly or explicitly) since the call to
> > > process_vm_getinfo().
> >
> > The sequence number is essentially a cookie that is transparent to the
> > userspace right? If yes, how does it differ from a fd (returned from
> > /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can
> 
> If you want to operate on N VMAs simultaneously under an FD-per-VMA
> model, you'd need to have those N FDs all open at the same time *and*
> add some kind of system call that accepted those N FDs and an
> operation to perform. The sequence number I'm proposing also applies
> to the whole address space, not just one VMA. Even if you did have
> these N FDs open all at once and supplied them all to some batch
> operation, you couldn't guarantee via the FD mechanism that some *new*
> VMA didn't appear in the address range you want to manipulate. A
> global sequence number would catch this case. I still think supplying
> a list of address ranges (like we already do for scatter-gather IO) is
> less error-prone, less resource-intensive, more consistent with
> existing practice, and equally flexible, especially if we start
> supporting destructive cross-process memory operations, which may be
> useful for things like checkpointing and optimizing process startup.

I have a strong feeling you are over optimizing here. We are talking
about a pro-active memory management and so far I haven't heard any
usecase where all this would happen in the fast path. There are
essentially two usecases I have heard so far. Age/Reclaim the whole
process (with anon/fs preferency) and do the same on a particular
and well specified range (e.g. a garbage collector or an inactive large
image in browsert etc...). The former doesn't really care about parallel
address range manipulations because it can tolerate them. The later is a
completely different story.

Are there any others where saving few ms matter so much?

> Besides: process_vm_readv and process_vm_writev already work on
> address ranges. Why should other cross-process memory APIs use a very
> different model for naming memory regions?

I would consider those APIs not a great example. They are racy on
more levels (pid reuse and address space modification), and require a
non-trivial synchronization. Do you want something similar for madvise
on a non-cooperating remote application?
 
> > be used to revalidate when the operation is requested and fail if
> > something has changed. Moreover we already do have a fd based madvise
> > syscall so there shouldn't be really a large need to add a new set of
> > syscalls.
> 
> We have various system calls that provide hints for open files, but
> the memory operations are distinct. Modeling anonymous memory as a
> kind of file-backed memory for purposes of VMA manipulation would also
> be a departure from existing practice. Can you help me understand why
> you seem to favor the FD-per-VMA approach so heavily? I don't see any
> arguments *for* an FD-per-VMA model for remove memory manipulation and
> I see a lot of arguments against it. Is there some compelling
> advantage I'm missing?

First and foremost it provides an easy cookie to the userspace to
guarantee time-to-check-time-to-use consistency. It also naturally
extend an existing fadvise interface that achieves madvise semantic on
files. I am not really pushing hard for this particular API but I really
do care about a programming model that would be sane. If we have a
different means to achieve the same then all fine by me but so far I
haven't heard any sound arguments to invent something completely new
when we have established APIs to use. Exporting anonymous mappings via
proc the same way we do for file mappings doesn't seem to be stepping
outside of the current practice way too much.

All I am trying to say here is that process_madvise(fd, start, len) is
inherently racy API and we should focus on discussing whether this is a
sane model. And I think it would be much better to discuss that under
the respective patch which introduces that API rather than here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:44                                 ` Minchan Kim
@ 2019-05-28 11:51                                   ` Daniel Colascione
  2019-05-28 12:06                                   ` Michal Hocko
  1 sibling, 0 replies; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:44 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > >
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > >
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > >
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > >
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > >
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > >
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> >
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
>
> Think about MADV_FREE. Allocator would think the chunk is worth to mark
> "freeable" but soon, user of the allocator asked the chunk - ie, it's not
> freeable any longer once user start to use it.
>
> My point is that kinds of *hints* are always racy so any synchronization
> couldn't help a lot. That's why I want to restrict hints process_madvise
> supports as such kinds of non-destruptive one at next respin.

I think it's more natural for process_madvise to be a superset of
regular madvise. What's the harm? There are no security implications,
since anyone who could process_madvise could just ptrace anyway. I
also don't think limiting the hinting to non-destructive operations
guarantees safety (in a broad sense) either, since operating on the
wrong memory range can still cause unexpected system performance
issues even if there's no data loss.

More broadly, what I want to see is a family of process_* APIs that
provide a superset of the functionality that the existing intraprocess
APIs provide. I think this approach is elegant and generalizes easily.
I'm worried about prematurely limiting the interprocess memory APIs
and creating limitations that will last a long time in order to avoid
having to consider issues like VMA synchronization.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:42                                 ` Daniel Colascione
@ 2019-05-28 11:56                                   ` Michal Hocko
  2019-05-28 12:18                                     ` Daniel Colascione
  2019-05-28 12:10                                   ` Minchan Kim
  1 sibling, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 11:56 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 04:42:47, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > >
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > >
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > >
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > >
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > >
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > >
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> >
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
> > processes and some of them might be even not aware of an external entity
> > tweaking their address space.
> 
> I don't think the difference between a thread and a process matters in
> this context. Threads race on address space operations all the time
> --- in the sense that multiple threads modify a process's address
> space without synchronization.

I would disagree. They do have in-kernel synchronization as long as they
do not use MAP_FIXED. If they do want to use MAP_FIXED then they better
synchronize or the result is undefined.

> The main reasons that these races
> hasn't been a problem are: 1) threads mostly "mind their own business"
> and modify different parts of the address space or use locks to ensure
> that they don't stop on each other (e.g., the malloc heap lock), and
> 2) POSIX mmap atomic-replacement semantics make certain classes of
> operation (like "magic ring buffer" setup) safe even in the presence
> of other threads stomping over an address space.

Agreed here.

[...]

> From a synchronization point
> of view, it doesn't really matter whether it's a thread within the
> target process or a thread outside the target process that does the
> address space manipulation. What's new is the inspection of the
> address space before performing an operation.

The fundamental difference is that if you want to achieve the same
inside the process then your application is inherenly aware of the
operation and use whatever synchronization is needed to achieve a
consistency. As soon as you allow the same from outside you either
have to have an aware target application as well or you need a mechanism
to find out that your decision has been invalidated by a later
unsynchronized action.

> Minchan started this thread by proposing some flags that would
> implement a few of the filtering policies I used as examples above.
> Personally, instead of providing a few pre-built policies as flags,
> I'd rather push the page manipulation policy to userspace as much as
> possible and just have the kernel provide a mechanism that *in
> general* makes these read-decide-modify operations efficient and
> robust. I still think there's way to achieve this goal very
> inexpensively without compromising on flexibility.

Agreed here.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:44                                 ` Minchan Kim
  2019-05-28 11:51                                   ` Daniel Colascione
@ 2019-05-28 12:06                                   ` Michal Hocko
  2019-05-28 12:22                                     ` Minchan Kim
  1 sibling, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 12:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 20:44:36, Minchan Kim wrote:
> On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > > 
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > > 
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > > 
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > > 
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > > 
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > > 
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> > 
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
> 
> Think about MADV_FREE. Allocator would think the chunk is worth to mark
> "freeable" but soon, user of the allocator asked the chunk - ie, it's not
> freeable any longer once user start to use it.

That is not a race in the address space, right. The underlying object
hasn't changed. It has been declared as freeable and since that moment
nobody can rely on the content because it might have been discarded.
Or put simply, the content is undefined. It is responsibility of the
madvise caller to make sure that the object is not in active use while
it is marking it.

> My point is that kinds of *hints* are always racy so any synchronization
> couldn't help a lot. That's why I want to restrict hints process_madvise
> supports as such kinds of non-destruptive one at next respin.

I agree that a non-destructive operations are safer against paralel
modifications because you just get a annoying and unexpected latency at
worst case. But we should discuss whether this assumption is sufficient
for further development. I am pretty sure once we open remote madvise
people will find usecases for destructive operations or even new madvise
modes we haven't heard of. What then?

> > processes and some of them might be even not aware of an external entity
> > tweaking their address space.
> > 
> > > If you are still concerning of address range synchronization, how about
> > > moving such hints to per-process level like prctl?
> > > Does it make sense to you?
> > 
> > No it doesn't. How is prctl any relevant to any address range
> > operations.
> 
> "whether we want to have the remote madvise call process or vma fd based."

Still not following. So you want to have a prctl (one of the worst API
we have along with ioctl) to tell the semantic? This sounds like a
terrible idea to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:42                                 ` Daniel Colascione
  2019-05-28 11:56                                   ` Michal Hocko
@ 2019-05-28 12:10                                   ` Minchan Kim
  1 sibling, 0 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-28 12:10 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 04:42:47AM -0700, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > >
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > >
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > >
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > >
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > >
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > >
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> >
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
> > processes and some of them might be even not aware of an external entity
> > tweaking their address space.
> 
> I don't think the difference between a thread and a process matters in
> this context. Threads race on address space operations all the time
> --- in the sense that multiple threads modify a process's address
> space without synchronization. The main reasons that these races
> hasn't been a problem are: 1) threads mostly "mind their own business"
> and modify different parts of the address space or use locks to ensure
> that they don't stop on each other (e.g., the malloc heap lock), and
> 2) POSIX mmap atomic-replacement semantics make certain classes of
> operation (like "magic ring buffer" setup) safe even in the presence
> of other threads stomping over an address space.
> 
> The thing that's new in this discussion from a synchronization point
> of view isn't that the VM operation we're talking about is coming from
> outside the process, but that we want to do a read-decide-modify-ish
> thing. We want to affect (using various hints) classes of pages like
> "all file pages" or "all anonymous pages" or "some pages referring to
> graphics buffers up to 100MB" (to pick an example off the top of my
> head of a policy that might make sense). From a synchronization point
> of view, it doesn't really matter whether it's a thread within the
> target process or a thread outside the target process that does the
> address space manipulation. What's new is the inspection of the
> address space before performing an operation.
> 
> Minchan started this thread by proposing some flags that would
> implement a few of the filtering policies I used as examples above.
> Personally, instead of providing a few pre-built policies as flags,
> I'd rather push the page manipulation policy to userspace as much as
> possible and just have the kernel provide a mechanism that *in
> general* makes these read-decide-modify operations efficient and
> robust. I still think there's way to achieve this goal very
> inexpensively without compromising on flexibility.

I'm looking forward to seeing the way. ;-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:49                               ` Michal Hocko
@ 2019-05-28 12:11                                 ` Daniel Colascione
  2019-05-28 12:32                                   ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28 12:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:49 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 04:21:44, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > >
> > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > >
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > >
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > >
> > > > I don't think the vma fd approach is viable. We have some processes
> > > > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > > > now (and that's typical) --- and an FD operation per VMA would be
> > > > excessive.
> > >
> > > What do you mean by excessive here? Do you expect the process to have
> > > them open all at once?
> >
> > Minchan's already done timing. More broadly, in an era with various
> > speculative execution mitigations, making a system call is pretty
> > expensive.
>
> This is a completely separate discussion. This could be argued about
> many other syscalls.

Yes, it can be. That's why we have scatter-gather IO system calls in
the first place.

> Let's make the semantic correct first before we
> even start thinking about mutliplexing. It is easier to multiplex on an
> existing and sane interface.

I don't think of it as "multiplexing" yet, not in the fundamental unit
of operation is the address range.

> Btw. Minchan concluded that multiplexing is not really all that
> important based on his numbers http://lkml.kernel.org/r/20190527074940.GB6879@google.com
>
> [...]
>
> > > Is this really too much different from /proc/<pid>/map_files?
> >
> > It's very different. See below.
> >
> > > > > An interface to query address range information is a separate but
> > > > > although a related topic. We have /proc/<pid>/[s]maps for that right
> > > > > now and I understand it is not a general win for all usecases because
> > > > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > > > > provide per vma information in a binary form via a fd based interface.
> > > > > But I would rather not conflate those two discussions much - well except
> > > > > if it could give one of the approaches more justification but let's
> > > > > focus on the madvise part first.
> > > >
> > > > I don't think it's a good idea to focus on one feature in a
> > > > multi-feature change when the interactions between features can be
> > > > very important for overall design of the multi-feature system and the
> > > > design of each feature.
> > > >
> > > > Here's my thinking on the high-level design:
> > > >
> > > > I'm imagining an address-range system that would work like this: we'd
> > > > create some kind of process_vm_getinfo(2) system call [1] that would
> > > > accept a statx-like attribute map and a pid/fd parameter as input and
> > > > return, on output, two things: 1) an array [2] of VMA descriptors
> > > > containing the requested information, and 2) a VMA configuration
> > > > sequence number. We'd then have process_madvise() and other
> > > > cross-process VM interfaces accept both address ranges and this
> > > > sequence number; they'd succeed only if the VMA configuration sequence
> > > > number is still current, i.e., the target process hasn't changed its
> > > > VMA configuration (implicitly or explicitly) since the call to
> > > > process_vm_getinfo().
> > >
> > > The sequence number is essentially a cookie that is transparent to the
> > > userspace right? If yes, how does it differ from a fd (returned from
> > > /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can
> >
> > If you want to operate on N VMAs simultaneously under an FD-per-VMA
> > model, you'd need to have those N FDs all open at the same time *and*
> > add some kind of system call that accepted those N FDs and an
> > operation to perform. The sequence number I'm proposing also applies
> > to the whole address space, not just one VMA. Even if you did have
> > these N FDs open all at once and supplied them all to some batch
> > operation, you couldn't guarantee via the FD mechanism that some *new*
> > VMA didn't appear in the address range you want to manipulate. A
> > global sequence number would catch this case. I still think supplying
> > a list of address ranges (like we already do for scatter-gather IO) is
> > less error-prone, less resource-intensive, more consistent with
> > existing practice, and equally flexible, especially if we start
> > supporting destructive cross-process memory operations, which may be
> > useful for things like checkpointing and optimizing process startup.
>
> I have a strong feeling you are over optimizing here. We are talking
> about a pro-active memory management and so far I haven't heard any
> usecase where all this would happen in the fast path. There are
> essentially two usecases I have heard so far. Age/Reclaim the whole
> process (with anon/fs preferency) and do the same on a particular
> and well specified range (e.g. a garbage collector or an inactive large
> image in browsert etc...). The former doesn't really care about parallel
> address range manipulations because it can tolerate them. The later is a
> completely different story.
>
> Are there any others where saving few ms matter so much?

Saving ms matters quite a bit. We may want to perform some of this
eager memory management in response to user activity, e.g.,
application switch, and even if that work isn't quite synchronous,
every cycle the system spends on management overhead is a cycle it
can't spend on rendering frames. Overhead means jank. Additionally,
we're on battery-operated devices. Milliseconds of CPU overhead
accumulated over a long time is a real energy sink.

> > Besides: process_vm_readv and process_vm_writev already work on
> > address ranges. Why should other cross-process memory APIs use a very
> > different model for naming memory regions?
>
> I would consider those APIs not a great example. They are racy on
> more levels (pid reuse and address space modification), and require a
> non-trivial synchronization. Do you want something similar for madvise
> on a non-cooperating remote application?
>
> > > be used to revalidate when the operation is requested and fail if
> > > something has changed. Moreover we already do have a fd based madvise
> > > syscall so there shouldn't be really a large need to add a new set of
> > > syscalls.
> >
> > We have various system calls that provide hints for open files, but
> > the memory operations are distinct. Modeling anonymous memory as a
> > kind of file-backed memory for purposes of VMA manipulation would also
> > be a departure from existing practice. Can you help me understand why
> > you seem to favor the FD-per-VMA approach so heavily? I don't see any
> > arguments *for* an FD-per-VMA model for remove memory manipulation and
> > I see a lot of arguments against it. Is there some compelling
> > advantage I'm missing?
>
> First and foremost it provides an easy cookie to the userspace to
> guarantee time-to-check-time-to-use consistency.

But only for one VMA at a time.

> It also naturally
> extend an existing fadvise interface that achieves madvise semantic on
> files.

There are lots of things that madvise can do that fadvise can't and
that don't even really make sense for fadvise, e.g., MADV_FREE. It
seems odd to me to duplicate much of the madvise interface into
fadvise so that we can use file APIs to give madvise hints. It seems
simpler to me to just provide a mechanism to put the madvise hints
where they're needed.

> I am not really pushing hard for this particular API but I really
> do care about a programming model that would be sane.

You've used "sane" twice so far in this message. Can you specify more
precisely what you mean by that word? I agree that there needs to be
some defense against TOCTOU races when doing remote memory management,
but I don't think providing this robustness via a file descriptor is
any more sane than alternative approaches. A file descriptor comes
with a lot of other features --- e.g., SCM_RIGHTS, fstat, and a
concept of owning a resource --- that aren't needed to achieve
robustness.

Normally, a file descriptor refers to some resource that the kernel
holds as long as the file descriptor (well, the open file description
or struct file) lives -- things like graphics buffers, files, and
sockets. If we're using an FD *just* as a cookie and not a resource,
I'd rather just expose the cookie directly.

> If we have a
> different means to achieve the same then all fine by me but so far I
> haven't heard any sound arguments to invent something completely new
> when we have established APIs to use.

Doesn't the next sentence describe something profoundly new? :-)

> Exporting anonymous mappings via
> proc the same way we do for file mappings doesn't seem to be stepping
> outside of the current practice way too much.

It seems like a radical departure from existing practice to provide
filesystem interfaces to anonymous memory regions, e.g., anon_vma.
You've never been able to refer to those memory regions with file
descriptors.

All I'm suggesting is that we take the existing madvise mechanism,
make it work cross-process, and make it robust against TOCTOU
problems, all one step at a time. Maybe my sense of API "size" is
miscalibrated, but adding a new type of FD to refer to anonymous VMA
regions feels like a bigger departure and so requires stronger
justification, especially if the result of the FD approach is probably
something less efficient than a cookie-based one.

> and we should focus on discussing whether this is a
> sane model. And I think it would be much better to discuss that under
> the respective patch which introduces that API rather than here.

I think it's important to discuss what that API should look like. :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:56                                   ` Michal Hocko
@ 2019-05-28 12:18                                     ` Daniel Colascione
  2019-05-28 12:38                                       ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Colascione @ 2019-05-28 12:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 04:42:47, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > > >
> > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > > >
> > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > > e.g., Android memory map debugging commands.
> > > > > > > >
> > > > > > > > I agree it's the best we can get *generally*.
> > > > > > > > Michal, any opinion?
> > > > > > >
> > > > > > > I am not really sure this is directly related. I think the primary
> > > > > > > question that we have to sort out first is whether we want to have
> > > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > > so far unfortunately.
> > > > > >
> > > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > > but thought it could be easily extended later for each address range
> > > > > > operation as userspace getting smarter with more information.
> > > > >
> > > > > Never design user API based on a single usecase, please. The "easily
> > > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > > several times, the synchronization model has to be thought through
> > > > > carefuly before a remote process address range operation can be
> > > > > implemented.
> > > >
> > > > I agree with you that we shouldn't design API on single usecase but what
> > > > you are concerning is actually not our usecase because we are resilient
> > > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > > Actually, many hints are already racy in that the upcoming pattern would
> > > > be different with the behavior you thought at the moment.
> > >
> > > How come they are racy wrt address ranges? You would have to be in
> > > multithreaded environment and then the onus of synchronization is on
> > > threads. That model is quite clear. But we are talking about separate
> > > processes and some of them might be even not aware of an external entity
> > > tweaking their address space.
> >
> > I don't think the difference between a thread and a process matters in
> > this context. Threads race on address space operations all the time
> > --- in the sense that multiple threads modify a process's address
> > space without synchronization.
>
> I would disagree. They do have in-kernel synchronization as long as they
> do not use MAP_FIXED. If they do want to use MAP_FIXED then they better
> synchronize or the result is undefined.

Right. It's because the kernel hands off different regions to
different non-MAP_FIXED mmap callers that it's pretty easy for threads
to mind their own business, but they're all still using the same
address space.

> > From a synchronization point
> > of view, it doesn't really matter whether it's a thread within the
> > target process or a thread outside the target process that does the
> > address space manipulation. What's new is the inspection of the
> > address space before performing an operation.
>
> The fundamental difference is that if you want to achieve the same
> inside the process then your application is inherenly aware of the
> operation and use whatever synchronization is needed to achieve a
> consistency. As soon as you allow the same from outside you either
> have to have an aware target application as well or you need a mechanism
> to find out that your decision has been invalidated by a later
> unsynchronized action.

I thought of this objection immediately after I hit send. :-)

I still don't think the intra- vs inter-process difference matters.
It's true that threads can synchronize with each other, but different
processes can synchronize with each other too. I mean, you *could* use
sem_open(3) for your heap lock and open the semaphore from two
different processes. That's silly, but it'd work.

The important requirement, I think, is that we need to support
managing "memory-naive" uncooperative tasks (perhaps legacy ones
written before cross-process memory management even became possible),
and I think that the cooperative-vs-uncooperative distinction matters
a lot more than the tgid of the thread doing the memory manipulation.
(Although in our case, we really do need a separate tgid. :-))

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 12:06                                   ` Michal Hocko
@ 2019-05-28 12:22                                     ` Minchan Kim
  0 siblings, 0 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-28 12:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 02:06:14PM +0200, Michal Hocko wrote:
> On Tue 28-05-19 20:44:36, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > > >
> > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > > > 
> > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > > e.g., Android memory map debugging commands.
> > > > > > > > 
> > > > > > > > I agree it's the best we can get *generally*.
> > > > > > > > Michal, any opinion?
> > > > > > > 
> > > > > > > I am not really sure this is directly related. I think the primary
> > > > > > > question that we have to sort out first is whether we want to have
> > > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > > so far unfortunately.
> > > > > > 
> > > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > > but thought it could be easily extended later for each address range
> > > > > > operation as userspace getting smarter with more information.
> > > > > 
> > > > > Never design user API based on a single usecase, please. The "easily
> > > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > > several times, the synchronization model has to be thought through
> > > > > carefuly before a remote process address range operation can be
> > > > > implemented.
> > > > 
> > > > I agree with you that we shouldn't design API on single usecase but what
> > > > you are concerning is actually not our usecase because we are resilient
> > > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > > Actually, many hints are already racy in that the upcoming pattern would
> > > > be different with the behavior you thought at the moment.
> > > 
> > > How come they are racy wrt address ranges? You would have to be in
> > > multithreaded environment and then the onus of synchronization is on
> > > threads. That model is quite clear. But we are talking about separate
> > 
> > Think about MADV_FREE. Allocator would think the chunk is worth to mark
> > "freeable" but soon, user of the allocator asked the chunk - ie, it's not
> > freeable any longer once user start to use it.
> 
> That is not a race in the address space, right. The underlying object
> hasn't changed. It has been declared as freeable and since that moment
> nobody can rely on the content because it might have been discarded.
> Or put simply, the content is undefined. It is responsibility of the
> madvise caller to make sure that the object is not in active use while
> it is marking it.
> 
> > My point is that kinds of *hints* are always racy so any synchronization
> > couldn't help a lot. That's why I want to restrict hints process_madvise
> > supports as such kinds of non-destruptive one at next respin.
> 
> I agree that a non-destructive operations are safer against paralel
> modifications because you just get a annoying and unexpected latency at
> worst case. But we should discuss whether this assumption is sufficient
> for further development. I am pretty sure once we open remote madvise
> people will find usecases for destructive operations or even new madvise
> modes we haven't heard of. What then?

I support Daniel's vma seq number approach for the future plan.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 12:11                                 ` Daniel Colascione
@ 2019-05-28 12:32                                   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 12:32 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 05:11:16, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 4:49 AM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > We have various system calls that provide hints for open files, but
> > > the memory operations are distinct. Modeling anonymous memory as a
> > > kind of file-backed memory for purposes of VMA manipulation would also
> > > be a departure from existing practice. Can you help me understand why
> > > you seem to favor the FD-per-VMA approach so heavily? I don't see any
> > > arguments *for* an FD-per-VMA model for remove memory manipulation and
> > > I see a lot of arguments against it. Is there some compelling
> > > advantage I'm missing?
> >
> > First and foremost it provides an easy cookie to the userspace to
> > guarantee time-to-check-time-to-use consistency.
> 
> But only for one VMA at a time.

Which is the unit we operate on, right?

> > It also naturally
> > extend an existing fadvise interface that achieves madvise semantic on
> > files.
> 
> There are lots of things that madvise can do that fadvise can't and
> that don't even really make sense for fadvise, e.g., MADV_FREE. It
> seems odd to me to duplicate much of the madvise interface into
> fadvise so that we can use file APIs to give madvise hints. It seems
> simpler to me to just provide a mechanism to put the madvise hints
> where they're needed.

I do not see why we would duplicate. I confess I haven't tried to
implement this so I might be overlooking something but it seems to me
that we could simply reuse the same functionality from both APIs.

> > I am not really pushing hard for this particular API but I really
> > do care about a programming model that would be sane.
> 
> You've used "sane" twice so far in this message. Can you specify more
> precisely what you mean by that word?

Well, I would consider a model which would prevent from unintended side
effects (e.g. working on a completely different object) without a tricky
synchronization sane.

> I agree that there needs to be
> some defense against TOCTOU races when doing remote memory management,
> but I don't think providing this robustness via a file descriptor is
> any more sane than alternative approaches. A file descriptor comes
> with a lot of other features --- e.g., SCM_RIGHTS, fstat, and a
> concept of owning a resource --- that aren't needed to achieve
> robustness.
> 
> Normally, a file descriptor refers to some resource that the kernel
> holds as long as the file descriptor (well, the open file description
> or struct file) lives -- things like graphics buffers, files, and
> sockets. If we're using an FD *just* as a cookie and not a resource,
> I'd rather just expose the cookie directly.

You are absolutely right. But doesn't that apply to any other
revalidation method that would be tracking VMA status as well. As I've
said I am not married to this approach as long as there are better
alternatives. So far we are in a discussion what should be the actual
semantic of the operation and how much do we want to tolerate races. And
it seems that we are diving into implementation details rather than
landing with a firm decision that the current proposed API is suitable
or not.

> > If we have a
> > different means to achieve the same then all fine by me but so far I
> > haven't heard any sound arguments to invent something completely new
> > when we have established APIs to use.
> 
> Doesn't the next sentence describe something profoundly new? :-)
> 
> > Exporting anonymous mappings via
> > proc the same way we do for file mappings doesn't seem to be stepping
> > outside of the current practice way too much.
> 
> It seems like a radical departure from existing practice to provide
> filesystem interfaces to anonymous memory regions, e.g., anon_vma.
> You've never been able to refer to those memory regions with file
> descriptors.
> 
> All I'm suggesting is that we take the existing madvise mechanism,
> make it work cross-process, and make it robust against TOCTOU
> problems, all one step at a time. Maybe my sense of API "size" is
> miscalibrated, but adding a new type of FD to refer to anonymous VMA
> regions feels like a bigger departure and so requires stronger
> justification, especially if the result of the FD approach is probably
> something less efficient than a cookie-based one.

Feel free to propose the way to achieve that in the respective email
thread.
 
> > and we should focus on discussing whether this is a
> > sane model. And I think it would be much better to discuss that under
> > the respective patch which introduces that API rather than here.
> 
> I think it's important to discuss what that API should look like. :-)

It will be fun to follow this discussion and make some sense of
different parallel threads.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 12:18                                     ` Daniel Colascione
@ 2019-05-28 12:38                                       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-28 12:38 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 05:18:48, Daniel Colascione wrote:
[...]
> The important requirement, I think, is that we need to support
> managing "memory-naive" uncooperative tasks (perhaps legacy ones
> written before cross-process memory management even became possible),
> and I think that the cooperative-vs-uncooperative distinction matters
> a lot more than the tgid of the thread doing the memory manipulation.
> (Although in our case, we really do need a separate tgid. :-))

Agreed here and that requires some sort of revalidation and failure on
"object has changed" in one form or another IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-27  7:49             ` Minchan Kim
@ 2019-05-29 10:08               ` Daniel Colascione
  2019-05-29 10:33                 ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Daniel Colascione @ 2019-05-29 10:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > [Cc linux-api]
> > > > > >
> > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > so user should call the syscall several times to give hints to
> > > > > > > multiple address range.
> > > > > >
> > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > >
> > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > with number in the description at respin.
> > > >
> > > > Does this really have to be a fast operation? I would expect the monitor
> > > > is by no means a fast path. The system call overhead is not what it used
> > > > to be, sigh, but still for something that is not a hot path it should be
> > > > tolerable, especially when the whole operation is quite expensive on its
> > > > own (wrt. the syscall entry/exit).
> > >
> > > What's different with process_vm_[readv|writev] and vmsplice?
> > > If the range needed to be covered is a lot, vector operation makes senese
> > > to me.
> >
> > I am not saying that the vector API is wrong. All I am trying to say is
> > that the benefit is not really clear so far. If you want to push it
> > through then you should better get some supporting data.
>
> I measured 1000 madvise syscall vs. a vector range syscall with 1000
> ranges on ARM64 mordern device. Even though I saw 15% improvement but
> absoluate gain is just 1ms so I don't think it's worth to support.
> I will drop vector support at next revision.

Please do keep the vector support. Absolute timing is misleading,
since in a tight loop, you're not going to contend on mmap_sem. We've
seen tons of improvements in things like camera start come from
coalescing mprotect calls, with the gains coming from taking and
releasing various locks a lot less often and bouncing around less on
the contended lock paths. Raw throughput doesn't tell the whole story,
especially on mobile.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-29 10:08               ` Daniel Colascione
@ 2019-05-29 10:33                 ` Michal Hocko
  2019-05-30  2:17                   ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-29 10:33 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > [Cc linux-api]
> > > > > > >
> > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > multiple address range.
> > > > > > >
> > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > >
> > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > with number in the description at respin.
> > > > >
> > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > own (wrt. the syscall entry/exit).
> > > >
> > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > to me.
> > >
> > > I am not saying that the vector API is wrong. All I am trying to say is
> > > that the benefit is not really clear so far. If you want to push it
> > > through then you should better get some supporting data.
> >
> > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > absoluate gain is just 1ms so I don't think it's worth to support.
> > I will drop vector support at next revision.
> 
> Please do keep the vector support. Absolute timing is misleading,
> since in a tight loop, you're not going to contend on mmap_sem. We've
> seen tons of improvements in things like camera start come from
> coalescing mprotect calls, with the gains coming from taking and
> releasing various locks a lot less often and bouncing around less on
> the contended lock paths. Raw throughput doesn't tell the whole story,
> especially on mobile.

This will always be a double edge sword. Taking a lock for longer can
improve a throughput of a single call but it would make a latency for
anybody contending on the lock much worse.

Besides that, please do not overcomplicate the thing from the early
beginning please. Let's start with a simple and well defined remote
madvise alternative first and build a vector API on top with some
numbers based on _real_ workloads.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-29 10:33                 ` Michal Hocko
@ 2019-05-30  2:17                   ` Minchan Kim
  2019-05-30  6:57                     ` Michal Hocko
  0 siblings, 1 reply; 68+ messages in thread
From: Minchan Kim @ 2019-05-30  2:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > [Cc linux-api]
> > > > > > > >
> > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > multiple address range.
> > > > > > > >
> > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > >
> > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > with number in the description at respin.
> > > > > >
> > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > own (wrt. the syscall entry/exit).
> > > > >
> > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > to me.
> > > >
> > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > that the benefit is not really clear so far. If you want to push it
> > > > through then you should better get some supporting data.
> > >
> > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > I will drop vector support at next revision.
> > 
> > Please do keep the vector support. Absolute timing is misleading,
> > since in a tight loop, you're not going to contend on mmap_sem. We've
> > seen tons of improvements in things like camera start come from
> > coalescing mprotect calls, with the gains coming from taking and
> > releasing various locks a lot less often and bouncing around less on
> > the contended lock paths. Raw throughput doesn't tell the whole story,
> > especially on mobile.
> 
> This will always be a double edge sword. Taking a lock for longer can
> improve a throughput of a single call but it would make a latency for
> anybody contending on the lock much worse.
> 
> Besides that, please do not overcomplicate the thing from the early
> beginning please. Let's start with a simple and well defined remote
> madvise alternative first and build a vector API on top with some
> numbers based on _real_ workloads.

First time, I didn't think about atomicity about address range race
because MADV_COLD/PAGEOUT is not critical for the race.
However you raised the atomicity issue because people would extend
hints to destructive ones easily. I agree with that and that's why
we discussed how to guarantee the race and Daniel comes up with good idea.

  - vma configuration seq number via process_getinfo(2).

We discussed the race issue without _read_ workloads/requests because
it's common sense that people might extend the syscall later.

Here is same. For current workload, we don't need to support vector
for perfomance point of view based on my experiment. However, it's
rather limited experiment. Some configuration might have 10000+ vmas
or really slow CPU. 

Furthermore, I want to have vector support due to atomicity issue
if it's really the one we should consider.
With vector support of the API and vma configuration sequence number
from Daniel, we could support address ranges operations's atomicity.
However, since we don't introduce vector at this moment, we need to
introduce *another syscall* later to be able to handle multile ranges
all at once atomically if it's okay.

Other thought:
Maybe we could extend address range batch syscall covers other MM
syscall like mmap/munmap/madvise/mprotect and so on because there
are multiple users that would benefit from this general batching
mechanism.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  2:17                   ` Minchan Kim
@ 2019-05-30  6:57                     ` Michal Hocko
  2019-05-30  8:02                       ` Minchan Kim
  0 siblings, 1 reply; 68+ messages in thread
From: Michal Hocko @ 2019-05-30  6:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > [Cc linux-api]
> > > > > > > > >
> > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > multiple address range.
> > > > > > > > >
> > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > >
> > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > with number in the description at respin.
> > > > > > >
> > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > own (wrt. the syscall entry/exit).
> > > > > >
> > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > to me.
> > > > >
> > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > that the benefit is not really clear so far. If you want to push it
> > > > > through then you should better get some supporting data.
> > > >
> > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > I will drop vector support at next revision.
> > > 
> > > Please do keep the vector support. Absolute timing is misleading,
> > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > seen tons of improvements in things like camera start come from
> > > coalescing mprotect calls, with the gains coming from taking and
> > > releasing various locks a lot less often and bouncing around less on
> > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > especially on mobile.
> > 
> > This will always be a double edge sword. Taking a lock for longer can
> > improve a throughput of a single call but it would make a latency for
> > anybody contending on the lock much worse.
> > 
> > Besides that, please do not overcomplicate the thing from the early
> > beginning please. Let's start with a simple and well defined remote
> > madvise alternative first and build a vector API on top with some
> > numbers based on _real_ workloads.
> 
> First time, I didn't think about atomicity about address range race
> because MADV_COLD/PAGEOUT is not critical for the race.
> However you raised the atomicity issue because people would extend
> hints to destructive ones easily. I agree with that and that's why
> we discussed how to guarantee the race and Daniel comes up with good idea.

Just for the clarification, I didn't really mean atomicity but rather a
_consistency_ (essentially time to check to time to use consistency).
 
>   - vma configuration seq number via process_getinfo(2).
> 
> We discussed the race issue without _read_ workloads/requests because
> it's common sense that people might extend the syscall later.
> 
> Here is same. For current workload, we don't need to support vector
> for perfomance point of view based on my experiment. However, it's
> rather limited experiment. Some configuration might have 10000+ vmas
> or really slow CPU. 
> 
> Furthermore, I want to have vector support due to atomicity issue
> if it's really the one we should consider.
> With vector support of the API and vma configuration sequence number
> from Daniel, we could support address ranges operations's atomicity.

I am not sure what do you mean here. Perform all ranges atomicaly wrt.
other address space modifications? If yes I am not sure we want that
semantic because it can cause really long stalls for other operations
but that is a discussion on its own and I would rather focus on a simple
interface first.

> However, since we don't introduce vector at this moment, we need to
> introduce *another syscall* later to be able to handle multile ranges
> all at once atomically if it's okay.

Agreed.

> Other thought:
> Maybe we could extend address range batch syscall covers other MM
> syscall like mmap/munmap/madvise/mprotect and so on because there
> are multiple users that would benefit from this general batching
> mechanism.

Again a discussion on its own ;)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  6:57                     ` Michal Hocko
@ 2019-05-30  8:02                       ` Minchan Kim
  2019-05-30 16:19                         ` Daniel Colascione
  2019-05-30 18:47                         ` Michal Hocko
  0 siblings, 2 replies; 68+ messages in thread
From: Minchan Kim @ 2019-05-30  8:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > >
> > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > > [Cc linux-api]
> > > > > > > > > >
> > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > > multiple address range.
> > > > > > > > > >
> > > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > > >
> > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > > with number in the description at respin.
> > > > > > > >
> > > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > > own (wrt. the syscall entry/exit).
> > > > > > >
> > > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > > to me.
> > > > > >
> > > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > > that the benefit is not really clear so far. If you want to push it
> > > > > > through then you should better get some supporting data.
> > > > >
> > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > > I will drop vector support at next revision.
> > > > 
> > > > Please do keep the vector support. Absolute timing is misleading,
> > > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > > seen tons of improvements in things like camera start come from
> > > > coalescing mprotect calls, with the gains coming from taking and
> > > > releasing various locks a lot less often and bouncing around less on
> > > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > > especially on mobile.
> > > 
> > > This will always be a double edge sword. Taking a lock for longer can
> > > improve a throughput of a single call but it would make a latency for
> > > anybody contending on the lock much worse.
> > > 
> > > Besides that, please do not overcomplicate the thing from the early
> > > beginning please. Let's start with a simple and well defined remote
> > > madvise alternative first and build a vector API on top with some
> > > numbers based on _real_ workloads.
> > 
> > First time, I didn't think about atomicity about address range race
> > because MADV_COLD/PAGEOUT is not critical for the race.
> > However you raised the atomicity issue because people would extend
> > hints to destructive ones easily. I agree with that and that's why
> > we discussed how to guarantee the race and Daniel comes up with good idea.
> 
> Just for the clarification, I didn't really mean atomicity but rather a
> _consistency_ (essentially time to check to time to use consistency).

What do you mean by *consistency*? Could you elaborate it more?

>  
> >   - vma configuration seq number via process_getinfo(2).
> > 
> > We discussed the race issue without _read_ workloads/requests because
> > it's common sense that people might extend the syscall later.
> > 
> > Here is same. For current workload, we don't need to support vector
> > for perfomance point of view based on my experiment. However, it's
> > rather limited experiment. Some configuration might have 10000+ vmas
> > or really slow CPU. 
> > 
> > Furthermore, I want to have vector support due to atomicity issue
> > if it's really the one we should consider.
> > With vector support of the API and vma configuration sequence number
> > from Daniel, we could support address ranges operations's atomicity.
> 
> I am not sure what do you mean here. Perform all ranges atomicaly wrt.
> other address space modifications? If yes I am not sure we want that

Yub, I think it's *necessary* if we want to support destructive hints
via process_madvise.

> semantic because it can cause really long stalls for other operations

It could be or it couldn't be.

For example, if we could multiplex several syscalls which we should
enumerate all of page table lookup, it could be more effective rather
than doing each page table on each syscall.

> but that is a discussion on its own and I would rather focus on a simple
> interface first.

It seems it's time to send RFCv2 since we discussed a lot although we
don't have clear conclution yet. But still want to understand what you
meant _consistency_.

Thanks for the review, Michal! It's very helpful.

> 
> > However, since we don't introduce vector at this moment, we need to
> > introduce *another syscall* later to be able to handle multile ranges
> > all at once atomically if it's okay.
> 
> Agreed.
> 
> > Other thought:
> > Maybe we could extend address range batch syscall covers other MM
> > syscall like mmap/munmap/madvise/mprotect and so on because there
> > are multiple users that would benefit from this general batching
> > mechanism.
> 
> Again a discussion on its own ;)
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  8:02                       ` Minchan Kim
@ 2019-05-30 16:19                         ` Daniel Colascione
  2019-05-30 18:47                         ` Michal Hocko
  1 sibling, 0 replies; 68+ messages in thread
From: Daniel Colascione @ 2019-05-30 16:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Thu, May 30, 2019 at 1:02 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> > On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> > > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > > > [Cc linux-api]
> > > > > > > > > > >
> > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > > > multiple address range.
> > > > > > > > > > >
> > > > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > > > >
> > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > > > with number in the description at respin.
> > > > > > > > >
> > > > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > > > own (wrt. the syscall entry/exit).
> > > > > > > >
> > > > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > > > to me.
> > > > > > >
> > > > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > > > that the benefit is not really clear so far. If you want to push it
> > > > > > > through then you should better get some supporting data.
> > > > > >
> > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > > > I will drop vector support at next revision.
> > > > >
> > > > > Please do keep the vector support. Absolute timing is misleading,
> > > > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > > > seen tons of improvements in things like camera start come from
> > > > > coalescing mprotect calls, with the gains coming from taking and
> > > > > releasing various locks a lot less often and bouncing around less on
> > > > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > > > especially on mobile.
> > > >
> > > > This will always be a double edge sword. Taking a lock for longer can
> > > > improve a throughput of a single call but it would make a latency for
> > > > anybody contending on the lock much worse.
> > > >
> > > > Besides that, please do not overcomplicate the thing from the early
> > > > beginning please. Let's start with a simple and well defined remote
> > > > madvise alternative first and build a vector API on top with some
> > > > numbers based on _real_ workloads.
> > >
> > > First time, I didn't think about atomicity about address range race
> > > because MADV_COLD/PAGEOUT is not critical for the race.
> > > However you raised the atomicity issue because people would extend
> > > hints to destructive ones easily. I agree with that and that's why
> > > we discussed how to guarantee the race and Daniel comes up with good idea.
> >
> > Just for the clarification, I didn't really mean atomicity but rather a
> > _consistency_ (essentially time to check to time to use consistency).
>
> What do you mean by *consistency*? Could you elaborate it more?
>
> >
> > >   - vma configuration seq number via process_getinfo(2).
> > >
> > > We discussed the race issue without _read_ workloads/requests because
> > > it's common sense that people might extend the syscall later.
> > >
> > > Here is same. For current workload, we don't need to support vector
> > > for perfomance point of view based on my experiment. However, it's
> > > rather limited experiment. Some configuration might have 10000+ vmas
> > > or really slow CPU.
> > >
> > > Furthermore, I want to have vector support due to atomicity issue
> > > if it's really the one we should consider.
> > > With vector support of the API and vma configuration sequence number
> > > from Daniel, we could support address ranges operations's atomicity.
> >
> > I am not sure what do you mean here. Perform all ranges atomicaly wrt.
> > other address space modifications? If yes I am not sure we want that
>
> Yub, I think it's *necessary* if we want to support destructive hints
> via process_madvise.

[Puts on flame-proof suit]

Here's a quick sketch of what I have in mind for process_getinfo(2).
Keep in mind that it's still just a rough idea.

We've had trouble efficiently learning about process and memory state
of the system via procfs. Android background memory-use scans (the
android.bg daemon) consume quite a bit of CPU time for PSS; Minchan's
done a lot of thinking for how we can specify desired page sets for
compaction as part of this patch set; and the full procfs walks that
some trace collection tools need to undertake take more than 200ms to
collect (sometimes much more) due mostly to procfs iteration. ISTM we
can do better.

While procfs *works* on a functional level, it's inefficient due to
the splatting of information we want across several different files
(which need to be independently opened --- e.g.,
/proc/pid/oom_score_adj *and* /proc/pid/status), inefficient due to
the ad-hoc text formatting, inefficient due to information over-fetch,
and cumbersome due to the fundamental impedance mismatch between
filesystem APIs and process lifetimes. procfs itself is also optional,
which has caused various bits of awkwardness that you'll recall from
the pidfd discussions.

How about we solve the problem once and for all? I'm imagining a new
process_getinfo(2) that solves all of these problems at the same time.
I want something with a few properties:

1) if we want to learn M facts about N things, we enter the kernel
once and learn all M*N things,
2) the information we collect is self-consistent (which implies
atomicity most of the time),
3) we retrieve the information we want in an efficient binary format, and
4) we don't pay to learn anything not in M.

I've jotted down a quick sketch of the API below; I'm curious what
everyone else thinks. It'd basically look like this:

int process_getinfo(int nr_proc, int* proc, int flags, unsigned long
mask, void* out_buf, size_t* inout_sz)

We wouldn't use the return value for much: 0 on success and -1 on
error with errno set. NR_PROC and PROC together would specify the
objects we want to learn about, which would be either PIDs or PIDFDs
or maybe nothing at all if FLAGS tells us to inspect every process on
the system. MASK is a bitset of facts we want to learn, described
below. OUT_BUF and INOUT_SZ are for actually communicating the result.
On input, the caller would fill *INOUT_SZ with the size of the buffer
to which OUT_BUF points; on success, we'd fill *INOUT_SZ with the
number of bytes we actually used. If the output buffer is too small,
we'll truncate the result, fail the system call with E2BIG, and fill
*INOUT_SZ with the number of needed bytes, inviting the caller to try
again. (If a caller supplies something huge like a reusable 512KB
buffer on the first call, no reallocation and retries will be
necessary in practice on a typically-sized system.)

The actual returned buffer is a collection of structures and data
blocks starting with a struct process_info. The structures in the
returned buffer sometimes contain "pointers" to other structures
encoded as byte offsets from the start of the information buffer.
Using offsets instead of actual pointers keeps the format the same
across 32- and 64-bit versions of process_getinfo and makes it
possible to relocate the returned information buffer with memcpy.

struct process_info {
  int first_procrec_offset;  // struct procrec*
  //  Examples of system-wide things we could ask for
  int mem_total_kb;
  int mem_free_kb;
  int mem_available_kb;
  char reserved[];
};

struct procrec {
  int next_procrec_offset;  // struct procrec*
  // Following fields are self-explanatory and are only examples of the
  // kind of information we could provide.
  int tid;
  int tgid;
  char status;
  int oom_score_adj;
  struct { int real, effective, saved, fs; } uids;
  int prio;
  int comm_offset;  // char*
  uint64_t rss_file_kb
  uint64_t rss_anon_fb;
  uint64_t vm_seq;
  int first_procrec_vma_offset;  // struct procrec_vma*
  char reserved[];
};

struct procrec_vma {
  int next_procrec_vma_offset;  // struct procrec_vma*
  unsigned long start;
  unsigned long end;
  int backing_file_name_offset;  // char*
  int prot;
  char reserved[];
};

Callers would use the returned buffer by casting it to a struct
process_info and following the internal "pointers".

MASK would specify which bits of information we wanted: for example,
if we asked for PROCESS_VM_MEMORY_MAP, the kernel would fill in each
struct procret's memory_map field and have it point to a struct
procrec_vma in the returned output buffer. If we omitted
PROCESS_VM_MEMORY_MAP, we'd leave the memory_map field as NULL
(encoded as offset zero). The kernel would embed any strings (like
comm and VMA names) into the output buffer; the precise locations
would be unspecified so long as callers could find these fields via
output-buffer pointers.

Because all the structures are variable-length and are chained
together with explicit pointers (or offsets) instead of being stuffed
into a literal array, we can add additional fields to the output
structures any time we want without breaking binary compatibility.
Callers would tell the kernel that they're interested in the added
struct fields by asking for them via bits in MASK, and kernels that
don't understand those fields would just fail the system call with
EINVAL or something.

Depending on how we call it, we can use this API as a bunch of different things:

1) Quick retrieval of system-wide memory counters, like /proc/meminfo
2) Quick snapshot of all process-thread identities (asking for the
wildcard TID match via FLAGS)
3) Fast enumeration of one process's address space
4) Collecting process-summary VM counters (e.g., rss_anon and
rss_file) for a set of processes
5) Retrieval of every VMA of every process on the system for debugging

We can do all of this with one entry into the kernel and without
opening any new file descriptors (unless we want to use pidfds as
inputs). We can also make this operation as atomic as we want, e.g.,
taking mmap_sem while looking at each process and taking tasklist_lock
so all the thread IDs line up with their processes. We don't
necessarily need to take the mmap_sems of all processes we care about
at the same time.

Since this isn't a filesystem-based API, we don't have to deal with
seq_file or deal with consistency issues arising from userspace
programs doing strange things like reading procfs files very slowly in
small chunks. Security-wise, we'd just use different access checks for
different requested information bits in MASK, maybe supplying "no
access" struct procrec entries if a caller doesn't happen to have
access to a particular process. I suppose we can talk about whether
access check failures should result in dummy values or syscall
failure: maybe callers should select which behavior they want.
Format-wise, we could also just return flatbuffer messages from the
kernel, but I suspect that we don't want flatbuffer in the kernel
right now. :-)

The API I'm proposing accepts an array of processes to inspect. We
could simplify it by accepting just one process and making the caller
enter the kernel once per process it wants to learn about, but this
simplification would make the API less useful for answering questions
like "What's the RSS of every process on the system right now?".

What do you think?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  8:02                       ` Minchan Kim
  2019-05-30 16:19                         ` Daniel Colascione
@ 2019-05-30 18:47                         ` Michal Hocko
  1 sibling, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2019-05-30 18:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Thu 30-05-19 17:02:14, Minchan Kim wrote:
> On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> > On Thu 30-05-19 11:17:48, Minchan Kim wrote:
[...]
> > > First time, I didn't think about atomicity about address range race
> > > because MADV_COLD/PAGEOUT is not critical for the race.
> > > However you raised the atomicity issue because people would extend
> > > hints to destructive ones easily. I agree with that and that's why
> > > we discussed how to guarantee the race and Daniel comes up with good idea.
> > 
> > Just for the clarification, I didn't really mean atomicity but rather a
> > _consistency_ (essentially time to check to time to use consistency).
> 
> What do you mean by *consistency*? Could you elaborate it more?

That you operate on the object you have got by some means. In other
words that the range you want to call madvise on hasn't been
remapped/replaced by a different mmap operation.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2019-05-30 18:47 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190520035254.57579-1-minchan@kernel.org>
     [not found] ` <20190520035254.57579-2-minchan@kernel.org>
2019-05-20  8:16   ` [RFC 1/7] mm: introduce MADV_COOL Michal Hocko
2019-05-20  8:19     ` Michal Hocko
2019-05-20 15:08       ` Suren Baghdasaryan
2019-05-20 22:55       ` Minchan Kim
2019-05-20 22:54     ` Minchan Kim
2019-05-21  6:04       ` Michal Hocko
2019-05-21  9:11         ` Minchan Kim
2019-05-21 10:05           ` Michal Hocko
     [not found] ` <20190520035254.57579-4-minchan@kernel.org>
2019-05-20  8:27   ` [RFC 3/7] mm: introduce MADV_COLD Michal Hocko
2019-05-20 23:00     ` Minchan Kim
2019-05-21  6:08       ` Michal Hocko
2019-05-21  9:13         ` Minchan Kim
     [not found] ` <20190520035254.57579-6-minchan@kernel.org>
2019-05-20  9:18   ` [RFC 5/7] mm: introduce external memory hinting API Michal Hocko
2019-05-21  2:41     ` Minchan Kim
2019-05-21  6:17       ` Michal Hocko
2019-05-21 10:32         ` Minchan Kim
     [not found] ` <20190520035254.57579-7-minchan@kernel.org>
2019-05-20  9:22   ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Michal Hocko
2019-05-21  2:48     ` Minchan Kim
2019-05-21  6:24       ` Michal Hocko
2019-05-21 10:26         ` Minchan Kim
2019-05-21 10:37           ` Michal Hocko
2019-05-27  7:49             ` Minchan Kim
2019-05-29 10:08               ` Daniel Colascione
2019-05-29 10:33                 ` Michal Hocko
2019-05-30  2:17                   ` Minchan Kim
2019-05-30  6:57                     ` Michal Hocko
2019-05-30  8:02                       ` Minchan Kim
2019-05-30 16:19                         ` Daniel Colascione
2019-05-30 18:47                         ` Michal Hocko
     [not found] ` <20190520035254.57579-8-minchan@kernel.org>
2019-05-20  9:28   ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Michal Hocko
2019-05-21  2:55     ` Minchan Kim
2019-05-21  6:26       ` Michal Hocko
2019-05-27  7:58         ` Minchan Kim
2019-05-27 12:44           ` Michal Hocko
2019-05-28  3:26             ` Minchan Kim
2019-05-28  6:29               ` Michal Hocko
2019-05-28  8:13                 ` Minchan Kim
2019-05-28  8:31                   ` Daniel Colascione
2019-05-28  8:49                     ` Minchan Kim
2019-05-28  9:08                       ` Michal Hocko
2019-05-28  9:39                         ` Daniel Colascione
2019-05-28 10:33                           ` Michal Hocko
2019-05-28 11:21                             ` Daniel Colascione
2019-05-28 11:49                               ` Michal Hocko
2019-05-28 12:11                                 ` Daniel Colascione
2019-05-28 12:32                                   ` Michal Hocko
2019-05-28 10:32                         ` Minchan Kim
2019-05-28 10:41                           ` Michal Hocko
2019-05-28 11:12                             ` Minchan Kim
2019-05-28 11:28                               ` Michal Hocko
2019-05-28 11:42                                 ` Daniel Colascione
2019-05-28 11:56                                   ` Michal Hocko
2019-05-28 12:18                                     ` Daniel Colascione
2019-05-28 12:38                                       ` Michal Hocko
2019-05-28 12:10                                   ` Minchan Kim
2019-05-28 11:44                                 ` Minchan Kim
2019-05-28 11:51                                   ` Daniel Colascione
2019-05-28 12:06                                   ` Michal Hocko
2019-05-28 12:22                                     ` Minchan Kim
2019-05-28 11:28                             ` Daniel Colascione
2019-05-21 15:33       ` Johannes Weiner
2019-05-22  1:50         ` Minchan Kim
2019-05-20  9:28 ` [RFC 0/7] introduce memory hinting API for external process Michal Hocko
     [not found] ` <20190520164605.GA11665@cmpxchg.org>
     [not found]   ` <20190521043950.GJ10039@google.com>
2019-05-21  6:32     ` Michal Hocko
     [not found] ` <20190521014452.GA6738@bombadil.infradead.org>
2019-05-21  6:34   ` Michal Hocko
2019-05-21 12:53 ` Shakeel Butt
     [not found] ` <dbe801f0-4bbe-5f6e-9053-4b7deb38e235@arm.com>
     [not found]   ` <CAEe=Sxka3Q3vX+7aWUJGKicM+a9Px0rrusyL+5bB1w4ywF6N4Q@mail.gmail.com>
     [not found]     ` <1754d0ef-6756-d88b-f728-17b1fe5d5b07@arm.com>
2019-05-21 12:56       ` Shakeel Butt
2019-05-22  4:23         ` Brian Geffon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).