linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/7] introduce memory hinting API for external process
@ 2019-05-20  3:52 Minchan Kim
  2019-05-20  3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
                   ` (13 more replies)
  0 siblings, 14 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

IMHO we should spell it out that this patchset complements MADV_WONTNEED
and MADV_FREE by adding non-destructive ways to gain some free memory
space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
when memory pressure rises.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COOL which will deactive activated pages and the other is
MADV_COLD which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.

This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -

	struct pr_madvise_param {
		int size;
		const struct iovec *vec;
	}

	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
				struct pr_madvise_param *restuls,
				struct pr_madvise_param *ranges,
				unsigned long flags);

The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once.

I guess others have different ideas about the naming of syscall and options
so feel free to suggest better naming.

- Experiment

We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.

A is vanilla and B is process_madvise.


                                       A          B      delta   ratio(%)
               allocstall_dma          0          0          0       0.00
           allocstall_movable       1464        457      -1007     -69.00
            allocstall_normal     263210     190763     -72447     -28.00
             allocstall_total     264674     191220     -73454     -28.00
          compact_daemon_wake      26912      25294      -1618      -7.00
                 compact_fail      17885      14151      -3734     -21.00
         compact_free_scanned 4204766409 3835994922 -368771487      -9.00
             compact_isolated    3446484    2967618    -478866     -14.00
      compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
                compact_stall      19387      15343      -4044     -21.00
              compact_success       1502       1192       -310     -21.00
kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
            kswapd_inodesteal     221635     233093      11458       5.00
 kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
                   nr_dirtied     259934     296476      36542      14.00
  nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
              nr_vmscan_write    1274232    2661733    1387501     108.00
                   nr_written    1514060    2937560    1423500      94.00
                   pageoutrun      67561      55133     -12428     -19.00
                   pgactivate    2335060    1984882    -350178     -15.00
                  pgalloc_dma   13743011   14096463     353452       2.00
              pgalloc_movable          0          0          0       0.00
               pgalloc_normal   18742440   16802065   -1940375     -11.00
                pgalloc_total   32485451   30898528   -1586923      -5.00
                 pgdeactivate    4262210    2930670   -1331540     -32.00
                      pgfault   30812334   31085065     272731       0.00
                       pgfree   33553970   31765164   -1788806      -6.00
                 pginodesteal      33411      15084     -18327     -55.00
                  pglazyfreed          0          0          0       0.00
                   pgmajfault     551312    1508299     956987     173.00
               pgmigrate_fail      43927      29330     -14597     -34.00
            pgmigrate_success    1399851    1203922    -195929     -14.00
                       pgpgin   24141776   19032156   -5109620     -22.00
                      pgpgout     959344    1103316     143972      15.00
                 pgpgoutclean    4639732    3765868    -873864     -19.00
                     pgrefill    4884560    3006938   -1877622     -39.00
                    pgrotated      37828      25897     -11931     -32.00
                pgscan_direct    1456037     957567    -498470     -35.00
       pgscan_direct_throttle          0          0          0       0.00
                pgscan_kswapd    6667767    5047360   -1620407     -25.00
                 pgscan_total    8123804    6004927   -2118877     -27.00
                   pgskip_dma          0          0          0       0.00
               pgskip_movable          0          0          0       0.00
                pgskip_normal      14907      25382      10475      70.00
                 pgskip_total      14907      25382      10475      70.00
               pgsteal_direct    1118986     690215    -428771     -39.00
               pgsteal_kswapd    4750223    3657107   -1093116     -24.00
                pgsteal_total    5869209    4347322   -1521887     -26.00
                       pswpin     417613    1392647     975034     233.00
                      pswpout    1274224    2661731    1387507     108.00
                slabs_scanned   13686905   10807200   -2879705     -22.00
          workingset_activate     668966     569444     -99522     -15.00
       workingset_nodereclaim      38957      32621      -6336     -17.00
           workingset_refault    2816795    2179782    -637013     -23.00
           workingset_restore     294320     168601    -125719     -43.00

pgmajfault is increased by 173% because swapin is increased by 200% by
process_madvise hint. However, swap read based on zram is much cheaper
than file IO in performance point of view and app hot start by swapin is
also cheaper than cold start from the beginning of app which needs many IO
from storage and initialization steps.

This patchset is against on next-20190517.

Minchan Kim (7):
  mm: introduce MADV_COOL
  mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  mm: introduce MADV_COLD
  mm: factor out madvise's core functionality
  mm: introduce external memory hinting API
  mm: extend process_madvise syscall to support vector arrary
  mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/page-flags.h             |   1 +
 include/linux/page_idle.h              |  15 +
 include/linux/proc_fs.h                |   1 +
 include/linux/swap.h                   |   2 +
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/mman-common.h |  12 +
 include/uapi/asm-generic/unistd.h      |   2 +
 kernel/signal.c                        |   2 +-
 kernel/sys_ni.c                        |   1 +
 mm/madvise.c                           | 600 +++++++++++++++++++++----
 mm/swap.c                              |  43 ++
 mm/vmscan.c                            |  80 +++-
 14 files changed, 680 insertions(+), 83 deletions(-)

-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply	[flat|nested] 138+ messages in thread

* [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
@ 2019-05-20  3:52 ` Minchan Kim
  2019-05-20  8:16   ` Michal Hocko
                     ` (2 more replies)
  2019-05-20  3:52 ` [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

When a process expects no accesses to a certain memory range
it could hint kernel that the pages can be reclaimed
when memory pressure happens but data should be preserved
for future use.  This could reduce workingset eviction so it
ends up increasing performance.

This patch introduces the new MADV_COOL hint to madvise(2)
syscall. MADV_COOL can be used by a process to mark a memory range
as not expected to be used in the near future. The hint can help
kernel in deciding which pages to evict early during memory
pressure.

Internally, it works via deactivating memory from active list to
inactive's head so when the memory pressure happens, they will be
reclaimed earlier than other active pages unless there is no
access until the time.

* v1r2
 * use clear_page_young in deactivate_page - joelaf

* v1r1
 * Revise the description - surenb
 * Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/page-flags.h             |   1 +
 include/linux/page_idle.h              |  15 ++++
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 112 +++++++++++++++++++++++++
 mm/swap.c                              |  43 ++++++++++
 6 files changed, 173 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
 TESTPAGEFLAG(Young, young, PF_ANY)
 SETPAGEFLAG(Young, young, PF_ANY)
 TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
 	SetPageYoung(page);
 }
 
+static inline void clear_page_young(struct page *page)
+{
+	ClearPageYoung(page);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
 	return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
 	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
 }
 
+static void clear_page_young(struct page *page)
+{
+	struct page_ext *page_ext = lookup_page_ext(page);
+
+	if (unlikely(!page_ext))
+		return;
+
+	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
 	struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..64795abea003 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index abd238d0f7a4..f7a4a5d4b642 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -42,6 +42,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_COOL	5		/* deactivatie these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..c05817fb570d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -8,6 +8,7 @@
 
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/page_idle.h>
 #include <linux/syscalls.h>
 #include <linux/mempolicy.h>
 #include <linux/page-isolation.h>
@@ -40,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_COOL:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -307,6 +309,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	struct page *page;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		spinlock_t *ptl;
+
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		if (is_huge_zero_pmd(*pmd))
+			goto huge_unlock;
+
+		page = pmd_page(*pmd);
+		if (page_mapcount(page) > 1)
+			goto huge_unlock;
+
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		pmdp_test_and_clear_young(vma, addr, pmd);
+		deactivate_page(page);
+huge_unlock:
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+regular_page:
+	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (pte_none(ptent))
+			continue;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (page_mapcount(page) > 1)
+			continue;
+
+		ptep_test_and_clear_young(vma, addr, pte);
+		deactivate_page(page);
+	}
+
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_cool_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk cool_walk = {
+		.pmd_entry = madvise_cool_pte_range,
+		.mm = vma->vm_mm,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &cool_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cool(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -695,6 +804,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_COOL:
+		return madvise_cool(vma, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +827,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_FREE:
+	case MADV_COOL:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/swap.c b/mm/swap.c
index 3a75722e68a9..0f94c3b5397d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,6 +46,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -537,6 +538,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		clear_page_young(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
 
 static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
@@ -589,6 +607,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -622,6 +644,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		get_page(page);
+		if (!pagevec_add(pvec, page) || PageCompound(page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 /**
  * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
@@ -686,6 +728,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
  2019-05-20  3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
@ 2019-05-20  3:52 ` Minchan Kim
  2019-05-20 16:50   ` Johannes Weiner
  2019-05-20  3:52 ` [RFC 3/7] mm: introduce MADV_COLD Minchan Kim
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with skip_reference_check to make it more clear.

This is a preparatory work for next patch.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d9c3e873eca6..a28e5d17b495 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
 				      struct reclaim_stat *stat,
-				      bool force_reclaim)
+				      bool skip_reference_check)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		struct address_space *mapping;
 		struct page *page;
 		int may_enter_fs;
-		enum page_references references = PAGEREF_RECLAIM_CLEAN;
+		enum page_references references = PAGEREF_RECLAIM;
 		bool dirty, writeback;
 
 		cond_resched();
@@ -1248,7 +1248,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (!force_reclaim)
+		if (!skip_reference_check)
 			references = page_check_references(page, sc);
 
 		switch (references) {
-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
  2019-05-20  3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
  2019-05-20  3:52 ` [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
@ 2019-05-20  3:52 ` Minchan Kim
  2019-05-20  8:27   ` Michal Hocko
                     ` (2 more replies)
  2019-05-20  3:52 ` [RFC 4/7] mm: factor out madvise's core functionality Minchan Kim
                   ` (10 subsequent siblings)
  13 siblings, 3 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.

This patch introduces the new MADV_COLD hint to madvise(2)
syscall. MADV_COLD can be used by a process to mark a memory range
as not expected to be used for a long time. The hint can help
kernel in deciding which pages to evict proactively.

Internally, it works via reclaiming memory in process context
the syscall is called. If the page is dirty but backing storage
is not synchronous device, the written page will be rotate back
into LRU's tail once the write is done so they will reclaim easily
when memory pressure happens. If backing storage is
synchrnous device(e.g., zram), hte page will be reclaimed instantly.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h                   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 123 +++++++++++++++++++++++++
 mm/vmscan.c                            |  74 +++++++++++++++
 4 files changed, 199 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 64795abea003..7f32a948fc6a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned long reclaim_pages(struct list_head *page_list);
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f7a4a5d4b642..b9b51eeb8e1a 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
 #define MADV_COOL	5		/* deactivatie these pages */
+#define MADV_COLD	6		/* reclaim these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE	8		/* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index c05817fb570d..9a6698b56845 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
 	case MADV_COOL:
+	case MADV_COLD:
 	case MADV_FREE:
 		return 0;
 	default:
@@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+{
+	pte_t *orig_pte, *pte, ptent;
+	spinlock_t *ptl;
+	LIST_HEAD(page_list);
+	struct page *page;
+	int isolated = 0;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		spinlock_t *ptl;
+
+		ptl = pmd_trans_huge_lock(pmd, vma);
+		if (!ptl)
+			return 0;
+
+		if (is_huge_zero_pmd(*pmd))
+			goto huge_unlock;
+
+		page = pmd_page(*pmd);
+		if (page_mapcount(page) > 1)
+			goto huge_unlock;
+
+		if (next - addr != HPAGE_PMD_SIZE) {
+			int err;
+
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			err = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (!err)
+				goto regular_page;
+			return 0;
+		}
+
+		if (isolate_lru_page(page))
+			goto huge_unlock;
+
+		list_add(&page->lru, &page_list);
+huge_unlock:
+		spin_unlock(ptl);
+		reclaim_pages(&page_list);
+		return 0;
+	}
+
+	if (pmd_trans_unstable(pmd))
+		return 0;
+regular_page:
+	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (page_mapcount(page) > 1)
+			continue;
+
+		if (isolate_lru_page(page))
+			continue;
+
+		isolated++;
+		list_add(&page->lru, &page_list);
+		if (isolated >= SWAP_CLUSTER_MAX) {
+			pte_unmap_unlock(orig_pte, ptl);
+			reclaim_pages(&page_list);
+			isolated = 0;
+			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+			orig_pte = pte;
+		}
+	}
+
+	pte_unmap_unlock(orig_pte, ptl);
+	reclaim_pages(&page_list);
+	cond_resched();
+
+	return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk warm_walk = {
+		.pmd_entry = madvise_cold_pte_range,
+		.mm = vma->vm_mm,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &warm_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+
+static long madvise_cold(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+	tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+	return 0;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_COOL:
 		return madvise_cool(vma, start, end);
+	case MADV_COLD:
+		return madvise_cold(vma, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_DONTNEED:
 	case MADV_FREE:
 	case MADV_COOL:
+	case MADV_COLD:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a28e5d17b495..1701b31f70a8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			nr_deactivate, nr_rotated, sc->priority, file);
 }
 
+unsigned long reclaim_pages(struct list_head *page_list)
+{
+	int nid = -1;
+	unsigned long nr_isolated[2] = {0, };
+	unsigned long nr_reclaimed = 0;
+	LIST_HEAD(node_page_list);
+	struct reclaim_stat dummy_stat;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.priority = DEF_PRIORITY,
+		.may_writepage = 1,
+		.may_unmap = 1,
+		.may_swap = 1,
+	};
+
+	while (!list_empty(page_list)) {
+		struct page *page;
+
+		page = lru_to_page(page_list);
+		list_del(&page->lru);
+
+		if (nid == -1) {
+			nid = page_to_nid(page);
+			INIT_LIST_HEAD(&node_page_list);
+			nr_isolated[0] = nr_isolated[1] = 0;
+		}
+
+		if (nid == page_to_nid(page)) {
+			list_add(&page->lru, &node_page_list);
+			nr_isolated[!!page_is_file_cache(page)] +=
+						hpage_nr_pages(page);
+			continue;
+		}
+
+		nid = page_to_nid(page);
+
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+					nr_isolated[0]);
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+					nr_isolated[1]);
+		nr_reclaimed += shrink_page_list(&node_page_list,
+				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+				&dummy_stat, true);
+		while (!list_empty(&node_page_list)) {
+			struct page *page = lru_to_page(page_list);
+
+			list_del(&page->lru);
+			putback_lru_page(page);
+		}
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+					-nr_isolated[0]);
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+					-nr_isolated[1]);
+		nr_isolated[0] = nr_isolated[1] = 0;
+		INIT_LIST_HEAD(&node_page_list);
+	}
+
+	if (!list_empty(&node_page_list)) {
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+					nr_isolated[0]);
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+					nr_isolated[1]);
+		nr_reclaimed += shrink_page_list(&node_page_list,
+				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+				&dummy_stat, true);
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+					-nr_isolated[0]);
+		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+					-nr_isolated[1]);
+	}
+
+	return nr_reclaimed;
+}
+
 /*
  * The inactive anon list should be small enough that the VM never has
  * to do too much work.
-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (2 preceding siblings ...)
  2019-05-20  3:52 ` [RFC 3/7] mm: introduce MADV_COLD Minchan Kim
@ 2019-05-20  3:52 ` Minchan Kim
  2019-05-20 14:26   ` Oleksandr Natalenko
  2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

This patch factor out madvise's core functionality so that upcoming
patch can reuse it without duplication.

It shouldn't change any behavior.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------
 1 file changed, 89 insertions(+), 79 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 9a6698b56845..119e82e1f065 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 	return 0;
 }
 
-static long madvise_dontneed_free(struct vm_area_struct *vma,
+static long madvise_dontneed_free(struct task_struct *tsk,
+				  struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
 				  int behavior)
@@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	if (!userfaultfd_remove(vma, start, end)) {
 		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
 
-		down_read(&current->mm->mmap_sem);
-		vma = find_vma(current->mm, start);
+		down_read(&tsk->mm->mmap_sem);
+		vma = find_vma(tsk->mm, start);
 		if (!vma)
 			return -ENOMEM;
 		if (start < vma->vm_start) {
@@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
  * Application wants to free up the pages and associated backing store.
  * This is effectively punching a hole into the middle of a file.
  */
-static long madvise_remove(struct vm_area_struct *vma,
+static long madvise_remove(struct task_struct *tsk,
+				struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end)
 {
@@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma,
 	get_file(f);
 	if (userfaultfd_remove(vma, start, end)) {
 		/* mmap_sem was not released by userfaultfd_remove() */
-		up_read(&current->mm->mmap_sem);
+		up_read(&tsk->mm->mmap_sem);
 	}
 	error = vfs_fallocate(f,
 				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
 				offset, end - start);
 	fput(f);
-	down_read(&current->mm->mmap_sem);
+	down_read(&tsk->mm->mmap_sem);
 	return error;
 }
 
@@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior,
 #endif
 
 static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		unsigned long start, unsigned long end, int behavior)
+madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
+		struct vm_area_struct **prev, unsigned long start,
+		unsigned long end, int behavior)
 {
 	switch (behavior) {
 	case MADV_REMOVE:
-		return madvise_remove(vma, prev, start, end);
+		return madvise_remove(tsk, vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_COOL:
@@ -930,7 +933,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_cold(vma, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(tsk, vma, prev, start,
+						end, behavior);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -974,68 +978,8 @@ madvise_behavior_valid(int behavior)
 	}
 }
 
-/*
- * The madvise(2) system call.
- *
- * Applications can use madvise() to advise the kernel how it should
- * handle paging I/O in this VM area.  The idea is to help the kernel
- * use appropriate read-ahead and caching techniques.  The information
- * provided is advisory only, and can be safely disregarded by the
- * kernel without affecting the correct operation of the application.
- *
- * behavior values:
- *  MADV_NORMAL - the default behavior is to read clusters.  This
- *		results in some read-ahead and read-behind.
- *  MADV_RANDOM - the system should read the minimum amount of data
- *		on any access, since it is unlikely that the appli-
- *		cation will need more than what it asks for.
- *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
- *		once, so they can be aggressively read ahead, and
- *		can be freed soon after they are accessed.
- *  MADV_WILLNEED - the application is notifying the system to read
- *		some pages ahead.
- *  MADV_DONTNEED - the application is finished with the given range,
- *		so the kernel can free resources associated with it.
- *  MADV_FREE - the application marks pages in the given range as lazy free,
- *		where actual purges are postponed until memory pressure happens.
- *  MADV_REMOVE - the application wants to free up the given range of
- *		pages and associated backing store.
- *  MADV_DONTFORK - omit this area from child's address space when forking:
- *		typically, to avoid COWing pages pinned by get_user_pages().
- *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
- *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
- *              range after a fork.
- *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
- *  MADV_HWPOISON - trigger memory error handler as if the given memory range
- *		were corrupted by unrecoverable hardware memory failure.
- *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
- *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
- *		this area with pages of identical content from other such areas.
- *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
- *  MADV_HUGEPAGE - the application wants to back the given range by transparent
- *		huge pages in the future. Existing pages might be coalesced and
- *		new pages might be allocated as THP.
- *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
- *		transparent huge pages so the existing pages will not be
- *		coalesced into THP and new pages will not be allocated as THP.
- *  MADV_DONTDUMP - the application wants to prevent pages in the given range
- *		from being included in its core dump.
- *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
- *
- * return values:
- *  zero    - success
- *  -EINVAL - start + len < 0, start is not page-aligned,
- *		"behavior" is not a valid value, or application
- *		is attempting to release locked or shared pages,
- *		or the specified address range includes file, Huge TLB,
- *		MAP_SHARED or VMPFNMAP range.
- *  -ENOMEM - addresses in the specified range are not currently
- *		mapped, or are outside the AS of the process.
- *  -EIO    - an I/O error occurred while paging in data.
- *  -EBADF  - map exists, but area maps something that isn't a file.
- *  -EAGAIN - a kernel resource was temporarily unavailable.
- */
-SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+static int madvise_core(struct task_struct *tsk, unsigned long start,
+			size_t len_in, int behavior)
 {
 	unsigned long end, tmp;
 	struct vm_area_struct *vma, *prev;
@@ -1071,10 +1015,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 
 	write = madvise_need_mmap_write(behavior);
 	if (write) {
-		if (down_write_killable(&current->mm->mmap_sem))
+		if (down_write_killable(&tsk->mm->mmap_sem))
 			return -EINTR;
 	} else {
-		down_read(&current->mm->mmap_sem);
+		down_read(&tsk->mm->mmap_sem);
 	}
 
 	/*
@@ -1082,7 +1026,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	 * ranges, just ignore them, but return -ENOMEM at the end.
 	 * - different from the way of handling in mlock etc.
 	 */
-	vma = find_vma_prev(current->mm, start, &prev);
+	vma = find_vma_prev(tsk->mm, start, &prev);
 	if (vma && start > vma->vm_start)
 		prev = vma;
 
@@ -1107,7 +1051,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			tmp = end;
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
+		error = madvise_vma(tsk, vma, &prev, start, tmp, behavior);
 		if (error)
 			goto out;
 		start = tmp;
@@ -1119,14 +1063,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 		if (prev)
 			vma = prev->vm_next;
 		else	/* madvise_remove dropped mmap_sem */
-			vma = find_vma(current->mm, start);
+			vma = find_vma(tsk->mm, start);
 	}
 out:
 	blk_finish_plug(&plug);
 	if (write)
-		up_write(&current->mm->mmap_sem);
+		up_write(&tsk->mm->mmap_sem);
 	else
-		up_read(&current->mm->mmap_sem);
+		up_read(&tsk->mm->mmap_sem);
 
 	return error;
 }
+
+/*
+ * The madvise(2) system call.
+ *
+ * Applications can use madvise() to advise the kernel how it should
+ * handle paging I/O in this VM area.  The idea is to help the kernel
+ * use appropriate read-ahead and caching techniques.  The information
+ * provided is advisory only, and can be safely disregarded by the
+ * kernel without affecting the correct operation of the application.
+ *
+ * behavior values:
+ *  MADV_NORMAL - the default behavior is to read clusters.  This
+ *		results in some read-ahead and read-behind.
+ *  MADV_RANDOM - the system should read the minimum amount of data
+ *		on any access, since it is unlikely that the appli-
+ *		cation will need more than what it asks for.
+ *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
+ *		once, so they can be aggressively read ahead, and
+ *		can be freed soon after they are accessed.
+ *  MADV_WILLNEED - the application is notifying the system to read
+ *		some pages ahead.
+ *  MADV_DONTNEED - the application is finished with the given range,
+ *		so the kernel can free resources associated with it.
+ *  MADV_FREE - the application marks pages in the given range as lazy free,
+ *		where actual purges are postponed until memory pressure happens.
+ *  MADV_REMOVE - the application wants to free up the given range of
+ *		pages and associated backing store.
+ *  MADV_DONTFORK - omit this area from child's address space when forking:
+ *		typically, to avoid COWing pages pinned by get_user_pages().
+ *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
+ *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
+ *              range after a fork.
+ *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
+ *  MADV_HWPOISON - trigger memory error handler as if the given memory range
+ *		were corrupted by unrecoverable hardware memory failure.
+ *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
+ *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
+ *		this area with pages of identical content from other such areas.
+ *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ *  MADV_HUGEPAGE - the application wants to back the given range by transparent
+ *		huge pages in the future. Existing pages might be coalesced and
+ *		new pages might be allocated as THP.
+ *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
+ *		transparent huge pages so the existing pages will not be
+ *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_DONTDUMP - the application wants to prevent pages in the given range
+ *		from being included in its core dump.
+ *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *
+ * return values:
+ *  zero    - success
+ *  -EINVAL - start + len < 0, start is not page-aligned,
+ *		"behavior" is not a valid value, or application
+ *		is attempting to release locked or shared pages,
+ *		or the specified address range includes file, Huge TLB,
+ *		MAP_SHARED or VMPFNMAP range.
+ *  -ENOMEM - addresses in the specified range are not currently
+ *		mapped, or are outside the AS of the process.
+ *  -EIO    - an I/O error occurred while paging in data.
+ *  -EBADF  - map exists, but area maps something that isn't a file.
+ *  -EAGAIN - a kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+{
+	return madvise_core(current, start, len_in, behavior);
+}
-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (3 preceding siblings ...)
  2019-05-20  3:52 ` [RFC 4/7] mm: factor out madvise's core functionality Minchan Kim
@ 2019-05-20  3:52 ` Minchan Kim
  2019-05-20  9:18   ` Michal Hocko
                     ` (4 more replies)
  2019-05-20  3:52 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Minchan Kim
                   ` (8 subsequent siblings)
  13 siblings, 5 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

There is some usecase that centralized userspace daemon want to give
a memory hint like MADV_[COOL|COLD] to other process. Android's
ActivityManagerService is one of them.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2)
which works based on pidfd so it could give a hint to the exeternal
process.

int process_madvise(int pidfd, void *addr, size_t length, int advise);

All advises madvise provides can be supported in process_madvise, too.
Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrrace the process could use it successfully.

Please suggest better idea if you have other idea about the permission.

* from v1r1
  * use ptrace capability - surenb, dancol

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/proc_fs.h                |  1 +
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/unistd.h      |  2 ++
 kernel/signal.c                        |  2 +-
 kernel/sys_ni.c                        |  1 +
 mm/madvise.c                           | 45 ++++++++++++++++++++++++++
 8 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..5b9dd55d6b57 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
 427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
+428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..0e5ee78161c9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
 427	common	io_uring_register	__x64_sys_io_uring_register
+428	common	process_madvise		__x64_sys_process_madvise
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 52a283ba0465..f8545d7c5218 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
 
 #endif /* CONFIG_PROC_FS */
 
+extern struct pid *pidfd_to_pid(const struct file *file);
 struct net;
 
 static inline struct proc_dir_entry *proc_net_mkdir(
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..21c6c9a62006 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
 asmlinkage long sys_mincore(unsigned long start, size_t len,
 				unsigned char __user * vec);
 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
+				size_t len, int behavior);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..7ee82ce04620 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 #define __NR_io_uring_register 427
 __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_process_madvise 428
+__SYSCALL(__NR_process_madvise, sys_process_madvise)
 
 #undef __NR_syscalls
 #define __NR_syscalls 428
diff --git a/kernel/signal.c b/kernel/signal.c
index 1c86b78a7597..04e75daab1f8 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 	return copy_siginfo_from_user(kinfo, info);
 }
 
-static struct pid *pidfd_to_pid(const struct file *file)
+struct pid *pidfd_to_pid(const struct file *file)
 {
 	if (file->f_op == &pidfd_fops)
 		return file->private_data;
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..5277421795ab 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
 COND_SYSCALL(munlockall);
 COND_SYSCALL(mincore);
 COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
 COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 119e82e1f065..af02aa17e5c1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -9,6 +9,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/page_idle.h>
+#include <linux/proc_fs.h>
 #include <linux/syscalls.h>
 #include <linux/mempolicy.h>
 #include <linux/page-isolation.h>
@@ -16,6 +17,7 @@
 #include <linux/hugetlb.h>
 #include <linux/falloc.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/ksm.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
 	return madvise_core(current, start, len_in, behavior);
 }
+
+SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
+		size_t, len_in, int, behavior)
+{
+	int ret;
+	struct fd f;
+	struct pid *pid;
+	struct task_struct *tsk;
+	struct mm_struct *mm;
+
+	f = fdget(pidfd);
+	if (!f.file)
+		return -EBADF;
+
+	pid = pidfd_to_pid(f.file);
+	if (IS_ERR(pid)) {
+		ret = PTR_ERR(pid);
+		goto err;
+	}
+
+	ret = -EINVAL;
+	rcu_read_lock();
+	tsk = pid_task(pid, PIDTYPE_PID);
+	if (!tsk) {
+		rcu_read_unlock();
+		goto err;
+	}
+	get_task_struct(tsk);
+	rcu_read_unlock();
+	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
+	if (!mm || IS_ERR(mm)) {
+		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+		if (ret == -EACCES)
+			ret = -EPERM;
+		goto err;
+	}
+	ret = madvise_core(tsk, start, len_in, behavior);
+	mmput(mm);
+	put_task_struct(tsk);
+err:
+	fdput(f);
+	return ret;
+}
-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (4 preceding siblings ...)
  2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
@ 2019-05-20  3:52 ` Minchan Kim
  2019-05-20  9:22   ` Michal Hocko
                     ` (2 more replies)
  2019-05-20  3:52 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Minchan Kim
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

Currently, process_madvise syscall works for only one address range
so user should call the syscall several times to give hints to
multiple address range.

This patch extends process_madvise syscall to support multiple
hints, address ranges and return vaules so user could give hints
all at once.

struct pr_madvise_param {
    int size;                       /* the size of this structure */
    const struct iovec __user *vec; /* address range array */
}

int process_madvise(int pidfd, ssize_t nr_elem,
		    int *behavior,
		    struct pr_madvise_param *results,
		    struct pr_madvise_param *ranges,
		    unsigned long flags);

- pidfd

target process fd

- nr_elem

the number of elemenent of array behavior, results, ranges

- behavior

hints for each address range in remote process so that user could
give different hints for each range.

- results

array of buffers to get results for associated remote address range
action.

- ranges

array to buffers to have remote process's address ranges to be
processed

- flags

extra argument for the future. It should be zero this moment.

Example)

struct pr_madvise_param {
        int size;
        const struct iovec *vec;
};

int main(int argc, char *argv[])
{
        struct pr_madvise_param retp, rangep;
        struct iovec result_vec[2], range_vec[2];
        int hints[2];
        long ret[2];
        void *addr[2];

        pid_t pid;
        char cmd[64] = {0,};
        addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
                          MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

        if (MAP_FAILED == addr[0])
                return 1;

        addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
                          MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

        if (MAP_FAILED == addr[1])
                return 1;

        hints[0] = MADV_COLD;
	range_vec[0].iov_base = addr[0];
        range_vec[0].iov_len = ALLOC_SIZE;
        result_vec[0].iov_base = &ret[0];
        result_vec[0].iov_len = sizeof(long);
	retp.vec = result_vec;
        retp.size = sizeof(struct pr_madvise_param);

        hints[1] = MADV_COOL;
        range_vec[1].iov_base = addr[1];
        range_vec[1].iov_len = ALLOC_SIZE;
        result_vec[1].iov_base = &ret[1];
        result_vec[1].iov_len = sizeof(long);
        rangep.vec = range_vec;
        rangep.size = sizeof(struct pr_madvise_param);

        pid = fork();
        if (!pid) {
                sleep(10);
        } else {
                int pidfd = open(cmd,  O_DIRECTORY | O_CLOEXEC);
                if (pidfd < 0)
                        return 1;

                /* munmap to make pages private for the child */
                munmap(addr[0], ALLOC_SIZE);
                munmap(addr[1], ALLOC_SIZE);
                system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
                if (syscall(__NR_process_madvise, pidfd, 2, behaviors,
						&retp, &rangep, 0))
                        perror("process_madvise fail\n");
                system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
        }

        return 0;
}

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/uapi/asm-generic/mman-common.h |   5 +
 mm/madvise.c                           | 184 +++++++++++++++++++++----
 2 files changed, 166 insertions(+), 23 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index b9b51eeb8e1a..b8e230de84a6 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -74,4 +74,9 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+struct pr_madvise_param {
+	int size;			/* the size of this structure */
+	const struct iovec __user *vec;	/* address range array */
+};
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index af02aa17e5c1..f4f569dac2bd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct vm_area_struct *vma = walk->vma;
 	unsigned long next;
+	long nr_pages = 0;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
@@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
 
 		ptep_test_and_clear_young(vma, addr, pte);
 		deactivate_page(page);
+		nr_pages++;
+
 	}
 
 	pte_unmap_unlock(orig_pte, ptl);
+	*(long *)walk->private += nr_pages;
 	cond_resched();
 
 	return 0;
@@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
 
 static void madvise_cool_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
-			     unsigned long addr, unsigned long end)
+			     unsigned long addr, unsigned long end,
+			     long *nr_pages)
 {
 	struct mm_walk cool_walk = {
 		.pmd_entry = madvise_cool_pte_range,
 		.mm = vma->vm_mm,
+		.private = nr_pages
 	};
 
 	tlb_start_vma(tlb, vma);
@@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb,
 }
 
 static long madvise_cool(struct vm_area_struct *vma,
-			unsigned long start_addr, unsigned long end_addr)
+			unsigned long start_addr, unsigned long end_addr,
+			long *nr_pages)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
@@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma,
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
-	madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
+	madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
 	tlb_finish_mmu(&tlb, start_addr, end_addr);
 
 	return 0;
@@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 	int isolated = 0;
 	struct vm_area_struct *vma = walk->vma;
 	unsigned long next;
+	long nr_pages = 0;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
@@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 		list_add(&page->lru, &page_list);
 		if (isolated >= SWAP_CLUSTER_MAX) {
 			pte_unmap_unlock(orig_pte, ptl);
-			reclaim_pages(&page_list);
+			nr_pages += reclaim_pages(&page_list);
 			isolated = 0;
 			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 			orig_pte = pte;
@@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 
 	pte_unmap_unlock(orig_pte, ptl);
-	reclaim_pages(&page_list);
+	nr_pages += reclaim_pages(&page_list);
 	cond_resched();
 
+	*(long *)walk->private += nr_pages;
 	return 0;
 }
 
 static void madvise_cold_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
-			     unsigned long addr, unsigned long end)
+			     unsigned long addr, unsigned long end,
+			     long *nr_pages)
 {
 	struct mm_walk warm_walk = {
 		.pmd_entry = madvise_cold_pte_range,
 		.mm = vma->vm_mm,
+		.private = nr_pages,
 	};
 
 	tlb_start_vma(tlb, vma);
@@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
 
 
 static long madvise_cold(struct vm_area_struct *vma,
-			unsigned long start_addr, unsigned long end_addr)
+			unsigned long start_addr, unsigned long end_addr,
+			long *nr_pages)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
@@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma,
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
-	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+	madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
 	tlb_finish_mmu(&tlb, start_addr, end_addr);
 
 	return 0;
@@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior,
 static long
 madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
 		struct vm_area_struct **prev, unsigned long start,
-		unsigned long end, int behavior)
+		unsigned long end, int behavior, long *nr_pages)
 {
 	switch (behavior) {
 	case MADV_REMOVE:
@@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_COOL:
-		return madvise_cool(vma, start, end);
+		return madvise_cool(vma, start, end, nr_pages);
 	case MADV_COLD:
-		return madvise_cold(vma, start, end);
+		return madvise_cold(vma, start, end, nr_pages);
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(tsk, vma, prev, start,
@@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior)
 }
 
 static int madvise_core(struct task_struct *tsk, unsigned long start,
-			size_t len_in, int behavior)
+			size_t len_in, int behavior, long *nr_pages)
 {
 	unsigned long end, tmp;
 	struct vm_area_struct *vma, *prev;
@@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
 
 	if (start & ~PAGE_MASK)
 		return error;
+
 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
 
 	/* Check to see whether len was rounded up from small -ve to zero */
@@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
 	blk_start_plug(&plug);
 	for (;;) {
 		/* Still start < end. */
+		long pages = 0;
+
 		error = -ENOMEM;
 		if (!vma)
 			goto out;
@@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
 			tmp = end;
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(tsk, vma, &prev, start, tmp, behavior);
+		error = madvise_vma(tsk, vma, &prev, start, tmp,
+					behavior, &pages);
 		if (error)
 			goto out;
+		*nr_pages += pages;
 		start = tmp;
 		if (prev && start < prev->vm_end)
 			start = prev->vm_end;
@@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
  */
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	return madvise_core(current, start, len_in, behavior);
+	unsigned long dummy;
+
+	return madvise_core(current, start, len_in, behavior, &dummy);
 }
 
-SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
-		size_t, len_in, int, behavior)
+static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
+		struct pr_madvise_param *param)
+{
+	u32 size;
+	int ret;
+
+	memset(param, 0, sizeof(*param));
+
+	ret = get_user(size, &u_param->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)
+		return -E2BIG;
+
+	if (!size || size > sizeof(struct pr_madvise_param))
+		return -EINVAL;
+
+	ret = copy_from_user(param, u_param, size);
+	if (ret)
+		return -EFAULT;
+
+	return ret;
+}
+
+static int process_madvise_core(struct task_struct *tsk, int *behaviors,
+				struct iov_iter *iter,
+				const struct iovec *range_vec,
+				unsigned long riovcnt,
+				unsigned long flags)
+{
+	int i;
+	long err;
+
+	for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) {
+		long ret = 0;
+
+		err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base,
+				range_vec[i].iov_len, behaviors[i],
+				&ret);
+		if (err)
+			ret = err;
+
+		if (copy_to_iter(&ret, sizeof(long), iter) !=
+				sizeof(long)) {
+			err = -EFAULT;
+			break;
+		}
+
+		err = 0;
+	}
+
+	return err;
+}
+
+SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem,
+			const int __user *, hints,
+			struct pr_madvise_param __user *, results,
+			struct pr_madvise_param __user *, ranges,
+			unsigned long, flags)
 {
 	int ret;
 	struct fd f;
 	struct pid *pid;
 	struct task_struct *tsk;
 	struct mm_struct *mm;
+	struct pr_madvise_param result_p, range_p;
+	const struct iovec __user *result_vec, __user *range_vec;
+	int *behaviors;
+	struct iovec iovstack_result[UIO_FASTIOV];
+	struct iovec iovstack_r[UIO_FASTIOV];
+	struct iovec *iov_l = iovstack_result;
+	struct iovec *iov_r = iovstack_r;
+	struct iov_iter iter;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	ret = pr_madvise_copy_param(results, &result_p);
+	if (ret)
+		return ret;
+
+	ret = pr_madvise_copy_param(ranges, &range_p);
+	if (ret)
+		return ret;
+
+	result_vec = result_p.vec;
+	range_vec = range_p.vec;
+
+	if (result_p.size != sizeof(struct pr_madvise_param) ||
+			range_p.size != sizeof(struct pr_madvise_param))
+		return -EINVAL;
+
+	behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
+	if (!behaviors)
+		return -ENOMEM;
+
+	ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem);
+	if (ret < 0)
+		goto free_behavior_vec;
+
+	ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV,
+				&iov_l, &iter);
+	if (ret < 0)
+		goto free_behavior_vec;
+
+	if (!iov_iter_count(&iter)) {
+		ret = -EINVAL;
+		goto free_iovecs;
+	}
+
+	ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
+				UIO_FASTIOV, iovstack_r, &iov_r);
+	if (ret <= 0)
+		goto free_iovecs;
 
 	f = fdget(pidfd);
-	if (!f.file)
-		return -EBADF;
+	if (!f.file) {
+		ret = -EBADF;
+		goto free_iovecs;
+	}
 
 	pid = pidfd_to_pid(f.file);
 	if (IS_ERR(pid)) {
 		ret = PTR_ERR(pid);
-		goto err;
+		goto put_fd;
 	}
 
 	ret = -EINVAL;
@@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
 	tsk = pid_task(pid, PIDTYPE_PID);
 	if (!tsk) {
 		rcu_read_unlock();
-		goto err;
+		goto put_fd;
 	}
 	get_task_struct(tsk);
 	rcu_read_unlock();
@@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
 		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
 		if (ret == -EACCES)
 			ret = -EPERM;
-		goto err;
+		goto put_task;
 	}
-	ret = madvise_core(tsk, start, len_in, behavior);
+
+	ret = process_madvise_core(tsk, behaviors, &iter, iov_r,
+					nr_elem, flags);
 	mmput(mm);
+put_task:
 	put_task_struct(tsk);
-err:
+put_fd:
 	fdput(f);
+free_iovecs:
+	if (iov_r != iovstack_r)
+		kfree(iov_r);
+	kfree(iov_l);
+free_behavior_vec:
+	kfree(behaviors);
+
 	return ret;
 }
-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (5 preceding siblings ...)
  2019-05-20  3:52 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Minchan Kim
@ 2019-05-20  3:52 ` Minchan Kim
  2019-05-20  9:28   ` Michal Hocko
                     ` (2 more replies)
  2019-05-20  6:37 ` [RFC 0/7] introduce memory hinting API for external process Anshuman Khandual
                   ` (6 subsequent siblings)
  13 siblings, 3 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20  3:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim

System could have much faster swap device like zRAM. In that case, swapping
is extremely cheaper than file-IO on the low-end storage.
In this configuration, userspace could handle different strategy for each
kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
while it keeps file-backed pages in inactive LRU by MADV_COOL because
file IO is more expensive in this case so want to keep them in memory
until memory pressure happens.

To support such strategy easier, this patch introduces
MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
that /proc/<pid>/clear_refs already has supported same filters.
They are filters could be Ored with other existing hints using top two bits
of (int behavior).

Once either of them is set, the hint could affect only the interested vma
either anonymous or file-backed.

With that, user could call a process_madvise syscall simply with a entire
range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
MADV_FILE_FILTER so there is no need to call the syscall range by range.

* from v1r2
  * use consistent check with clear_refs to identify anon/file vma - surenb

* from v1r1
  * use naming "filter" for new madvise option - dancol

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/uapi/asm-generic/mman-common.h |  5 +++++
 mm/madvise.c                           | 14 ++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index b8e230de84a6..be59a1b90284 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -66,6 +66,11 @@
 #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
 
+#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER))
+
+#define MADV_ANONYMOUS_FILTER	(1<<31)	/* works for only anonymous vma */
+#define MADV_FILE_FILTER	(1<<30)	/* works for only file-backed vma */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index f4f569dac2bd..116131243540 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
 	int write;
 	size_t len;
 	struct blk_plug plug;
+	bool anon_only, file_only;
 
+	anon_only = behavior & MADV_ANONYMOUS_FILTER;
+	file_only = behavior & MADV_FILE_FILTER;
+
+	if (anon_only && file_only)
+		return error;
+
+	behavior = behavior & MADV_BEHAVIOR_MASK;
 	if (!madvise_behavior_valid(behavior))
 		return error;
 
@@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
 		if (end < tmp)
 			tmp = end;
 
+		if (anon_only && vma->vm_file)
+			goto next;
+		if (file_only && !vma->vm_file)
+			goto next;
+
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
 		error = madvise_vma(tsk, vma, &prev, start, tmp,
 					behavior, &pages);
 		if (error)
 			goto out;
 		*nr_pages += pages;
+next:
 		start = tmp;
 		if (prev && start < prev->vm_end)
 			start = prev->vm_end;
-- 
2.21.0.1020.gf2820cf01a-goog


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (6 preceding siblings ...)
  2019-05-20  3:52 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Minchan Kim
@ 2019-05-20  6:37 ` Anshuman Khandual
  2019-05-20 16:59   ` Tim Murray
  2019-05-20  9:28 ` Michal Hocko
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 138+ messages in thread
From: Anshuman Khandual @ 2019-05-20  6:37 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon



On 05/20/2019 09:22 AM, Minchan Kim wrote:
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.

Getting killed by lmkd which is triggered by custom system memory allocation
parameters and hence not being able to swap out is a problem ? But is not the
problem created by lmkd itself.

Or Is the objective here is reduce the number of processes which get killed by
lmkd by triggering swapping for the unused memory (user hinted) sooner so that
they dont get picked by lmkd. Under utilization for zram hardware is a concern
here as well ?

Swapping out memory into zram wont increase the latency for a hot start ? Or
is it because as it will prevent a fresh cold start which anyway will be slower
than a slow hot start. Just being curious.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
@ 2019-05-20  8:16   ` Michal Hocko
  2019-05-20  8:19     ` Michal Hocko
  2019-05-20 22:54     ` Minchan Kim
  2019-05-28  8:53   ` Hillf Danton
  2019-05-28 10:58   ` Minchan Kim
  2 siblings, 2 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-20  8:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[CC linux-api]

On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> it could hint kernel that the pages can be reclaimed
> when memory pressure happens but data should be preserved
> for future use.  This could reduce workingset eviction so it
> ends up increasing performance.
> 
> This patch introduces the new MADV_COOL hint to madvise(2)
> syscall. MADV_COOL can be used by a process to mark a memory range
> as not expected to be used in the near future. The hint can help
> kernel in deciding which pages to evict early during memory
> pressure.

I do not want to start naming fight but MADV_COOL sounds a bit
misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
or MADV_DONTNEED_PRESERVE.

> Internally, it works via deactivating memory from active list to
> inactive's head so when the memory pressure happens, they will be
> reclaimed earlier than other active pages unless there is no
> access until the time.

Could you elaborate about the decision to move to the head rather than
tail? What should happen to inactive pages? Should we move them to the
tail? Your implementation seems to ignore those completely. Why?

What should happen for shared pages? In other words do we want to allow
less privileged process to control evicting of shared pages with a more
privileged one? E.g. think of all sorts of side channel attacks. Maybe
we want to do the same thing as for mincore where write access is
required.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:16   ` Michal Hocko
@ 2019-05-20  8:19     ` Michal Hocko
  2019-05-20 15:08       ` Suren Baghdasaryan
  2019-05-20 22:55       ` Minchan Kim
  2019-05-20 22:54     ` Minchan Kim
  1 sibling, 2 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-20  8:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> [CC linux-api]
> 
> On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > it could hint kernel that the pages can be reclaimed
> > when memory pressure happens but data should be preserved
> > for future use.  This could reduce workingset eviction so it
> > ends up increasing performance.
> > 
> > This patch introduces the new MADV_COOL hint to madvise(2)
> > syscall. MADV_COOL can be used by a process to mark a memory range
> > as not expected to be used in the near future. The hint can help
> > kernel in deciding which pages to evict early during memory
> > pressure.
> 
> I do not want to start naming fight but MADV_COOL sounds a bit
> misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> or MADV_DONTNEED_PRESERVE.

OK, I can see that you have used MADV_COLD for a different mode.
So this one is effectively a non destructive MADV_FREE alternative
so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
in other patch would then be MADV_DONTNEED_PRESERVE. Right?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20  3:52 ` [RFC 3/7] mm: introduce MADV_COLD Minchan Kim
@ 2019-05-20  8:27   ` Michal Hocko
  2019-05-20 23:00     ` Minchan Kim
  2019-05-28 14:54   ` Hillf Danton
  2019-05-30  0:45   ` Minchan Kim
  2 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-20  8:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> for a long time, it could hint kernel that the pages can be
> reclaimed instantly but data should be preserved for future use.
> This could reduce workingset eviction so it ends up increasing
> performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2)
> syscall. MADV_COLD can be used by a process to mark a memory range
> as not expected to be used for a long time. The hint can help
> kernel in deciding which pages to evict proactively.

As mentioned in other email this looks like a non-destructive
MADV_DONTNEED alternative.

> Internally, it works via reclaiming memory in process context
> the syscall is called. If the page is dirty but backing storage
> is not synchronous device, the written page will be rotate back
> into LRU's tail once the write is done so they will reclaim easily
> when memory pressure happens. If backing storage is
> synchrnous device(e.g., zram), hte page will be reclaimed instantly.

Why do we special case async backing storage? Please always try to
explain _why_ the decision is made.

I haven't checked the implementation yet so I cannot comment on that.

> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/swap.h                   |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c                           | 123 +++++++++++++++++++++++++
>  mm/vmscan.c                            |  74 +++++++++++++++
>  4 files changed, 199 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 64795abea003..7f32a948fc6a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -365,6 +365,7 @@ extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern unsigned long vm_total_pages;
>  
> +extern unsigned long reclaim_pages(struct list_head *page_list);
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim_mode;
>  extern int sysctl_min_unmapped_ratio;
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f7a4a5d4b642..b9b51eeb8e1a 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -43,6 +43,7 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
>  #define MADV_COOL	5		/* deactivatie these pages */
> +#define MADV_COLD	6		/* reclaim these pages */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c05817fb570d..9a6698b56845 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_COOL:
> +	case MADV_COLD:
>  	case MADV_FREE:
>  		return 0;
>  	default:
> @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	LIST_HEAD(page_list);
> +	struct page *page;
> +	int isolated = 0;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		spinlock_t *ptl;
> +
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		if (isolate_lru_page(page))
> +			goto huge_unlock;
> +
> +		list_add(&page->lru, &page_list);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		reclaim_pages(&page_list);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +regular_page:
> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		if (isolate_lru_page(page))
> +			continue;
> +
> +		isolated++;
> +		list_add(&page->lru, &page_list);
> +		if (isolated >= SWAP_CLUSTER_MAX) {
> +			pte_unmap_unlock(orig_pte, ptl);
> +			reclaim_pages(&page_list);
> +			isolated = 0;
> +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +			orig_pte = pte;
> +		}
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	reclaim_pages(&page_list);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static void madvise_cold_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct mm_walk warm_walk = {
> +		.pmd_entry = madvise_cold_pte_range,
> +		.mm = vma->vm_mm,
> +	};
> +
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &warm_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +
> +static long madvise_cold(struct vm_area_struct *vma,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> +		return -EINVAL;
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				unsigned long end, struct mm_walk *walk)
>  
> @@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COOL:
>  		return madvise_cool(vma, start, end);
> +	case MADV_COLD:
> +		return madvise_cold(vma, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> @@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
>  	case MADV_COOL:
> +	case MADV_COLD:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a28e5d17b495..1701b31f70a8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  			nr_deactivate, nr_rotated, sc->priority, file);
>  }
>  
> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> +	int nid = -1;
> +	unsigned long nr_isolated[2] = {0, };
> +	unsigned long nr_reclaimed = 0;
> +	LIST_HEAD(node_page_list);
> +	struct reclaim_stat dummy_stat;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
> +		.may_writepage = 1,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +	};
> +
> +	while (!list_empty(page_list)) {
> +		struct page *page;
> +
> +		page = lru_to_page(page_list);
> +		list_del(&page->lru);
> +
> +		if (nid == -1) {
> +			nid = page_to_nid(page);
> +			INIT_LIST_HEAD(&node_page_list);
> +			nr_isolated[0] = nr_isolated[1] = 0;
> +		}
> +
> +		if (nid == page_to_nid(page)) {
> +			list_add(&page->lru, &node_page_list);
> +			nr_isolated[!!page_is_file_cache(page)] +=
> +						hpage_nr_pages(page);
> +			continue;
> +		}
> +
> +		nid = page_to_nid(page);
> +
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					nr_isolated[1]);
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> +				&dummy_stat, true);
> +		while (!list_empty(&node_page_list)) {
> +			struct page *page = lru_to_page(page_list);
> +
> +			list_del(&page->lru);
> +			putback_lru_page(page);
> +		}
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					-nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					-nr_isolated[1]);
> +		nr_isolated[0] = nr_isolated[1] = 0;
> +		INIT_LIST_HEAD(&node_page_list);
> +	}
> +
> +	if (!list_empty(&node_page_list)) {
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					nr_isolated[1]);
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> +				&dummy_stat, true);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					-nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					-nr_isolated[1]);
> +	}
> +
> +	return nr_reclaimed;
> +}
> +
>  /*
>   * The inactive anon list should be small enough that the VM never has
>   * to do too much work.
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
@ 2019-05-20  9:18   ` Michal Hocko
  2019-05-21  2:41     ` Minchan Kim
  2019-05-21  9:01   ` Christian Brauner
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-20  9:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COOL|COLD] to other process. Android's
> ActivityManagerService is one of them.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.

Could you expand some more about how this all works? How does the
centralized daemon track respective ranges? How does it synchronize
against parallel modification of the address space etc.

> To solve the issue, this patch introduces new syscall process_madvise(2)
> which works based on pidfd so it could give a hint to the exeternal
> process.
> 
> int process_madvise(int pidfd, void *addr, size_t length, int advise);

OK, this makes some sense from the API point of view. When we have
discussed that at LSFMM I was contemplating about something like that
except the fd would be a VMA fd rather than the process. We could extend
and reuse /proc/<pid>/map_files interface which doesn't support the
anonymous memory right now. 

I am not saying this would be a better interface but I wanted to mention
it here for a further discussion. One slight advantage would be that
you know the exact object that you are operating on because you have a
fd for the VMA and we would have a more straightforward way to reject
operation if the underlying object has changed (e.g. unmapped and reused
for a different mapping).

> All advises madvise provides can be supported in process_madvise, too.
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrrace the process could use it successfully.

proc_mem_open model we use for accessing address space via proc sounds
like a good mode. You are doing something similar.

> Please suggest better idea if you have other idea about the permission.
> 
> * from v1r1
>   * use ptrace capability - surenb, dancol
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/proc_fs.h                |  1 +
>  include/linux/syscalls.h               |  2 ++
>  include/uapi/asm-generic/unistd.h      |  2 ++
>  kernel/signal.c                        |  2 +-
>  kernel/sys_ni.c                        |  1 +
>  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
>  8 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4cd5f982b1e5..5b9dd55d6b57 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
>  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
>  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 64ca0d06259a..0e5ee78161c9 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  425	common	io_uring_setup		__x64_sys_io_uring_setup
>  426	common	io_uring_enter		__x64_sys_io_uring_enter
>  427	common	io_uring_register	__x64_sys_io_uring_register
> +428	common	process_madvise		__x64_sys_process_madvise
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 52a283ba0465..f8545d7c5218 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
>  
>  #endif /* CONFIG_PROC_FS */
>  
> +extern struct pid *pidfd_to_pid(const struct file *file);
>  struct net;
>  
>  static inline struct proc_dir_entry *proc_net_mkdir(
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..21c6c9a62006 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
>  asmlinkage long sys_mincore(unsigned long start, size_t len,
>  				unsigned char __user * vec);
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> +				size_t len, int behavior);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dee7292e1df6..7ee82ce04620 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
>  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
>  #define __NR_io_uring_register 427
>  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_process_madvise 428
> +__SYSCALL(__NR_process_madvise, sys_process_madvise)
>  
>  #undef __NR_syscalls
>  #define __NR_syscalls 428
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1c86b78a7597..04e75daab1f8 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
>  	return copy_siginfo_from_user(kinfo, info);
>  }
>  
> -static struct pid *pidfd_to_pid(const struct file *file)
> +struct pid *pidfd_to_pid(const struct file *file)
>  {
>  	if (file->f_op == &pidfd_fops)
>  		return file->private_data;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 4d9ae5ea6caf..5277421795ab 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
>  COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
> +COND_SYSCALL(process_madvise);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 119e82e1f065..af02aa17e5c1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -9,6 +9,7 @@
>  #include <linux/mman.h>
>  #include <linux/pagemap.h>
>  #include <linux/page_idle.h>
> +#include <linux/proc_fs.h>
>  #include <linux/syscalls.h>
>  #include <linux/mempolicy.h>
>  #include <linux/page-isolation.h>
> @@ -16,6 +17,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/falloc.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
>  	return madvise_core(current, start, len_in, behavior);
>  }
> +
> +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> +		size_t, len_in, int, behavior)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	struct task_struct *tsk;
> +	struct mm_struct *mm;
> +
> +	f = fdget(pidfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = pidfd_to_pid(f.file);
> +	if (IS_ERR(pid)) {
> +		ret = PTR_ERR(pid);
> +		goto err;
> +	}
> +
> +	ret = -EINVAL;
> +	rcu_read_lock();
> +	tsk = pid_task(pid, PIDTYPE_PID);
> +	if (!tsk) {
> +		rcu_read_unlock();
> +		goto err;
> +	}
> +	get_task_struct(tsk);
> +	rcu_read_unlock();
> +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> +	if (!mm || IS_ERR(mm)) {
> +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +		if (ret == -EACCES)
> +			ret = -EPERM;
> +		goto err;
> +	}
> +	ret = madvise_core(tsk, start, len_in, behavior);
> +	mmput(mm);
> +	put_task_struct(tsk);
> +err:
> +	fdput(f);
> +	return ret;
> +}
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-20  3:52 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Minchan Kim
@ 2019-05-20  9:22   ` Michal Hocko
  2019-05-21  2:48     ` Minchan Kim
  2019-05-29  4:14   ` Hillf Danton
  2019-05-30  0:35   ` Minchan Kim
  2 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-20  9:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> Currently, process_madvise syscall works for only one address range
> so user should call the syscall several times to give hints to
> multiple address range.

Is that a problem? How big of a problem? Any numbers?

> This patch extends process_madvise syscall to support multiple
> hints, address ranges and return vaules so user could give hints
> all at once.
> 
> struct pr_madvise_param {
>     int size;                       /* the size of this structure */
>     const struct iovec __user *vec; /* address range array */
> }
> 
> int process_madvise(int pidfd, ssize_t nr_elem,
> 		    int *behavior,
> 		    struct pr_madvise_param *results,
> 		    struct pr_madvise_param *ranges,
> 		    unsigned long flags);
> 
> - pidfd
> 
> target process fd
> 
> - nr_elem
> 
> the number of elemenent of array behavior, results, ranges
> 
> - behavior
> 
> hints for each address range in remote process so that user could
> give different hints for each range.

What is the guarantee of a single call? Do all hints get applied or the
first failure backs of? What are the atomicity guarantees?

> 
> - results
> 
> array of buffers to get results for associated remote address range
> action.
> 
> - ranges
> 
> array to buffers to have remote process's address ranges to be
> processed
> 
> - flags
> 
> extra argument for the future. It should be zero this moment.
> 
> Example)
> 
> struct pr_madvise_param {
>         int size;
>         const struct iovec *vec;
> };
> 
> int main(int argc, char *argv[])
> {
>         struct pr_madvise_param retp, rangep;
>         struct iovec result_vec[2], range_vec[2];
>         int hints[2];
>         long ret[2];
>         void *addr[2];
> 
>         pid_t pid;
>         char cmd[64] = {0,};
>         addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
>                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>         if (MAP_FAILED == addr[0])
>                 return 1;
> 
>         addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
>                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>         if (MAP_FAILED == addr[1])
>                 return 1;
> 
>         hints[0] = MADV_COLD;
> 	range_vec[0].iov_base = addr[0];
>         range_vec[0].iov_len = ALLOC_SIZE;
>         result_vec[0].iov_base = &ret[0];
>         result_vec[0].iov_len = sizeof(long);
> 	retp.vec = result_vec;
>         retp.size = sizeof(struct pr_madvise_param);
> 
>         hints[1] = MADV_COOL;
>         range_vec[1].iov_base = addr[1];
>         range_vec[1].iov_len = ALLOC_SIZE;
>         result_vec[1].iov_base = &ret[1];
>         result_vec[1].iov_len = sizeof(long);
>         rangep.vec = range_vec;
>         rangep.size = sizeof(struct pr_madvise_param);
> 
>         pid = fork();
>         if (!pid) {
>                 sleep(10);
>         } else {
>                 int pidfd = open(cmd,  O_DIRECTORY | O_CLOEXEC);
>                 if (pidfd < 0)
>                         return 1;
> 
>                 /* munmap to make pages private for the child */
>                 munmap(addr[0], ALLOC_SIZE);
>                 munmap(addr[1], ALLOC_SIZE);
>                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
>                 if (syscall(__NR_process_madvise, pidfd, 2, behaviors,
> 						&retp, &rangep, 0))
>                         perror("process_madvise fail\n");
>                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
>         }
> 
>         return 0;
> }
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/uapi/asm-generic/mman-common.h |   5 +
>  mm/madvise.c                           | 184 +++++++++++++++++++++----
>  2 files changed, 166 insertions(+), 23 deletions(-)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index b9b51eeb8e1a..b8e230de84a6 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -74,4 +74,9 @@
>  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
>  				 PKEY_DISABLE_WRITE)
>  
> +struct pr_madvise_param {
> +	int size;			/* the size of this structure */
> +	const struct iovec __user *vec;	/* address range array */
> +};
> +
>  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index af02aa17e5c1..f4f569dac2bd 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
>  	struct page *page;
>  	struct vm_area_struct *vma = walk->vma;
>  	unsigned long next;
> +	long nr_pages = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  		ptep_test_and_clear_young(vma, addr, pte);
>  		deactivate_page(page);
> +		nr_pages++;
> +
>  	}
>  
>  	pte_unmap_unlock(orig_pte, ptl);
> +	*(long *)walk->private += nr_pages;
>  	cond_resched();
>  
>  	return 0;
> @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  static void madvise_cool_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
> -			     unsigned long addr, unsigned long end)
> +			     unsigned long addr, unsigned long end,
> +			     long *nr_pages)
>  {
>  	struct mm_walk cool_walk = {
>  		.pmd_entry = madvise_cool_pte_range,
>  		.mm = vma->vm_mm,
> +		.private = nr_pages
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> @@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb,
>  }
>  
>  static long madvise_cool(struct vm_area_struct *vma,
> -			unsigned long start_addr, unsigned long end_addr)
> +			unsigned long start_addr, unsigned long end_addr,
> +			long *nr_pages)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_gather tlb;
> @@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma,
>  
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> -	madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
> +	madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
>  	tlb_finish_mmu(&tlb, start_addr, end_addr);
>  
>  	return 0;
> @@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  	int isolated = 0;
>  	struct vm_area_struct *vma = walk->vma;
>  	unsigned long next;
> +	long nr_pages = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  		list_add(&page->lru, &page_list);
>  		if (isolated >= SWAP_CLUSTER_MAX) {
>  			pte_unmap_unlock(orig_pte, ptl);
> -			reclaim_pages(&page_list);
> +			nr_pages += reclaim_pages(&page_list);
>  			isolated = 0;
>  			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  			orig_pte = pte;
> @@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>  	}
>  
>  	pte_unmap_unlock(orig_pte, ptl);
> -	reclaim_pages(&page_list);
> +	nr_pages += reclaim_pages(&page_list);
>  	cond_resched();
>  
> +	*(long *)walk->private += nr_pages;
>  	return 0;
>  }
>  
>  static void madvise_cold_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
> -			     unsigned long addr, unsigned long end)
> +			     unsigned long addr, unsigned long end,
> +			     long *nr_pages)
>  {
>  	struct mm_walk warm_walk = {
>  		.pmd_entry = madvise_cold_pte_range,
>  		.mm = vma->vm_mm,
> +		.private = nr_pages,
>  	};
>  
>  	tlb_start_vma(tlb, vma);
> @@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
>  
>  
>  static long madvise_cold(struct vm_area_struct *vma,
> -			unsigned long start_addr, unsigned long end_addr)
> +			unsigned long start_addr, unsigned long end_addr,
> +			long *nr_pages)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_gather tlb;
> @@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma,
>  
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> -	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
>  	tlb_finish_mmu(&tlb, start_addr, end_addr);
>  
>  	return 0;
> @@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior,
>  static long
>  madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
>  		struct vm_area_struct **prev, unsigned long start,
> -		unsigned long end, int behavior)
> +		unsigned long end, int behavior, long *nr_pages)
>  {
>  	switch (behavior) {
>  	case MADV_REMOVE:
> @@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COOL:
> -		return madvise_cool(vma, start, end);
> +		return madvise_cool(vma, start, end, nr_pages);
>  	case MADV_COLD:
> -		return madvise_cold(vma, start, end);
> +		return madvise_cold(vma, start, end, nr_pages);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(tsk, vma, prev, start,
> @@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior)
>  }
>  
>  static int madvise_core(struct task_struct *tsk, unsigned long start,
> -			size_t len_in, int behavior)
> +			size_t len_in, int behavior, long *nr_pages)
>  {
>  	unsigned long end, tmp;
>  	struct vm_area_struct *vma, *prev;
> @@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  
>  	if (start & ~PAGE_MASK)
>  		return error;
> +
>  	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>  
>  	/* Check to see whether len was rounded up from small -ve to zero */
> @@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  	blk_start_plug(&plug);
>  	for (;;) {
>  		/* Still start < end. */
> +		long pages = 0;
> +
>  		error = -ENOMEM;
>  		if (!vma)
>  			goto out;
> @@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  			tmp = end;
>  
>  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> -		error = madvise_vma(tsk, vma, &prev, start, tmp, behavior);
> +		error = madvise_vma(tsk, vma, &prev, start, tmp,
> +					behavior, &pages);
>  		if (error)
>  			goto out;
> +		*nr_pages += pages;
>  		start = tmp;
>  		if (prev && start < prev->vm_end)
>  			start = prev->vm_end;
> @@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>   */
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -	return madvise_core(current, start, len_in, behavior);
> +	unsigned long dummy;
> +
> +	return madvise_core(current, start, len_in, behavior, &dummy);
>  }
>  
> -SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> -		size_t, len_in, int, behavior)
> +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
> +		struct pr_madvise_param *param)
> +{
> +	u32 size;
> +	int ret;
> +
> +	memset(param, 0, sizeof(*param));
> +
> +	ret = get_user(size, &u_param->size);
> +	if (ret)
> +		return ret;
> +
> +	if (size > PAGE_SIZE)
> +		return -E2BIG;
> +
> +	if (!size || size > sizeof(struct pr_madvise_param))
> +		return -EINVAL;
> +
> +	ret = copy_from_user(param, u_param, size);
> +	if (ret)
> +		return -EFAULT;
> +
> +	return ret;
> +}
> +
> +static int process_madvise_core(struct task_struct *tsk, int *behaviors,
> +				struct iov_iter *iter,
> +				const struct iovec *range_vec,
> +				unsigned long riovcnt,
> +				unsigned long flags)
> +{
> +	int i;
> +	long err;
> +
> +	for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) {
> +		long ret = 0;
> +
> +		err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base,
> +				range_vec[i].iov_len, behaviors[i],
> +				&ret);
> +		if (err)
> +			ret = err;
> +
> +		if (copy_to_iter(&ret, sizeof(long), iter) !=
> +				sizeof(long)) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		err = 0;
> +	}
> +
> +	return err;
> +}
> +
> +SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem,
> +			const int __user *, hints,
> +			struct pr_madvise_param __user *, results,
> +			struct pr_madvise_param __user *, ranges,
> +			unsigned long, flags)
>  {
>  	int ret;
>  	struct fd f;
>  	struct pid *pid;
>  	struct task_struct *tsk;
>  	struct mm_struct *mm;
> +	struct pr_madvise_param result_p, range_p;
> +	const struct iovec __user *result_vec, __user *range_vec;
> +	int *behaviors;
> +	struct iovec iovstack_result[UIO_FASTIOV];
> +	struct iovec iovstack_r[UIO_FASTIOV];
> +	struct iovec *iov_l = iovstack_result;
> +	struct iovec *iov_r = iovstack_r;
> +	struct iov_iter iter;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	ret = pr_madvise_copy_param(results, &result_p);
> +	if (ret)
> +		return ret;
> +
> +	ret = pr_madvise_copy_param(ranges, &range_p);
> +	if (ret)
> +		return ret;
> +
> +	result_vec = result_p.vec;
> +	range_vec = range_p.vec;
> +
> +	if (result_p.size != sizeof(struct pr_madvise_param) ||
> +			range_p.size != sizeof(struct pr_madvise_param))
> +		return -EINVAL;
> +
> +	behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
> +	if (!behaviors)
> +		return -ENOMEM;
> +
> +	ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem);
> +	if (ret < 0)
> +		goto free_behavior_vec;
> +
> +	ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV,
> +				&iov_l, &iter);
> +	if (ret < 0)
> +		goto free_behavior_vec;
> +
> +	if (!iov_iter_count(&iter)) {
> +		ret = -EINVAL;
> +		goto free_iovecs;
> +	}
> +
> +	ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
> +				UIO_FASTIOV, iovstack_r, &iov_r);
> +	if (ret <= 0)
> +		goto free_iovecs;
>  
>  	f = fdget(pidfd);
> -	if (!f.file)
> -		return -EBADF;
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto free_iovecs;
> +	}
>  
>  	pid = pidfd_to_pid(f.file);
>  	if (IS_ERR(pid)) {
>  		ret = PTR_ERR(pid);
> -		goto err;
> +		goto put_fd;
>  	}
>  
>  	ret = -EINVAL;
> @@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
>  	tsk = pid_task(pid, PIDTYPE_PID);
>  	if (!tsk) {
>  		rcu_read_unlock();
> -		goto err;
> +		goto put_fd;
>  	}
>  	get_task_struct(tsk);
>  	rcu_read_unlock();
> @@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
>  		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>  		if (ret == -EACCES)
>  			ret = -EPERM;
> -		goto err;
> +		goto put_task;
>  	}
> -	ret = madvise_core(tsk, start, len_in, behavior);
> +
> +	ret = process_madvise_core(tsk, behaviors, &iter, iov_r,
> +					nr_elem, flags);
>  	mmput(mm);
> +put_task:
>  	put_task_struct(tsk);
> -err:
> +put_fd:
>  	fdput(f);
> +free_iovecs:
> +	if (iov_r != iovstack_r)
> +		kfree(iov_r);
> +	kfree(iov_l);
> +free_behavior_vec:
> +	kfree(behaviors);
> +
>  	return ret;
>  }
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-20  3:52 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Minchan Kim
@ 2019-05-20  9:28   ` Michal Hocko
  2019-05-21  2:55     ` Minchan Kim
  2019-05-29  4:36   ` Hillf Danton
  2019-05-30  1:00   ` Minchan Kim
  2 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-20  9:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[cc linux-api]

On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> System could have much faster swap device like zRAM. In that case, swapping
> is extremely cheaper than file-IO on the low-end storage.
> In this configuration, userspace could handle different strategy for each
> kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> while it keeps file-backed pages in inactive LRU by MADV_COOL because
> file IO is more expensive in this case so want to keep them in memory
> until memory pressure happens.
> 
> To support such strategy easier, this patch introduces
> MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> that /proc/<pid>/clear_refs already has supported same filters.
> They are filters could be Ored with other existing hints using top two bits
> of (int behavior).

madvise operates on top of ranges and it is quite trivial to do the
filtering from the userspace so why do we need any additional filtering?

> Once either of them is set, the hint could affect only the interested vma
> either anonymous or file-backed.
> 
> With that, user could call a process_madvise syscall simply with a entire
> range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> MADV_FILE_FILTER so there is no need to call the syscall range by range.

OK, so here is the reason you want that. The immediate question is why
cannot the monitor do the filtering from the userspace. Slightly more
work, all right, but less of an API to expose and that itself is a
strong argument against.

> * from v1r2
>   * use consistent check with clear_refs to identify anon/file vma - surenb
> 
> * from v1r1
>   * use naming "filter" for new madvise option - dancol
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/uapi/asm-generic/mman-common.h |  5 +++++
>  mm/madvise.c                           | 14 ++++++++++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index b8e230de84a6..be59a1b90284 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -66,6 +66,11 @@
>  #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
>  #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
>  
> +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER))
> +
> +#define MADV_ANONYMOUS_FILTER	(1<<31)	/* works for only anonymous vma */
> +#define MADV_FILE_FILTER	(1<<30)	/* works for only file-backed vma */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index f4f569dac2bd..116131243540 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  	int write;
>  	size_t len;
>  	struct blk_plug plug;
> +	bool anon_only, file_only;
>  
> +	anon_only = behavior & MADV_ANONYMOUS_FILTER;
> +	file_only = behavior & MADV_FILE_FILTER;
> +
> +	if (anon_only && file_only)
> +		return error;
> +
> +	behavior = behavior & MADV_BEHAVIOR_MASK;
>  	if (!madvise_behavior_valid(behavior))
>  		return error;
>  
> @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
>  		if (end < tmp)
>  			tmp = end;
>  
> +		if (anon_only && vma->vm_file)
> +			goto next;
> +		if (file_only && !vma->vm_file)
> +			goto next;
> +
>  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
>  		error = madvise_vma(tsk, vma, &prev, start, tmp,
>  					behavior, &pages);
>  		if (error)
>  			goto out;
>  		*nr_pages += pages;
> +next:
>  		start = tmp;
>  		if (prev && start < prev->vm_end)
>  			start = prev->vm_end;
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (7 preceding siblings ...)
  2019-05-20  6:37 ` [RFC 0/7] introduce memory hinting API for external process Anshuman Khandual
@ 2019-05-20  9:28 ` Michal Hocko
  2019-05-20 14:42 ` Oleksandr Natalenko
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-20  9:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Mon 20-05-19 12:52:47, Minchan Kim wrote:
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
> 	struct pr_madvise_param {
> 		int size;
> 		const struct iovec *vec;
> 	}
> 
> 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> 				struct pr_madvise_param *restuls,
> 				struct pr_madvise_param *ranges,
> 				unsigned long flags);
> 
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
> 
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
> 
> - Experiment
> 
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
> 
> A is vanilla and B is process_madvise.
> 
> 
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
> 
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
> 
> This patchset is against on next-20190517.
> 
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
> 
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-20  3:52 ` [RFC 4/7] mm: factor out madvise's core functionality Minchan Kim
@ 2019-05-20 14:26   ` Oleksandr Natalenko
  2019-05-21  1:26     ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Natalenko @ 2019-05-20 14:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

Hi.

On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote:
> This patch factor out madvise's core functionality so that upcoming
> patch can reuse it without duplication.
> 
> It shouldn't change any behavior.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------
>  1 file changed, 89 insertions(+), 79 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9a6698b56845..119e82e1f065 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> -static long madvise_dontneed_free(struct vm_area_struct *vma,
> +static long madvise_dontneed_free(struct task_struct *tsk,
> +				  struct vm_area_struct *vma,
>  				  struct vm_area_struct **prev,
>  				  unsigned long start, unsigned long end,
>  				  int behavior)
> @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  	if (!userfaultfd_remove(vma, start, end)) {
>  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
>  
> -		down_read(&current->mm->mmap_sem);
> -		vma = find_vma(current->mm, start);
> +		down_read(&tsk->mm->mmap_sem);
> +		vma = find_vma(tsk->mm, start);
>  		if (!vma)
>  			return -ENOMEM;
>  		if (start < vma->vm_start) {
> @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>   * Application wants to free up the pages and associated backing store.
>   * This is effectively punching a hole into the middle of a file.
>   */
> -static long madvise_remove(struct vm_area_struct *vma,
> +static long madvise_remove(struct task_struct *tsk,
> +				struct vm_area_struct *vma,
>  				struct vm_area_struct **prev,
>  				unsigned long start, unsigned long end)
>  {
> @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma,
>  	get_file(f);
>  	if (userfaultfd_remove(vma, start, end)) {
>  		/* mmap_sem was not released by userfaultfd_remove() */
> -		up_read(&current->mm->mmap_sem);
> +		up_read(&tsk->mm->mmap_sem);
>  	}
>  	error = vfs_fallocate(f,
>  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>  				offset, end - start);
>  	fput(f);
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&tsk->mm->mmap_sem);
>  	return error;
>  }
>  
> @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior,
>  #endif

What about madvise_inject_error() and get_user_pages_fast() in it
please?

>  
>  static long
> -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> -		unsigned long start, unsigned long end, int behavior)
> +madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
> +		struct vm_area_struct **prev, unsigned long start,
> +		unsigned long end, int behavior)
>  {
>  	switch (behavior) {
>  	case MADV_REMOVE:
> -		return madvise_remove(vma, prev, start, end);
> +		return madvise_remove(tsk, vma, prev, start, end);
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_COOL:
> @@ -930,7 +933,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_cold(vma, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
> -		return madvise_dontneed_free(vma, prev, start, end, behavior);
> +		return madvise_dontneed_free(tsk, vma, prev, start,
> +						end, behavior);
>  	default:
>  		return madvise_behavior(vma, prev, start, end, behavior);
>  	}
> @@ -974,68 +978,8 @@ madvise_behavior_valid(int behavior)
>  	}
>  }
>  
> -/*
> - * The madvise(2) system call.
> - *
> - * Applications can use madvise() to advise the kernel how it should
> - * handle paging I/O in this VM area.  The idea is to help the kernel
> - * use appropriate read-ahead and caching techniques.  The information
> - * provided is advisory only, and can be safely disregarded by the
> - * kernel without affecting the correct operation of the application.
> - *
> - * behavior values:
> - *  MADV_NORMAL - the default behavior is to read clusters.  This
> - *		results in some read-ahead and read-behind.
> - *  MADV_RANDOM - the system should read the minimum amount of data
> - *		on any access, since it is unlikely that the appli-
> - *		cation will need more than what it asks for.
> - *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
> - *		once, so they can be aggressively read ahead, and
> - *		can be freed soon after they are accessed.
> - *  MADV_WILLNEED - the application is notifying the system to read
> - *		some pages ahead.
> - *  MADV_DONTNEED - the application is finished with the given range,
> - *		so the kernel can free resources associated with it.
> - *  MADV_FREE - the application marks pages in the given range as lazy free,
> - *		where actual purges are postponed until memory pressure happens.
> - *  MADV_REMOVE - the application wants to free up the given range of
> - *		pages and associated backing store.
> - *  MADV_DONTFORK - omit this area from child's address space when forking:
> - *		typically, to avoid COWing pages pinned by get_user_pages().
> - *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> - *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
> - *              range after a fork.
> - *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> - *  MADV_HWPOISON - trigger memory error handler as if the given memory range
> - *		were corrupted by unrecoverable hardware memory failure.
> - *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> - *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> - *		this area with pages of identical content from other such areas.
> - *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> - *  MADV_HUGEPAGE - the application wants to back the given range by transparent
> - *		huge pages in the future. Existing pages might be coalesced and
> - *		new pages might be allocated as THP.
> - *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> - *		transparent huge pages so the existing pages will not be
> - *		coalesced into THP and new pages will not be allocated as THP.
> - *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> - *		from being included in its core dump.
> - *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> - *
> - * return values:
> - *  zero    - success
> - *  -EINVAL - start + len < 0, start is not page-aligned,
> - *		"behavior" is not a valid value, or application
> - *		is attempting to release locked or shared pages,
> - *		or the specified address range includes file, Huge TLB,
> - *		MAP_SHARED or VMPFNMAP range.
> - *  -ENOMEM - addresses in the specified range are not currently
> - *		mapped, or are outside the AS of the process.
> - *  -EIO    - an I/O error occurred while paging in data.
> - *  -EBADF  - map exists, but area maps something that isn't a file.
> - *  -EAGAIN - a kernel resource was temporarily unavailable.
> - */
> -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +static int madvise_core(struct task_struct *tsk, unsigned long start,
> +			size_t len_in, int behavior)
>  {
>  	unsigned long end, tmp;
>  	struct vm_area_struct *vma, *prev;
> @@ -1071,10 +1015,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  
>  	write = madvise_need_mmap_write(behavior);
>  	if (write) {
> -		if (down_write_killable(&current->mm->mmap_sem))
> +		if (down_write_killable(&tsk->mm->mmap_sem))
>  			return -EINTR;
>  	} else {
> -		down_read(&current->mm->mmap_sem);
> +		down_read(&tsk->mm->mmap_sem);
>  	}
>  
>  	/*
> @@ -1082,7 +1026,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  	 * ranges, just ignore them, but return -ENOMEM at the end.
>  	 * - different from the way of handling in mlock etc.
>  	 */
> -	vma = find_vma_prev(current->mm, start, &prev);
> +	vma = find_vma_prev(tsk->mm, start, &prev);
>  	if (vma && start > vma->vm_start)
>  		prev = vma;
>  
> @@ -1107,7 +1051,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  			tmp = end;
>  
>  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> -		error = madvise_vma(vma, &prev, start, tmp, behavior);
> +		error = madvise_vma(tsk, vma, &prev, start, tmp, behavior);
>  		if (error)
>  			goto out;
>  		start = tmp;
> @@ -1119,14 +1063,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  		if (prev)
>  			vma = prev->vm_next;
>  		else	/* madvise_remove dropped mmap_sem */
> -			vma = find_vma(current->mm, start);
> +			vma = find_vma(tsk->mm, start);
>  	}
>  out:
>  	blk_finish_plug(&plug);
>  	if (write)
> -		up_write(&current->mm->mmap_sem);
> +		up_write(&tsk->mm->mmap_sem);
>  	else
> -		up_read(&current->mm->mmap_sem);
> +		up_read(&tsk->mm->mmap_sem);
>  
>  	return error;
>  }
> +
> +/*
> + * The madvise(2) system call.
> + *
> + * Applications can use madvise() to advise the kernel how it should
> + * handle paging I/O in this VM area.  The idea is to help the kernel
> + * use appropriate read-ahead and caching techniques.  The information
> + * provided is advisory only, and can be safely disregarded by the
> + * kernel without affecting the correct operation of the application.
> + *
> + * behavior values:
> + *  MADV_NORMAL - the default behavior is to read clusters.  This
> + *		results in some read-ahead and read-behind.
> + *  MADV_RANDOM - the system should read the minimum amount of data
> + *		on any access, since it is unlikely that the appli-
> + *		cation will need more than what it asks for.
> + *  MADV_SEQUENTIAL - pages in the given range will probably be accessed
> + *		once, so they can be aggressively read ahead, and
> + *		can be freed soon after they are accessed.
> + *  MADV_WILLNEED - the application is notifying the system to read
> + *		some pages ahead.
> + *  MADV_DONTNEED - the application is finished with the given range,
> + *		so the kernel can free resources associated with it.
> + *  MADV_FREE - the application marks pages in the given range as lazy free,
> + *		where actual purges are postponed until memory pressure happens.
> + *  MADV_REMOVE - the application wants to free up the given range of
> + *		pages and associated backing store.
> + *  MADV_DONTFORK - omit this area from child's address space when forking:
> + *		typically, to avoid COWing pages pinned by get_user_pages().
> + *  MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> + *  MADV_WIPEONFORK - present the child process with zero-filled memory in this
> + *              range after a fork.
> + *  MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> + *  MADV_HWPOISON - trigger memory error handler as if the given memory range
> + *		were corrupted by unrecoverable hardware memory failure.
> + *  MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> + *  MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> + *		this area with pages of identical content from other such areas.
> + *  MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> + *  MADV_HUGEPAGE - the application wants to back the given range by transparent
> + *		huge pages in the future. Existing pages might be coalesced and
> + *		new pages might be allocated as THP.
> + *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> + *		transparent huge pages so the existing pages will not be
> + *		coalesced into THP and new pages will not be allocated as THP.
> + *  MADV_DONTDUMP - the application wants to prevent pages in the given range
> + *		from being included in its core dump.
> + *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *
> + * return values:
> + *  zero    - success
> + *  -EINVAL - start + len < 0, start is not page-aligned,
> + *		"behavior" is not a valid value, or application
> + *		is attempting to release locked or shared pages,
> + *		or the specified address range includes file, Huge TLB,
> + *		MAP_SHARED or VMPFNMAP range.
> + *  -ENOMEM - addresses in the specified range are not currently
> + *		mapped, or are outside the AS of the process.
> + *  -EIO    - an I/O error occurred while paging in data.
> + *  -EBADF  - map exists, but area maps something that isn't a file.
> + *  -EAGAIN - a kernel resource was temporarily unavailable.
> + */
> +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +{
> +	return madvise_core(current, start, len_in, behavior);
> +}
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (8 preceding siblings ...)
  2019-05-20  9:28 ` Michal Hocko
@ 2019-05-20 14:42 ` Oleksandr Natalenko
  2019-05-21  2:56   ` Minchan Kim
  2019-05-20 16:46 ` Johannes Weiner
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Natalenko @ 2019-05-20 14:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

Hi.

On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
> 	struct pr_madvise_param {
> 		int size;
> 		const struct iovec *vec;
> 	}
> 
> 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> 				struct pr_madvise_param *restuls,
> 				struct pr_madvise_param *ranges,
> 				unsigned long flags);
> 
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
> 
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
> 
> - Experiment
> 
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
> 
> A is vanilla and B is process_madvise.
> 
> 
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
> 
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
> 
> This patchset is against on next-20190517.
> 
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
> 
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

Please Cc me for the next iteration since I was working on the very same
thing recently [1].

Thank you.

[1] https://gitlab.com/post-factum/pf-kernel/commits/remote-madvise-v3

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:19     ` Michal Hocko
@ 2019-05-20 15:08       ` Suren Baghdasaryan
  2019-05-20 22:55       ` Minchan Kim
  1 sibling, 0 replies; 138+ messages in thread
From: Suren Baghdasaryan @ 2019-05-20 15:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Daniel Colascione, Shakeel Butt,
	Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 1:19 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> > [CC linux-api]
> >
> > On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > it could hint kernel that the pages can be reclaimed
> > > when memory pressure happens but data should be preserved
> > > for future use.  This could reduce workingset eviction so it
> > > ends up increasing performance.
> > >
> > > This patch introduces the new MADV_COOL hint to madvise(2)
> > > syscall. MADV_COOL can be used by a process to mark a memory range
> > > as not expected to be used in the near future. The hint can help
> > > kernel in deciding which pages to evict early during memory
> > > pressure.
> >
> > I do not want to start naming fight but MADV_COOL sounds a bit
> > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> > or MADV_DONTNEED_PRESERVE.
>
> OK, I can see that you have used MADV_COLD for a different mode.
> So this one is effectively a non destructive MADV_FREE alternative
> so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
> in other patch would then be MADV_DONTNEED_PRESERVE. Right?
>

I agree that naming them this way would be more in-line with the
existing API. Another good option IMO could be MADV_RECLAIM_NOW /
MADV_RECLAIM_LAZY which might explain a bit better what they do but
Michal's proposal is more consistent with the current API.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (9 preceding siblings ...)
  2019-05-20 14:42 ` Oleksandr Natalenko
@ 2019-05-20 16:46 ` Johannes Weiner
  2019-05-21  4:39   ` Minchan Kim
  2019-05-21  1:44 ` Matthew Wilcox
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 138+ messages in thread
From: Johannes Weiner @ 2019-05-20 16:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.

I agree with this approach and the semantics. But these names are very
vague and extremely easy to confuse since they're so similar.

MADV_COLD could be a good name, but for deactivating pages, not
reclaiming them - marking memory "cold" on the LRU for later reclaim.

For the immediate reclaim one, I think there is a better option too:
In virtual memory speak, putting a page into secondary storage (or
ensuring it's already there), and then freeing its in-memory copy, is
called "paging out". And that's what this flag is supposed to do. So
how about MADV_PAGEOUT?

With that, we'd have:

MADV_FREE: Mark data invalid, free memory when needed
MADV_DONTNEED: Mark data invalid, free memory immediately

MADV_COLD: Data is not used for a while, free memory when needed
MADV_PAGEOUT: Data is not used for a while, free memory immediately

What do you think?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  2019-05-20  3:52 ` [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
@ 2019-05-20 16:50   ` Johannes Weiner
  2019-05-20 22:57     ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Johannes Weiner @ 2019-05-20 16:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 20, 2019 at 12:52:49PM +0900, Minchan Kim wrote:
> The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
> as default. It is for preventing to reclaim dirty pages when CMA try to
> migrate pages. Strictly speaking, we don't need it because CMA didn't allow
> to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
>
> Moreover, it has a problem to prevent anonymous pages's swap out even
> though force_reclaim = true in shrink_page_list on upcoming patch.
> So this patch makes references's default value to PAGEREF_RECLAIM and
> rename force_reclaim with skip_reference_check to make it more clear.
> 
> This is a preparatory work for next patch.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Looks good to me, just one nit below.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d9c3e873eca6..a28e5d17b495 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				      struct scan_control *sc,
>  				      enum ttu_flags ttu_flags,
>  				      struct reclaim_stat *stat,
> -				      bool force_reclaim)
> +				      bool skip_reference_check)

"ignore_references" would be better IMO


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  6:37 ` [RFC 0/7] introduce memory hinting API for external process Anshuman Khandual
@ 2019-05-20 16:59   ` Tim Murray
  2019-05-21  2:55     ` Anshuman Khandual
  0 siblings, 1 reply; 138+ messages in thread
From: Tim Murray @ 2019-05-20 16:59 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> Or Is the objective here is reduce the number of processes which get killed by
> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> here as well ?

The objective is to avoid some instances of memory pressure by
proactively swapping pages that userspace knows to be cold before
those pages reach the end of the LRUs, which in turn can prevent some
apps from being killed by lmk/lmkd. As soon as Android userspace knows
that an application is not being used and is only resident to improve
performance if the user returns to that app, we can kick off
process_madvise on that process's pages (or some portion of those
pages) in a power-efficient way to reduce memory pressure long before
the system hits the free page watermark. This allows the system more
time to put pages into zram versus waiting for the watermark to
trigger kswapd, which decreases the likelihood that later memory
allocations will cause enough pressure to trigger a kill of one of
these apps.

> Swapping out memory into zram wont increase the latency for a hot start ? Or
> is it because as it will prevent a fresh cold start which anyway will be slower
> than a slow hot start. Just being curious.

First, not all swapped pages will be reloaded immediately once an app
is resumed. We've found that an app's working set post-process_madvise
is significantly smaller than what an app allocates when it first
launches (see the delta between pswpin and pswpout in Minchan's
results). Presumably because of this, faulting to fetch from zram does
not seem to introduce a noticeable hot start penalty, not does it
cause an increase in performance problems later in the app's
lifecycle. I've measured with and without process_madvise, and the
differences are within our noise bounds. Second, because we're not
preemptively evicting file pages and only making them more likely to
be evicted when there's already memory pressure, we avoid the case
where we process_madvise an app then immediately return to the app and
reload all file pages in the working set even though there was no
intervening memory pressure. Our initial version of this work evicted
file pages preemptively and did cause a noticeable slowdown (~15%) for
that case; this patch set avoids that slowdown. Finally, the benefit
from avoiding cold starts is huge. The performance improvement from
having a hot start instead of a cold start ranges from 3x for very
small apps to 50x+ for larger apps like high-fidelity games.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:16   ` Michal Hocko
  2019-05-20  8:19     ` Michal Hocko
@ 2019-05-20 22:54     ` Minchan Kim
  2019-05-21  6:04       ` Michal Hocko
  1 sibling, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-20 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> [CC linux-api]

Thanks, Michal. I forgot to add it.

> 
> On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > it could hint kernel that the pages can be reclaimed
> > when memory pressure happens but data should be preserved
> > for future use.  This could reduce workingset eviction so it
> > ends up increasing performance.
> > 
> > This patch introduces the new MADV_COOL hint to madvise(2)
> > syscall. MADV_COOL can be used by a process to mark a memory range
> > as not expected to be used in the near future. The hint can help
> > kernel in deciding which pages to evict early during memory
> > pressure.
> 
> I do not want to start naming fight but MADV_COOL sounds a bit
> misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> or MADV_DONTNEED_PRESERVE.

Thanks for the suggestion. Since I got several suggestions, Let's discuss
them all at once in cover-letter.

> 
> > Internally, it works via deactivating memory from active list to
> > inactive's head so when the memory pressure happens, they will be
> > reclaimed earlier than other active pages unless there is no
> > access until the time.
> 
> Could you elaborate about the decision to move to the head rather than
> tail? What should happen to inactive pages? Should we move them to the
> tail? Your implementation seems to ignore those completely. Why?

Normally, inactive LRU could have used-once pages without any mapping
to user's address space. Such pages would be better candicate to
reclaim when the memory pressure happens. With deactivating only
active LRU pages of the process to the head of inactive LRU, we will
keep them in RAM longer than used-once pages and could have more chance
to be activated once the process is resumed.

> 
> What should happen for shared pages? In other words do we want to allow
> less privileged process to control evicting of shared pages with a more
> privileged one? E.g. think of all sorts of side channel attacks. Maybe
> we want to do the same thing as for mincore where write access is
> required.

It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
in the description.

> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  8:19     ` Michal Hocko
  2019-05-20 15:08       ` Suren Baghdasaryan
@ 2019-05-20 22:55       ` Minchan Kim
  1 sibling, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20 22:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 10:19:43AM +0200, Michal Hocko wrote:
> On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> > [CC linux-api]
> > 
> > On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > it could hint kernel that the pages can be reclaimed
> > > when memory pressure happens but data should be preserved
> > > for future use.  This could reduce workingset eviction so it
> > > ends up increasing performance.
> > > 
> > > This patch introduces the new MADV_COOL hint to madvise(2)
> > > syscall. MADV_COOL can be used by a process to mark a memory range
> > > as not expected to be used in the near future. The hint can help
> > > kernel in deciding which pages to evict early during memory
> > > pressure.
> > 
> > I do not want to start naming fight but MADV_COOL sounds a bit
> > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> > or MADV_DONTNEED_PRESERVE.
> 
> OK, I can see that you have used MADV_COLD for a different mode.
> So this one is effectively a non destructive MADV_FREE alternative
> so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
> in other patch would then be MADV_DONTNEED_PRESERVE. Right?

Correct.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  2019-05-20 16:50   ` Johannes Weiner
@ 2019-05-20 22:57     ` Minchan Kim
  0 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-20 22:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 20, 2019 at 12:50:13PM -0400, Johannes Weiner wrote:
> On Mon, May 20, 2019 at 12:52:49PM +0900, Minchan Kim wrote:
> > The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
> > as default. It is for preventing to reclaim dirty pages when CMA try to
> > migrate pages. Strictly speaking, we don't need it because CMA didn't allow
> > to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
> >
> > Moreover, it has a problem to prevent anonymous pages's swap out even
> > though force_reclaim = true in shrink_page_list on upcoming patch.
> > So this patch makes references's default value to PAGEREF_RECLAIM and
> > rename force_reclaim with skip_reference_check to make it more clear.
> > 
> > This is a preparatory work for next patch.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Looks good to me, just one nit below.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks, Johannes.

> 
> > ---
> >  mm/vmscan.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d9c3e873eca6..a28e5d17b495 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  				      struct scan_control *sc,
> >  				      enum ttu_flags ttu_flags,
> >  				      struct reclaim_stat *stat,
> > -				      bool force_reclaim)
> > +				      bool skip_reference_check)
> 
> "ignore_references" would be better IMO

Sure.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20  8:27   ` Michal Hocko
@ 2019-05-20 23:00     ` Minchan Kim
  2019-05-21  6:08       ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-20 23:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > for a long time, it could hint kernel that the pages can be
> > reclaimed instantly but data should be preserved for future use.
> > This could reduce workingset eviction so it ends up increasing
> > performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2)
> > syscall. MADV_COLD can be used by a process to mark a memory range
> > as not expected to be used for a long time. The hint can help
> > kernel in deciding which pages to evict proactively.
> 
> As mentioned in other email this looks like a non-destructive
> MADV_DONTNEED alternative.
> 
> > Internally, it works via reclaiming memory in process context
> > the syscall is called. If the page is dirty but backing storage
> > is not synchronous device, the written page will be rotate back
> > into LRU's tail once the write is done so they will reclaim easily
> > when memory pressure happens. If backing storage is
> > synchrnous device(e.g., zram), hte page will be reclaimed instantly.
> 
> Why do we special case async backing storage? Please always try to
> explain _why_ the decision is made.

I didn't make any decesion. ;-) That's how current reclaim works to
avoid latency of freeing page in interrupt context. I had a patchset
to resolve the concern a few years ago but got distracted.

> 
> I haven't checked the implementation yet so I cannot comment on that.
> 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/swap.h                   |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/madvise.c                           | 123 +++++++++++++++++++++++++
> >  mm/vmscan.c                            |  74 +++++++++++++++
> >  4 files changed, 199 insertions(+)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 64795abea003..7f32a948fc6a 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -365,6 +365,7 @@ extern int vm_swappiness;
> >  extern int remove_mapping(struct address_space *mapping, struct page *page);
> >  extern unsigned long vm_total_pages;
> >  
> > +extern unsigned long reclaim_pages(struct list_head *page_list);
> >  #ifdef CONFIG_NUMA
> >  extern int node_reclaim_mode;
> >  extern int sysctl_min_unmapped_ratio;
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index f7a4a5d4b642..b9b51eeb8e1a 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -43,6 +43,7 @@
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> >  #define MADV_COOL	5		/* deactivatie these pages */
> > +#define MADV_COLD	6		/* reclaim these pages */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c05817fb570d..9a6698b56845 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> >  	case MADV_COOL:
> > +	case MADV_COLD:
> >  	case MADV_FREE:
> >  		return 0;
> >  	default:
> > @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	LIST_HEAD(page_list);
> > +	struct page *page;
> > +	int isolated = 0;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		spinlock_t *ptl;
> > +
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		if (isolate_lru_page(page))
> > +			goto huge_unlock;
> > +
> > +		list_add(&page->lru, &page_list);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		reclaim_pages(&page_list);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +regular_page:
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		if (isolate_lru_page(page))
> > +			continue;
> > +
> > +		isolated++;
> > +		list_add(&page->lru, &page_list);
> > +		if (isolated >= SWAP_CLUSTER_MAX) {
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			reclaim_pages(&page_list);
> > +			isolated = 0;
> > +			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +			orig_pte = pte;
> > +		}
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	reclaim_pages(&page_list);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static void madvise_cold_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct mm_walk warm_walk = {
> > +		.pmd_entry = madvise_cold_pte_range,
> > +		.mm = vma->vm_mm,
> > +	};
> > +
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &warm_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +
> > +static long madvise_cold(struct vm_area_struct *vma,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > +		return -EINVAL;
> > +
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> > +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> > +
> > +	return 0;
> > +}
> > +
> >  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  				unsigned long end, struct mm_walk *walk)
> >  
> > @@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		return madvise_willneed(vma, prev, start, end);
> >  	case MADV_COOL:
> >  		return madvise_cool(vma, start, end);
> > +	case MADV_COLD:
> > +		return madvise_cold(vma, start, end);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> > @@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_DONTNEED:
> >  	case MADV_FREE:
> >  	case MADV_COOL:
> > +	case MADV_COLD:
> >  #ifdef CONFIG_KSM
> >  	case MADV_MERGEABLE:
> >  	case MADV_UNMERGEABLE:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a28e5d17b495..1701b31f70a8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan,
> >  			nr_deactivate, nr_rotated, sc->priority, file);
> >  }
> >  
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > +	int nid = -1;
> > +	unsigned long nr_isolated[2] = {0, };
> > +	unsigned long nr_reclaimed = 0;
> > +	LIST_HEAD(node_page_list);
> > +	struct reclaim_stat dummy_stat;
> > +	struct scan_control sc = {
> > +		.gfp_mask = GFP_KERNEL,
> > +		.priority = DEF_PRIORITY,
> > +		.may_writepage = 1,
> > +		.may_unmap = 1,
> > +		.may_swap = 1,
> > +	};
> > +
> > +	while (!list_empty(page_list)) {
> > +		struct page *page;
> > +
> > +		page = lru_to_page(page_list);
> > +		list_del(&page->lru);
> > +
> > +		if (nid == -1) {
> > +			nid = page_to_nid(page);
> > +			INIT_LIST_HEAD(&node_page_list);
> > +			nr_isolated[0] = nr_isolated[1] = 0;
> > +		}
> > +
> > +		if (nid == page_to_nid(page)) {
> > +			list_add(&page->lru, &node_page_list);
> > +			nr_isolated[!!page_is_file_cache(page)] +=
> > +						hpage_nr_pages(page);
> > +			continue;
> > +		}
> > +
> > +		nid = page_to_nid(page);
> > +
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					nr_isolated[1]);
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > +				&dummy_stat, true);
> > +		while (!list_empty(&node_page_list)) {
> > +			struct page *page = lru_to_page(page_list);
> > +
> > +			list_del(&page->lru);
> > +			putback_lru_page(page);
> > +		}
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					-nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					-nr_isolated[1]);
> > +		nr_isolated[0] = nr_isolated[1] = 0;
> > +		INIT_LIST_HEAD(&node_page_list);
> > +	}
> > +
> > +	if (!list_empty(&node_page_list)) {
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					nr_isolated[1]);
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > +				&dummy_stat, true);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					-nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					-nr_isolated[1]);
> > +	}
> > +
> > +	return nr_reclaimed;
> > +}
> > +
> >  /*
> >   * The inactive anon list should be small enough that the VM never has
> >   * to do too much work.
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-20 14:26   ` Oleksandr Natalenko
@ 2019-05-21  1:26     ` Minchan Kim
  2019-05-21  6:36       ` Oleksandr Natalenko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  1:26 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

Hi Oleksandr,

On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote:
> Hi.
> 
> On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote:
> > This patch factor out madvise's core functionality so that upcoming
> > patch can reuse it without duplication.
> > 
> > It shouldn't change any behavior.
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------
> >  1 file changed, 89 insertions(+), 79 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 9a6698b56845..119e82e1f065 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > +static long madvise_dontneed_free(struct task_struct *tsk,
> > +				  struct vm_area_struct *vma,
> >  				  struct vm_area_struct **prev,
> >  				  unsigned long start, unsigned long end,
> >  				  int behavior)
> > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >  	if (!userfaultfd_remove(vma, start, end)) {
> >  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
> >  
> > -		down_read(&current->mm->mmap_sem);
> > -		vma = find_vma(current->mm, start);
> > +		down_read(&tsk->mm->mmap_sem);
> > +		vma = find_vma(tsk->mm, start);
> >  		if (!vma)
> >  			return -ENOMEM;
> >  		if (start < vma->vm_start) {
> > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> >   * Application wants to free up the pages and associated backing store.
> >   * This is effectively punching a hole into the middle of a file.
> >   */
> > -static long madvise_remove(struct vm_area_struct *vma,
> > +static long madvise_remove(struct task_struct *tsk,
> > +				struct vm_area_struct *vma,
> >  				struct vm_area_struct **prev,
> >  				unsigned long start, unsigned long end)
> >  {
> > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> >  	get_file(f);
> >  	if (userfaultfd_remove(vma, start, end)) {
> >  		/* mmap_sem was not released by userfaultfd_remove() */
> > -		up_read(&current->mm->mmap_sem);
> > +		up_read(&tsk->mm->mmap_sem);
> >  	}
> >  	error = vfs_fallocate(f,
> >  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> >  				offset, end - start);
> >  	fput(f);
> > -	down_read(&current->mm->mmap_sem);
> > +	down_read(&tsk->mm->mmap_sem);
> >  	return error;
> >  }
> >  
> > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior,
> >  #endif
> 
> What about madvise_inject_error() and get_user_pages_fast() in it
> please?

Good point. Maybe, there more places where assume context is "current" so
I'm thinking to limit hints we could allow from external process.
It would be better for maintainance point of view in that we could know
the workload/usecases when someone ask new advises from external process
without making every hints works both contexts.

Thanks.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (10 preceding siblings ...)
  2019-05-20 16:46 ` Johannes Weiner
@ 2019-05-21  1:44 ` Matthew Wilcox
  2019-05-21  5:01   ` Minchan Kim
  2019-05-21  6:34   ` Michal Hocko
  2019-05-21  8:42 ` Christian Brauner
  2019-05-21 12:53 ` Shakeel Butt
  13 siblings, 2 replies; 138+ messages in thread
From: Matthew Wilcox @ 2019-05-21  1:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.

Do we tear down page tables for these ranges?  That seems like a good
way of reclaiming potentially a substantial amount of memory.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  9:18   ` Michal Hocko
@ 2019-05-21  2:41     ` Minchan Kim
  2019-05-21  6:17       ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  2:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > ActivityManagerService is one of them.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> 
> Could you expand some more about how this all works? How does the
> centralized daemon track respective ranges? How does it synchronize
> against parallel modification of the address space etc.

Currently, we don't track each address ranges because we have two
policies at this moment:

	deactive file pages and reclaim anonymous pages of the app.

Since the daemon has a ability to let background apps resume(IOW, process
will be run by the daemon) and both hints are non-disruptive stabilty point
of view, we are okay for the race.

> 
> > To solve the issue, this patch introduces new syscall process_madvise(2)
> > which works based on pidfd so it could give a hint to the exeternal
> > process.
> > 
> > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> 
> OK, this makes some sense from the API point of view. When we have
> discussed that at LSFMM I was contemplating about something like that
> except the fd would be a VMA fd rather than the process. We could extend
> and reuse /proc/<pid>/map_files interface which doesn't support the
> anonymous memory right now. 
> 
> I am not saying this would be a better interface but I wanted to mention
> it here for a further discussion. One slight advantage would be that
> you know the exact object that you are operating on because you have a
> fd for the VMA and we would have a more straightforward way to reject
> operation if the underlying object has changed (e.g. unmapped and reused
> for a different mapping).

I agree your point. If I didn't miss something, such kinds of vma level
modify notification doesn't work even file mapped vma at this moment.
For anonymous vma, I think we could use userfaultfd, pontentially.
It would be great if someone want to do with disruptive hints like
MADV_DONTNEED.

I'd like to see it further enhancement after landing address range based
operation via limiting hints process_madvise supports to non-disruptive
only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
when someone want to extend the API.

> 
> > All advises madvise provides can be supported in process_madvise, too.
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrrace the process could use it successfully.
> 
> proc_mem_open model we use for accessing address space via proc sounds
> like a good mode. You are doing something similar.
> 
> > Please suggest better idea if you have other idea about the permission.
> > 
> > * from v1r1
> >   * use ptrace capability - surenb, dancol
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/proc_fs.h                |  1 +
> >  include/linux/syscalls.h               |  2 ++
> >  include/uapi/asm-generic/unistd.h      |  2 ++
> >  kernel/signal.c                        |  2 +-
> >  kernel/sys_ni.c                        |  1 +
> >  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
> >  8 files changed, 54 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 4cd5f982b1e5..5b9dd55d6b57 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -438,3 +438,4 @@
> >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 64ca0d06259a..0e5ee78161c9 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -355,6 +355,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	process_madvise		__x64_sys_process_madvise
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > index 52a283ba0465..f8545d7c5218 100644
> > --- a/include/linux/proc_fs.h
> > +++ b/include/linux/proc_fs.h
> > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
> >  
> >  #endif /* CONFIG_PROC_FS */
> >  
> > +extern struct pid *pidfd_to_pid(const struct file *file);
> >  struct net;
> >  
> >  static inline struct proc_dir_entry *proc_net_mkdir(
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e2870fe1be5b..21c6c9a62006 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
> >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  				unsigned char __user * vec);
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> > +				size_t len, int behavior);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >  			unsigned long prot, unsigned long pgoff,
> >  			unsigned long flags);
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..7ee82ce04620 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_process_madvise 428
> > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  
> >  #undef __NR_syscalls
> >  #define __NR_syscalls 428
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1c86b78a7597..04e75daab1f8 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> >  	return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > -static struct pid *pidfd_to_pid(const struct file *file)
> > +struct pid *pidfd_to_pid(const struct file *file)
> >  {
> >  	if (file->f_op == &pidfd_fops)
> >  		return file->private_data;
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 4d9ae5ea6caf..5277421795ab 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
> >  COND_SYSCALL(munlockall);
> >  COND_SYSCALL(mincore);
> >  COND_SYSCALL(madvise);
> > +COND_SYSCALL(process_madvise);
> >  COND_SYSCALL(remap_file_pages);
> >  COND_SYSCALL(mbind);
> >  COND_SYSCALL_COMPAT(mbind);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 119e82e1f065..af02aa17e5c1 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/proc_fs.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/mempolicy.h>
> >  #include <linux/page-isolation.h>
> > @@ -16,6 +17,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/falloc.h>
> >  #include <linux/sched.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/ksm.h>
> >  #include <linux/fs.h>
> >  #include <linux/file.h>
> > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> >  	return madvise_core(current, start, len_in, behavior);
> >  }
> > +
> > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > +		size_t, len_in, int, behavior)
> > +{
> > +	int ret;
> > +	struct fd f;
> > +	struct pid *pid;
> > +	struct task_struct *tsk;
> > +	struct mm_struct *mm;
> > +
> > +	f = fdget(pidfd);
> > +	if (!f.file)
> > +		return -EBADF;
> > +
> > +	pid = pidfd_to_pid(f.file);
> > +	if (IS_ERR(pid)) {
> > +		ret = PTR_ERR(pid);
> > +		goto err;
> > +	}
> > +
> > +	ret = -EINVAL;
> > +	rcu_read_lock();
> > +	tsk = pid_task(pid, PIDTYPE_PID);
> > +	if (!tsk) {
> > +		rcu_read_unlock();
> > +		goto err;
> > +	}
> > +	get_task_struct(tsk);
> > +	rcu_read_unlock();
> > +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> > +	if (!mm || IS_ERR(mm)) {
> > +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > +		if (ret == -EACCES)
> > +			ret = -EPERM;
> > +		goto err;
> > +	}
> > +	ret = madvise_core(tsk, start, len_in, behavior);
> > +	mmput(mm);
> > +	put_task_struct(tsk);
> > +err:
> > +	fdput(f);
> > +	return ret;
> > +}
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-20  9:22   ` Michal Hocko
@ 2019-05-21  2:48     ` Minchan Kim
  2019-05-21  6:24       ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  2:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > Currently, process_madvise syscall works for only one address range
> > so user should call the syscall several times to give hints to
> > multiple address range.
> 
> Is that a problem? How big of a problem? Any numbers?

We easily have 2000+ vma so it's not trivial overhead. I will come up
with number in the description at respin.

> 
> > This patch extends process_madvise syscall to support multiple
> > hints, address ranges and return vaules so user could give hints
> > all at once.
> > 
> > struct pr_madvise_param {
> >     int size;                       /* the size of this structure */
> >     const struct iovec __user *vec; /* address range array */
> > }
> > 
> > int process_madvise(int pidfd, ssize_t nr_elem,
> > 		    int *behavior,
> > 		    struct pr_madvise_param *results,
> > 		    struct pr_madvise_param *ranges,
> > 		    unsigned long flags);
> > 
> > - pidfd
> > 
> > target process fd
> > 
> > - nr_elem
> > 
> > the number of elemenent of array behavior, results, ranges
> > 
> > - behavior
> > 
> > hints for each address range in remote process so that user could
> > give different hints for each range.
> 
> What is the guarantee of a single call? Do all hints get applied or the
> first failure backs of? What are the atomicity guarantees?

All hints will be tried even though one of them is failed. User will
see the success or failure from the restuls parameter.
For the single call, there is no guarantee of atomicity.

> 
> > 
> > - results
> > 
> > array of buffers to get results for associated remote address range
> > action.
> > 
> > - ranges
> > 
> > array to buffers to have remote process's address ranges to be
> > processed
> > 
> > - flags
> > 
> > extra argument for the future. It should be zero this moment.
> > 
> > Example)
> > 
> > struct pr_madvise_param {
> >         int size;
> >         const struct iovec *vec;
> > };
> > 
> > int main(int argc, char *argv[])
> > {
> >         struct pr_madvise_param retp, rangep;
> >         struct iovec result_vec[2], range_vec[2];
> >         int hints[2];
> >         long ret[2];
> >         void *addr[2];
> > 
> >         pid_t pid;
> >         char cmd[64] = {0,};
> >         addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
> >                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 
> >         if (MAP_FAILED == addr[0])
> >                 return 1;
> > 
> >         addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
> >                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> > 
> >         if (MAP_FAILED == addr[1])
> >                 return 1;
> > 
> >         hints[0] = MADV_COLD;
> > 	range_vec[0].iov_base = addr[0];
> >         range_vec[0].iov_len = ALLOC_SIZE;
> >         result_vec[0].iov_base = &ret[0];
> >         result_vec[0].iov_len = sizeof(long);
> > 	retp.vec = result_vec;
> >         retp.size = sizeof(struct pr_madvise_param);
> > 
> >         hints[1] = MADV_COOL;
> >         range_vec[1].iov_base = addr[1];
> >         range_vec[1].iov_len = ALLOC_SIZE;
> >         result_vec[1].iov_base = &ret[1];
> >         result_vec[1].iov_len = sizeof(long);
> >         rangep.vec = range_vec;
> >         rangep.size = sizeof(struct pr_madvise_param);
> > 
> >         pid = fork();
> >         if (!pid) {
> >                 sleep(10);
> >         } else {
> >                 int pidfd = open(cmd,  O_DIRECTORY | O_CLOEXEC);
> >                 if (pidfd < 0)
> >                         return 1;
> > 
> >                 /* munmap to make pages private for the child */
> >                 munmap(addr[0], ALLOC_SIZE);
> >                 munmap(addr[1], ALLOC_SIZE);
> >                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
> >                 if (syscall(__NR_process_madvise, pidfd, 2, behaviors,
> > 						&retp, &rangep, 0))
> >                         perror("process_madvise fail\n");
> >                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
> >         }
> > 
> >         return 0;
> > }
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/uapi/asm-generic/mman-common.h |   5 +
> >  mm/madvise.c                           | 184 +++++++++++++++++++++----
> >  2 files changed, 166 insertions(+), 23 deletions(-)
> > 
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index b9b51eeb8e1a..b8e230de84a6 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -74,4 +74,9 @@
> >  #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
> >  				 PKEY_DISABLE_WRITE)
> >  
> > +struct pr_madvise_param {
> > +	int size;			/* the size of this structure */
> > +	const struct iovec __user *vec;	/* address range array */
> > +};
> > +
> >  #endif /* __ASM_GENERIC_MMAN_COMMON_H */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index af02aa17e5c1..f4f569dac2bd 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> >  	struct page *page;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	unsigned long next;
> > +	long nr_pages = 0;
> >  
> >  	next = pmd_addr_end(addr, end);
> >  	if (pmd_trans_huge(*pmd)) {
> > @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> >  
> >  		ptep_test_and_clear_young(vma, addr, pte);
> >  		deactivate_page(page);
> > +		nr_pages++;
> > +
> >  	}
> >  
> >  	pte_unmap_unlock(orig_pte, ptl);
> > +	*(long *)walk->private += nr_pages;
> >  	cond_resched();
> >  
> >  	return 0;
> > @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> >  
> >  static void madvise_cool_page_range(struct mmu_gather *tlb,
> >  			     struct vm_area_struct *vma,
> > -			     unsigned long addr, unsigned long end)
> > +			     unsigned long addr, unsigned long end,
> > +			     long *nr_pages)
> >  {
> >  	struct mm_walk cool_walk = {
> >  		.pmd_entry = madvise_cool_pte_range,
> >  		.mm = vma->vm_mm,
> > +		.private = nr_pages
> >  	};
> >  
> >  	tlb_start_vma(tlb, vma);
> > @@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb,
> >  }
> >  
> >  static long madvise_cool(struct vm_area_struct *vma,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +			unsigned long start_addr, unsigned long end_addr,
> > +			long *nr_pages)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_gather tlb;
> > @@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma,
> >  
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > -	madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
> > +	madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
> >  	tlb_finish_mmu(&tlb, start_addr, end_addr);
> >  
> >  	return 0;
> > @@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >  	int isolated = 0;
> >  	struct vm_area_struct *vma = walk->vma;
> >  	unsigned long next;
> > +	long nr_pages = 0;
> >  
> >  	next = pmd_addr_end(addr, end);
> >  	if (pmd_trans_huge(*pmd)) {
> > @@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >  		list_add(&page->lru, &page_list);
> >  		if (isolated >= SWAP_CLUSTER_MAX) {
> >  			pte_unmap_unlock(orig_pte, ptl);
> > -			reclaim_pages(&page_list);
> > +			nr_pages += reclaim_pages(&page_list);
> >  			isolated = 0;
> >  			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  			orig_pte = pte;
> > @@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >  	}
> >  
> >  	pte_unmap_unlock(orig_pte, ptl);
> > -	reclaim_pages(&page_list);
> > +	nr_pages += reclaim_pages(&page_list);
> >  	cond_resched();
> >  
> > +	*(long *)walk->private += nr_pages;
> >  	return 0;
> >  }
> >  
> >  static void madvise_cold_page_range(struct mmu_gather *tlb,
> >  			     struct vm_area_struct *vma,
> > -			     unsigned long addr, unsigned long end)
> > +			     unsigned long addr, unsigned long end,
> > +			     long *nr_pages)
> >  {
> >  	struct mm_walk warm_walk = {
> >  		.pmd_entry = madvise_cold_pte_range,
> >  		.mm = vma->vm_mm,
> > +		.private = nr_pages,
> >  	};
> >  
> >  	tlb_start_vma(tlb, vma);
> > @@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
> >  
> >  
> >  static long madvise_cold(struct vm_area_struct *vma,
> > -			unsigned long start_addr, unsigned long end_addr)
> > +			unsigned long start_addr, unsigned long end_addr,
> > +			long *nr_pages)
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct mmu_gather tlb;
> > @@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma,
> >  
> >  	lru_add_drain();
> >  	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > -	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> > +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages);
> >  	tlb_finish_mmu(&tlb, start_addr, end_addr);
> >  
> >  	return 0;
> > @@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior,
> >  static long
> >  madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
> >  		struct vm_area_struct **prev, unsigned long start,
> > -		unsigned long end, int behavior)
> > +		unsigned long end, int behavior, long *nr_pages)
> >  {
> >  	switch (behavior) {
> >  	case MADV_REMOVE:
> > @@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma,
> >  	case MADV_WILLNEED:
> >  		return madvise_willneed(vma, prev, start, end);
> >  	case MADV_COOL:
> > -		return madvise_cool(vma, start, end);
> > +		return madvise_cool(vma, start, end, nr_pages);
> >  	case MADV_COLD:
> > -		return madvise_cold(vma, start, end);
> > +		return madvise_cold(vma, start, end, nr_pages);
> >  	case MADV_FREE:
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed_free(tsk, vma, prev, start,
> > @@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior)
> >  }
> >  
> >  static int madvise_core(struct task_struct *tsk, unsigned long start,
> > -			size_t len_in, int behavior)
> > +			size_t len_in, int behavior, long *nr_pages)
> >  {
> >  	unsigned long end, tmp;
> >  	struct vm_area_struct *vma, *prev;
> > @@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  
> >  	if (start & ~PAGE_MASK)
> >  		return error;
> > +
> >  	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> >  
> >  	/* Check to see whether len was rounded up from small -ve to zero */
> > @@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  	blk_start_plug(&plug);
> >  	for (;;) {
> >  		/* Still start < end. */
> > +		long pages = 0;
> > +
> >  		error = -ENOMEM;
> >  		if (!vma)
> >  			goto out;
> > @@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  			tmp = end;
> >  
> >  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> > -		error = madvise_vma(tsk, vma, &prev, start, tmp, behavior);
> > +		error = madvise_vma(tsk, vma, &prev, start, tmp,
> > +					behavior, &pages);
> >  		if (error)
> >  			goto out;
> > +		*nr_pages += pages;
> >  		start = tmp;
> >  		if (prev && start < prev->vm_end)
> >  			start = prev->vm_end;
> > @@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >   */
> >  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> > -	return madvise_core(current, start, len_in, behavior);
> > +	unsigned long dummy;
> > +
> > +	return madvise_core(current, start, len_in, behavior, &dummy);
> >  }
> >  
> > -SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > -		size_t, len_in, int, behavior)
> > +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
> > +		struct pr_madvise_param *param)
> > +{
> > +	u32 size;
> > +	int ret;
> > +
> > +	memset(param, 0, sizeof(*param));
> > +
> > +	ret = get_user(size, &u_param->size);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (size > PAGE_SIZE)
> > +		return -E2BIG;
> > +
> > +	if (!size || size > sizeof(struct pr_madvise_param))
> > +		return -EINVAL;
> > +
> > +	ret = copy_from_user(param, u_param, size);
> > +	if (ret)
> > +		return -EFAULT;
> > +
> > +	return ret;
> > +}
> > +
> > +static int process_madvise_core(struct task_struct *tsk, int *behaviors,
> > +				struct iov_iter *iter,
> > +				const struct iovec *range_vec,
> > +				unsigned long riovcnt,
> > +				unsigned long flags)
> > +{
> > +	int i;
> > +	long err;
> > +
> > +	for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) {
> > +		long ret = 0;
> > +
> > +		err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base,
> > +				range_vec[i].iov_len, behaviors[i],
> > +				&ret);
> > +		if (err)
> > +			ret = err;
> > +
> > +		if (copy_to_iter(&ret, sizeof(long), iter) !=
> > +				sizeof(long)) {
> > +			err = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		err = 0;
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem,
> > +			const int __user *, hints,
> > +			struct pr_madvise_param __user *, results,
> > +			struct pr_madvise_param __user *, ranges,
> > +			unsigned long, flags)
> >  {
> >  	int ret;
> >  	struct fd f;
> >  	struct pid *pid;
> >  	struct task_struct *tsk;
> >  	struct mm_struct *mm;
> > +	struct pr_madvise_param result_p, range_p;
> > +	const struct iovec __user *result_vec, __user *range_vec;
> > +	int *behaviors;
> > +	struct iovec iovstack_result[UIO_FASTIOV];
> > +	struct iovec iovstack_r[UIO_FASTIOV];
> > +	struct iovec *iov_l = iovstack_result;
> > +	struct iovec *iov_r = iovstack_r;
> > +	struct iov_iter iter;
> > +
> > +	if (flags != 0)
> > +		return -EINVAL;
> > +
> > +	ret = pr_madvise_copy_param(results, &result_p);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pr_madvise_copy_param(ranges, &range_p);
> > +	if (ret)
> > +		return ret;
> > +
> > +	result_vec = result_p.vec;
> > +	range_vec = range_p.vec;
> > +
> > +	if (result_p.size != sizeof(struct pr_madvise_param) ||
> > +			range_p.size != sizeof(struct pr_madvise_param))
> > +		return -EINVAL;
> > +
> > +	behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
> > +	if (!behaviors)
> > +		return -ENOMEM;
> > +
> > +	ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem);
> > +	if (ret < 0)
> > +		goto free_behavior_vec;
> > +
> > +	ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV,
> > +				&iov_l, &iter);
> > +	if (ret < 0)
> > +		goto free_behavior_vec;
> > +
> > +	if (!iov_iter_count(&iter)) {
> > +		ret = -EINVAL;
> > +		goto free_iovecs;
> > +	}
> > +
> > +	ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
> > +				UIO_FASTIOV, iovstack_r, &iov_r);
> > +	if (ret <= 0)
> > +		goto free_iovecs;
> >  
> >  	f = fdget(pidfd);
> > -	if (!f.file)
> > -		return -EBADF;
> > +	if (!f.file) {
> > +		ret = -EBADF;
> > +		goto free_iovecs;
> > +	}
> >  
> >  	pid = pidfd_to_pid(f.file);
> >  	if (IS_ERR(pid)) {
> >  		ret = PTR_ERR(pid);
> > -		goto err;
> > +		goto put_fd;
> >  	}
> >  
> >  	ret = -EINVAL;
> > @@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> >  	tsk = pid_task(pid, PIDTYPE_PID);
> >  	if (!tsk) {
> >  		rcu_read_unlock();
> > -		goto err;
> > +		goto put_fd;
> >  	}
> >  	get_task_struct(tsk);
> >  	rcu_read_unlock();
> > @@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> >  		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >  		if (ret == -EACCES)
> >  			ret = -EPERM;
> > -		goto err;
> > +		goto put_task;
> >  	}
> > -	ret = madvise_core(tsk, start, len_in, behavior);
> > +
> > +	ret = process_madvise_core(tsk, behaviors, &iter, iov_r,
> > +					nr_elem, flags);
> >  	mmput(mm);
> > +put_task:
> >  	put_task_struct(tsk);
> > -err:
> > +put_fd:
> >  	fdput(f);
> > +free_iovecs:
> > +	if (iov_r != iovstack_r)
> > +		kfree(iov_r);
> > +	kfree(iov_l);
> > +free_behavior_vec:
> > +	kfree(behaviors);
> > +
> >  	return ret;
> >  }
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-20  9:28   ` Michal Hocko
@ 2019-05-21  2:55     ` Minchan Kim
  2019-05-21  6:26       ` Michal Hocko
  2019-05-21 15:33       ` Johannes Weiner
  0 siblings, 2 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  2:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> [cc linux-api]
> 
> On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > System could have much faster swap device like zRAM. In that case, swapping
> > is extremely cheaper than file-IO on the low-end storage.
> > In this configuration, userspace could handle different strategy for each
> > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > file IO is more expensive in this case so want to keep them in memory
> > until memory pressure happens.
> > 
> > To support such strategy easier, this patch introduces
> > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > that /proc/<pid>/clear_refs already has supported same filters.
> > They are filters could be Ored with other existing hints using top two bits
> > of (int behavior).
> 
> madvise operates on top of ranges and it is quite trivial to do the
> filtering from the userspace so why do we need any additional filtering?
> 
> > Once either of them is set, the hint could affect only the interested vma
> > either anonymous or file-backed.
> > 
> > With that, user could call a process_madvise syscall simply with a entire
> > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> 
> OK, so here is the reason you want that. The immediate question is why
> cannot the monitor do the filtering from the userspace. Slightly more
> work, all right, but less of an API to expose and that itself is a
> strong argument against.

What I should do if we don't have such filter option is to enumerate all of
vma via /proc/<pid>/maps and then parse every ranges and inode from string,
which would be painful for 2000+ vmas.

> 
> > * from v1r2
> >   * use consistent check with clear_refs to identify anon/file vma - surenb
> > 
> > * from v1r1
> >   * use naming "filter" for new madvise option - dancol
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/uapi/asm-generic/mman-common.h |  5 +++++
> >  mm/madvise.c                           | 14 ++++++++++++++
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index b8e230de84a6..be59a1b90284 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -66,6 +66,11 @@
> >  #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
> >  #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
> >  
> > +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER))
> > +
> > +#define MADV_ANONYMOUS_FILTER	(1<<31)	/* works for only anonymous vma */
> > +#define MADV_FILE_FILTER	(1<<30)	/* works for only file-backed vma */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE	0
> >  
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index f4f569dac2bd..116131243540 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  	int write;
> >  	size_t len;
> >  	struct blk_plug plug;
> > +	bool anon_only, file_only;
> >  
> > +	anon_only = behavior & MADV_ANONYMOUS_FILTER;
> > +	file_only = behavior & MADV_FILE_FILTER;
> > +
> > +	if (anon_only && file_only)
> > +		return error;
> > +
> > +	behavior = behavior & MADV_BEHAVIOR_MASK;
> >  	if (!madvise_behavior_valid(behavior))
> >  		return error;
> >  
> > @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start,
> >  		if (end < tmp)
> >  			tmp = end;
> >  
> > +		if (anon_only && vma->vm_file)
> > +			goto next;
> > +		if (file_only && !vma->vm_file)
> > +			goto next;
> > +
> >  		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> >  		error = madvise_vma(tsk, vma, &prev, start, tmp,
> >  					behavior, &pages);
> >  		if (error)
> >  			goto out;
> >  		*nr_pages += pages;
> > +next:
> >  		start = tmp;
> >  		if (prev && start < prev->vm_end)
> >  			start = prev->vm_end;
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20 16:59   ` Tim Murray
@ 2019-05-21  2:55     ` Anshuman Khandual
  2019-05-21  5:14       ` Minchan Kim
                         ` (2 more replies)
  0 siblings, 3 replies; 138+ messages in thread
From: Anshuman Khandual @ 2019-05-21  2:55 UTC (permalink / raw)
  To: Tim Murray
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon



On 05/20/2019 10:29 PM, Tim Murray wrote:
> On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>> Or Is the objective here is reduce the number of processes which get killed by
>> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
>> they dont get picked by lmkd. Under utilization for zram hardware is a concern
>> here as well ?
> 
> The objective is to avoid some instances of memory pressure by
> proactively swapping pages that userspace knows to be cold before
> those pages reach the end of the LRUs, which in turn can prevent some
> apps from being killed by lmk/lmkd. As soon as Android userspace knows
> that an application is not being used and is only resident to improve
> performance if the user returns to that app, we can kick off
> process_madvise on that process's pages (or some portion of those
> pages) in a power-efficient way to reduce memory pressure long before
> the system hits the free page watermark. This allows the system more
> time to put pages into zram versus waiting for the watermark to
> trigger kswapd, which decreases the likelihood that later memory
> allocations will cause enough pressure to trigger a kill of one of
> these apps.

So this opens up bit of LRU management to user space hints. Also because the app
in itself wont know about the memory situation of the entire system, new system
call needs to be called from an external process.

> 
>> Swapping out memory into zram wont increase the latency for a hot start ? Or
>> is it because as it will prevent a fresh cold start which anyway will be slower
>> than a slow hot start. Just being curious.
> 
> First, not all swapped pages will be reloaded immediately once an app
> is resumed. We've found that an app's working set post-process_madvise
> is significantly smaller than what an app allocates when it first
> launches (see the delta between pswpin and pswpout in Minchan's
> results). Presumably because of this, faulting to fetch from zram does

pswpin      417613    1392647     975034     233.00
pswpout    1274224    2661731    1387507     108.00

IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
always the case ? Or it tend to swap out from an active area of the working set
which faulted back again.

> not seem to introduce a noticeable hot start penalty, not does it
> cause an increase in performance problems later in the app's
> lifecycle. I've measured with and without process_madvise, and the
> differences are within our noise bounds. Second, because we're not

That is assuming that post process_madvise() working set for the application is
always smaller. There is another challenge. The external process should ideally
have the knowledge of active areas of the working set for an application in
question for it to invoke process_madvise() correctly to prevent such scenarios.

> preemptively evicting file pages and only making them more likely to
> be evicted when there's already memory pressure, we avoid the case
> where we process_madvise an app then immediately return to the app and
> reload all file pages in the working set even though there was no
> intervening memory pressure. Our initial version of this work evicted

That would be the worst case scenario which should be avoided. Memory pressure
must be a parameter before actually doing the swap out. But pages if know to be
inactive/cold can be marked high priority to be swapped out.

> file pages preemptively and did cause a noticeable slowdown (~15%) for
> that case; this patch set avoids that slowdown. Finally, the benefit
> from avoiding cold starts is huge. The performance improvement from
> having a hot start instead of a cold start ranges from 3x for very
> small apps to 50x+ for larger apps like high-fidelity games.

Is there any other real world scenario apart from this app based ecosystem where
user hinted LRU management might be helpful ? Just being curious. Thanks for the
detailed explanation. I will continue looking into this series.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20 14:42 ` Oleksandr Natalenko
@ 2019-05-21  2:56   ` Minchan Kim
  0 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  2:56 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 20, 2019 at 04:42:00PM +0200, Oleksandr Natalenko wrote:
> Hi.
> 
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > - Background
> > 
> > The Android terminology used for forking a new process and starting an app
> > from scratch is a cold start, while resuming an existing app is a hot start.
> > While we continually try to improve the performance of cold starts, hot
> > starts will always be significantly less power hungry as well as faster so
> > we are trying to make hot start more likely than cold start.
> > 
> > To increase hot start, Android userspace manages the order that apps should
> > be killed in a process called ActivityManagerService. ActivityManagerService
> > tracks every Android app or service that the user could be interacting with
> > at any time and translates that into a ranked list for lmkd(low memory
> > killer daemon). They are likely to be killed by lmkd if the system has to
> > reclaim memory. In that sense they are similar to entries in any other cache.
> > Those apps are kept alive for opportunistic performance improvements but
> > those performance improvements will vary based on the memory requirements of
> > individual workloads.
> > 
> > - Problem
> > 
> > Naturally, cached apps were dominant consumers of memory on the system.
> > However, they were not significant consumers of swap even though they are
> > good candidate for swap. Under investigation, swapping out only begins
> > once the low zone watermark is hit and kswapd wakes up, but the overall
> > allocation rate in the system might trip lmkd thresholds and cause a cached
> > process to be killed(we measured performance swapping out vs. zapping the
> > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > even though we use zram which is much faster than real storage) so kill
> > from lmkd will often satisfy the high zone watermark, resulting in very
> > few pages actually being moved to swap.
> > 
> > - Approach
> > 
> > The approach we chose was to use a new interface to allow userspace to
> > proactively reclaim entire processes by leveraging platform information.
> > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > that are known to be cold from userspace and to avoid races with lmkd
> > by reclaiming apps as soon as they entered the cached state. Additionally,
> > it could provide many chances for platform to use much information to
> > optimize memory efficiency.
> > 
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> > 
> > To achieve the goal, the patchset introduce two new options for madvise.
> > One is MADV_COOL which will deactive activated pages and the other is
> > MADV_COLD which will reclaim private pages instantly. These new options
> > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed when memory pressure rises.
> > 
> > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > information required to make the reclaim decision is not known to the app.
> > Instead, it is known to a centralized userspace daemon, and that daemon
> > must be able to initiate reclaim on its own without any app involvement.
> > To solve the concern, this patch introduces new syscall -
> > 
> > 	struct pr_madvise_param {
> > 		int size;
> > 		const struct iovec *vec;
> > 	}
> > 
> > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > 				struct pr_madvise_param *restuls,
> > 				struct pr_madvise_param *ranges,
> > 				unsigned long flags);
> > 
> > The syscall get pidfd to give hints to external process and provides
> > pair of result/ranges vector arguments so that it could give several
> > hints to each address range all at once.
> > 
> > I guess others have different ideas about the naming of syscall and options
> > so feel free to suggest better naming.
> > 
> > - Experiment
> > 
> > We did bunch of testing with several hundreds of real users, not artificial
> > benchmark on android. We saw about 17% cold start decreasement without any
> > significant battery/app startup latency issues. And with artificial benchmark
> > which launches and switching apps, we saw average 7% app launching improvement,
> > 18% less lmkd kill and good stat from vmstat.
> > 
> > A is vanilla and B is process_madvise.
> > 
> > 
> >                                        A          B      delta   ratio(%)
> >                allocstall_dma          0          0          0       0.00
> >            allocstall_movable       1464        457      -1007     -69.00
> >             allocstall_normal     263210     190763     -72447     -28.00
> >              allocstall_total     264674     191220     -73454     -28.00
> >           compact_daemon_wake      26912      25294      -1618      -7.00
> >                  compact_fail      17885      14151      -3734     -21.00
> >          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
> >              compact_isolated    3446484    2967618    -478866     -14.00
> >       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
> >                 compact_stall      19387      15343      -4044     -21.00
> >               compact_success       1502       1192       -310     -21.00
> > kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
> >             kswapd_inodesteal     221635     233093      11458       5.00
> >  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
> >                    nr_dirtied     259934     296476      36542      14.00
> >   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
> >               nr_vmscan_write    1274232    2661733    1387501     108.00
> >                    nr_written    1514060    2937560    1423500      94.00
> >                    pageoutrun      67561      55133     -12428     -19.00
> >                    pgactivate    2335060    1984882    -350178     -15.00
> >                   pgalloc_dma   13743011   14096463     353452       2.00
> >               pgalloc_movable          0          0          0       0.00
> >                pgalloc_normal   18742440   16802065   -1940375     -11.00
> >                 pgalloc_total   32485451   30898528   -1586923      -5.00
> >                  pgdeactivate    4262210    2930670   -1331540     -32.00
> >                       pgfault   30812334   31085065     272731       0.00
> >                        pgfree   33553970   31765164   -1788806      -6.00
> >                  pginodesteal      33411      15084     -18327     -55.00
> >                   pglazyfreed          0          0          0       0.00
> >                    pgmajfault     551312    1508299     956987     173.00
> >                pgmigrate_fail      43927      29330     -14597     -34.00
> >             pgmigrate_success    1399851    1203922    -195929     -14.00
> >                        pgpgin   24141776   19032156   -5109620     -22.00
> >                       pgpgout     959344    1103316     143972      15.00
> >                  pgpgoutclean    4639732    3765868    -873864     -19.00
> >                      pgrefill    4884560    3006938   -1877622     -39.00
> >                     pgrotated      37828      25897     -11931     -32.00
> >                 pgscan_direct    1456037     957567    -498470     -35.00
> >        pgscan_direct_throttle          0          0          0       0.00
> >                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
> >                  pgscan_total    8123804    6004927   -2118877     -27.00
> >                    pgskip_dma          0          0          0       0.00
> >                pgskip_movable          0          0          0       0.00
> >                 pgskip_normal      14907      25382      10475      70.00
> >                  pgskip_total      14907      25382      10475      70.00
> >                pgsteal_direct    1118986     690215    -428771     -39.00
> >                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
> >                 pgsteal_total    5869209    4347322   -1521887     -26.00
> >                        pswpin     417613    1392647     975034     233.00
> >                       pswpout    1274224    2661731    1387507     108.00
> >                 slabs_scanned   13686905   10807200   -2879705     -22.00
> >           workingset_activate     668966     569444     -99522     -15.00
> >        workingset_nodereclaim      38957      32621      -6336     -17.00
> >            workingset_refault    2816795    2179782    -637013     -23.00
> >            workingset_restore     294320     168601    -125719     -43.00
> > 
> > pgmajfault is increased by 173% because swapin is increased by 200% by
> > process_madvise hint. However, swap read based on zram is much cheaper
> > than file IO in performance point of view and app hot start by swapin is
> > also cheaper than cold start from the beginning of app which needs many IO
> > from storage and initialization steps.
> > 
> > This patchset is against on next-20190517.
> > 
> > Minchan Kim (7):
> >   mm: introduce MADV_COOL
> >   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
> >   mm: introduce MADV_COLD
> >   mm: factor out madvise's core functionality
> >   mm: introduce external memory hinting API
> >   mm: extend process_madvise syscall to support vector arrary
> >   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
> > 
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/page-flags.h             |   1 +
> >  include/linux/page_idle.h              |  15 +
> >  include/linux/proc_fs.h                |   1 +
> >  include/linux/swap.h                   |   2 +
> >  include/linux/syscalls.h               |   2 +
> >  include/uapi/asm-generic/mman-common.h |  12 +
> >  include/uapi/asm-generic/unistd.h      |   2 +
> >  kernel/signal.c                        |   2 +-
> >  kernel/sys_ni.c                        |   1 +
> >  mm/madvise.c                           | 600 +++++++++++++++++++++----
> >  mm/swap.c                              |  43 ++
> >  mm/vmscan.c                            |  80 +++-
> >  14 files changed, 680 insertions(+), 83 deletions(-)
> > 
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> Please Cc me for the next iteration since I was working on the very same
> thing recently [1].
> 
> Thank you.
> 
> [1] https://gitlab.com/post-factum/pf-kernel/commits/remote-madvise-v3

Sure, I'm happy to see others have similar requirement.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20 16:46 ` Johannes Weiner
@ 2019-05-21  4:39   ` Minchan Kim
  2019-05-21  6:32     ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  4:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > - Approach
> > 
> > The approach we chose was to use a new interface to allow userspace to
> > proactively reclaim entire processes by leveraging platform information.
> > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > that are known to be cold from userspace and to avoid races with lmkd
> > by reclaiming apps as soon as they entered the cached state. Additionally,
> > it could provide many chances for platform to use much information to
> > optimize memory efficiency.
> > 
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> 
> I agree with this approach and the semantics. But these names are very
> vague and extremely easy to confuse since they're so similar.
> 
> MADV_COLD could be a good name, but for deactivating pages, not
> reclaiming them - marking memory "cold" on the LRU for later reclaim.
> 
> For the immediate reclaim one, I think there is a better option too:
> In virtual memory speak, putting a page into secondary storage (or
> ensuring it's already there), and then freeing its in-memory copy, is
> called "paging out". And that's what this flag is supposed to do. So
> how about MADV_PAGEOUT?
> 
> With that, we'd have:
> 
> MADV_FREE: Mark data invalid, free memory when needed
> MADV_DONTNEED: Mark data invalid, free memory immediately
> 
> MADV_COLD: Data is not used for a while, free memory when needed
> MADV_PAGEOUT: Data is not used for a while, free memory immediately
> 
> What do you think?

There are several suggestions until now. Thanks, Folks!

For deactivating:

- MADV_COOL
- MADV_RECLAIM_LAZY
- MADV_DEACTIVATE
- MADV_COLD
- MADV_FREE_PRESERVE


For reclaiming:

- MADV_COLD
- MADV_RECLAIM_NOW
- MADV_RECLAIMING
- MADV_PAGEOUT
- MADV_DONTNEED_PRESERVE

It seems everybody doesn't like MADV_COLD so want to go with other.
For consisteny of view with other existing hints of madvise, -preserve
postfix suits well. However, originally, I don't like the naming FREE
vs DONTNEED from the beginning. They were easily confused.
I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to
represent reclaim with memory pressure and is supposed to paged-in
if someone need it later. So, it imply PRESERVE.
If there is not strong against it, I want to go with MADV_COLD and
MADV_PAGEOUT.

Other opinion?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21  1:44 ` Matthew Wilcox
@ 2019-05-21  5:01   ` Minchan Kim
  2019-05-21  6:34   ` Michal Hocko
  1 sibling, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  5:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 20, 2019 at 06:44:52PM -0700, Matthew Wilcox wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> 
> Do we tear down page tables for these ranges?  That seems like a good

True for MADV_COLD(reclaiming) but false for MADV_COOL(deactivating) at
this implementation.

> way of reclaiming potentially a substantial amount of memory.

Given that consider refauting are spread out over time and reclaim occurs
in burst, that does make sense to speed up the reclaiming. However, a
concern to me is anonymous pages since they need swap cache insertion,
which would be wasteful if they are not reclaimed, finally.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21  2:55     ` Anshuman Khandual
@ 2019-05-21  5:14       ` Minchan Kim
  2019-05-21 10:34       ` Michal Hocko
  2019-05-21 12:56       ` Shakeel Butt
  2 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  5:14 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Tim Murray, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue, May 21, 2019 at 08:25:55AM +0530, Anshuman Khandual wrote:
> 
> 
> On 05/20/2019 10:29 PM, Tim Murray wrote:
> > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >> Or Is the objective here is reduce the number of processes which get killed by
> >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> >> here as well ?
> > 
> > The objective is to avoid some instances of memory pressure by
> > proactively swapping pages that userspace knows to be cold before
> > those pages reach the end of the LRUs, which in turn can prevent some
> > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > that an application is not being used and is only resident to improve
> > performance if the user returns to that app, we can kick off
> > process_madvise on that process's pages (or some portion of those
> > pages) in a power-efficient way to reduce memory pressure long before
> > the system hits the free page watermark. This allows the system more
> > time to put pages into zram versus waiting for the watermark to
> > trigger kswapd, which decreases the likelihood that later memory
> > allocations will cause enough pressure to trigger a kill of one of
> > these apps.
> 
> So this opens up bit of LRU management to user space hints. Also because the app
> in itself wont know about the memory situation of the entire system, new system
> call needs to be called from an external process.

That's why process_madvise is introduced here.

> 
> > 
> >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> >> is it because as it will prevent a fresh cold start which anyway will be slower
> >> than a slow hot start. Just being curious.
> > 
> > First, not all swapped pages will be reloaded immediately once an app
> > is resumed. We've found that an app's working set post-process_madvise
> > is significantly smaller than what an app allocates when it first
> > launches (see the delta between pswpin and pswpout in Minchan's
> > results). Presumably because of this, faulting to fetch from zram does
> 
> pswpin      417613    1392647     975034     233.00
> pswpout    1274224    2661731    1387507     108.00
> 
> IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> always the case ? Or it tend to swap out from an active area of the working set
> which faulted back again.

I think it's because apps are alive longer via reducing being killed
so turn into from pgpgin to swapin.

> 
> > not seem to introduce a noticeable hot start penalty, not does it
> > cause an increase in performance problems later in the app's
> > lifecycle. I've measured with and without process_madvise, and the
> > differences are within our noise bounds. Second, because we're not
> 
> That is assuming that post process_madvise() working set for the application is
> always smaller. There is another challenge. The external process should ideally
> have the knowledge of active areas of the working set for an application in
> question for it to invoke process_madvise() correctly to prevent such scenarios.

There are several ways to detect workingset more accurately at the cost
of runtime. For example, with idle page tracking or clear_refs. Accuracy
is always trade-off of overhead for LRU aging.

> 
> > preemptively evicting file pages and only making them more likely to
> > be evicted when there's already memory pressure, we avoid the case
> > where we process_madvise an app then immediately return to the app and
> > reload all file pages in the working set even though there was no
> > intervening memory pressure. Our initial version of this work evicted
> 
> That would be the worst case scenario which should be avoided. Memory pressure
> must be a parameter before actually doing the swap out. But pages if know to be
> inactive/cold can be marked high priority to be swapped out.
> 
> > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > that case; this patch set avoids that slowdown. Finally, the benefit
> > from avoiding cold starts is huge. The performance improvement from
> > having a hot start instead of a cold start ranges from 3x for very
> > small apps to 50x+ for larger apps like high-fidelity games.
> 
> Is there any other real world scenario apart from this app based ecosystem where
> user hinted LRU management might be helpful ? Just being curious. Thanks for the
> detailed explanation. I will continue looking into this series.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20 22:54     ` Minchan Kim
@ 2019-05-21  6:04       ` Michal Hocko
  2019-05-21  9:11         ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
[...]
> > > Internally, it works via deactivating memory from active list to
> > > inactive's head so when the memory pressure happens, they will be
> > > reclaimed earlier than other active pages unless there is no
> > > access until the time.
> > 
> > Could you elaborate about the decision to move to the head rather than
> > tail? What should happen to inactive pages? Should we move them to the
> > tail? Your implementation seems to ignore those completely. Why?
> 
> Normally, inactive LRU could have used-once pages without any mapping
> to user's address space. Such pages would be better candicate to
> reclaim when the memory pressure happens. With deactivating only
> active LRU pages of the process to the head of inactive LRU, we will
> keep them in RAM longer than used-once pages and could have more chance
> to be activated once the process is resumed.

You are making some assumptions here. You have an explicit call what is
cold now you are assuming something is even colder. Is this assumption a
general enough to make people depend on it? Not that we wouldn't be able
to change to logic later but that will always be risky - especially in
the area when somebody want to make a user space driven memory
management.
 
> > What should happen for shared pages? In other words do we want to allow
> > less privileged process to control evicting of shared pages with a more
> > privileged one? E.g. think of all sorts of side channel attacks. Maybe
> > we want to do the same thing as for mincore where write access is
> > required.
> 
> It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
> in the description.

OK, this is good for the starter. It makes the implementation simpler
and we can add shared mappings coverage later.

Although I would argue that touching only writeable mappings should be
reasonably safe.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20 23:00     ` Minchan Kim
@ 2019-05-21  6:08       ` Michal Hocko
  2019-05-21  9:13         ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 08:00:38, Minchan Kim wrote:
> On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > for a long time, it could hint kernel that the pages can be
> > > reclaimed instantly but data should be preserved for future use.
> > > This could reduce workingset eviction so it ends up increasing
> > > performance.
> > > 
> > > This patch introduces the new MADV_COLD hint to madvise(2)
> > > syscall. MADV_COLD can be used by a process to mark a memory range
> > > as not expected to be used for a long time. The hint can help
> > > kernel in deciding which pages to evict proactively.
> > 
> > As mentioned in other email this looks like a non-destructive
> > MADV_DONTNEED alternative.
> > 
> > > Internally, it works via reclaiming memory in process context
> > > the syscall is called. If the page is dirty but backing storage
> > > is not synchronous device, the written page will be rotate back
> > > into LRU's tail once the write is done so they will reclaim easily
> > > when memory pressure happens. If backing storage is
> > > synchrnous device(e.g., zram), hte page will be reclaimed instantly.
> > 
> > Why do we special case async backing storage? Please always try to
> > explain _why_ the decision is made.
> 
> I didn't make any decesion. ;-) That's how current reclaim works to
> avoid latency of freeing page in interrupt context. I had a patchset
> to resolve the concern a few years ago but got distracted.

Please articulate that in the changelog then. Or even do not go into
implementation details and stick with - reuse the current reclaim
implementation. If you call out some of the specific details you are
risking people will start depending on them. The fact that this reuses
the currect reclaim logic is enough from the review point of view
because we know that there is no additional special casing to worry
about.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-21  2:41     ` Minchan Kim
@ 2019-05-21  6:17       ` Michal Hocko
  2019-05-21 10:32         ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 11:41:07, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > ActivityManagerService is one of them.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > 
> > Could you expand some more about how this all works? How does the
> > centralized daemon track respective ranges? How does it synchronize
> > against parallel modification of the address space etc.
> 
> Currently, we don't track each address ranges because we have two
> policies at this moment:
> 
> 	deactive file pages and reclaim anonymous pages of the app.
> 
> Since the daemon has a ability to let background apps resume(IOW, process
> will be run by the daemon) and both hints are non-disruptive stabilty point
> of view, we are okay for the race.

Fair enough but the API should consider future usecases where this might
be a problem. So we should really think about those potential scenarios
now. If we are ok with that, fine, but then we should be explicit and
document it that way. Essentially say that any sort of synchronization
is supposed to be done by monitor. This will make the API less usable
but maybe that is enough.
 
> > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > which works based on pidfd so it could give a hint to the exeternal
> > > process.
> > > 
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > 
> > OK, this makes some sense from the API point of view. When we have
> > discussed that at LSFMM I was contemplating about something like that
> > except the fd would be a VMA fd rather than the process. We could extend
> > and reuse /proc/<pid>/map_files interface which doesn't support the
> > anonymous memory right now. 
> > 
> > I am not saying this would be a better interface but I wanted to mention
> > it here for a further discussion. One slight advantage would be that
> > you know the exact object that you are operating on because you have a
> > fd for the VMA and we would have a more straightforward way to reject
> > operation if the underlying object has changed (e.g. unmapped and reused
> > for a different mapping).
> 
> I agree your point. If I didn't miss something, such kinds of vma level
> modify notification doesn't work even file mapped vma at this moment.
> For anonymous vma, I think we could use userfaultfd, pontentially.
> It would be great if someone want to do with disruptive hints like
> MADV_DONTNEED.
> 
> I'd like to see it further enhancement after landing address range based
> operation via limiting hints process_madvise supports to non-disruptive
> only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
> when someone want to extend the API.

So do you think we want both interfaces (process_madvise and madvisefd)?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21  2:48     ` Minchan Kim
@ 2019-05-21  6:24       ` Michal Hocko
  2019-05-21 10:26         ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > Currently, process_madvise syscall works for only one address range
> > > so user should call the syscall several times to give hints to
> > > multiple address range.
> > 
> > Is that a problem? How big of a problem? Any numbers?
> 
> We easily have 2000+ vma so it's not trivial overhead. I will come up
> with number in the description at respin.

Does this really have to be a fast operation? I would expect the monitor
is by no means a fast path. The system call overhead is not what it used
to be, sigh, but still for something that is not a hot path it should be
tolerable, especially when the whole operation is quite expensive on its
own (wrt. the syscall entry/exit).

I am not saying we do not need a multiplexing API, I am just not sure
we need it right away. Btw. there was some demand for other MM syscalls
to provide a multiplexing API (e.g. mprotect), maybe it would be better
to handle those in one go?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21  2:55     ` Minchan Kim
@ 2019-05-21  6:26       ` Michal Hocko
  2019-05-27  7:58         ` Minchan Kim
  2019-05-21 15:33       ` Johannes Weiner
  1 sibling, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > [cc linux-api]
> > 
> > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > System could have much faster swap device like zRAM. In that case, swapping
> > > is extremely cheaper than file-IO on the low-end storage.
> > > In this configuration, userspace could handle different strategy for each
> > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > file IO is more expensive in this case so want to keep them in memory
> > > until memory pressure happens.
> > > 
> > > To support such strategy easier, this patch introduces
> > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > that /proc/<pid>/clear_refs already has supported same filters.
> > > They are filters could be Ored with other existing hints using top two bits
> > > of (int behavior).
> > 
> > madvise operates on top of ranges and it is quite trivial to do the
> > filtering from the userspace so why do we need any additional filtering?
> > 
> > > Once either of them is set, the hint could affect only the interested vma
> > > either anonymous or file-backed.
> > > 
> > > With that, user could call a process_madvise syscall simply with a entire
> > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > 
> > OK, so here is the reason you want that. The immediate question is why
> > cannot the monitor do the filtering from the userspace. Slightly more
> > work, all right, but less of an API to expose and that itself is a
> > strong argument against.
> 
> What I should do if we don't have such filter option is to enumerate all of
> vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> which would be painful for 2000+ vmas.

Painful is not an argument to add a new user API. If the existing API
suits the purpose then it should be used. If it is not usable, we can
think of a different way.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21  4:39   ` Minchan Kim
@ 2019-05-21  6:32     ` Michal Hocko
  0 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

[Cc linux-api]

On Tue 21-05-19 13:39:50, Minchan Kim wrote:
> On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote:
> > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > - Approach
> > > 
> > > The approach we chose was to use a new interface to allow userspace to
> > > proactively reclaim entire processes by leveraging platform information.
> > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > that are known to be cold from userspace and to avoid races with lmkd
> > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > it could provide many chances for platform to use much information to
> > > optimize memory efficiency.
> > > 
> > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > when memory pressure rises.
> > 
> > I agree with this approach and the semantics. But these names are very
> > vague and extremely easy to confuse since they're so similar.
> > 
> > MADV_COLD could be a good name, but for deactivating pages, not
> > reclaiming them - marking memory "cold" on the LRU for later reclaim.
> > 
> > For the immediate reclaim one, I think there is a better option too:
> > In virtual memory speak, putting a page into secondary storage (or
> > ensuring it's already there), and then freeing its in-memory copy, is
> > called "paging out". And that's what this flag is supposed to do. So
> > how about MADV_PAGEOUT?
> > 
> > With that, we'd have:
> > 
> > MADV_FREE: Mark data invalid, free memory when needed
> > MADV_DONTNEED: Mark data invalid, free memory immediately
> > 
> > MADV_COLD: Data is not used for a while, free memory when needed
> > MADV_PAGEOUT: Data is not used for a while, free memory immediately
> > 
> > What do you think?
> 
> There are several suggestions until now. Thanks, Folks!
> 
> For deactivating:
> 
> - MADV_COOL
> - MADV_RECLAIM_LAZY
> - MADV_DEACTIVATE
> - MADV_COLD
> - MADV_FREE_PRESERVE
> 
> 
> For reclaiming:
> 
> - MADV_COLD
> - MADV_RECLAIM_NOW
> - MADV_RECLAIMING
> - MADV_PAGEOUT
> - MADV_DONTNEED_PRESERVE
> 
> It seems everybody doesn't like MADV_COLD so want to go with other.
> For consisteny of view with other existing hints of madvise, -preserve
> postfix suits well. However, originally, I don't like the naming FREE
> vs DONTNEED from the beginning. They were easily confused.
> I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to
> represent reclaim with memory pressure and is supposed to paged-in
> if someone need it later. So, it imply PRESERVE.
> If there is not strong against it, I want to go with MADV_COLD and
> MADV_PAGEOUT.
> 
> Other opinion?

I do not really care strongly. I am pretty sure we will have a lot of
suggestions because people tend to be good at arguing about that...
Anyway, unlike DONTNEED/FREE we do not have any other OS to implement
these features, right? So we shouldn't be tight to existing names.
On the other hand I kinda like the reference to the existing names but
DEACTIVATE/PAGEOUT seem a good fit to me as well. Unless there is way
much better name suggested I would go with one of those. Up to you.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21  1:44 ` Matthew Wilcox
  2019-05-21  5:01   ` Minchan Kim
@ 2019-05-21  6:34   ` Michal Hocko
  1 sibling, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon,
	linux-api

[linux-api]

On Mon 20-05-19 18:44:52, Matthew Wilcox wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> 
> Do we tear down page tables for these ranges?  That seems like a good
> way of reclaiming potentially a substantial amount of memory.

I do not think we can in general because this is a non-destructive
operation. So at least we cannot tear down anonymous ptes (they will
turn into swap entries).

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21  1:26     ` Minchan Kim
@ 2019-05-21  6:36       ` Oleksandr Natalenko
  2019-05-21  6:50         ` Michal Hocko
  2019-05-21 10:49         ` Minchan Kim
  0 siblings, 2 replies; 138+ messages in thread
From: Oleksandr Natalenko @ 2019-05-21  6:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

Hi.

On Tue, May 21, 2019 at 10:26:49AM +0900, Minchan Kim wrote:
> On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote:
> > Hi.
> > 
> > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote:
> > > This patch factor out madvise's core functionality so that upcoming
> > > patch can reuse it without duplication.
> > > 
> > > It shouldn't change any behavior.
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------
> > >  1 file changed, 89 insertions(+), 79 deletions(-)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 9a6698b56845..119e82e1f065 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > >  	return 0;
> > >  }
> > >  
> > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > +static long madvise_dontneed_free(struct task_struct *tsk,
> > > +				  struct vm_area_struct *vma,
> > >  				  struct vm_area_struct **prev,
> > >  				  unsigned long start, unsigned long end,
> > >  				  int behavior)
> > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > >  	if (!userfaultfd_remove(vma, start, end)) {
> > >  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > >  
> > > -		down_read(&current->mm->mmap_sem);
> > > -		vma = find_vma(current->mm, start);
> > > +		down_read(&tsk->mm->mmap_sem);
> > > +		vma = find_vma(tsk->mm, start);
> > >  		if (!vma)
> > >  			return -ENOMEM;
> > >  		if (start < vma->vm_start) {
> > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > >   * Application wants to free up the pages and associated backing store.
> > >   * This is effectively punching a hole into the middle of a file.
> > >   */
> > > -static long madvise_remove(struct vm_area_struct *vma,
> > > +static long madvise_remove(struct task_struct *tsk,
> > > +				struct vm_area_struct *vma,
> > >  				struct vm_area_struct **prev,
> > >  				unsigned long start, unsigned long end)
> > >  {
> > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > >  	get_file(f);
> > >  	if (userfaultfd_remove(vma, start, end)) {
> > >  		/* mmap_sem was not released by userfaultfd_remove() */
> > > -		up_read(&current->mm->mmap_sem);
> > > +		up_read(&tsk->mm->mmap_sem);
> > >  	}
> > >  	error = vfs_fallocate(f,
> > >  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > >  				offset, end - start);
> > >  	fput(f);
> > > -	down_read(&current->mm->mmap_sem);
> > > +	down_read(&tsk->mm->mmap_sem);
> > >  	return error;
> > >  }
> > >  
> > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior,
> > >  #endif
> > 
> > What about madvise_inject_error() and get_user_pages_fast() in it
> > please?
> 
> Good point. Maybe, there more places where assume context is "current" so
> I'm thinking to limit hints we could allow from external process.
> It would be better for maintainance point of view in that we could know
> the workload/usecases when someone ask new advises from external process
> without making every hints works both contexts.

Well, for madvise_inject_error() we still have a remote variant of
get_user_pages(), and that should work, no?

Regarding restricting the hints, I'm definitely interested in having
remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
madvise() introduces another issue with traversing remote VMAs reliably.
IIUC, one can do this via userspace by parsing [s]maps file only, which
is not very consistent, and once some range is parsed, and then it is
immediately gone, a wrong hint will be sent.

Isn't this a problem we should worry about?

> 
> Thanks.

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21  6:36       ` Oleksandr Natalenko
@ 2019-05-21  6:50         ` Michal Hocko
  2019-05-21  7:06           ` Oleksandr Natalenko
  2019-05-21 10:49         ` Minchan Kim
  1 sibling, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21  6:50 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote:
[...]
> Regarding restricting the hints, I'm definitely interested in having
> remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> madvise() introduces another issue with traversing remote VMAs reliably.
> IIUC, one can do this via userspace by parsing [s]maps file only, which
> is not very consistent, and once some range is parsed, and then it is
> immediately gone, a wrong hint will be sent.
> 
> Isn't this a problem we should worry about?

See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21  6:50         ` Michal Hocko
@ 2019-05-21  7:06           ` Oleksandr Natalenko
  2019-05-21 10:52             ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Oleksandr Natalenko @ 2019-05-21  7:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

Hi.

On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote:
> [...]
> > Regarding restricting the hints, I'm definitely interested in having
> > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> > madvise() introduces another issue with traversing remote VMAs reliably.
> > IIUC, one can do this via userspace by parsing [s]maps file only, which
> > is not very consistent, and once some range is parsed, and then it is
> > immediately gone, a wrong hint will be sent.
> > 
> > Isn't this a problem we should worry about?
> 
> See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz

Oh, thanks for the pointer.

Indeed, for my specific task with remote KSM I'd go with map_files
instead. This doesn't solve the task completely in case of traversal
through all the VMAs in one pass, but makes it easier comparing to a
remote syscall.

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (11 preceding siblings ...)
  2019-05-21  1:44 ` Matthew Wilcox
@ 2019-05-21  8:42 ` Christian Brauner
  2019-05-21 11:05   ` Minchan Kim
  2019-05-21 12:53 ` Shakeel Butt
  13 siblings, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-21  8:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh

On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
> 	struct pr_madvise_param {
> 		int size;
> 		const struct iovec *vec;
> 	}
> 
> 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> 				struct pr_madvise_param *restuls,
> 				struct pr_madvise_param *ranges,
> 				unsigned long flags);
> 
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
> 
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.

Yes, all new syscalls making use of pidfds should be named
pidfd_<action>. So please make this pidfd_madvise.

Please make sure to Cc me on this in the future as I'm maintaining
pidfds. Would be great to have Jann on this too since he's been touching
both mm and parts of the pidfd stuff with me.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
  2019-05-20  9:18   ` Michal Hocko
@ 2019-05-21  9:01   ` Christian Brauner
  2019-05-21 11:35     ` Minchan Kim
  2019-05-21 15:31   ` Oleg Nesterov
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-21  9:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg

Cc: Jann and Oleg too

On Mon, May 20, 2019 at 12:52:52PM +0900, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COOL|COLD] to other process. Android's
> ActivityManagerService is one of them.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
> 
> To solve the issue, this patch introduces new syscall process_madvise(2)
> which works based on pidfd so it could give a hint to the exeternal
> process.
> 
> int process_madvise(int pidfd, void *addr, size_t length, int advise);
> 
> All advises madvise provides can be supported in process_madvise, too.
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrrace the process could use it successfully.
> 
> Please suggest better idea if you have other idea about the permission.
> 
> * from v1r1
>   * use ptrace capability - surenb, dancol
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/proc_fs.h                |  1 +
>  include/linux/syscalls.h               |  2 ++
>  include/uapi/asm-generic/unistd.h      |  2 ++
>  kernel/signal.c                        |  2 +-
>  kernel/sys_ni.c                        |  1 +
>  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
>  8 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4cd5f982b1e5..5b9dd55d6b57 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
>  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
>  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 64ca0d06259a..0e5ee78161c9 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  425	common	io_uring_setup		__x64_sys_io_uring_setup
>  426	common	io_uring_enter		__x64_sys_io_uring_enter
>  427	common	io_uring_register	__x64_sys_io_uring_register
> +428	common	process_madvise		__x64_sys_process_madvise
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 52a283ba0465..f8545d7c5218 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
>  
>  #endif /* CONFIG_PROC_FS */
>  
> +extern struct pid *pidfd_to_pid(const struct file *file);
>  struct net;
>  
>  static inline struct proc_dir_entry *proc_net_mkdir(
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..21c6c9a62006 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
>  asmlinkage long sys_mincore(unsigned long start, size_t len,
>  				unsigned char __user * vec);
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> +				size_t len, int behavior);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dee7292e1df6..7ee82ce04620 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
>  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
>  #define __NR_io_uring_register 427
>  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_process_madvise 428
> +__SYSCALL(__NR_process_madvise, sys_process_madvise)
>  
>  #undef __NR_syscalls
>  #define __NR_syscalls 428
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1c86b78a7597..04e75daab1f8 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
>  	return copy_siginfo_from_user(kinfo, info);
>  }
>  
> -static struct pid *pidfd_to_pid(const struct file *file)
> +struct pid *pidfd_to_pid(const struct file *file)
>  {
>  	if (file->f_op == &pidfd_fops)
>  		return file->private_data;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 4d9ae5ea6caf..5277421795ab 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
>  COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
> +COND_SYSCALL(process_madvise);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 119e82e1f065..af02aa17e5c1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -9,6 +9,7 @@
>  #include <linux/mman.h>
>  #include <linux/pagemap.h>
>  #include <linux/page_idle.h>
> +#include <linux/proc_fs.h>
>  #include <linux/syscalls.h>
>  #include <linux/mempolicy.h>
>  #include <linux/page-isolation.h>
> @@ -16,6 +17,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/falloc.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
>  	return madvise_core(current, start, len_in, behavior);
>  }
> +
> +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> +		size_t, len_in, int, behavior)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	struct task_struct *tsk;
> +	struct mm_struct *mm;
> +
> +	f = fdget(pidfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = pidfd_to_pid(f.file);

pidfd_to_pid() should not be directly exported since this allows
/proc/<pid> fds to be used too. That's something we won't be going
forward with. All new syscalls should only allow to operate on pidfds
created through CLONE_PIDFD or pidfd_open() (cf. [1]).

So e.g. please export a simple helper like

struct pid *pidfd_to_pid(const struct file *file)
{
        if (file->f_op == &pidfd_fops)
                return file->private_data;

        return NULL;
}

turning the old pidfd_to_pid() into something like:

static struct pid *__fd_to_pid(const struct file *file)
{
        struct pid *pid;

        pid = pidfd_to_pid(file);
        if (pid)
                return pid;

        return tgid_pidfd_to_pid(file);
}

All new syscalls should only be using anon inode pidfds since they can
actually have a clean security model built around them in the future.
Note, pidfd_open() will be sent out together with making pidfds pollable
for the 5.3 merge window.

[1]: https://lore.kernel.org/lkml/20190520155630.21684-1-christian@brauner.io/

Thanks!
Christian


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-21  6:04       ` Michal Hocko
@ 2019-05-21  9:11         ` Minchan Kim
  2019-05-21 10:05           ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  9:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> [...]
> > > > Internally, it works via deactivating memory from active list to
> > > > inactive's head so when the memory pressure happens, they will be
> > > > reclaimed earlier than other active pages unless there is no
> > > > access until the time.
> > > 
> > > Could you elaborate about the decision to move to the head rather than
> > > tail? What should happen to inactive pages? Should we move them to the
> > > tail? Your implementation seems to ignore those completely. Why?
> > 
> > Normally, inactive LRU could have used-once pages without any mapping
> > to user's address space. Such pages would be better candicate to
> > reclaim when the memory pressure happens. With deactivating only
> > active LRU pages of the process to the head of inactive LRU, we will
> > keep them in RAM longer than used-once pages and could have more chance
> > to be activated once the process is resumed.
> 
> You are making some assumptions here. You have an explicit call what is
> cold now you are assuming something is even colder. Is this assumption a
> general enough to make people depend on it? Not that we wouldn't be able
> to change to logic later but that will always be risky - especially in
> the area when somebody want to make a user space driven memory
> management.

Think about MADV_FREE. It moves those pages into inactive file LRU's head.
See the get_scan_count which makes forceful scanning of inactive file LRU
if it has enough size based on the memory pressure.
The reason is it's likely to have used-once pages in inactive file LRU,
generally. Those pages has been top-priority candidate to be reclaimed
for a long time.

Only parts I am aware of moving pages into tail of inactive LRU are places
writeback is done for pages VM already decide to reclaim by LRU aging or
destructive operation like invalidating but couldn't completed. It's
really strong hints with no doubt.

>  
> > > What should happen for shared pages? In other words do we want to allow
> > > less privileged process to control evicting of shared pages with a more
> > > privileged one? E.g. think of all sorts of side channel attacks. Maybe
> > > we want to do the same thing as for mincore where write access is
> > > required.
> > 
> > It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
> > in the description.
> 
> OK, this is good for the starter. It makes the implementation simpler
> and we can add shared mappings coverage later.
> 
> Although I would argue that touching only writeable mappings should be
> reasonably safe.
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-21  6:08       ` Michal Hocko
@ 2019-05-21  9:13         ` Minchan Kim
  0 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-21  9:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:08:20AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 08:00:38, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote:
> > > [Cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:50, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range
> > > > for a long time, it could hint kernel that the pages can be
> > > > reclaimed instantly but data should be preserved for future use.
> > > > This could reduce workingset eviction so it ends up increasing
> > > > performance.
> > > > 
> > > > This patch introduces the new MADV_COLD hint to madvise(2)
> > > > syscall. MADV_COLD can be used by a process to mark a memory range
> > > > as not expected to be used for a long time. The hint can help
> > > > kernel in deciding which pages to evict proactively.
> > > 
> > > As mentioned in other email this looks like a non-destructive
> > > MADV_DONTNEED alternative.
> > > 
> > > > Internally, it works via reclaiming memory in process context
> > > > the syscall is called. If the page is dirty but backing storage
> > > > is not synchronous device, the written page will be rotate back
> > > > into LRU's tail once the write is done so they will reclaim easily
> > > > when memory pressure happens. If backing storage is
> > > > synchrnous device(e.g., zram), hte page will be reclaimed instantly.
> > > 
> > > Why do we special case async backing storage? Please always try to
> > > explain _why_ the decision is made.
> > 
> > I didn't make any decesion. ;-) That's how current reclaim works to
> > avoid latency of freeing page in interrupt context. I had a patchset
> > to resolve the concern a few years ago but got distracted.
> 
> Please articulate that in the changelog then. Or even do not go into
> implementation details and stick with - reuse the current reclaim
> implementation. If you call out some of the specific details you are
> risking people will start depending on them. The fact that this reuses
> the currect reclaim logic is enough from the review point of view
> because we know that there is no additional special casing to worry
> about.

I should have clarified. I will remove those lines in respin.

> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-21  9:11         ` Minchan Kim
@ 2019-05-21 10:05           ` Michal Hocko
  0 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-21 10:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 18:11:34, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> > [...]
> > > > > Internally, it works via deactivating memory from active list to
> > > > > inactive's head so when the memory pressure happens, they will be
> > > > > reclaimed earlier than other active pages unless there is no
> > > > > access until the time.
> > > > 
> > > > Could you elaborate about the decision to move to the head rather than
> > > > tail? What should happen to inactive pages? Should we move them to the
> > > > tail? Your implementation seems to ignore those completely. Why?
> > > 
> > > Normally, inactive LRU could have used-once pages without any mapping
> > > to user's address space. Such pages would be better candicate to
> > > reclaim when the memory pressure happens. With deactivating only
> > > active LRU pages of the process to the head of inactive LRU, we will
> > > keep them in RAM longer than used-once pages and could have more chance
> > > to be activated once the process is resumed.
> > 
> > You are making some assumptions here. You have an explicit call what is
> > cold now you are assuming something is even colder. Is this assumption a
> > general enough to make people depend on it? Not that we wouldn't be able
> > to change to logic later but that will always be risky - especially in
> > the area when somebody want to make a user space driven memory
> > management.
> 
> Think about MADV_FREE. It moves those pages into inactive file LRU's head.
> See the get_scan_count which makes forceful scanning of inactive file LRU
> if it has enough size based on the memory pressure.
> The reason is it's likely to have used-once pages in inactive file LRU,
> generally. Those pages has been top-priority candidate to be reclaimed
> for a long time.

OK, fair enough. Being consistent with MADV_FREE is reasonable. I just
forgot we do rotate like this there.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21  6:24       ` Michal Hocko
@ 2019-05-21 10:26         ` Minchan Kim
  2019-05-21 10:37           ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 10:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > [Cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > Currently, process_madvise syscall works for only one address range
> > > > so user should call the syscall several times to give hints to
> > > > multiple address range.
> > > 
> > > Is that a problem? How big of a problem? Any numbers?
> > 
> > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > with number in the description at respin.
> 
> Does this really have to be a fast operation? I would expect the monitor
> is by no means a fast path. The system call overhead is not what it used
> to be, sigh, but still for something that is not a hot path it should be
> tolerable, especially when the whole operation is quite expensive on its
> own (wrt. the syscall entry/exit).

What's different with process_vm_[readv|writev] and vmsplice?
If the range needed to be covered is a lot, vector operation makes senese
to me.

> 
> I am not saying we do not need a multiplexing API, I am just not sure
> we need it right away. Btw. there was some demand for other MM syscalls
> to provide a multiplexing API (e.g. mprotect), maybe it would be better
> to handle those in one go?

That's the exactly what Daniel Colascione suggested from internal
review. That would be a interesting approach if we could aggregate
all of system call in one go.

> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-21  6:17       ` Michal Hocko
@ 2019-05-21 10:32         ` Minchan Kim
  0 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 10:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:17:43AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:41:07, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> > > [Cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > > > There is some usecase that centralized userspace daemon want to give
> > > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > > ActivityManagerService is one of them.
> > > > 
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > 
> > > Could you expand some more about how this all works? How does the
> > > centralized daemon track respective ranges? How does it synchronize
> > > against parallel modification of the address space etc.
> > 
> > Currently, we don't track each address ranges because we have two
> > policies at this moment:
> > 
> > 	deactive file pages and reclaim anonymous pages of the app.
> > 
> > Since the daemon has a ability to let background apps resume(IOW, process
> > will be run by the daemon) and both hints are non-disruptive stabilty point
> > of view, we are okay for the race.
> 
> Fair enough but the API should consider future usecases where this might
> be a problem. So we should really think about those potential scenarios
> now. If we are ok with that, fine, but then we should be explicit and
> document it that way. Essentially say that any sort of synchronization
> is supposed to be done by monitor. This will make the API less usable
> but maybe that is enough.

Okay, I will add more about that in the description.

>  
> > > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > > which works based on pidfd so it could give a hint to the exeternal
> > > > process.
> > > > 
> > > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > > 
> > > OK, this makes some sense from the API point of view. When we have
> > > discussed that at LSFMM I was contemplating about something like that
> > > except the fd would be a VMA fd rather than the process. We could extend
> > > and reuse /proc/<pid>/map_files interface which doesn't support the
> > > anonymous memory right now. 
> > > 
> > > I am not saying this would be a better interface but I wanted to mention
> > > it here for a further discussion. One slight advantage would be that
> > > you know the exact object that you are operating on because you have a
> > > fd for the VMA and we would have a more straightforward way to reject
> > > operation if the underlying object has changed (e.g. unmapped and reused
> > > for a different mapping).
> > 
> > I agree your point. If I didn't miss something, such kinds of vma level
> > modify notification doesn't work even file mapped vma at this moment.
> > For anonymous vma, I think we could use userfaultfd, pontentially.
> > It would be great if someone want to do with disruptive hints like
> > MADV_DONTNEED.
> > 
> > I'd like to see it further enhancement after landing address range based
> > operation via limiting hints process_madvise supports to non-disruptive
> > only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
> > when someone want to extend the API.
> 
> So do you think we want both interfaces (process_madvise and madvisefd)?

What I have in mind is to extend process_madvise later like this

struct pr_madvise_param {
    int size;                       /* the size of this structure */
    union {
    	const struct iovec __user *vec; /* address range array */
	int fd;				/* supported from 6.0 */
    }
}

with introducing new hint Or-able PR_MADV_RANGE_FD, so that process_madvise
can go with fd instead of address range.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21  2:55     ` Anshuman Khandual
  2019-05-21  5:14       ` Minchan Kim
@ 2019-05-21 10:34       ` Michal Hocko
  2019-05-28 10:50         ` Anshuman Khandual
  2019-05-21 12:56       ` Shakeel Butt
  2 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21 10:34 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Tim Murray, Minchan Kim, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue 21-05-19 08:25:55, Anshuman Khandual wrote:
> On 05/20/2019 10:29 PM, Tim Murray wrote:
[...]
> > not seem to introduce a noticeable hot start penalty, not does it
> > cause an increase in performance problems later in the app's
> > lifecycle. I've measured with and without process_madvise, and the
> > differences are within our noise bounds. Second, because we're not
> 
> That is assuming that post process_madvise() working set for the application is
> always smaller. There is another challenge. The external process should ideally
> have the knowledge of active areas of the working set for an application in
> question for it to invoke process_madvise() correctly to prevent such scenarios.

But that doesn't really seem relevant for the API itself, right? The
higher level logic the monitor's business.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21 10:26         ` Minchan Kim
@ 2019-05-21 10:37           ` Michal Hocko
  2019-05-27  7:49             ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21 10:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > [Cc linux-api]
> > > > 
> > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > Currently, process_madvise syscall works for only one address range
> > > > > so user should call the syscall several times to give hints to
> > > > > multiple address range.
> > > > 
> > > > Is that a problem? How big of a problem? Any numbers?
> > > 
> > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > with number in the description at respin.
> > 
> > Does this really have to be a fast operation? I would expect the monitor
> > is by no means a fast path. The system call overhead is not what it used
> > to be, sigh, but still for something that is not a hot path it should be
> > tolerable, especially when the whole operation is quite expensive on its
> > own (wrt. the syscall entry/exit).
> 
> What's different with process_vm_[readv|writev] and vmsplice?
> If the range needed to be covered is a lot, vector operation makes senese
> to me.

I am not saying that the vector API is wrong. All I am trying to say is
that the benefit is not really clear so far. If you want to push it
through then you should better get some supporting data.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21  6:36       ` Oleksandr Natalenko
  2019-05-21  6:50         ` Michal Hocko
@ 2019-05-21 10:49         ` Minchan Kim
  2019-05-21 10:55           ` Michal Hocko
  1 sibling, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 10:49 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue, May 21, 2019 at 08:36:28AM +0200, Oleksandr Natalenko wrote:
> Hi.
> 
> On Tue, May 21, 2019 at 10:26:49AM +0900, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote:
> > > Hi.
> > > 
> > > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote:
> > > > This patch factor out madvise's core functionality so that upcoming
> > > > patch can reuse it without duplication.
> > > > 
> > > > It shouldn't change any behavior.
> > > > 
> > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > >  mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------
> > > >  1 file changed, 89 insertions(+), 79 deletions(-)
> > > > 
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 9a6698b56845..119e82e1f065 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > +static long madvise_dontneed_free(struct task_struct *tsk,
> > > > +				  struct vm_area_struct *vma,
> > > >  				  struct vm_area_struct **prev,
> > > >  				  unsigned long start, unsigned long end,
> > > >  				  int behavior)
> > > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > >  	if (!userfaultfd_remove(vma, start, end)) {
> > > >  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > > >  
> > > > -		down_read(&current->mm->mmap_sem);
> > > > -		vma = find_vma(current->mm, start);
> > > > +		down_read(&tsk->mm->mmap_sem);
> > > > +		vma = find_vma(tsk->mm, start);
> > > >  		if (!vma)
> > > >  			return -ENOMEM;
> > > >  		if (start < vma->vm_start) {
> > > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > >   * Application wants to free up the pages and associated backing store.
> > > >   * This is effectively punching a hole into the middle of a file.
> > > >   */
> > > > -static long madvise_remove(struct vm_area_struct *vma,
> > > > +static long madvise_remove(struct task_struct *tsk,
> > > > +				struct vm_area_struct *vma,
> > > >  				struct vm_area_struct **prev,
> > > >  				unsigned long start, unsigned long end)
> > > >  {
> > > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > > >  	get_file(f);
> > > >  	if (userfaultfd_remove(vma, start, end)) {
> > > >  		/* mmap_sem was not released by userfaultfd_remove() */
> > > > -		up_read(&current->mm->mmap_sem);
> > > > +		up_read(&tsk->mm->mmap_sem);
> > > >  	}
> > > >  	error = vfs_fallocate(f,
> > > >  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > >  				offset, end - start);
> > > >  	fput(f);
> > > > -	down_read(&current->mm->mmap_sem);
> > > > +	down_read(&tsk->mm->mmap_sem);
> > > >  	return error;
> > > >  }
> > > >  
> > > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior,
> > > >  #endif
> > > 
> > > What about madvise_inject_error() and get_user_pages_fast() in it
> > > please?
> > 
> > Good point. Maybe, there more places where assume context is "current" so
> > I'm thinking to limit hints we could allow from external process.
> > It would be better for maintainance point of view in that we could know
> > the workload/usecases when someone ask new advises from external process
> > without making every hints works both contexts.
> 
> Well, for madvise_inject_error() we still have a remote variant of
> get_user_pages(), and that should work, no?

Regardless of madvise_inject_error, it seems to be risky to expose all
of hints for external process, I think. For example, MADV_DONTNEED with
race, it's critical for stability. So, until we could get the way to
prevent the race, I want to restrict hints.

> 
> Regarding restricting the hints, I'm definitely interested in having
> remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> madvise() introduces another issue with traversing remote VMAs reliably.

How is it signifiact when the race happens? It could waste CPU cycle
and make unncessary break of that merged pages but expect it should be
rare so such non-desruptive hint could be exposed via process_madvise, I think.

If the hint is critical for the race, yes, as Michal suggested, we need a way
to close it and I guess non-cooperative userfaultfd with synchronous support
would help private anonymous vma.

> IIUC, one can do this via userspace by parsing [s]maps file only, which
> is not very consistent, and once some range is parsed, and then it is
> immediately gone, a wrong hint will be sent.
> 
> Isn't this a problem we should worry about?

I think it depends on the hint and usecase.

> 
> > 
> > Thanks.
> 
> -- 
>   Best regards,
>     Oleksandr Natalenko (post-factum)
>     Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21  7:06           ` Oleksandr Natalenko
@ 2019-05-21 10:52             ` Minchan Kim
  2019-05-21 11:00               ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 10:52 UTC (permalink / raw)
  To: Oleksandr Natalenko
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote:
> Hi.
> 
> On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote:
> > [...]
> > > Regarding restricting the hints, I'm definitely interested in having
> > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> > > madvise() introduces another issue with traversing remote VMAs reliably.
> > > IIUC, one can do this via userspace by parsing [s]maps file only, which
> > > is not very consistent, and once some range is parsed, and then it is
> > > immediately gone, a wrong hint will be sent.
> > > 
> > > Isn't this a problem we should worry about?
> > 
> > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz
> 
> Oh, thanks for the pointer.
> 
> Indeed, for my specific task with remote KSM I'd go with map_files
> instead. This doesn't solve the task completely in case of traversal
> through all the VMAs in one pass, but makes it easier comparing to a
> remote syscall.

I'm wondering how map_files can solve your concern exactly if you have
a concern about the race of vma unmap/remap even there are anonymous
vma which map_files doesn't support.

> 
> -- 
>   Best regards,
>     Oleksandr Natalenko (post-factum)
>     Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21 10:49         ` Minchan Kim
@ 2019-05-21 10:55           ` Michal Hocko
  0 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-21 10:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleksandr Natalenko, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue 21-05-19 19:49:49, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:36:28AM +0200, Oleksandr Natalenko wrote:
> > Hi.
> > 
> > On Tue, May 21, 2019 at 10:26:49AM +0900, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote:
> > > > Hi.
> > > > 
> > > > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote:
> > > > > This patch factor out madvise's core functionality so that upcoming
> > > > > patch can reuse it without duplication.
> > > > > 
> > > > > It shouldn't change any behavior.
> > > > > 
> > > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > > ---
> > > > >  mm/madvise.c | 168 +++++++++++++++++++++++++++------------------------
> > > > >  1 file changed, 89 insertions(+), 79 deletions(-)
> > > > > 
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 9a6698b56845..119e82e1f065 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > > +static long madvise_dontneed_free(struct task_struct *tsk,
> > > > > +				  struct vm_area_struct *vma,
> > > > >  				  struct vm_area_struct **prev,
> > > > >  				  unsigned long start, unsigned long end,
> > > > >  				  int behavior)
> > > > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > >  	if (!userfaultfd_remove(vma, start, end)) {
> > > > >  		*prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > > > >  
> > > > > -		down_read(&current->mm->mmap_sem);
> > > > > -		vma = find_vma(current->mm, start);
> > > > > +		down_read(&tsk->mm->mmap_sem);
> > > > > +		vma = find_vma(tsk->mm, start);
> > > > >  		if (!vma)
> > > > >  			return -ENOMEM;
> > > > >  		if (start < vma->vm_start) {
> > > > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > >   * Application wants to free up the pages and associated backing store.
> > > > >   * This is effectively punching a hole into the middle of a file.
> > > > >   */
> > > > > -static long madvise_remove(struct vm_area_struct *vma,
> > > > > +static long madvise_remove(struct task_struct *tsk,
> > > > > +				struct vm_area_struct *vma,
> > > > >  				struct vm_area_struct **prev,
> > > > >  				unsigned long start, unsigned long end)
> > > > >  {
> > > > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > > > >  	get_file(f);
> > > > >  	if (userfaultfd_remove(vma, start, end)) {
> > > > >  		/* mmap_sem was not released by userfaultfd_remove() */
> > > > > -		up_read(&current->mm->mmap_sem);
> > > > > +		up_read(&tsk->mm->mmap_sem);
> > > > >  	}
> > > > >  	error = vfs_fallocate(f,
> > > > >  				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > > >  				offset, end - start);
> > > > >  	fput(f);
> > > > > -	down_read(&current->mm->mmap_sem);
> > > > > +	down_read(&tsk->mm->mmap_sem);
> > > > >  	return error;
> > > > >  }
> > > > >  
> > > > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior,
> > > > >  #endif
> > > > 
> > > > What about madvise_inject_error() and get_user_pages_fast() in it
> > > > please?
> > > 
> > > Good point. Maybe, there more places where assume context is "current" so
> > > I'm thinking to limit hints we could allow from external process.
> > > It would be better for maintainance point of view in that we could know
> > > the workload/usecases when someone ask new advises from external process
> > > without making every hints works both contexts.
> > 
> > Well, for madvise_inject_error() we still have a remote variant of
> > get_user_pages(), and that should work, no?
> 
> Regardless of madvise_inject_error, it seems to be risky to expose all
> of hints for external process, I think. For example, MADV_DONTNEED with
> race, it's critical for stability. So, until we could get the way to
> prevent the race, I want to restrict hints.

Well, if you allow the full ptrace access then you can shoot the target
whatever you like.

> > Regarding restricting the hints, I'm definitely interested in having
> > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> > madvise() introduces another issue with traversing remote VMAs reliably.
> 
> How is it signifiact when the race happens? It could waste CPU cycle
> and make unncessary break of that merged pages but expect it should be
> rare so such non-desruptive hint could be exposed via process_madvise, I think.
> 
> If the hint is critical for the race, yes, as Michal suggested, we need a way
> to close it and I guess non-cooperative userfaultfd with synchronous support
> would help private anonymous vma.

If we have a per vma fd approach then we can revalidate atomically and
make sure the operation is performed on the range that was really
requested. I do not think we want to provide a more specific guarantees.
Monitor process has to be careful same way ptrace doesn't want to harm
the target.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21 10:52             ` Minchan Kim
@ 2019-05-21 11:00               ` Michal Hocko
  2019-05-21 11:24                 ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-21 11:00 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleksandr Natalenko, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue 21-05-19 19:52:56, Minchan Kim wrote:
> On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote:
> > Hi.
> > 
> > On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote:
> > > [...]
> > > > Regarding restricting the hints, I'm definitely interested in having
> > > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> > > > madvise() introduces another issue with traversing remote VMAs reliably.
> > > > IIUC, one can do this via userspace by parsing [s]maps file only, which
> > > > is not very consistent, and once some range is parsed, and then it is
> > > > immediately gone, a wrong hint will be sent.
> > > > 
> > > > Isn't this a problem we should worry about?
> > > 
> > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz
> > 
> > Oh, thanks for the pointer.
> > 
> > Indeed, for my specific task with remote KSM I'd go with map_files
> > instead. This doesn't solve the task completely in case of traversal
> > through all the VMAs in one pass, but makes it easier comparing to a
> > remote syscall.
> 
> I'm wondering how map_files can solve your concern exactly if you have
> a concern about the race of vma unmap/remap even there are anonymous
> vma which map_files doesn't support.

See http://lkml.kernel.org/r/20190521105503.GQ32329@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21  8:42 ` Christian Brauner
@ 2019-05-21 11:05   ` Minchan Kim
  2019-05-21 11:30     ` Christian Brauner
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 11:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh

On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > - Background
> > 
> > The Android terminology used for forking a new process and starting an app
> > from scratch is a cold start, while resuming an existing app is a hot start.
> > While we continually try to improve the performance of cold starts, hot
> > starts will always be significantly less power hungry as well as faster so
> > we are trying to make hot start more likely than cold start.
> > 
> > To increase hot start, Android userspace manages the order that apps should
> > be killed in a process called ActivityManagerService. ActivityManagerService
> > tracks every Android app or service that the user could be interacting with
> > at any time and translates that into a ranked list for lmkd(low memory
> > killer daemon). They are likely to be killed by lmkd if the system has to
> > reclaim memory. In that sense they are similar to entries in any other cache.
> > Those apps are kept alive for opportunistic performance improvements but
> > those performance improvements will vary based on the memory requirements of
> > individual workloads.
> > 
> > - Problem
> > 
> > Naturally, cached apps were dominant consumers of memory on the system.
> > However, they were not significant consumers of swap even though they are
> > good candidate for swap. Under investigation, swapping out only begins
> > once the low zone watermark is hit and kswapd wakes up, but the overall
> > allocation rate in the system might trip lmkd thresholds and cause a cached
> > process to be killed(we measured performance swapping out vs. zapping the
> > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > even though we use zram which is much faster than real storage) so kill
> > from lmkd will often satisfy the high zone watermark, resulting in very
> > few pages actually being moved to swap.
> > 
> > - Approach
> > 
> > The approach we chose was to use a new interface to allow userspace to
> > proactively reclaim entire processes by leveraging platform information.
> > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > that are known to be cold from userspace and to avoid races with lmkd
> > by reclaiming apps as soon as they entered the cached state. Additionally,
> > it could provide many chances for platform to use much information to
> > optimize memory efficiency.
> > 
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> > 
> > To achieve the goal, the patchset introduce two new options for madvise.
> > One is MADV_COOL which will deactive activated pages and the other is
> > MADV_COLD which will reclaim private pages instantly. These new options
> > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed when memory pressure rises.
> > 
> > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > information required to make the reclaim decision is not known to the app.
> > Instead, it is known to a centralized userspace daemon, and that daemon
> > must be able to initiate reclaim on its own without any app involvement.
> > To solve the concern, this patch introduces new syscall -
> > 
> > 	struct pr_madvise_param {
> > 		int size;
> > 		const struct iovec *vec;
> > 	}
> > 
> > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > 				struct pr_madvise_param *restuls,
> > 				struct pr_madvise_param *ranges,
> > 				unsigned long flags);
> > 
> > The syscall get pidfd to give hints to external process and provides
> > pair of result/ranges vector arguments so that it could give several
> > hints to each address range all at once.
> > 
> > I guess others have different ideas about the naming of syscall and options
> > so feel free to suggest better naming.
> 
> Yes, all new syscalls making use of pidfds should be named
> pidfd_<action>. So please make this pidfd_madvise.

I don't have any particular preference but just wondering why pidfd is
so special to have it as prefix of system call name.

> 
> Please make sure to Cc me on this in the future as I'm maintaining
> pidfds. Would be great to have Jann on this too since he's been touching
> both mm and parts of the pidfd stuff with me.

Sure!


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21 11:00               ` Michal Hocko
@ 2019-05-21 11:24                 ` Minchan Kim
  2019-05-21 11:32                   ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 11:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Oleksandr Natalenko, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue, May 21, 2019 at 01:00:30PM +0200, Michal Hocko wrote:
> On Tue 21-05-19 19:52:56, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote:
> > > Hi.
> > > 
> > > On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote:
> > > > [...]
> > > > > Regarding restricting the hints, I'm definitely interested in having
> > > > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> > > > > madvise() introduces another issue with traversing remote VMAs reliably.
> > > > > IIUC, one can do this via userspace by parsing [s]maps file only, which
> > > > > is not very consistent, and once some range is parsed, and then it is
> > > > > immediately gone, a wrong hint will be sent.
> > > > > 
> > > > > Isn't this a problem we should worry about?
> > > > 
> > > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz
> > > 
> > > Oh, thanks for the pointer.
> > > 
> > > Indeed, for my specific task with remote KSM I'd go with map_files
> > > instead. This doesn't solve the task completely in case of traversal
> > > through all the VMAs in one pass, but makes it easier comparing to a
> > > remote syscall.
> > 
> > I'm wondering how map_files can solve your concern exactly if you have
> > a concern about the race of vma unmap/remap even there are anonymous
> > vma which map_files doesn't support.
> 
> See http://lkml.kernel.org/r/20190521105503.GQ32329@dhcp22.suse.cz

Question is how it works for anonymous vma which don't have backing
file.

> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 11:05   ` Minchan Kim
@ 2019-05-21 11:30     ` Christian Brauner
  2019-05-21 11:39       ` Christian Brauner
  2019-05-21 11:41       ` Minchan Kim
  0 siblings, 2 replies; 138+ messages in thread
From: Christian Brauner @ 2019-05-21 11:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh

On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > - Background
> > > 
> > > The Android terminology used for forking a new process and starting an app
> > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > While we continually try to improve the performance of cold starts, hot
> > > starts will always be significantly less power hungry as well as faster so
> > > we are trying to make hot start more likely than cold start.
> > > 
> > > To increase hot start, Android userspace manages the order that apps should
> > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > tracks every Android app or service that the user could be interacting with
> > > at any time and translates that into a ranked list for lmkd(low memory
> > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > Those apps are kept alive for opportunistic performance improvements but
> > > those performance improvements will vary based on the memory requirements of
> > > individual workloads.
> > > 
> > > - Problem
> > > 
> > > Naturally, cached apps were dominant consumers of memory on the system.
> > > However, they were not significant consumers of swap even though they are
> > > good candidate for swap. Under investigation, swapping out only begins
> > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > process to be killed(we measured performance swapping out vs. zapping the
> > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > even though we use zram which is much faster than real storage) so kill
> > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > few pages actually being moved to swap.
> > > 
> > > - Approach
> > > 
> > > The approach we chose was to use a new interface to allow userspace to
> > > proactively reclaim entire processes by leveraging platform information.
> > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > that are known to be cold from userspace and to avoid races with lmkd
> > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > it could provide many chances for platform to use much information to
> > > optimize memory efficiency.
> > > 
> > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > when memory pressure rises.
> > > 
> > > To achieve the goal, the patchset introduce two new options for madvise.
> > > One is MADV_COOL which will deactive activated pages and the other is
> > > MADV_COLD which will reclaim private pages instantly. These new options
> > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > that it hints the kernel that memory region is not currently needed and
> > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > that it hints the kernel that memory region is not currently needed and
> > > should be reclaimed when memory pressure rises.
> > > 
> > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > information required to make the reclaim decision is not known to the app.
> > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > must be able to initiate reclaim on its own without any app involvement.
> > > To solve the concern, this patch introduces new syscall -
> > > 
> > > 	struct pr_madvise_param {
> > > 		int size;
> > > 		const struct iovec *vec;
> > > 	}
> > > 
> > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > 				struct pr_madvise_param *restuls,
> > > 				struct pr_madvise_param *ranges,
> > > 				unsigned long flags);
> > > 
> > > The syscall get pidfd to give hints to external process and provides
> > > pair of result/ranges vector arguments so that it could give several
> > > hints to each address range all at once.
> > > 
> > > I guess others have different ideas about the naming of syscall and options
> > > so feel free to suggest better naming.
> > 
> > Yes, all new syscalls making use of pidfds should be named
> > pidfd_<action>. So please make this pidfd_madvise.
> 
> I don't have any particular preference but just wondering why pidfd is
> so special to have it as prefix of system call name.

It's a whole new API to address processes. We already have
clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
syscall works only with pidfds so it's tied to this api as well so it
should follow the naming scheme. This also makes life easier for
userspace and is consistent.

> 
> > 
> > Please make sure to Cc me on this in the future as I'm maintaining
> > pidfds. Would be great to have Jann on this too since he's been touching
> > both mm and parts of the pidfd stuff with me.
> 
> Sure!

Thanks!


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 4/7] mm: factor out madvise's core functionality
  2019-05-21 11:24                 ` Minchan Kim
@ 2019-05-21 11:32                   ` Michal Hocko
  0 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-21 11:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleksandr Natalenko, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue 21-05-19 20:24:23, Minchan Kim wrote:
> On Tue, May 21, 2019 at 01:00:30PM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 19:52:56, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 09:06:38AM +0200, Oleksandr Natalenko wrote:
> > > > Hi.
> > > > 
> > > > On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote:
> > > > > [...]
> > > > > > Regarding restricting the hints, I'm definitely interested in having
> > > > > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote
> > > > > > madvise() introduces another issue with traversing remote VMAs reliably.
> > > > > > IIUC, one can do this via userspace by parsing [s]maps file only, which
> > > > > > is not very consistent, and once some range is parsed, and then it is
> > > > > > immediately gone, a wrong hint will be sent.
> > > > > > 
> > > > > > Isn't this a problem we should worry about?
> > > > > 
> > > > > See http://lkml.kernel.org/r/20190520091829.GY6836@dhcp22.suse.cz
> > > > 
> > > > Oh, thanks for the pointer.
> > > > 
> > > > Indeed, for my specific task with remote KSM I'd go with map_files
> > > > instead. This doesn't solve the task completely in case of traversal
> > > > through all the VMAs in one pass, but makes it easier comparing to a
> > > > remote syscall.
> > > 
> > > I'm wondering how map_files can solve your concern exactly if you have
> > > a concern about the race of vma unmap/remap even there are anonymous
> > > vma which map_files doesn't support.
> > 
> > See http://lkml.kernel.org/r/20190521105503.GQ32329@dhcp22.suse.cz
> 
> Question is how it works for anonymous vma which don't have backing
> file.

We would have to export map_files like interface for anonymous vmas
and have a way to invalidate on the VMA removal (reference counting or
something similar), no question this will be some additional work to do.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-21  9:01   ` Christian Brauner
@ 2019-05-21 11:35     ` Minchan Kim
  2019-05-21 11:51       ` Christian Brauner
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 11:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg

On Tue, May 21, 2019 at 11:01:01AM +0200, Christian Brauner wrote:
> Cc: Jann and Oleg too
> 
> On Mon, May 20, 2019 at 12:52:52PM +0900, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > ActivityManagerService is one of them.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> > 
> > To solve the issue, this patch introduces new syscall process_madvise(2)
> > which works based on pidfd so it could give a hint to the exeternal
> > process.
> > 
> > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > 
> > All advises madvise provides can be supported in process_madvise, too.
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrrace the process could use it successfully.
> > 
> > Please suggest better idea if you have other idea about the permission.
> > 
> > * from v1r1
> >   * use ptrace capability - surenb, dancol
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/proc_fs.h                |  1 +
> >  include/linux/syscalls.h               |  2 ++
> >  include/uapi/asm-generic/unistd.h      |  2 ++
> >  kernel/signal.c                        |  2 +-
> >  kernel/sys_ni.c                        |  1 +
> >  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
> >  8 files changed, 54 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 4cd5f982b1e5..5b9dd55d6b57 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -438,3 +438,4 @@
> >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 64ca0d06259a..0e5ee78161c9 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -355,6 +355,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	process_madvise		__x64_sys_process_madvise
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > index 52a283ba0465..f8545d7c5218 100644
> > --- a/include/linux/proc_fs.h
> > +++ b/include/linux/proc_fs.h
> > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
> >  
> >  #endif /* CONFIG_PROC_FS */
> >  
> > +extern struct pid *pidfd_to_pid(const struct file *file);
> >  struct net;
> >  
> >  static inline struct proc_dir_entry *proc_net_mkdir(
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e2870fe1be5b..21c6c9a62006 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
> >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  				unsigned char __user * vec);
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> > +				size_t len, int behavior);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >  			unsigned long prot, unsigned long pgoff,
> >  			unsigned long flags);
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..7ee82ce04620 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_process_madvise 428
> > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  
> >  #undef __NR_syscalls
> >  #define __NR_syscalls 428
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1c86b78a7597..04e75daab1f8 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> >  	return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > -static struct pid *pidfd_to_pid(const struct file *file)
> > +struct pid *pidfd_to_pid(const struct file *file)
> >  {
> >  	if (file->f_op == &pidfd_fops)
> >  		return file->private_data;
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 4d9ae5ea6caf..5277421795ab 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
> >  COND_SYSCALL(munlockall);
> >  COND_SYSCALL(mincore);
> >  COND_SYSCALL(madvise);
> > +COND_SYSCALL(process_madvise);
> >  COND_SYSCALL(remap_file_pages);
> >  COND_SYSCALL(mbind);
> >  COND_SYSCALL_COMPAT(mbind);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 119e82e1f065..af02aa17e5c1 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/proc_fs.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/mempolicy.h>
> >  #include <linux/page-isolation.h>
> > @@ -16,6 +17,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/falloc.h>
> >  #include <linux/sched.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/ksm.h>
> >  #include <linux/fs.h>
> >  #include <linux/file.h>
> > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> >  	return madvise_core(current, start, len_in, behavior);
> >  }
> > +
> > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > +		size_t, len_in, int, behavior)
> > +{
> > +	int ret;
> > +	struct fd f;
> > +	struct pid *pid;
> > +	struct task_struct *tsk;
> > +	struct mm_struct *mm;
> > +
> > +	f = fdget(pidfd);
> > +	if (!f.file)
> > +		return -EBADF;
> > +
> > +	pid = pidfd_to_pid(f.file);
> 
> pidfd_to_pid() should not be directly exported since this allows
> /proc/<pid> fds to be used too. That's something we won't be going
> forward with. All new syscalls should only allow to operate on pidfds
> created through CLONE_PIDFD or pidfd_open() (cf. [1]).

Thanks for the information.

> 
> So e.g. please export a simple helper like
> 
> struct pid *pidfd_to_pid(const struct file *file)
> {
>         if (file->f_op == &pidfd_fops)
>                 return file->private_data;
> 
>         return NULL;
> }
> 
> turning the old pidfd_to_pid() into something like:
> 
> static struct pid *__fd_to_pid(const struct file *file)
> {
>         struct pid *pid;
> 
>         pid = pidfd_to_pid(file);
>         if (pid)
>                 return pid;
> 
>         return tgid_pidfd_to_pid(file);
> }

So, I want to clarify what you suggest here.

1. modify pidfd_to_pid as what you described above(ie, return NULL
instead of returning tgid_pidfd_to_pid(file);
2. never export pidfd_to_pid
3. create wrapper __fd_to_pid which calls pidfd_to_pid internally
4. export __fd_to_pid and use it

Correct?

Thanks.

> 
> All new syscalls should only be using anon inode pidfds since they can
> actually have a clean security model built around them in the future.
> Note, pidfd_open() will be sent out together with making pidfds pollable
> for the 5.3 merge window.
> 
> [1]: https://lore.kernel.org/lkml/20190520155630.21684-1-christian@brauner.io/
> 
> Thanks!
> Christian


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 11:30     ` Christian Brauner
@ 2019-05-21 11:39       ` Christian Brauner
  2019-05-22  5:11         ` Daniel Colascione
  2019-05-21 11:41       ` Minchan Kim
  1 sibling, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-21 11:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh

On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > - Background
> > > > 
> > > > The Android terminology used for forking a new process and starting an app
> > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > While we continually try to improve the performance of cold starts, hot
> > > > starts will always be significantly less power hungry as well as faster so
> > > > we are trying to make hot start more likely than cold start.
> > > > 
> > > > To increase hot start, Android userspace manages the order that apps should
> > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > tracks every Android app or service that the user could be interacting with
> > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > Those apps are kept alive for opportunistic performance improvements but
> > > > those performance improvements will vary based on the memory requirements of
> > > > individual workloads.
> > > > 
> > > > - Problem
> > > > 
> > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > However, they were not significant consumers of swap even though they are
> > > > good candidate for swap. Under investigation, swapping out only begins
> > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > even though we use zram which is much faster than real storage) so kill
> > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > few pages actually being moved to swap.
> > > > 
> > > > - Approach
> > > > 
> > > > The approach we chose was to use a new interface to allow userspace to
> > > > proactively reclaim entire processes by leveraging platform information.
> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > it could provide many chances for platform to use much information to
> > > > optimize memory efficiency.
> > > > 
> > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > when memory pressure rises.
> > > > 
> > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed when memory pressure rises.
> > > > 
> > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > information required to make the reclaim decision is not known to the app.
> > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > must be able to initiate reclaim on its own without any app involvement.
> > > > To solve the concern, this patch introduces new syscall -
> > > > 
> > > > 	struct pr_madvise_param {
> > > > 		int size;
> > > > 		const struct iovec *vec;
> > > > 	}
> > > > 
> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > 				struct pr_madvise_param *restuls,
> > > > 				struct pr_madvise_param *ranges,
> > > > 				unsigned long flags);
> > > > 
> > > > The syscall get pidfd to give hints to external process and provides
> > > > pair of result/ranges vector arguments so that it could give several
> > > > hints to each address range all at once.
> > > > 
> > > > I guess others have different ideas about the naming of syscall and options
> > > > so feel free to suggest better naming.
> > > 
> > > Yes, all new syscalls making use of pidfds should be named
> > > pidfd_<action>. So please make this pidfd_madvise.
> > 
> > I don't have any particular preference but just wondering why pidfd is
> > so special to have it as prefix of system call name.
> 
> It's a whole new API to address processes. We already have
> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> syscall works only with pidfds so it's tied to this api as well so it
> should follow the naming scheme. This also makes life easier for
> userspace and is consistent.

This is at least my reasoning. I'm not going to make this a whole big
pedantic argument. If people have really strong feelings about not using
this prefix then fine. But if syscalls can be grouped together and have
consistent naming this is always a big plus.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 11:30     ` Christian Brauner
  2019-05-21 11:39       ` Christian Brauner
@ 2019-05-21 11:41       ` Minchan Kim
  2019-05-21 12:04         ` Christian Brauner
  1 sibling, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-21 11:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh

On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote:
> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > - Background
> > > > 
> > > > The Android terminology used for forking a new process and starting an app
> > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > While we continually try to improve the performance of cold starts, hot
> > > > starts will always be significantly less power hungry as well as faster so
> > > > we are trying to make hot start more likely than cold start.
> > > > 
> > > > To increase hot start, Android userspace manages the order that apps should
> > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > tracks every Android app or service that the user could be interacting with
> > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > Those apps are kept alive for opportunistic performance improvements but
> > > > those performance improvements will vary based on the memory requirements of
> > > > individual workloads.
> > > > 
> > > > - Problem
> > > > 
> > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > However, they were not significant consumers of swap even though they are
> > > > good candidate for swap. Under investigation, swapping out only begins
> > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > even though we use zram which is much faster than real storage) so kill
> > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > few pages actually being moved to swap.
> > > > 
> > > > - Approach
> > > > 
> > > > The approach we chose was to use a new interface to allow userspace to
> > > > proactively reclaim entire processes by leveraging platform information.
> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > it could provide many chances for platform to use much information to
> > > > optimize memory efficiency.
> > > > 
> > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > when memory pressure rises.
> > > > 
> > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed when memory pressure rises.
> > > > 
> > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > information required to make the reclaim decision is not known to the app.
> > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > must be able to initiate reclaim on its own without any app involvement.
> > > > To solve the concern, this patch introduces new syscall -
> > > > 
> > > > 	struct pr_madvise_param {
> > > > 		int size;
> > > > 		const struct iovec *vec;
> > > > 	}
> > > > 
> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > 				struct pr_madvise_param *restuls,
> > > > 				struct pr_madvise_param *ranges,
> > > > 				unsigned long flags);
> > > > 
> > > > The syscall get pidfd to give hints to external process and provides
> > > > pair of result/ranges vector arguments so that it could give several
> > > > hints to each address range all at once.
> > > > 
> > > > I guess others have different ideas about the naming of syscall and options
> > > > so feel free to suggest better naming.
> > > 
> > > Yes, all new syscalls making use of pidfds should be named
> > > pidfd_<action>. So please make this pidfd_madvise.
> > 
> > I don't have any particular preference but just wondering why pidfd is
> > so special to have it as prefix of system call name.
> 
> It's a whole new API to address processes. We already have
> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> syscall works only with pidfds so it's tied to this api as well so it
> should follow the naming scheme. This also makes life easier for
> userspace and is consistent.

Okay. I will change the API name at next revision.
Thanks.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-21 11:35     ` Minchan Kim
@ 2019-05-21 11:51       ` Christian Brauner
  0 siblings, 0 replies; 138+ messages in thread
From: Christian Brauner @ 2019-05-21 11:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg

On Tue, May 21, 2019 at 08:35:10PM +0900, Minchan Kim wrote:
> On Tue, May 21, 2019 at 11:01:01AM +0200, Christian Brauner wrote:
> > Cc: Jann and Oleg too
> > 
> > On Mon, May 20, 2019 at 12:52:52PM +0900, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > ActivityManagerService is one of them.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > > 
> > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > which works based on pidfd so it could give a hint to the exeternal
> > > process.
> > > 
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > > 
> > > All advises madvise provides can be supported in process_madvise, too.
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrrace the process could use it successfully.
> > > 
> > > Please suggest better idea if you have other idea about the permission.
> > > 
> > > * from v1r1
> > >   * use ptrace capability - surenb, dancol
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> > >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> > >  include/linux/proc_fs.h                |  1 +
> > >  include/linux/syscalls.h               |  2 ++
> > >  include/uapi/asm-generic/unistd.h      |  2 ++
> > >  kernel/signal.c                        |  2 +-
> > >  kernel/sys_ni.c                        |  1 +
> > >  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
> > >  8 files changed, 54 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > > index 4cd5f982b1e5..5b9dd55d6b57 100644
> > > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > > @@ -438,3 +438,4 @@
> > >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> > >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> > >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > > +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > > index 64ca0d06259a..0e5ee78161c9 100644
> > > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > > @@ -355,6 +355,7 @@
> > >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> > >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> > >  427	common	io_uring_register	__x64_sys_io_uring_register
> > > +428	common	process_madvise		__x64_sys_process_madvise
> > >  
> > >  #
> > >  # x32-specific system call numbers start at 512 to avoid cache impact
> > > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > > index 52a283ba0465..f8545d7c5218 100644
> > > --- a/include/linux/proc_fs.h
> > > +++ b/include/linux/proc_fs.h
> > > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
> > >  
> > >  #endif /* CONFIG_PROC_FS */
> > >  
> > > +extern struct pid *pidfd_to_pid(const struct file *file);
> > >  struct net;
> > >  
> > >  static inline struct proc_dir_entry *proc_net_mkdir(
> > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > > index e2870fe1be5b..21c6c9a62006 100644
> > > --- a/include/linux/syscalls.h
> > > +++ b/include/linux/syscalls.h
> > > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
> > >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> > >  				unsigned char __user * vec);
> > >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> > > +				size_t len, int behavior);
> > >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> > >  			unsigned long prot, unsigned long pgoff,
> > >  			unsigned long flags);
> > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > > index dee7292e1df6..7ee82ce04620 100644
> > > --- a/include/uapi/asm-generic/unistd.h
> > > +++ b/include/uapi/asm-generic/unistd.h
> > > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> > >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> > >  #define __NR_io_uring_register 427
> > >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > > +#define __NR_process_madvise 428
> > > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> > >  
> > >  #undef __NR_syscalls
> > >  #define __NR_syscalls 428
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index 1c86b78a7597..04e75daab1f8 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> > >  	return copy_siginfo_from_user(kinfo, info);
> > >  }
> > >  
> > > -static struct pid *pidfd_to_pid(const struct file *file)
> > > +struct pid *pidfd_to_pid(const struct file *file)
> > >  {
> > >  	if (file->f_op == &pidfd_fops)
> > >  		return file->private_data;
> > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > > index 4d9ae5ea6caf..5277421795ab 100644
> > > --- a/kernel/sys_ni.c
> > > +++ b/kernel/sys_ni.c
> > > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
> > >  COND_SYSCALL(munlockall);
> > >  COND_SYSCALL(mincore);
> > >  COND_SYSCALL(madvise);
> > > +COND_SYSCALL(process_madvise);
> > >  COND_SYSCALL(remap_file_pages);
> > >  COND_SYSCALL(mbind);
> > >  COND_SYSCALL_COMPAT(mbind);
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 119e82e1f065..af02aa17e5c1 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -9,6 +9,7 @@
> > >  #include <linux/mman.h>
> > >  #include <linux/pagemap.h>
> > >  #include <linux/page_idle.h>
> > > +#include <linux/proc_fs.h>
> > >  #include <linux/syscalls.h>
> > >  #include <linux/mempolicy.h>
> > >  #include <linux/page-isolation.h>
> > > @@ -16,6 +17,7 @@
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/falloc.h>
> > >  #include <linux/sched.h>
> > > +#include <linux/sched/mm.h>
> > >  #include <linux/ksm.h>
> > >  #include <linux/fs.h>
> > >  #include <linux/file.h>
> > > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > >  {
> > >  	return madvise_core(current, start, len_in, behavior);
> > >  }
> > > +
> > > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > > +		size_t, len_in, int, behavior)
> > > +{
> > > +	int ret;
> > > +	struct fd f;
> > > +	struct pid *pid;
> > > +	struct task_struct *tsk;
> > > +	struct mm_struct *mm;
> > > +
> > > +	f = fdget(pidfd);
> > > +	if (!f.file)
> > > +		return -EBADF;
> > > +
> > > +	pid = pidfd_to_pid(f.file);
> > 
> > pidfd_to_pid() should not be directly exported since this allows
> > /proc/<pid> fds to be used too. That's something we won't be going
> > forward with. All new syscalls should only allow to operate on pidfds
> > created through CLONE_PIDFD or pidfd_open() (cf. [1]).
> 
> Thanks for the information.
> 
> > 
> > So e.g. please export a simple helper like
> > 
> > struct pid *pidfd_to_pid(const struct file *file)
> > {
> >         if (file->f_op == &pidfd_fops)
> >                 return file->private_data;
> > 
> >         return NULL;
> > }
> > 
> > turning the old pidfd_to_pid() into something like:
> > 
> > static struct pid *__fd_to_pid(const struct file *file)
> > {
> >         struct pid *pid;
> > 
> >         pid = pidfd_to_pid(file);
> >         if (pid)
> >                 return pid;
> > 
> >         return tgid_pidfd_to_pid(file);
> > }
> 
> So, I want to clarify what you suggest here.
> 
> 1. modify pidfd_to_pid as what you described above(ie, return NULL
> instead of returning tgid_pidfd_to_pid(file);
> 2. never export pidfd_to_pid
> 3. create wrapper __fd_to_pid which calls pidfd_to_pid internally
> 4. export __fd_to_pid and use it

Sorry, seems I was not clear enough. :)
What I meant was something along the lines of (not compile tested):

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..3ccc07010603 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -68,6 +68,10 @@ extern struct pid init_struct_pid;
 
 extern const struct file_operations pidfd_fops;
 
+struct file;
+
+extern struct pid *pidfd_pid(const struct file *file);
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..0fcfffaf5fdc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1711,6 +1711,14 @@ const struct file_operations pidfd_fops = {
 #endif
 };
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op == &pidfd_fops)
+		return file->private_data;
+
+	return ERR_PTR(-EBADF);
+}
+
 /**
  * pidfd_create() - Create a new pid file descriptor.
  *
diff --git a/kernel/signal.c b/kernel/signal.c
index a1eb44dc9ff5..7fca9d7e1633 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3611,8 +3611,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 
 static struct pid *pidfd_to_pid(const struct file *file)
 {
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
+	struct pid *pid;
+
+	pid = pidfd_pid(file);
+	if (pid)
+		return pid;
 
 	return tgid_pidfd_to_pid(file);
 }


^ permalink raw reply related	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 11:41       ` Minchan Kim
@ 2019-05-21 12:04         ` Christian Brauner
  2019-05-21 12:15           ` Oleksandr Natalenko
  0 siblings, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-21 12:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleksandr

On May 21, 2019 1:41:20 PM GMT+02:00, Minchan Kim <minchan@kernel.org> wrote:
>On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote:
>> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
>> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
>> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
>> > > > - Background
>> > > > 
>> > > > The Android terminology used for forking a new process and
>starting an app
>> > > > from scratch is a cold start, while resuming an existing app is
>a hot start.
>> > > > While we continually try to improve the performance of cold
>starts, hot
>> > > > starts will always be significantly less power hungry as well
>as faster so
>> > > > we are trying to make hot start more likely than cold start.
>> > > > 
>> > > > To increase hot start, Android userspace manages the order that
>apps should
>> > > > be killed in a process called ActivityManagerService.
>ActivityManagerService
>> > > > tracks every Android app or service that the user could be
>interacting with
>> > > > at any time and translates that into a ranked list for lmkd(low
>memory
>> > > > killer daemon). They are likely to be killed by lmkd if the
>system has to
>> > > > reclaim memory. In that sense they are similar to entries in
>any other cache.
>> > > > Those apps are kept alive for opportunistic performance
>improvements but
>> > > > those performance improvements will vary based on the memory
>requirements of
>> > > > individual workloads.
>> > > > 
>> > > > - Problem
>> > > > 
>> > > > Naturally, cached apps were dominant consumers of memory on the
>system.
>> > > > However, they were not significant consumers of swap even
>though they are
>> > > > good candidate for swap. Under investigation, swapping out only
>begins
>> > > > once the low zone watermark is hit and kswapd wakes up, but the
>overall
>> > > > allocation rate in the system might trip lmkd thresholds and
>cause a cached
>> > > > process to be killed(we measured performance swapping out vs.
>zapping the
>> > > > memory by killing a process. Unsurprisingly, zapping is 10x
>times faster
>> > > > even though we use zram which is much faster than real storage)
>so kill
>> > > > from lmkd will often satisfy the high zone watermark, resulting
>in very
>> > > > few pages actually being moved to swap.
>> > > > 
>> > > > - Approach
>> > > > 
>> > > > The approach we chose was to use a new interface to allow
>userspace to
>> > > > proactively reclaim entire processes by leveraging platform
>information.
>> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs
>for pages
>> > > > that are known to be cold from userspace and to avoid races
>with lmkd
>> > > > by reclaiming apps as soon as they entered the cached state.
>Additionally,
>> > > > it could provide many chances for platform to use much
>information to
>> > > > optimize memory efficiency.
>> > > > 
>> > > > IMHO we should spell it out that this patchset complements
>MADV_WONTNEED
>> > > > and MADV_FREE by adding non-destructive ways to gain some free
>memory
>> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it
>hints the
>> > > > kernel that memory region is not currently needed and should be
>reclaimed
>> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it
>hints the
>> > > > kernel that memory region is not currently needed and should be
>reclaimed
>> > > > when memory pressure rises.
>> > > > 
>> > > > To achieve the goal, the patchset introduce two new options for
>madvise.
>> > > > One is MADV_COOL which will deactive activated pages and the
>other is
>> > > > MADV_COLD which will reclaim private pages instantly. These new
>options
>> > > > complement MADV_DONTNEED and MADV_FREE by adding
>non-destructive ways to
>> > > > gain some free memory space. MADV_COLD is similar to
>MADV_DONTNEED in a way
>> > > > that it hints the kernel that memory region is not currently
>needed and
>> > > > should be reclaimed immediately; MADV_COOL is similar to
>MADV_FREE in a way
>> > > > that it hints the kernel that memory region is not currently
>needed and
>> > > > should be reclaimed when memory pressure rises.
>> > > > 
>> > > > This approach is similar in spirit to madvise(MADV_WONTNEED),
>but the
>> > > > information required to make the reclaim decision is not known
>to the app.
>> > > > Instead, it is known to a centralized userspace daemon, and
>that daemon
>> > > > must be able to initiate reclaim on its own without any app
>involvement.
>> > > > To solve the concern, this patch introduces new syscall -
>> > > > 
>> > > > 	struct pr_madvise_param {
>> > > > 		int size;
>> > > > 		const struct iovec *vec;
>> > > > 	}
>> > > > 
>> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
>> > > > 				struct pr_madvise_param *restuls,
>> > > > 				struct pr_madvise_param *ranges,
>> > > > 				unsigned long flags);
>> > > > 
>> > > > The syscall get pidfd to give hints to external process and
>provides
>> > > > pair of result/ranges vector arguments so that it could give
>several
>> > > > hints to each address range all at once.
>> > > > 
>> > > > I guess others have different ideas about the naming of syscall
>and options
>> > > > so feel free to suggest better naming.
>> > > 
>> > > Yes, all new syscalls making use of pidfds should be named
>> > > pidfd_<action>. So please make this pidfd_madvise.
>> > 
>> > I don't have any particular preference but just wondering why pidfd
>is
>> > so special to have it as prefix of system call name.
>> 
>> It's a whole new API to address processes. We already have
>> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
>> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
>> syscall works only with pidfds so it's tied to this api as well so it
>> should follow the naming scheme. This also makes life easier for
>> userspace and is consistent.
>
>Okay. I will change the API name at next revision.
>Thanks.

Thanks!
Fwiw, there's been a similar patch by Oleksandr for pidfd_madvise I stumbled upon a few days back:
https://gitlab.com/post-factum/pf-kernel/commit/0595f874a53fa898739ac315ddf208554d9dc897

He wanted to be cc'ed but I forgot.

Christian


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 12:04         ` Christian Brauner
@ 2019-05-21 12:15           ` Oleksandr Natalenko
  0 siblings, 0 replies; 138+ messages in thread
From: Oleksandr Natalenko @ 2019-05-21 12:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh

On Tue, May 21, 2019 at 02:04:00PM +0200, Christian Brauner wrote:
> On May 21, 2019 1:41:20 PM GMT+02:00, Minchan Kim <minchan@kernel.org> wrote:
> >On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote:
> >> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> >> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> >> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> >> > > > - Background
> >> > > > 
> >> > > > The Android terminology used for forking a new process and
> >starting an app
> >> > > > from scratch is a cold start, while resuming an existing app is
> >a hot start.
> >> > > > While we continually try to improve the performance of cold
> >starts, hot
> >> > > > starts will always be significantly less power hungry as well
> >as faster so
> >> > > > we are trying to make hot start more likely than cold start.
> >> > > > 
> >> > > > To increase hot start, Android userspace manages the order that
> >apps should
> >> > > > be killed in a process called ActivityManagerService.
> >ActivityManagerService
> >> > > > tracks every Android app or service that the user could be
> >interacting with
> >> > > > at any time and translates that into a ranked list for lmkd(low
> >memory
> >> > > > killer daemon). They are likely to be killed by lmkd if the
> >system has to
> >> > > > reclaim memory. In that sense they are similar to entries in
> >any other cache.
> >> > > > Those apps are kept alive for opportunistic performance
> >improvements but
> >> > > > those performance improvements will vary based on the memory
> >requirements of
> >> > > > individual workloads.
> >> > > > 
> >> > > > - Problem
> >> > > > 
> >> > > > Naturally, cached apps were dominant consumers of memory on the
> >system.
> >> > > > However, they were not significant consumers of swap even
> >though they are
> >> > > > good candidate for swap. Under investigation, swapping out only
> >begins
> >> > > > once the low zone watermark is hit and kswapd wakes up, but the
> >overall
> >> > > > allocation rate in the system might trip lmkd thresholds and
> >cause a cached
> >> > > > process to be killed(we measured performance swapping out vs.
> >zapping the
> >> > > > memory by killing a process. Unsurprisingly, zapping is 10x
> >times faster
> >> > > > even though we use zram which is much faster than real storage)
> >so kill
> >> > > > from lmkd will often satisfy the high zone watermark, resulting
> >in very
> >> > > > few pages actually being moved to swap.
> >> > > > 
> >> > > > - Approach
> >> > > > 
> >> > > > The approach we chose was to use a new interface to allow
> >userspace to
> >> > > > proactively reclaim entire processes by leveraging platform
> >information.
> >> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs
> >for pages
> >> > > > that are known to be cold from userspace and to avoid races
> >with lmkd
> >> > > > by reclaiming apps as soon as they entered the cached state.
> >Additionally,
> >> > > > it could provide many chances for platform to use much
> >information to
> >> > > > optimize memory efficiency.
> >> > > > 
> >> > > > IMHO we should spell it out that this patchset complements
> >MADV_WONTNEED
> >> > > > and MADV_FREE by adding non-destructive ways to gain some free
> >memory
> >> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it
> >hints the
> >> > > > kernel that memory region is not currently needed and should be
> >reclaimed
> >> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it
> >hints the
> >> > > > kernel that memory region is not currently needed and should be
> >reclaimed
> >> > > > when memory pressure rises.
> >> > > > 
> >> > > > To achieve the goal, the patchset introduce two new options for
> >madvise.
> >> > > > One is MADV_COOL which will deactive activated pages and the
> >other is
> >> > > > MADV_COLD which will reclaim private pages instantly. These new
> >options
> >> > > > complement MADV_DONTNEED and MADV_FREE by adding
> >non-destructive ways to
> >> > > > gain some free memory space. MADV_COLD is similar to
> >MADV_DONTNEED in a way
> >> > > > that it hints the kernel that memory region is not currently
> >needed and
> >> > > > should be reclaimed immediately; MADV_COOL is similar to
> >MADV_FREE in a way
> >> > > > that it hints the kernel that memory region is not currently
> >needed and
> >> > > > should be reclaimed when memory pressure rises.
> >> > > > 
> >> > > > This approach is similar in spirit to madvise(MADV_WONTNEED),
> >but the
> >> > > > information required to make the reclaim decision is not known
> >to the app.
> >> > > > Instead, it is known to a centralized userspace daemon, and
> >that daemon
> >> > > > must be able to initiate reclaim on its own without any app
> >involvement.
> >> > > > To solve the concern, this patch introduces new syscall -
> >> > > > 
> >> > > > 	struct pr_madvise_param {
> >> > > > 		int size;
> >> > > > 		const struct iovec *vec;
> >> > > > 	}
> >> > > > 
> >> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> >> > > > 				struct pr_madvise_param *restuls,
> >> > > > 				struct pr_madvise_param *ranges,
> >> > > > 				unsigned long flags);
> >> > > > 
> >> > > > The syscall get pidfd to give hints to external process and
> >provides
> >> > > > pair of result/ranges vector arguments so that it could give
> >several
> >> > > > hints to each address range all at once.
> >> > > > 
> >> > > > I guess others have different ideas about the naming of syscall
> >and options
> >> > > > so feel free to suggest better naming.
> >> > > 
> >> > > Yes, all new syscalls making use of pidfds should be named
> >> > > pidfd_<action>. So please make this pidfd_madvise.
> >> > 
> >> > I don't have any particular preference but just wondering why pidfd
> >is
> >> > so special to have it as prefix of system call name.
> >> 
> >> It's a whole new API to address processes. We already have
> >> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> >> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> >> syscall works only with pidfds so it's tied to this api as well so it
> >> should follow the naming scheme. This also makes life easier for
> >> userspace and is consistent.
> >
> >Okay. I will change the API name at next revision.
> >Thanks.
> 
> Thanks!
> Fwiw, there's been a similar patch by Oleksandr for pidfd_madvise I stumbled upon a few days back:
> https://gitlab.com/post-factum/pf-kernel/commit/0595f874a53fa898739ac315ddf208554d9dc897
> 
> He wanted to be cc'ed but I forgot.

Thanks :).

FWIW, since this submission is essentially a continuation of our discussion
involving my earlier KSM submissions here, I won't move my gitlab branch
forward and will be happy to assist with what we have here, be it
pidfd_madvise() or a set or /proc files (or smth else).

> 
> Christian
> 

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
                   ` (12 preceding siblings ...)
  2019-05-21  8:42 ` Christian Brauner
@ 2019-05-21 12:53 ` Shakeel Butt
  13 siblings, 0 replies; 138+ messages in thread
From: Shakeel Butt @ 2019-05-21 12:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Sonny Rao, Brian Geffon, linux-api

On Sun, May 19, 2019 at 8:53 PM Minchan Kim <minchan@kernel.org> wrote:
>
> - Background
>
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
>
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
>
> - Problem
>
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.

It is not clear what exactly is the problem from the above para. IMO
low usage of swap is not the problem but rather global memory pressure
and the reactive response to it is the problem. Killing apps over swap
is preferred as you have noted zapping frees memory faster but it
indirectly increases cold start. Also swapping on allocation causes
latency issues for the app. So, a proactive mechanism is needed to
keep global pressure away and indirectly reduces cold starts and alloc
stalls.

>
> - Approach
>
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.

I think it would be good to have clear reasoning on why "reclaim from
userspace" approach is taken. Android runtime clearly has more
accurate stale/cold information at the app/process level and can
positively influence kernel's reclaim decisions. So, "reclaim from
userspace" approach makes total sense for Android. I envision that
Chrome OS would be another very obvious user of this approach. There
can be tens of tabs which the user have not touched for sometime.
Chrome OS can proactively reclaim memory from such tabs.

>
> IMHO we should spell it out that this patchset complements MADV_WONTNEED

MADV_DONTNEED? same at couple of places below.

> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
>
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
>
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
>
>         struct pr_madvise_param {
>                 int size;
>                 const struct iovec *vec;
>         }
>
>         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
>                                 struct pr_madvise_param *restuls,
>                                 struct pr_madvise_param *ranges,
>                                 unsigned long flags);
>
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
>
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
>
> - Experiment
>
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
>
> A is vanilla and B is process_madvise.
>
>
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
>
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
>
> This patchset is against on next-20190517.
>
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
>
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
>
> --
> 2.21.0.1020.gf2820cf01a-goog
>


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21  2:55     ` Anshuman Khandual
  2019-05-21  5:14       ` Minchan Kim
  2019-05-21 10:34       ` Michal Hocko
@ 2019-05-21 12:56       ` Shakeel Butt
  2019-05-22  4:15         ` Brian Geffon
  2019-05-22  4:23         ` Brian Geffon
  2 siblings, 2 replies; 138+ messages in thread
From: Shakeel Butt @ 2019-05-21 12:56 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Tim Murray, Minchan Kim, Andrew Morton, LKML, linux-mm,
	Michal Hocko, Johannes Weiner, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Sonny Rao, Brian Geffon,
	linux-api

On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 05/20/2019 10:29 PM, Tim Murray wrote:
> > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >> Or Is the objective here is reduce the number of processes which get killed by
> >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> >> here as well ?
> >
> > The objective is to avoid some instances of memory pressure by
> > proactively swapping pages that userspace knows to be cold before
> > those pages reach the end of the LRUs, which in turn can prevent some
> > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > that an application is not being used and is only resident to improve
> > performance if the user returns to that app, we can kick off
> > process_madvise on that process's pages (or some portion of those
> > pages) in a power-efficient way to reduce memory pressure long before
> > the system hits the free page watermark. This allows the system more
> > time to put pages into zram versus waiting for the watermark to
> > trigger kswapd, which decreases the likelihood that later memory
> > allocations will cause enough pressure to trigger a kill of one of
> > these apps.
>
> So this opens up bit of LRU management to user space hints. Also because the app
> in itself wont know about the memory situation of the entire system, new system
> call needs to be called from an external process.
>
> >
> >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> >> is it because as it will prevent a fresh cold start which anyway will be slower
> >> than a slow hot start. Just being curious.
> >
> > First, not all swapped pages will be reloaded immediately once an app
> > is resumed. We've found that an app's working set post-process_madvise
> > is significantly smaller than what an app allocates when it first
> > launches (see the delta between pswpin and pswpout in Minchan's
> > results). Presumably because of this, faulting to fetch from zram does
>
> pswpin      417613    1392647     975034     233.00
> pswpout    1274224    2661731    1387507     108.00
>
> IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> always the case ? Or it tend to swap out from an active area of the working set
> which faulted back again.
>
> > not seem to introduce a noticeable hot start penalty, not does it
> > cause an increase in performance problems later in the app's
> > lifecycle. I've measured with and without process_madvise, and the
> > differences are within our noise bounds. Second, because we're not
>
> That is assuming that post process_madvise() working set for the application is
> always smaller. There is another challenge. The external process should ideally
> have the knowledge of active areas of the working set for an application in
> question for it to invoke process_madvise() correctly to prevent such scenarios.
>
> > preemptively evicting file pages and only making them more likely to
> > be evicted when there's already memory pressure, we avoid the case
> > where we process_madvise an app then immediately return to the app and
> > reload all file pages in the working set even though there was no
> > intervening memory pressure. Our initial version of this work evicted
>
> That would be the worst case scenario which should be avoided. Memory pressure
> must be a parameter before actually doing the swap out. But pages if know to be
> inactive/cold can be marked high priority to be swapped out.
>
> > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > that case; this patch set avoids that slowdown. Finally, the benefit
> > from avoiding cold starts is huge. The performance improvement from
> > having a hot start instead of a cold start ranges from 3x for very
> > small apps to 50x+ for larger apps like high-fidelity games.
>
> Is there any other real world scenario apart from this app based ecosystem where
> user hinted LRU management might be helpful ? Just being curious. Thanks for the
> detailed explanation. I will continue looking into this series.

Chrome OS is another real world use-case for this user hinted LRU
management approach by proactively reclaiming reclaim from tabs not
accessed by the user for some time.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
  2019-05-20  9:18   ` Michal Hocko
  2019-05-21  9:01   ` Christian Brauner
@ 2019-05-21 15:31   ` Oleg Nesterov
  2019-05-27  7:43     ` Minchan Kim
  2019-05-29  3:41   ` Hillf Danton
  2019-05-30  0:38   ` Minchan Kim
  4 siblings, 1 reply; 138+ messages in thread
From: Oleg Nesterov @ 2019-05-21 15:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On 05/20, Minchan Kim wrote:
>
> +	rcu_read_lock();
> +	tsk = pid_task(pid, PIDTYPE_PID);
> +	if (!tsk) {
> +		rcu_read_unlock();
> +		goto err;
> +	}
> +	get_task_struct(tsk);
> +	rcu_read_unlock();
> +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> +	if (!mm || IS_ERR(mm)) {
> +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +		if (ret == -EACCES)
> +			ret = -EPERM;
> +		goto err;
> +	}
> +	ret = madvise_core(tsk, start, len_in, behavior);

IIUC, madvise_core(tsk) plays with tsk->mm->mmap_sem. But this tsk can exit and
nullify its ->mm right after mm_access() succeeds.

another problem is that pid_task(pid) can return a zombie leader, in this case
mm_access() will fail while it shouldn't.

Oleg.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21  2:55     ` Minchan Kim
  2019-05-21  6:26       ` Michal Hocko
@ 2019-05-21 15:33       ` Johannes Weiner
  2019-05-22  1:50         ` Minchan Kim
  1 sibling, 1 reply; 138+ messages in thread
From: Johannes Weiner @ 2019-05-21 15:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > [cc linux-api]
> > 
> > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > System could have much faster swap device like zRAM. In that case, swapping
> > > is extremely cheaper than file-IO on the low-end storage.
> > > In this configuration, userspace could handle different strategy for each
> > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > file IO is more expensive in this case so want to keep them in memory
> > > until memory pressure happens.
> > > 
> > > To support such strategy easier, this patch introduces
> > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > that /proc/<pid>/clear_refs already has supported same filters.
> > > They are filters could be Ored with other existing hints using top two bits
> > > of (int behavior).
> > 
> > madvise operates on top of ranges and it is quite trivial to do the
> > filtering from the userspace so why do we need any additional filtering?
> > 
> > > Once either of them is set, the hint could affect only the interested vma
> > > either anonymous or file-backed.
> > > 
> > > With that, user could call a process_madvise syscall simply with a entire
> > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > 
> > OK, so here is the reason you want that. The immediate question is why
> > cannot the monitor do the filtering from the userspace. Slightly more
> > work, all right, but less of an API to expose and that itself is a
> > strong argument against.
> 
> What I should do if we don't have such filter option is to enumerate all of
> vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> which would be painful for 2000+ vmas.

Just out of curiosity, how do you get to 2000+ distinct memory regions
in the address space of a mobile app? I'm assuming these aren't files,
but rather anon objects with poor grouping. Is that from guard pages
between individual heap allocations or something?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21 15:33       ` Johannes Weiner
@ 2019-05-22  1:50         ` Minchan Kim
  0 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-22  1:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 11:33:10AM -0400, Johannes Weiner wrote:
> On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > [cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > is extremely cheaper than file-IO on the low-end storage.
> > > > In this configuration, userspace could handle different strategy for each
> > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > file IO is more expensive in this case so want to keep them in memory
> > > > until memory pressure happens.
> > > > 
> > > > To support such strategy easier, this patch introduces
> > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > They are filters could be Ored with other existing hints using top two bits
> > > > of (int behavior).
> > > 
> > > madvise operates on top of ranges and it is quite trivial to do the
> > > filtering from the userspace so why do we need any additional filtering?
> > > 
> > > > Once either of them is set, the hint could affect only the interested vma
> > > > either anonymous or file-backed.
> > > > 
> > > > With that, user could call a process_madvise syscall simply with a entire
> > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > 
> > > OK, so here is the reason you want that. The immediate question is why
> > > cannot the monitor do the filtering from the userspace. Slightly more
> > > work, all right, but less of an API to expose and that itself is a
> > > strong argument against.
> > 
> > What I should do if we don't have such filter option is to enumerate all of
> > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > which would be painful for 2000+ vmas.
> 
> Just out of curiosity, how do you get to 2000+ distinct memory regions
> in the address space of a mobile app? I'm assuming these aren't files,
> but rather anon objects with poor grouping. Is that from guard pages
> between individual heap allocations or something?

Android uses preload library model to speed up app launch so it loads
all of library in advance on zygote and forks new app based on it.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 12:56       ` Shakeel Butt
@ 2019-05-22  4:15         ` Brian Geffon
  2019-05-22  4:23         ` Brian Geffon
  1 sibling, 0 replies; 138+ messages in thread
From: Brian Geffon @ 2019-05-22  4:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Anshuman Khandual, Tim Murray, Minchan Kim, Andrew Morton, LKML,
	linux-mm, Michal Hocko, Johannes Weiner, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Sonny Rao, linux-api

[-- Attachment #1: Type: text/plain, Size: 5235 bytes --]

To expand on the ChromeOS use case we're in a very similar situation to
Android. For example, the Chrome browser uses a separate process for each
individual tab (with some exceptions) and over time many tabs remain open
in a back-grounded or idle state. Given that we have a lot of information
about the weight of a tab, when it was last active, etc, we can benefit
tremendously from per-process reclaim. We're working on getting real world
numbers but all of our initial testing shows very promising results.

On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote:

> On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
> >
> >
> >
> > On 05/20/2019 10:29 PM, Tim Murray wrote:
> > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >> Or Is the objective here is reduce the number of processes which get
> killed by
> > >> lmkd by triggering swapping for the unused memory (user hinted)
> sooner so that
> > >> they dont get picked by lmkd. Under utilization for zram hardware is
> a concern
> > >> here as well ?
> > >
> > > The objective is to avoid some instances of memory pressure by
> > > proactively swapping pages that userspace knows to be cold before
> > > those pages reach the end of the LRUs, which in turn can prevent some
> > > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > > that an application is not being used and is only resident to improve
> > > performance if the user returns to that app, we can kick off
> > > process_madvise on that process's pages (or some portion of those
> > > pages) in a power-efficient way to reduce memory pressure long before
> > > the system hits the free page watermark. This allows the system more
> > > time to put pages into zram versus waiting for the watermark to
> > > trigger kswapd, which decreases the likelihood that later memory
> > > allocations will cause enough pressure to trigger a kill of one of
> > > these apps.
> >
> > So this opens up bit of LRU management to user space hints. Also because
> the app
> > in itself wont know about the memory situation of the entire system, new
> system
> > call needs to be called from an external process.
> >
> > >
> > >> Swapping out memory into zram wont increase the latency for a hot
> start ? Or
> > >> is it because as it will prevent a fresh cold start which anyway will
> be slower
> > >> than a slow hot start. Just being curious.
> > >
> > > First, not all swapped pages will be reloaded immediately once an app
> > > is resumed. We've found that an app's working set post-process_madvise
> > > is significantly smaller than what an app allocates when it first
> > > launches (see the delta between pswpin and pswpout in Minchan's
> > > results). Presumably because of this, faulting to fetch from zram does
> >
> > pswpin      417613    1392647     975034     233.00
> > pswpout    1274224    2661731    1387507     108.00
> >
> > IIUC the swap-in ratio is way higher in comparison to that of swap out.
> Is that
> > always the case ? Or it tend to swap out from an active area of the
> working set
> > which faulted back again.
> >
> > > not seem to introduce a noticeable hot start penalty, not does it
> > > cause an increase in performance problems later in the app's
> > > lifecycle. I've measured with and without process_madvise, and the
> > > differences are within our noise bounds. Second, because we're not
> >
> > That is assuming that post process_madvise() working set for the
> application is
> > always smaller. There is another challenge. The external process should
> ideally
> > have the knowledge of active areas of the working set for an application
> in
> > question for it to invoke process_madvise() correctly to prevent such
> scenarios.
> >
> > > preemptively evicting file pages and only making them more likely to
> > > be evicted when there's already memory pressure, we avoid the case
> > > where we process_madvise an app then immediately return to the app and
> > > reload all file pages in the working set even though there was no
> > > intervening memory pressure. Our initial version of this work evicted
> >
> > That would be the worst case scenario which should be avoided. Memory
> pressure
> > must be a parameter before actually doing the swap out. But pages if
> know to be
> > inactive/cold can be marked high priority to be swapped out.
> >
> > > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > > that case; this patch set avoids that slowdown. Finally, the benefit
> > > from avoiding cold starts is huge. The performance improvement from
> > > having a hot start instead of a cold start ranges from 3x for very
> > > small apps to 50x+ for larger apps like high-fidelity games.
> >
> > Is there any other real world scenario apart from this app based
> ecosystem where
> > user hinted LRU management might be helpful ? Just being curious. Thanks
> for the
> > detailed explanation. I will continue looking into this series.
>
> Chrome OS is another real world use-case for this user hinted LRU
> management approach by proactively reclaiming reclaim from tabs not
> accessed by the user for some time.
>

[-- Attachment #2: Type: text/html, Size: 6316 bytes --]

^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 12:56       ` Shakeel Butt
  2019-05-22  4:15         ` Brian Geffon
@ 2019-05-22  4:23         ` Brian Geffon
  1 sibling, 0 replies; 138+ messages in thread
From: Brian Geffon @ 2019-05-22  4:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Anshuman Khandual, Tim Murray, Minchan Kim, Andrew Morton, LKML,
	linux-mm, Michal Hocko, Johannes Weiner, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Sonny Rao, linux-api

To expand on the ChromeOS use case we're in a very similar situation
to Android. For example, the Chrome browser uses a separate process
for each individual tab (with some exceptions) and over time many tabs
remain open in a back-grounded or idle state. Given that we have a lot
of information about the weight of a tab, when it was last active,
etc, we can benefit tremendously from per-process reclaim. We're
working on getting real world numbers but all of our initial testing
shows very promising results.


On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
> >
> >
> >
> > On 05/20/2019 10:29 PM, Tim Murray wrote:
> > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >> Or Is the objective here is reduce the number of processes which get killed by
> > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> > >> here as well ?
> > >
> > > The objective is to avoid some instances of memory pressure by
> > > proactively swapping pages that userspace knows to be cold before
> > > those pages reach the end of the LRUs, which in turn can prevent some
> > > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > > that an application is not being used and is only resident to improve
> > > performance if the user returns to that app, we can kick off
> > > process_madvise on that process's pages (or some portion of those
> > > pages) in a power-efficient way to reduce memory pressure long before
> > > the system hits the free page watermark. This allows the system more
> > > time to put pages into zram versus waiting for the watermark to
> > > trigger kswapd, which decreases the likelihood that later memory
> > > allocations will cause enough pressure to trigger a kill of one of
> > > these apps.
> >
> > So this opens up bit of LRU management to user space hints. Also because the app
> > in itself wont know about the memory situation of the entire system, new system
> > call needs to be called from an external process.
> >
> > >
> > >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> > >> is it because as it will prevent a fresh cold start which anyway will be slower
> > >> than a slow hot start. Just being curious.
> > >
> > > First, not all swapped pages will be reloaded immediately once an app
> > > is resumed. We've found that an app's working set post-process_madvise
> > > is significantly smaller than what an app allocates when it first
> > > launches (see the delta between pswpin and pswpout in Minchan's
> > > results). Presumably because of this, faulting to fetch from zram does
> >
> > pswpin      417613    1392647     975034     233.00
> > pswpout    1274224    2661731    1387507     108.00
> >
> > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> > always the case ? Or it tend to swap out from an active area of the working set
> > which faulted back again.
> >
> > > not seem to introduce a noticeable hot start penalty, not does it
> > > cause an increase in performance problems later in the app's
> > > lifecycle. I've measured with and without process_madvise, and the
> > > differences are within our noise bounds. Second, because we're not
> >
> > That is assuming that post process_madvise() working set for the application is
> > always smaller. There is another challenge. The external process should ideally
> > have the knowledge of active areas of the working set for an application in
> > question for it to invoke process_madvise() correctly to prevent such scenarios.
> >
> > > preemptively evicting file pages and only making them more likely to
> > > be evicted when there's already memory pressure, we avoid the case
> > > where we process_madvise an app then immediately return to the app and
> > > reload all file pages in the working set even though there was no
> > > intervening memory pressure. Our initial version of this work evicted
> >
> > That would be the worst case scenario which should be avoided. Memory pressure
> > must be a parameter before actually doing the swap out. But pages if know to be
> > inactive/cold can be marked high priority to be swapped out.
> >
> > > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > > that case; this patch set avoids that slowdown. Finally, the benefit
> > > from avoiding cold starts is huge. The performance improvement from
> > > having a hot start instead of a cold start ranges from 3x for very
> > > small apps to 50x+ for larger apps like high-fidelity games.
> >
> > Is there any other real world scenario apart from this app based ecosystem where
> > user hinted LRU management might be helpful ? Just being curious. Thanks for the
> > detailed explanation. I will continue looking into this series.
>
> Chrome OS is another real world use-case for this user hinted LRU
> management approach by proactively reclaiming reclaim from tabs not
> accessed by the user for some time.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 11:39       ` Christian Brauner
@ 2019-05-22  5:11         ` Daniel Colascione
  2019-05-22  8:22           ` Christian Brauner
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-22  5:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > - Background
> > > > >
> > > > > The Android terminology used for forking a new process and starting an app
> > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > While we continually try to improve the performance of cold starts, hot
> > > > > starts will always be significantly less power hungry as well as faster so
> > > > > we are trying to make hot start more likely than cold start.
> > > > >
> > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > tracks every Android app or service that the user could be interacting with
> > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > those performance improvements will vary based on the memory requirements of
> > > > > individual workloads.
> > > > >
> > > > > - Problem
> > > > >
> > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > However, they were not significant consumers of swap even though they are
> > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > even though we use zram which is much faster than real storage) so kill
> > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > few pages actually being moved to swap.
> > > > >
> > > > > - Approach
> > > > >
> > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > it could provide many chances for platform to use much information to
> > > > > optimize memory efficiency.
> > > > >
> > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > when memory pressure rises.
> > > > >
> > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > that it hints the kernel that memory region is not currently needed and
> > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > that it hints the kernel that memory region is not currently needed and
> > > > > should be reclaimed when memory pressure rises.
> > > > >
> > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > information required to make the reclaim decision is not known to the app.
> > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > To solve the concern, this patch introduces new syscall -
> > > > >
> > > > >         struct pr_madvise_param {
> > > > >                 int size;
> > > > >                 const struct iovec *vec;
> > > > >         }
> > > > >
> > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > >                                 struct pr_madvise_param *restuls,
> > > > >                                 struct pr_madvise_param *ranges,
> > > > >                                 unsigned long flags);
> > > > >
> > > > > The syscall get pidfd to give hints to external process and provides
> > > > > pair of result/ranges vector arguments so that it could give several
> > > > > hints to each address range all at once.
> > > > >
> > > > > I guess others have different ideas about the naming of syscall and options
> > > > > so feel free to suggest better naming.
> > > >
> > > > Yes, all new syscalls making use of pidfds should be named
> > > > pidfd_<action>. So please make this pidfd_madvise.
> > >
> > > I don't have any particular preference but just wondering why pidfd is
> > > so special to have it as prefix of system call name.
> >
> > It's a whole new API to address processes. We already have
> > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > syscall works only with pidfds so it's tied to this api as well so it
> > should follow the naming scheme. This also makes life easier for
> > userspace and is consistent.
>
> This is at least my reasoning. I'm not going to make this a whole big
> pedantic argument. If people have really strong feelings about not using
> this prefix then fine. But if syscalls can be grouped together and have
> consistent naming this is always a big plus.

My hope has been that pidfd use becomes normalized enough that
prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
We write write(), not fd_write(), right? :-) pidfd_open() makes sense
because the primary purpose of this system call is to operate on a
pidfd, but I think process_madvise() is fine.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22  5:11         ` Daniel Colascione
@ 2019-05-22  8:22           ` Christian Brauner
  2019-05-22 13:16             ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-22  8:22 UTC (permalink / raw)
  To: Daniel Colascione, Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > > - Background
> > > > > >
> > > > > > The Android terminology used for forking a new process and starting an app
> > > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > > While we continually try to improve the performance of cold starts, hot
> > > > > > starts will always be significantly less power hungry as well as faster so
> > > > > > we are trying to make hot start more likely than cold start.
> > > > > >
> > > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > > tracks every Android app or service that the user could be interacting with
> > > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > > those performance improvements will vary based on the memory requirements of
> > > > > > individual workloads.
> > > > > >
> > > > > > - Problem
> > > > > >
> > > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > > However, they were not significant consumers of swap even though they are
> > > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > > even though we use zram which is much faster than real storage) so kill
> > > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > > few pages actually being moved to swap.
> > > > > >
> > > > > > - Approach
> > > > > >
> > > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > > it could provide many chances for platform to use much information to
> > > > > > optimize memory efficiency.
> > > > > >
> > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > when memory pressure rises.
> > > > > >
> > > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > should be reclaimed when memory pressure rises.
> > > > > >
> > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > > information required to make the reclaim decision is not known to the app.
> > > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > > To solve the concern, this patch introduces new syscall -
> > > > > >
> > > > > >         struct pr_madvise_param {
> > > > > >                 int size;
> > > > > >                 const struct iovec *vec;
> > > > > >         }
> > > > > >
> > > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > > >                                 struct pr_madvise_param *restuls,
> > > > > >                                 struct pr_madvise_param *ranges,
> > > > > >                                 unsigned long flags);
> > > > > >
> > > > > > The syscall get pidfd to give hints to external process and provides
> > > > > > pair of result/ranges vector arguments so that it could give several
> > > > > > hints to each address range all at once.
> > > > > >
> > > > > > I guess others have different ideas about the naming of syscall and options
> > > > > > so feel free to suggest better naming.
> > > > >
> > > > > Yes, all new syscalls making use of pidfds should be named
> > > > > pidfd_<action>. So please make this pidfd_madvise.
> > > >
> > > > I don't have any particular preference but just wondering why pidfd is
> > > > so special to have it as prefix of system call name.
> > >
> > > It's a whole new API to address processes. We already have
> > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > > syscall works only with pidfds so it's tied to this api as well so it
> > > should follow the naming scheme. This also makes life easier for
> > > userspace and is consistent.
> >
> > This is at least my reasoning. I'm not going to make this a whole big
> > pedantic argument. If people have really strong feelings about not using
> > this prefix then fine. But if syscalls can be grouped together and have
> > consistent naming this is always a big plus.
>
> My hope has been that pidfd use becomes normalized enough that
> prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
> We write write(), not fd_write(), right? :-) pidfd_open() makes sense
> because the primary purpose of this system call is to operate on a
> pidfd, but I think process_madvise() is fine.

This madvise syscall just operates on pidfds. It would make sense to
name it process_madvise() if were to operate both on pid_t and int pidfd.
Giving specific names to system calls won't stop it from becoming
normalized. The fact that people built other system calls around it
is enough proof of that. :)
For userspace pidfd_madvise is nicer and it clearly expresses
that it only accepts pidfds.
So please, Minchan make it pidfd_madvise() in the next version. :)

Christian


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22  8:22           ` Christian Brauner
@ 2019-05-22 13:16             ` Daniel Colascione
  2019-05-22 14:52               ` Christian Brauner
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-22 13:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 1:22 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > > > - Background
> > > > > > >
> > > > > > > The Android terminology used for forking a new process and starting an app
> > > > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > > > While we continually try to improve the performance of cold starts, hot
> > > > > > > starts will always be significantly less power hungry as well as faster so
> > > > > > > we are trying to make hot start more likely than cold start.
> > > > > > >
> > > > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > > > tracks every Android app or service that the user could be interacting with
> > > > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > > > those performance improvements will vary based on the memory requirements of
> > > > > > > individual workloads.
> > > > > > >
> > > > > > > - Problem
> > > > > > >
> > > > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > > > However, they were not significant consumers of swap even though they are
> > > > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > > > even though we use zram which is much faster than real storage) so kill
> > > > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > > > few pages actually being moved to swap.
> > > > > > >
> > > > > > > - Approach
> > > > > > >
> > > > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > > > it could provide many chances for platform to use much information to
> > > > > > > optimize memory efficiency.
> > > > > > >
> > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > when memory pressure rises.
> > > > > > >
> > > > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > should be reclaimed when memory pressure rises.
> > > > > > >
> > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > > > information required to make the reclaim decision is not known to the app.
> > > > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > > > To solve the concern, this patch introduces new syscall -
> > > > > > >
> > > > > > >         struct pr_madvise_param {
> > > > > > >                 int size;
> > > > > > >                 const struct iovec *vec;
> > > > > > >         }
> > > > > > >
> > > > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > > > >                                 struct pr_madvise_param *restuls,
> > > > > > >                                 struct pr_madvise_param *ranges,
> > > > > > >                                 unsigned long flags);
> > > > > > >
> > > > > > > The syscall get pidfd to give hints to external process and provides
> > > > > > > pair of result/ranges vector arguments so that it could give several
> > > > > > > hints to each address range all at once.
> > > > > > >
> > > > > > > I guess others have different ideas about the naming of syscall and options
> > > > > > > so feel free to suggest better naming.
> > > > > >
> > > > > > Yes, all new syscalls making use of pidfds should be named
> > > > > > pidfd_<action>. So please make this pidfd_madvise.
> > > > >
> > > > > I don't have any particular preference but just wondering why pidfd is
> > > > > so special to have it as prefix of system call name.
> > > >
> > > > It's a whole new API to address processes. We already have
> > > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > > > syscall works only with pidfds so it's tied to this api as well so it
> > > > should follow the naming scheme. This also makes life easier for
> > > > userspace and is consistent.
> > >
> > > This is at least my reasoning. I'm not going to make this a whole big
> > > pedantic argument. If people have really strong feelings about not using
> > > this prefix then fine. But if syscalls can be grouped together and have
> > > consistent naming this is always a big plus.
> >
> > My hope has been that pidfd use becomes normalized enough that
> > prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
> > We write write(), not fd_write(), right? :-) pidfd_open() makes sense
> > because the primary purpose of this system call is to operate on a
> > pidfd, but I think process_madvise() is fine.
>
> This madvise syscall just operates on pidfds. It would make sense to
> name it process_madvise() if were to operate both on pid_t and int pidfd.

The name of the function ought to encode its purpose, not its
signature. The system call under discussion operates on processes and
so should be called "process_madvise". That this system call happens
to accept a pidfd to identify the process on which it operates is not
its most interesting aspect of the system call. The argument type
isn't important enough to spotlight in the permanent name of an API.
Pidfds are novel now, but they won't be novel in the future.

> Giving specific names to system calls won't stop it from becoming
> normalized.

We could name system calls with `cat /dev/urandom | xxd` and they'd
still get used. It doesn't follow that all names are equally good.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22 13:16             ` Daniel Colascione
@ 2019-05-22 14:52               ` Christian Brauner
  2019-05-22 15:17                 ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-22 14:52 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 06:16:35AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 1:22 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote:
> > >
> > > On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > > > > - Background
> > > > > > > >
> > > > > > > > The Android terminology used for forking a new process and starting an app
> > > > > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > > > > While we continually try to improve the performance of cold starts, hot
> > > > > > > > starts will always be significantly less power hungry as well as faster so
> > > > > > > > we are trying to make hot start more likely than cold start.
> > > > > > > >
> > > > > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > > > > tracks every Android app or service that the user could be interacting with
> > > > > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > > > > those performance improvements will vary based on the memory requirements of
> > > > > > > > individual workloads.
> > > > > > > >
> > > > > > > > - Problem
> > > > > > > >
> > > > > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > > > > However, they were not significant consumers of swap even though they are
> > > > > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > > > > even though we use zram which is much faster than real storage) so kill
> > > > > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > > > > few pages actually being moved to swap.
> > > > > > > >
> > > > > > > > - Approach
> > > > > > > >
> > > > > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > > > > it could provide many chances for platform to use much information to
> > > > > > > > optimize memory efficiency.
> > > > > > > >
> > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > > when memory pressure rises.
> > > > > > > >
> > > > > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > > should be reclaimed when memory pressure rises.
> > > > > > > >
> > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > > > > information required to make the reclaim decision is not known to the app.
> > > > > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > > > > To solve the concern, this patch introduces new syscall -
> > > > > > > >
> > > > > > > >         struct pr_madvise_param {
> > > > > > > >                 int size;
> > > > > > > >                 const struct iovec *vec;
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > > > > >                                 struct pr_madvise_param *restuls,
> > > > > > > >                                 struct pr_madvise_param *ranges,
> > > > > > > >                                 unsigned long flags);
> > > > > > > >
> > > > > > > > The syscall get pidfd to give hints to external process and provides
> > > > > > > > pair of result/ranges vector arguments so that it could give several
> > > > > > > > hints to each address range all at once.
> > > > > > > >
> > > > > > > > I guess others have different ideas about the naming of syscall and options
> > > > > > > > so feel free to suggest better naming.
> > > > > > >
> > > > > > > Yes, all new syscalls making use of pidfds should be named
> > > > > > > pidfd_<action>. So please make this pidfd_madvise.
> > > > > >
> > > > > > I don't have any particular preference but just wondering why pidfd is
> > > > > > so special to have it as prefix of system call name.
> > > > >
> > > > > It's a whole new API to address processes. We already have
> > > > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > > > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > > > > syscall works only with pidfds so it's tied to this api as well so it
> > > > > should follow the naming scheme. This also makes life easier for
> > > > > userspace and is consistent.
> > > >
> > > > This is at least my reasoning. I'm not going to make this a whole big
> > > > pedantic argument. If people have really strong feelings about not using
> > > > this prefix then fine. But if syscalls can be grouped together and have
> > > > consistent naming this is always a big plus.
> > >
> > > My hope has been that pidfd use becomes normalized enough that
> > > prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
> > > We write write(), not fd_write(), right? :-) pidfd_open() makes sense
> > > because the primary purpose of this system call is to operate on a
> > > pidfd, but I think process_madvise() is fine.
> >
> > This madvise syscall just operates on pidfds. It would make sense to
> > name it process_madvise() if were to operate both on pid_t and int pidfd.
> 
> The name of the function ought to encode its purpose, not its
> signature. The system call under discussion operates on processes and
> so should be called "process_madvise". That this system call happens
> to accept a pidfd to identify the process on which it operates is not
> its most interesting aspect of the system call. The argument type
> isn't important enough to spotlight in the permanent name of an API.
> Pidfds are novel now, but they won't be novel in the future.

I'm not going to go into yet another long argument. I prefer pidfd_*.
It's tied to the api, transparent for userspace, and disambiguates it
from process_vm_{read,write}v that both take a pid_t.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22 14:52               ` Christian Brauner
@ 2019-05-22 15:17                 ` Daniel Colascione
  2019-05-22 15:48                   ` Christian Brauner
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-22 15:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> I'm not going to go into yet another long argument. I prefer pidfd_*.

Ok. We're each allowed our opinion.

> It's tied to the api, transparent for userspace, and disambiguates it
> from process_vm_{read,write}v that both take a pid_t.

Speaking of process_vm_readv and process_vm_writev: both have a
currently-unused flags argument. Both should grow a flag that tells
them to interpret the pid argument as a pidfd. Or do you support
adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
should process_madvise be called pidfd_madvise while process_vm_readv
isn't called pidfd_vm_readv?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22 15:17                 ` Daniel Colascione
@ 2019-05-22 15:48                   ` Christian Brauner
  2019-05-22 15:57                     ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-22 15:48 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > I'm not going to go into yet another long argument. I prefer pidfd_*.
> 
> Ok. We're each allowed our opinion.
> 
> > It's tied to the api, transparent for userspace, and disambiguates it
> > from process_vm_{read,write}v that both take a pid_t.
> 
> Speaking of process_vm_readv and process_vm_writev: both have a
> currently-unused flags argument. Both should grow a flag that tells
> them to interpret the pid argument as a pidfd. Or do you support
> adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> should process_madvise be called pidfd_madvise while process_vm_readv
> isn't called pidfd_vm_readv?

Actually, you should then do the same with process_madvise() and give it
a flag for that too if that's not too crazy.

Christian


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22 15:48                   ` Christian Brauner
@ 2019-05-22 15:57                     ` Daniel Colascione
  2019-05-22 16:01                       ` Christian Brauner
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-22 15:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> >
> > Ok. We're each allowed our opinion.
> >
> > > It's tied to the api, transparent for userspace, and disambiguates it
> > > from process_vm_{read,write}v that both take a pid_t.
> >
> > Speaking of process_vm_readv and process_vm_writev: both have a
> > currently-unused flags argument. Both should grow a flag that tells
> > them to interpret the pid argument as a pidfd. Or do you support
> > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > should process_madvise be called pidfd_madvise while process_vm_readv
> > isn't called pidfd_vm_readv?
>
> Actually, you should then do the same with process_madvise() and give it
> a flag for that too if that's not too crazy.

I don't know what you mean. My gut feeling is that for the sake of
consistency, process_madvise, process_vm_readv, and process_vm_writev
should all accept a first argument interpreted as either a numeric PID
or a pidfd depending on a flag --- ideally the same flag. Is that what
you have in mind?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22 15:57                     ` Daniel Colascione
@ 2019-05-22 16:01                       ` Christian Brauner
  2019-05-22 16:01                         ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Christian Brauner @ 2019-05-22 16:01 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > >
> > > Ok. We're each allowed our opinion.
> > >
> > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > from process_vm_{read,write}v that both take a pid_t.
> > >
> > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > currently-unused flags argument. Both should grow a flag that tells
> > > them to interpret the pid argument as a pidfd. Or do you support
> > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > isn't called pidfd_vm_readv?
> >
> > Actually, you should then do the same with process_madvise() and give it
> > a flag for that too if that's not too crazy.
> 
> I don't know what you mean. My gut feeling is that for the sake of
> consistency, process_madvise, process_vm_readv, and process_vm_writev
> should all accept a first argument interpreted as either a numeric PID
> or a pidfd depending on a flag --- ideally the same flag. Is that what
> you have in mind?

Yes. For the sake of consistency they should probably all default to
interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
interpret as pidfd.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22 16:01                       ` Christian Brauner
@ 2019-05-22 16:01                         ` Daniel Colascione
  2019-05-23 13:07                           ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-22 16:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > > >
> > > > Ok. We're each allowed our opinion.
> > > >
> > > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > > from process_vm_{read,write}v that both take a pid_t.
> > > >
> > > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > > currently-unused flags argument. Both should grow a flag that tells
> > > > them to interpret the pid argument as a pidfd. Or do you support
> > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > > isn't called pidfd_vm_readv?
> > >
> > > Actually, you should then do the same with process_madvise() and give it
> > > a flag for that too if that's not too crazy.
> >
> > I don't know what you mean. My gut feeling is that for the sake of
> > consistency, process_madvise, process_vm_readv, and process_vm_writev
> > should all accept a first argument interpreted as either a numeric PID
> > or a pidfd depending on a flag --- ideally the same flag. Is that what
> > you have in mind?
>
> Yes. For the sake of consistency they should probably all default to
> interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
> interpret as pidfd.

Sounds good to me!


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-22 16:01                         ` Daniel Colascione
@ 2019-05-23 13:07                           ` Minchan Kim
  2019-05-27  8:06                             ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-23 13:07 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Christian Brauner, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Wed, May 22, 2019 at 09:01:33AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> > > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > > > >
> > > > > Ok. We're each allowed our opinion.
> > > > >
> > > > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > > > from process_vm_{read,write}v that both take a pid_t.
> > > > >
> > > > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > > > currently-unused flags argument. Both should grow a flag that tells
> > > > > them to interpret the pid argument as a pidfd. Or do you support
> > > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > > > isn't called pidfd_vm_readv?
> > > >
> > > > Actually, you should then do the same with process_madvise() and give it
> > > > a flag for that too if that's not too crazy.
> > >
> > > I don't know what you mean. My gut feeling is that for the sake of
> > > consistency, process_madvise, process_vm_readv, and process_vm_writev
> > > should all accept a first argument interpreted as either a numeric PID
> > > or a pidfd depending on a flag --- ideally the same flag. Is that what
> > > you have in mind?
> >
> > Yes. For the sake of consistency they should probably all default to
> > interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
> > interpret as pidfd.
> 
> Sounds good to me!

Then, I want to change from pidfd to pid at next revsion and stick to
process_madvise as naming. Later, you guys could define PROCESS_PIDFD
flag and change all at once every process_xxx syscall friends.

If you are faster so that I see PROCESS_PIDFD earlier, I am happy to
use it.

Thanks.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-21 15:31   ` Oleg Nesterov
@ 2019-05-27  7:43     ` Minchan Kim
  2019-05-27 15:12       ` Oleg Nesterov
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-27  7:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

Sorry for the late response. I miseed your comment. :(

On Tue, May 21, 2019 at 05:31:13PM +0200, Oleg Nesterov wrote:
> On 05/20, Minchan Kim wrote:
> >
> > +	rcu_read_lock();
> > +	tsk = pid_task(pid, PIDTYPE_PID);
> > +	if (!tsk) {
> > +		rcu_read_unlock();
> > +		goto err;
> > +	}
> > +	get_task_struct(tsk);
> > +	rcu_read_unlock();
> > +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> > +	if (!mm || IS_ERR(mm)) {
> > +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > +		if (ret == -EACCES)
> > +			ret = -EPERM;
> > +		goto err;
> > +	}
> > +	ret = madvise_core(tsk, start, len_in, behavior);
> 
> IIUC, madvise_core(tsk) plays with tsk->mm->mmap_sem. But this tsk can exit and
> nullify its ->mm right after mm_access() succeeds.

You're absolutely right. I will fix it via passing mm_struct instead of
task_struct.

Thanks!

> 
> another problem is that pid_task(pid) can return a zombie leader, in this case
> mm_access() will fail while it shouldn't.

I'm sorry. I didn't notice that. However, I couldn't understand your point. 
Why do you think mm_access shouldn't fail even though pid_task returns
a zombie leader? I thought it's okay since the target process is exiting
so hinting operation would be meaniness for the process.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-21 10:37           ` Michal Hocko
@ 2019-05-27  7:49             ` Minchan Kim
  2019-05-29 10:08               ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-27  7:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > [Cc linux-api]
> > > > > 
> > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > so user should call the syscall several times to give hints to
> > > > > > multiple address range.
> > > > > 
> > > > > Is that a problem? How big of a problem? Any numbers?
> > > > 
> > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > with number in the description at respin.
> > > 
> > > Does this really have to be a fast operation? I would expect the monitor
> > > is by no means a fast path. The system call overhead is not what it used
> > > to be, sigh, but still for something that is not a hot path it should be
> > > tolerable, especially when the whole operation is quite expensive on its
> > > own (wrt. the syscall entry/exit).
> > 
> > What's different with process_vm_[readv|writev] and vmsplice?
> > If the range needed to be covered is a lot, vector operation makes senese
> > to me.
> 
> I am not saying that the vector API is wrong. All I am trying to say is
> that the benefit is not really clear so far. If you want to push it
> through then you should better get some supporting data.

I measured 1000 madvise syscall vs. a vector range syscall with 1000
ranges on ARM64 mordern device. Even though I saw 15% improvement but
absoluate gain is just 1ms so I don't think it's worth to support.
I will drop vector support at next revision.

Thanks for the review, Michal!


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-21  6:26       ` Michal Hocko
@ 2019-05-27  7:58         ` Minchan Kim
  2019-05-27 12:44           ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-27  7:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > [cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > is extremely cheaper than file-IO on the low-end storage.
> > > > In this configuration, userspace could handle different strategy for each
> > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > file IO is more expensive in this case so want to keep them in memory
> > > > until memory pressure happens.
> > > > 
> > > > To support such strategy easier, this patch introduces
> > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > They are filters could be Ored with other existing hints using top two bits
> > > > of (int behavior).
> > > 
> > > madvise operates on top of ranges and it is quite trivial to do the
> > > filtering from the userspace so why do we need any additional filtering?
> > > 
> > > > Once either of them is set, the hint could affect only the interested vma
> > > > either anonymous or file-backed.
> > > > 
> > > > With that, user could call a process_madvise syscall simply with a entire
> > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > 
> > > OK, so here is the reason you want that. The immediate question is why
> > > cannot the monitor do the filtering from the userspace. Slightly more
> > > work, all right, but less of an API to expose and that itself is a
> > > strong argument against.
> > 
> > What I should do if we don't have such filter option is to enumerate all of
> > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > which would be painful for 2000+ vmas.
> 
> Painful is not an argument to add a new user API. If the existing API
> suits the purpose then it should be used. If it is not usable, we can
> think of a different way.

I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
It's never trivial.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-23 13:07                           ` Minchan Kim
@ 2019-05-27  8:06                             ` Minchan Kim
  0 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-27  8:06 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Christian Brauner, Andrew Morton, LKML, linux-mm, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn

On Thu, May 23, 2019 at 10:07:17PM +0900, Minchan Kim wrote:
> On Wed, May 22, 2019 at 09:01:33AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> > > > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> > > > >
> > > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > > > > >
> > > > > > Ok. We're each allowed our opinion.
> > > > > >
> > > > > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > > > > from process_vm_{read,write}v that both take a pid_t.
> > > > > >
> > > > > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > > > > currently-unused flags argument. Both should grow a flag that tells
> > > > > > them to interpret the pid argument as a pidfd. Or do you support
> > > > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > > > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > > > > isn't called pidfd_vm_readv?
> > > > >
> > > > > Actually, you should then do the same with process_madvise() and give it
> > > > > a flag for that too if that's not too crazy.
> > > >
> > > > I don't know what you mean. My gut feeling is that for the sake of
> > > > consistency, process_madvise, process_vm_readv, and process_vm_writev
> > > > should all accept a first argument interpreted as either a numeric PID
> > > > or a pidfd depending on a flag --- ideally the same flag. Is that what
> > > > you have in mind?
> > >
> > > Yes. For the sake of consistency they should probably all default to
> > > interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
> > > interpret as pidfd.
> > 
> > Sounds good to me!
> 
> Then, I want to change from pidfd to pid at next revsion and stick to
> process_madvise as naming. Later, you guys could define PROCESS_PIDFD
> flag and change all at once every process_xxx syscall friends.
> 
> If you are faster so that I see PROCESS_PIDFD earlier, I am happy to
> use it.

Hi Folks,

I don't want to consume a new API argument too early so want to say
I will use process_madvise with pidfs argument because I agree with
Daniel that we don't need to export implmentation on the syscall name.

I hope every upcoming new syscall with process has by default pidfs
so people are familiar with pidfd slowly so finallly they forgot pid
in the long run so naturally replace pid with pidfs.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-27  7:58         ` Minchan Kim
@ 2019-05-27 12:44           ` Michal Hocko
  2019-05-28  3:26             ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-27 12:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > [cc linux-api]
> > > > 
> > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > In this configuration, userspace could handle different strategy for each
> > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > until memory pressure happens.
> > > > > 
> > > > > To support such strategy easier, this patch introduces
> > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > of (int behavior).
> > > > 
> > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > filtering from the userspace so why do we need any additional filtering?
> > > > 
> > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > either anonymous or file-backed.
> > > > > 
> > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > 
> > > > OK, so here is the reason you want that. The immediate question is why
> > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > work, all right, but less of an API to expose and that itself is a
> > > > strong argument against.
> > > 
> > > What I should do if we don't have such filter option is to enumerate all of
> > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > which would be painful for 2000+ vmas.
> > 
> > Painful is not an argument to add a new user API. If the existing API
> > suits the purpose then it should be used. If it is not usable, we can
> > think of a different way.
> 
> I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> It's never trivial.

This is not the only option. Have you tried to simply use
/proc/<pid>/map_files interface? This will provide you with all the file
backed mappings.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-27  7:43     ` Minchan Kim
@ 2019-05-27 15:12       ` Oleg Nesterov
  2019-05-27 23:33         ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Oleg Nesterov @ 2019-05-27 15:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On 05/27, Minchan Kim wrote:
>
> > another problem is that pid_task(pid) can return a zombie leader, in this case
> > mm_access() will fail while it shouldn't.
>
> I'm sorry. I didn't notice that. However, I couldn't understand your point.
> Why do you think mm_access shouldn't fail even though pid_task returns
> a zombie leader?

The leader can exit (call sys_exit(), not sys_exit_group()), this won't affect
other threads. In this case the process is still alive even if the leader thread
is zombie. That is why we have find_lock_task_mm().

Oleg.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-27 15:12       ` Oleg Nesterov
@ 2019-05-27 23:33         ` Minchan Kim
  2019-05-28  7:23           ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-27 23:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Mon, May 27, 2019 at 05:12:02PM +0200, Oleg Nesterov wrote:
> On 05/27, Minchan Kim wrote:
> >
> > > another problem is that pid_task(pid) can return a zombie leader, in this case
> > > mm_access() will fail while it shouldn't.
> >
> > I'm sorry. I didn't notice that. However, I couldn't understand your point.
> > Why do you think mm_access shouldn't fail even though pid_task returns
> > a zombie leader?
> 
> The leader can exit (call sys_exit(), not sys_exit_group()), this won't affect
> other threads. In this case the process is still alive even if the leader thread
> is zombie. That is why we have find_lock_task_mm().

Thanks for clarification, Oleg. Then, Let me have a further question.

It means process_vm_readv, move_pages have same problem too because find_task_by_vpid
can return a zomebie leader and next line checks for mm_struct validation makes a
failure. My understand is correct? If so, we need to fix all places.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-27 12:44           ` Michal Hocko
@ 2019-05-28  3:26             ` Minchan Kim
  2019-05-28  6:29               ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-28  3:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > [cc linux-api]
> > > > > 
> > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > until memory pressure happens.
> > > > > > 
> > > > > > To support such strategy easier, this patch introduces
> > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > of (int behavior).
> > > > > 
> > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > 
> > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > either anonymous or file-backed.
> > > > > > 
> > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > 
> > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > work, all right, but less of an API to expose and that itself is a
> > > > > strong argument against.
> > > > 
> > > > What I should do if we don't have such filter option is to enumerate all of
> > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > which would be painful for 2000+ vmas.
> > > 
> > > Painful is not an argument to add a new user API. If the existing API
> > > suits the purpose then it should be used. If it is not usable, we can
> > > think of a different way.
> > 
> > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > It's never trivial.
> 
> This is not the only option. Have you tried to simply use
> /proc/<pid>/map_files interface? This will provide you with all the file
> backed mappings.

I compared maps vs. map_files with 3036 file-backed vma.
Test scenario is to dump all of vmas of the process and parse address
ranges.
For map_files, it's easy to parse each address range because directory name
itself is range. However, in case of maps, I need to parse each range
line by line so need to scan all of lines.

(maps cover additional non-file-backed vmas so nr_vma is a little bigger)

performance mode:
map_files: nr_vma 3036 usec 13387
maps     : nr_vma 3078 usec 12923

powersave mode:

map_files: nr_vma 3036 usec 52614
maps     : nr_vma 3078 usec 41089

map_files is slower than maps if we dump all of vmas. I guess directory
operation needs much more jobs(e.g., dentry lookup, instantiation)
compared to maps.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  3:26             ` Minchan Kim
@ 2019-05-28  6:29               ` Michal Hocko
  2019-05-28  8:13                 ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-28  6:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue 28-05-19 12:26:32, Minchan Kim wrote:
> On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> > On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > > [cc linux-api]
> > > > > > 
> > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > > until memory pressure happens.
> > > > > > > 
> > > > > > > To support such strategy easier, this patch introduces
> > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > > of (int behavior).
> > > > > > 
> > > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > > 
> > > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > > either anonymous or file-backed.
> > > > > > > 
> > > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > > 
> > > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > > work, all right, but less of an API to expose and that itself is a
> > > > > > strong argument against.
> > > > > 
> > > > > What I should do if we don't have such filter option is to enumerate all of
> > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > > which would be painful for 2000+ vmas.
> > > > 
> > > > Painful is not an argument to add a new user API. If the existing API
> > > > suits the purpose then it should be used. If it is not usable, we can
> > > > think of a different way.
> > > 
> > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > > It's never trivial.
> > 
> > This is not the only option. Have you tried to simply use
> > /proc/<pid>/map_files interface? This will provide you with all the file
> > backed mappings.
> 
> I compared maps vs. map_files with 3036 file-backed vma.
> Test scenario is to dump all of vmas of the process and parse address
> ranges.
> For map_files, it's easy to parse each address range because directory name
> itself is range. However, in case of maps, I need to parse each range
> line by line so need to scan all of lines.
> 
> (maps cover additional non-file-backed vmas so nr_vma is a little bigger)
> 
> performance mode:
> map_files: nr_vma 3036 usec 13387
> maps     : nr_vma 3078 usec 12923
> 
> powersave mode:
> 
> map_files: nr_vma 3036 usec 52614
> maps     : nr_vma 3078 usec 41089
> 
> map_files is slower than maps if we dump all of vmas. I guess directory
> operation needs much more jobs(e.g., dentry lookup, instantiation)
> compared to maps.

OK, that is somehow surprising. I am still not convinced the filter is a
good idea though. The primary reason is that it encourages using madvise
on a wide range without having a clue what the range contains. E.g. the
full address range and rely the right thing will happen. Do we really
want madvise to operate in that mode?

Btw. if we went with the per vma fd approach then you would get this
feature automatically because map_files would refer to file backed
mappings while map_anon could refer only to anonymous mappings.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-27 23:33         ` Minchan Kim
@ 2019-05-28  7:23           ` Michal Hocko
  0 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-28  7:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Oleg Nesterov, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue 28-05-19 08:33:06, Minchan Kim wrote:
> On Mon, May 27, 2019 at 05:12:02PM +0200, Oleg Nesterov wrote:
> > On 05/27, Minchan Kim wrote:
> > >
> > > > another problem is that pid_task(pid) can return a zombie leader, in this case
> > > > mm_access() will fail while it shouldn't.
> > >
> > > I'm sorry. I didn't notice that. However, I couldn't understand your point.
> > > Why do you think mm_access shouldn't fail even though pid_task returns
> > > a zombie leader?
> > 
> > The leader can exit (call sys_exit(), not sys_exit_group()), this won't affect
> > other threads. In this case the process is still alive even if the leader thread
> > is zombie. That is why we have find_lock_task_mm().
> 
> Thanks for clarification, Oleg. Then, Let me have a further question.
> 
> It means process_vm_readv, move_pages have same problem too because find_task_by_vpid
> can return a zomebie leader and next line checks for mm_struct validation makes a
> failure. My understand is correct? If so, we need to fix all places.

Isn't that a problem of most callers of get_task_mm? Shouldn't we fix it
turning it into find_lock_task_mm?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  6:29               ` Michal Hocko
@ 2019-05-28  8:13                 ` Minchan Kim
  2019-05-28  8:31                   ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-28  8:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api

On Tue, May 28, 2019 at 08:29:47AM +0200, Michal Hocko wrote:
> On Tue 28-05-19 12:26:32, Minchan Kim wrote:
> > On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote:
> > > On Mon 27-05-19 16:58:11, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote:
> > > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > > > > > [cc linux-api]
> > > > > > > 
> > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > > > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > > > > > is extremely cheaper than file-IO on the low-end storage.
> > > > > > > > In this configuration, userspace could handle different strategy for each
> > > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > > > > > file IO is more expensive in this case so want to keep them in memory
> > > > > > > > until memory pressure happens.
> > > > > > > > 
> > > > > > > > To support such strategy easier, this patch introduces
> > > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > > > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > > > > > They are filters could be Ored with other existing hints using top two bits
> > > > > > > > of (int behavior).
> > > > > > > 
> > > > > > > madvise operates on top of ranges and it is quite trivial to do the
> > > > > > > filtering from the userspace so why do we need any additional filtering?
> > > > > > > 
> > > > > > > > Once either of them is set, the hint could affect only the interested vma
> > > > > > > > either anonymous or file-backed.
> > > > > > > > 
> > > > > > > > With that, user could call a process_madvise syscall simply with a entire
> > > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > > > > > 
> > > > > > > OK, so here is the reason you want that. The immediate question is why
> > > > > > > cannot the monitor do the filtering from the userspace. Slightly more
> > > > > > > work, all right, but less of an API to expose and that itself is a
> > > > > > > strong argument against.
> > > > > > 
> > > > > > What I should do if we don't have such filter option is to enumerate all of
> > > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > > > > > which would be painful for 2000+ vmas.
> > > > > 
> > > > > Painful is not an argument to add a new user API. If the existing API
> > > > > suits the purpose then it should be used. If it is not usable, we can
> > > > > think of a different way.
> > > > 
> > > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern
> > > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor.
> > > > It's never trivial.
> > > 
> > > This is not the only option. Have you tried to simply use
> > > /proc/<pid>/map_files interface? This will provide you with all the file
> > > backed mappings.
> > 
> > I compared maps vs. map_files with 3036 file-backed vma.
> > Test scenario is to dump all of vmas of the process and parse address
> > ranges.
> > For map_files, it's easy to parse each address range because directory name
> > itself is range. However, in case of maps, I need to parse each range
> > line by line so need to scan all of lines.
> > 
> > (maps cover additional non-file-backed vmas so nr_vma is a little bigger)
> > 
> > performance mode:
> > map_files: nr_vma 3036 usec 13387
> > maps     : nr_vma 3078 usec 12923
> > 
> > powersave mode:
> > 
> > map_files: nr_vma 3036 usec 52614
> > maps     : nr_vma 3078 usec 41089
> > 
> > map_files is slower than maps if we dump all of vmas. I guess directory
> > operation needs much more jobs(e.g., dentry lookup, instantiation)
> > compared to maps.
> 
> OK, that is somehow surprising. I am still not convinced the filter is a
> good idea though. The primary reason is that it encourages using madvise
> on a wide range without having a clue what the range contains. E.g. the
> full address range and rely the right thing will happen. Do we really
> want madvise to operate in that mode?

If user space daemon(e.g., activity manager service) could know a certain
process is bakground and idle for a while, yeb, that would be good option.

> 
> Btw. if we went with the per vma fd approach then you would get this
> feature automatically because map_files would refer to file backed
> mappings while map_anon could refer only to anonymous mappings.

The reason to add such filter option is to avoid the parsing overhead
so map_anon wouldn't be helpful.


> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  8:13                 ` Minchan Kim
@ 2019-05-28  8:31                   ` Daniel Colascione
  2019-05-28  8:49                     ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28  8:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> if we went with the per vma fd approach then you would get this
> > feature automatically because map_files would refer to file backed
> > mappings while map_anon could refer only to anonymous mappings.
>
> The reason to add such filter option is to avoid the parsing overhead
> so map_anon wouldn't be helpful.

Without chiming on whether the filter option is a good idea, I'd like
to suggest that providing an efficient binary interfaces for pulling
memory map information out of processes.  Some single-system-call
method for retrieving a binary snapshot of a process's address space
complete with attributes (selectable, like statx?) for each VMA would
reduce complexity and increase performance in a variety of areas,
e.g., Android memory map debugging commands.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  8:31                   ` Daniel Colascione
@ 2019-05-28  8:49                     ` Minchan Kim
  2019-05-28  9:08                       ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-28  8:49 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > if we went with the per vma fd approach then you would get this
> > > feature automatically because map_files would refer to file backed
> > > mappings while map_anon could refer only to anonymous mappings.
> >
> > The reason to add such filter option is to avoid the parsing overhead
> > so map_anon wouldn't be helpful.
> 
> Without chiming on whether the filter option is a good idea, I'd like
> to suggest that providing an efficient binary interfaces for pulling
> memory map information out of processes.  Some single-system-call
> method for retrieving a binary snapshot of a process's address space
> complete with attributes (selectable, like statx?) for each VMA would
> reduce complexity and increase performance in a variety of areas,
> e.g., Android memory map debugging commands.

I agree it's the best we can get *generally*.
Michal, any opinion?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
  2019-05-20  8:16   ` Michal Hocko
@ 2019-05-28  8:53   ` Hillf Danton
  2019-05-28 10:58   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Hillf Danton @ 2019-05-28  8:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon


On Mon, 20 May 2019 12:52:48 +0900 Minchan Kim wrote:
> +static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		spinlock_t *ptl;

Seems not needed with another ptl declared above.
> +
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;

Alternately, we deactivate thp only if the address range from userspace
is sane enough, in order to avoid complex works we have to do here.
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		pmdp_test_and_clear_young(vma, addr, pmd);
> +		deactivate_page(page);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +regular_page:

Take a look at pending signal?

> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {

s/end/next/ ?
> +		ptent = *pte;
> +
> +		if (pte_none(ptent))
> +			continue;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		ptep_test_and_clear_young(vma, addr, pte);
> +		deactivate_page(page);
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static long madvise_cool(struct vm_area_struct *vma,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> +		return -EINVAL;

No service in case of VM_IO?
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
> +/*
> + * deactivate_page - deactivate a page
> + * @page: page to deactivate
> + *
> + * deactivate_page() moves @page to the inactive list if @page was on the active
> + * list and was not an unevictable page.  This is done to accelerate the reclaim
> + * of @page.
> + */
> +void deactivate_page(struct page *page)
> +{
> +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +
> +		get_page(page);

A line of comment seems needed for pinning the page.

> +		if (!pagevec_add(pvec, page) || PageCompound(page))
> +			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +		put_cpu_var(lru_deactivate_pvecs);
> +	}
> +}
> +

--
Hillf


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  8:49                     ` Minchan Kim
@ 2019-05-28  9:08                       ` Michal Hocko
  2019-05-28  9:39                         ` Daniel Colascione
  2019-05-28 10:32                         ` Minchan Kim
  0 siblings, 2 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-28  9:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > if we went with the per vma fd approach then you would get this
> > > > feature automatically because map_files would refer to file backed
> > > > mappings while map_anon could refer only to anonymous mappings.
> > >
> > > The reason to add such filter option is to avoid the parsing overhead
> > > so map_anon wouldn't be helpful.
> > 
> > Without chiming on whether the filter option is a good idea, I'd like
> > to suggest that providing an efficient binary interfaces for pulling
> > memory map information out of processes.  Some single-system-call
> > method for retrieving a binary snapshot of a process's address space
> > complete with attributes (selectable, like statx?) for each VMA would
> > reduce complexity and increase performance in a variety of areas,
> > e.g., Android memory map debugging commands.
> 
> I agree it's the best we can get *generally*.
> Michal, any opinion?

I am not really sure this is directly related. I think the primary
question that we have to sort out first is whether we want to have
the remote madvise call process or vma fd based. This is an important
distinction wrt. usability. I have only seen pid vs. pidfd discussions
so far unfortunately.

An interface to query address range information is a separate but
although a related topic. We have /proc/<pid>/[s]maps for that right
now and I understand it is not a general win for all usecases because
it tends to be slow for some. I can see how /proc/<pid>/map_anons could
provide per vma information in a binary form via a fd based interface.
But I would rather not conflate those two discussions much - well except
if it could give one of the approaches more justification but let's
focus on the madvise part first.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  9:08                       ` Michal Hocko
@ 2019-05-28  9:39                         ` Daniel Colascione
  2019-05-28 10:33                           ` Michal Hocko
  2019-05-28 10:32                         ` Minchan Kim
  1 sibling, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28  9:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > if we went with the per vma fd approach then you would get this
> > > > > feature automatically because map_files would refer to file backed
> > > > > mappings while map_anon could refer only to anonymous mappings.
> > > >
> > > > The reason to add such filter option is to avoid the parsing overhead
> > > > so map_anon wouldn't be helpful.
> > >
> > > Without chiming on whether the filter option is a good idea, I'd like
> > > to suggest that providing an efficient binary interfaces for pulling
> > > memory map information out of processes.  Some single-system-call
> > > method for retrieving a binary snapshot of a process's address space
> > > complete with attributes (selectable, like statx?) for each VMA would
> > > reduce complexity and increase performance in a variety of areas,
> > > e.g., Android memory map debugging commands.
> >
> > I agree it's the best we can get *generally*.
> > Michal, any opinion?
>
> I am not really sure this is directly related. I think the primary
> question that we have to sort out first is whether we want to have
> the remote madvise call process or vma fd based. This is an important
> distinction wrt. usability. I have only seen pid vs. pidfd discussions
> so far unfortunately.

I don't think the vma fd approach is viable. We have some processes
with a *lot* of VMAs --- system_server had 4204 when I checked just
now (and that's typical) --- and an FD operation per VMA would be
excessive. VMAs also come and go pretty easily depending on changes in
protections and various faults. It's also not entirely clear what the
semantics of vma FDs should be over address space mutations, while the
semantics of address ranges are well-understood. I would much prefer
an interface operating on address ranges to one operating on VMA FDs,
both for efficiency and for consistency with other memory management
APIs.

> An interface to query address range information is a separate but
> although a related topic. We have /proc/<pid>/[s]maps for that right
> now and I understand it is not a general win for all usecases because
> it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> provide per vma information in a binary form via a fd based interface.
> But I would rather not conflate those two discussions much - well except
> if it could give one of the approaches more justification but let's
> focus on the madvise part first.

I don't think it's a good idea to focus on one feature in a
multi-feature change when the interactions between features can be
very important for overall design of the multi-feature system and the
design of each feature.

Here's my thinking on the high-level design:

I'm imagining an address-range system that would work like this: we'd
create some kind of process_vm_getinfo(2) system call [1] that would
accept a statx-like attribute map and a pid/fd parameter as input and
return, on output, two things: 1) an array [2] of VMA descriptors
containing the requested information, and 2) a VMA configuration
sequence number. We'd then have process_madvise() and other
cross-process VM interfaces accept both address ranges and this
sequence number; they'd succeed only if the VMA configuration sequence
number is still current, i.e., the target process hasn't changed its
VMA configuration (implicitly or explicitly) since the call to
process_vm_getinfo().

This way, a process A that wants to perform some VM operation on
process B can slurp B's VMA configuration using process_vm_getinfo(),
figure out what it wants to do, and attempt to do it. If B modifies
its memory map in the meantime, If A finds that its local knowledge of
B's memory map has become invalid between the process_vm_getinfo() and
A taking some action based on the result, A can retry [3]. While A
could instead ptrace or otherwise suspend B, *then* read B's memory
map (knowing B is quiescent), *then* operate on B, the optimistic
approach I'm describing would be much lighter-weight in the typical
case. It's also pretty simple, IMHO. If the "operate on B" step is
some kind of vectorized operation over multiple address ranges, this
approach also gets us all-or-nothing semantics.

Or maybe the whole sequence number thing is overkill and we don't need
atomicity? But if there's a concern  that A shouldn't operate on B's
memory without knowing what it's operating on, then the scheme I've
proposed above solves this knowledge problem in a pretty lightweight
way.

[1] or some other interface
[2] or something more complicated if we want the descriptors to
contain variable-length elements, e.g., strings
[3] or override the sequence number check if it's feeling bold?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  9:08                       ` Michal Hocko
  2019-05-28  9:39                         ` Daniel Colascione
@ 2019-05-28 10:32                         ` Minchan Kim
  2019-05-28 10:41                           ` Michal Hocko
  1 sibling, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-28 10:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > if we went with the per vma fd approach then you would get this
> > > > > feature automatically because map_files would refer to file backed
> > > > > mappings while map_anon could refer only to anonymous mappings.
> > > >
> > > > The reason to add such filter option is to avoid the parsing overhead
> > > > so map_anon wouldn't be helpful.
> > > 
> > > Without chiming on whether the filter option is a good idea, I'd like
> > > to suggest that providing an efficient binary interfaces for pulling
> > > memory map information out of processes.  Some single-system-call
> > > method for retrieving a binary snapshot of a process's address space
> > > complete with attributes (selectable, like statx?) for each VMA would
> > > reduce complexity and increase performance in a variety of areas,
> > > e.g., Android memory map debugging commands.
> > 
> > I agree it's the best we can get *generally*.
> > Michal, any opinion?
> 
> I am not really sure this is directly related. I think the primary
> question that we have to sort out first is whether we want to have
> the remote madvise call process or vma fd based. This is an important
> distinction wrt. usability. I have only seen pid vs. pidfd discussions
> so far unfortunately.

With current usecase, it's per-process API with distinguishable anon/file
but thought it could be easily extended later for each address range
operation as userspace getting smarter with more information.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28  9:39                         ` Daniel Colascione
@ 2019-05-28 10:33                           ` Michal Hocko
  2019-05-28 11:21                             ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 10:33 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > if we went with the per vma fd approach then you would get this
> > > > > > feature automatically because map_files would refer to file backed
> > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > >
> > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > so map_anon wouldn't be helpful.
> > > >
> > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > to suggest that providing an efficient binary interfaces for pulling
> > > > memory map information out of processes.  Some single-system-call
> > > > method for retrieving a binary snapshot of a process's address space
> > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > reduce complexity and increase performance in a variety of areas,
> > > > e.g., Android memory map debugging commands.
> > >
> > > I agree it's the best we can get *generally*.
> > > Michal, any opinion?
> >
> > I am not really sure this is directly related. I think the primary
> > question that we have to sort out first is whether we want to have
> > the remote madvise call process or vma fd based. This is an important
> > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > so far unfortunately.
> 
> I don't think the vma fd approach is viable. We have some processes
> with a *lot* of VMAs --- system_server had 4204 when I checked just
> now (and that's typical) --- and an FD operation per VMA would be
> excessive.

What do you mean by excessive here? Do you expect the process to have
them open all at once?

> VMAs also come and go pretty easily depending on changes in
> protections and various faults.

Is this really too much different from /proc/<pid>/map_files?

[...]

> > An interface to query address range information is a separate but
> > although a related topic. We have /proc/<pid>/[s]maps for that right
> > now and I understand it is not a general win for all usecases because
> > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > provide per vma information in a binary form via a fd based interface.
> > But I would rather not conflate those two discussions much - well except
> > if it could give one of the approaches more justification but let's
> > focus on the madvise part first.
> 
> I don't think it's a good idea to focus on one feature in a
> multi-feature change when the interactions between features can be
> very important for overall design of the multi-feature system and the
> design of each feature.
> 
> Here's my thinking on the high-level design:
> 
> I'm imagining an address-range system that would work like this: we'd
> create some kind of process_vm_getinfo(2) system call [1] that would
> accept a statx-like attribute map and a pid/fd parameter as input and
> return, on output, two things: 1) an array [2] of VMA descriptors
> containing the requested information, and 2) a VMA configuration
> sequence number. We'd then have process_madvise() and other
> cross-process VM interfaces accept both address ranges and this
> sequence number; they'd succeed only if the VMA configuration sequence
> number is still current, i.e., the target process hasn't changed its
> VMA configuration (implicitly or explicitly) since the call to
> process_vm_getinfo().

The sequence number is essentially a cookie that is transparent to the
userspace right? If yes, how does it differ from a fd (returned from
/proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can
be used to revalidate when the operation is requested and fail if
something has changed. Moreover we already do have a fd based madvise
syscall so there shouldn't be really a large need to add a new set of
syscalls.

[...]

> Or maybe the whole sequence number thing is overkill and we don't need
> atomicity? But if there's a concern  that A shouldn't operate on B's
> memory without knowing what it's operating on, then the scheme I've
> proposed above solves this knowledge problem in a pretty lightweight
> way.

This is the main question here. Do we really want to enforce an external
synchronization between the two processes to make sure that they are
both operating on the same range - aka protect from the range going away
and being reused for a different purpose. Right now it wouldn't be fatal
because both operations are non destructive but I can imagine that there
will be more madvise operations to follow (including those that are
destructive) because people will simply find usecases for that. This
should be reflected in the proposed API.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:32                         ` Minchan Kim
@ 2019-05-28 10:41                           ` Michal Hocko
  2019-05-28 11:12                             ` Minchan Kim
  2019-05-28 11:28                             ` Daniel Colascione
  0 siblings, 2 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 10:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > if we went with the per vma fd approach then you would get this
> > > > > > feature automatically because map_files would refer to file backed
> > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > >
> > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > so map_anon wouldn't be helpful.
> > > > 
> > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > to suggest that providing an efficient binary interfaces for pulling
> > > > memory map information out of processes.  Some single-system-call
> > > > method for retrieving a binary snapshot of a process's address space
> > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > reduce complexity and increase performance in a variety of areas,
> > > > e.g., Android memory map debugging commands.
> > > 
> > > I agree it's the best we can get *generally*.
> > > Michal, any opinion?
> > 
> > I am not really sure this is directly related. I think the primary
> > question that we have to sort out first is whether we want to have
> > the remote madvise call process or vma fd based. This is an important
> > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > so far unfortunately.
> 
> With current usecase, it's per-process API with distinguishable anon/file
> but thought it could be easily extended later for each address range
> operation as userspace getting smarter with more information.

Never design user API based on a single usecase, please. The "easily
extended" part is by far not clear to me TBH. As I've already mentioned
several times, the synchronization model has to be thought through
carefuly before a remote process address range operation can be
implemented.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 0/7] introduce memory hinting API for external process
  2019-05-21 10:34       ` Michal Hocko
@ 2019-05-28 10:50         ` Anshuman Khandual
  0 siblings, 0 replies; 138+ messages in thread
From: Anshuman Khandual @ 2019-05-28 10:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Murray, Minchan Kim, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon



On 05/21/2019 04:04 PM, Michal Hocko wrote:
> On Tue 21-05-19 08:25:55, Anshuman Khandual wrote:
>> On 05/20/2019 10:29 PM, Tim Murray wrote:
> [...]
>>> not seem to introduce a noticeable hot start penalty, not does it
>>> cause an increase in performance problems later in the app's
>>> lifecycle. I've measured with and without process_madvise, and the
>>> differences are within our noise bounds. Second, because we're not
>>
>> That is assuming that post process_madvise() working set for the application is
>> always smaller. There is another challenge. The external process should ideally
>> have the knowledge of active areas of the working set for an application in
>> question for it to invoke process_madvise() correctly to prevent such scenarios.
> 
> But that doesn't really seem relevant for the API itself, right? The
> higher level logic the monitor's business.

Right. I was just wondering how the monitor would even decide what areas of the
target application is active or inactive. The target application is still just an
opaque entity for the monitor unless there is some sort of communication. But you
are right, this not relevant to the API itself.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 1/7] mm: introduce MADV_COOL
  2019-05-20  3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
  2019-05-20  8:16   ` Michal Hocko
  2019-05-28  8:53   ` Hillf Danton
@ 2019-05-28 10:58   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-28 10:58 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue, May 28, 2019 at 04:53:01PM +0800, Hillf Danton wrote:
> 
> On Mon, 20 May 2019 12:52:48 +0900 Minchan Kim wrote:
> > +static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> > +	pte_t *orig_pte, *pte, ptent;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	unsigned long next;
> > +
> > +	next = pmd_addr_end(addr, end);
> > +	if (pmd_trans_huge(*pmd)) {
> > +		spinlock_t *ptl;
> 
> Seems not needed with another ptl declared above.

Will remove it.

> > +
> > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > +		if (!ptl)
> > +			return 0;
> > +
> > +		if (is_huge_zero_pmd(*pmd))
> > +			goto huge_unlock;
> > +
> > +		page = pmd_page(*pmd);
> > +		if (page_mapcount(page) > 1)
> > +			goto huge_unlock;
> > +
> > +		if (next - addr != HPAGE_PMD_SIZE) {
> > +			int err;
> 
> Alternately, we deactivate thp only if the address range from userspace
> is sane enough, in order to avoid complex works we have to do here.

Not sure it's a good idea. That's the way we have done in MADV_FREE
so want to be consistent.

> > +
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			lock_page(page);
> > +			err = split_huge_page(page);
> > +			unlock_page(page);
> > +			put_page(page);
> > +			if (!err)
> > +				goto regular_page;
> > +			return 0;
> > +		}
> > +
> > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > +		deactivate_page(page);
> > +huge_unlock:
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +regular_page:
> 
> Take a look at pending signal?

Do you have any reason to see pending signal here? I want to know what's
your requirement so that what's the better place to handle it.

> 
> > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> 
> s/end/next/ ?

Why do you think it should be next?

> > +		ptent = *pte;
> > +
> > +		if (pte_none(ptent))
> > +			continue;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (page_mapcount(page) > 1)
> > +			continue;
> > +
> > +		ptep_test_and_clear_young(vma, addr, pte);
> > +		deactivate_page(page);
> > +	}
> > +
> > +	pte_unmap_unlock(orig_pte, ptl);
> > +	cond_resched();
> > +
> > +	return 0;
> > +}
> > +
> > +static long madvise_cool(struct vm_area_struct *vma,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > +		return -EINVAL;
> 
> No service in case of VM_IO?

I don't know VM_IO would have regular LRU pages but just follow normal
convention for DONTNEED and FREE.
Do you have anything in your mind?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:41                           ` Michal Hocko
@ 2019-05-28 11:12                             ` Minchan Kim
  2019-05-28 11:28                               ` Michal Hocko
  2019-05-28 11:28                             ` Daniel Colascione
  1 sibling, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-28 11:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > > 
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > > 
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > > 
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> > 
> > With current usecase, it's per-process API with distinguishable anon/file
> > but thought it could be easily extended later for each address range
> > operation as userspace getting smarter with more information.
> 
> Never design user API based on a single usecase, please. The "easily
> extended" part is by far not clear to me TBH. As I've already mentioned
> several times, the synchronization model has to be thought through
> carefuly before a remote process address range operation can be
> implemented.

I agree with you that we shouldn't design API on single usecase but what
you are concerning is actually not our usecase because we are resilient
with the race since MADV_COLD|PAGEOUT is not destruptive.
Actually, many hints are already racy in that the upcoming pattern would
be different with the behavior you thought at the moment.

If you are still concerning of address range synchronization, how about
moving such hints to per-process level like prctl?
Does it make sense to you?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:33                           ` Michal Hocko
@ 2019-05-28 11:21                             ` Daniel Colascione
  2019-05-28 11:49                               ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > >
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > >
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > >
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> >
> > I don't think the vma fd approach is viable. We have some processes
> > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > now (and that's typical) --- and an FD operation per VMA would be
> > excessive.
>
> What do you mean by excessive here? Do you expect the process to have
> them open all at once?

Minchan's already done timing. More broadly, in an era with various
speculative execution mitigations, making a system call is pretty
expensive. If we have two options for remote VMA manipulation, one
that requires thousands of system calls (with the count proportional
to the address space size of the process) and one that requires only a
few system calls no matter how large the target process is, the latter
ought to start off with more points than the former under any kind of
design scoring.

> > VMAs also come and go pretty easily depending on changes in
> > protections and various faults.
>
> Is this really too much different from /proc/<pid>/map_files?

It's very different. See below.

> > > An interface to query address range information is a separate but
> > > although a related topic. We have /proc/<pid>/[s]maps for that right
> > > now and I understand it is not a general win for all usecases because
> > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > > provide per vma information in a binary form via a fd based interface.
> > > But I would rather not conflate those two discussions much - well except
> > > if it could give one of the approaches more justification but let's
> > > focus on the madvise part first.
> >
> > I don't think it's a good idea to focus on one feature in a
> > multi-feature change when the interactions between features can be
> > very important for overall design of the multi-feature system and the
> > design of each feature.
> >
> > Here's my thinking on the high-level design:
> >
> > I'm imagining an address-range system that would work like this: we'd
> > create some kind of process_vm_getinfo(2) system call [1] that would
> > accept a statx-like attribute map and a pid/fd parameter as input and
> > return, on output, two things: 1) an array [2] of VMA descriptors
> > containing the requested information, and 2) a VMA configuration
> > sequence number. We'd then have process_madvise() and other
> > cross-process VM interfaces accept both address ranges and this
> > sequence number; they'd succeed only if the VMA configuration sequence
> > number is still current, i.e., the target process hasn't changed its
> > VMA configuration (implicitly or explicitly) since the call to
> > process_vm_getinfo().
>
> The sequence number is essentially a cookie that is transparent to the
> userspace right? If yes, how does it differ from a fd (returned from
> /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can

If you want to operate on N VMAs simultaneously under an FD-per-VMA
model, you'd need to have those N FDs all open at the same time *and*
add some kind of system call that accepted those N FDs and an
operation to perform. The sequence number I'm proposing also applies
to the whole address space, not just one VMA. Even if you did have
these N FDs open all at once and supplied them all to some batch
operation, you couldn't guarantee via the FD mechanism that some *new*
VMA didn't appear in the address range you want to manipulate. A
global sequence number would catch this case. I still think supplying
a list of address ranges (like we already do for scatter-gather IO) is
less error-prone, less resource-intensive, more consistent with
existing practice, and equally flexible, especially if we start
supporting destructive cross-process memory operations, which may be
useful for things like checkpointing and optimizing process startup.

Besides: process_vm_readv and process_vm_writev already work on
address ranges. Why should other cross-process memory APIs use a very
different model for naming memory regions?

> be used to revalidate when the operation is requested and fail if
> something has changed. Moreover we already do have a fd based madvise
> syscall so there shouldn't be really a large need to add a new set of
> syscalls.

We have various system calls that provide hints for open files, but
the memory operations are distinct. Modeling anonymous memory as a
kind of file-backed memory for purposes of VMA manipulation would also
be a departure from existing practice. Can you help me understand why
you seem to favor the FD-per-VMA approach so heavily? I don't see any
arguments *for* an FD-per-VMA model for remove memory manipulation and
I see a lot of arguments against it. Is there some compelling
advantage I'm missing?

> > Or maybe the whole sequence number thing is overkill and we don't need
> > atomicity? But if there's a concern  that A shouldn't operate on B's
> > memory without knowing what it's operating on, then the scheme I've
> > proposed above solves this knowledge problem in a pretty lightweight
> > way.
>
> This is the main question here. Do we really want to enforce an external
> synchronization between the two processes to make sure that they are
> both operating on the same range - aka protect from the range going away
> and being reused for a different purpose. Right now it wouldn't be fatal
> because both operations are non destructive but I can imagine that there
> will be more madvise operations to follow (including those that are
> destructive) because people will simply find usecases for that. This
> should be reflected in the proposed API.

A sequence number gives us this synchronization at very low cost and
adds safety. It's also a general-purpose mechanism that would
safeguard *any* cross-process VM operation, not just the VM operations
we're discussing right now.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:12                             ` Minchan Kim
@ 2019-05-28 11:28                               ` Michal Hocko
  2019-05-28 11:42                                 ` Daniel Colascione
  2019-05-28 11:44                                 ` Minchan Kim
  0 siblings, 2 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 11:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > >
> > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > so map_anon wouldn't be helpful.
> > > > > > 
> > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > memory map information out of processes.  Some single-system-call
> > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > e.g., Android memory map debugging commands.
> > > > > 
> > > > > I agree it's the best we can get *generally*.
> > > > > Michal, any opinion?
> > > > 
> > > > I am not really sure this is directly related. I think the primary
> > > > question that we have to sort out first is whether we want to have
> > > > the remote madvise call process or vma fd based. This is an important
> > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > so far unfortunately.
> > > 
> > > With current usecase, it's per-process API with distinguishable anon/file
> > > but thought it could be easily extended later for each address range
> > > operation as userspace getting smarter with more information.
> > 
> > Never design user API based on a single usecase, please. The "easily
> > extended" part is by far not clear to me TBH. As I've already mentioned
> > several times, the synchronization model has to be thought through
> > carefuly before a remote process address range operation can be
> > implemented.
> 
> I agree with you that we shouldn't design API on single usecase but what
> you are concerning is actually not our usecase because we are resilient
> with the race since MADV_COLD|PAGEOUT is not destruptive.
> Actually, many hints are already racy in that the upcoming pattern would
> be different with the behavior you thought at the moment.

How come they are racy wrt address ranges? You would have to be in
multithreaded environment and then the onus of synchronization is on
threads. That model is quite clear. But we are talking about separate
processes and some of them might be even not aware of an external entity
tweaking their address space.

> If you are still concerning of address range synchronization, how about
> moving such hints to per-process level like prctl?
> Does it make sense to you?

No it doesn't. How is prctl any relevant to any address range
operations.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 10:41                           ` Michal Hocko
  2019-05-28 11:12                             ` Minchan Kim
@ 2019-05-28 11:28                             ` Daniel Colascione
  1 sibling, 0 replies; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 3:41 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > >
> > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > so map_anon wouldn't be helpful.
> > > > >
> > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > memory map information out of processes.  Some single-system-call
> > > > > method for retrieving a binary snapshot of a process's address space
> > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > reduce complexity and increase performance in a variety of areas,
> > > > > e.g., Android memory map debugging commands.
> > > >
> > > > I agree it's the best we can get *generally*.
> > > > Michal, any opinion?
> > >
> > > I am not really sure this is directly related. I think the primary
> > > question that we have to sort out first is whether we want to have
> > > the remote madvise call process or vma fd based. This is an important
> > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > so far unfortunately.
> >
> > With current usecase, it's per-process API with distinguishable anon/file
> > but thought it could be easily extended later for each address range
> > operation as userspace getting smarter with more information.
>
> Never design user API based on a single usecase, please. The "easily
> extended" part is by far not clear to me TBH. As I've already mentioned
> several times, the synchronization model has to be thought through
> carefuly before a remote process address range operation can be
> implemented.

I don't think anyone is overfitting for a specific use case. When some
process A wants to manipulate process B's memory, it's fair for A to
want to know what memory it's manipulating. That's a general concern
that applies to a large family of cross-process memory operations.
It's less important for non-destructive hints than for some kind of
destructive operation, but the same idea applies. If there's a simple
way to solve this A-B information problem in a general way, it seems
to be that we should apply that general solution. Likewise, an API to
get an efficiently-collected snapshot of a process's address space
would be immediately useful in several very different use cases,
including debuggers, Android memory use reporting tools, and various
kinds of metric collection. Because we're talking about mechanisms
that solve several independent problems at the same time and in a
general way, it doesn't sound to me like overfitting for a particular
use case.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:28                               ` Michal Hocko
@ 2019-05-28 11:42                                 ` Daniel Colascione
  2019-05-28 11:56                                   ` Michal Hocko
  2019-05-28 12:10                                   ` Minchan Kim
  2019-05-28 11:44                                 ` Minchan Kim
  1 sibling, 2 replies; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > >
> > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > >
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > >
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > >
> > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > but thought it could be easily extended later for each address range
> > > > operation as userspace getting smarter with more information.
> > >
> > > Never design user API based on a single usecase, please. The "easily
> > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > several times, the synchronization model has to be thought through
> > > carefuly before a remote process address range operation can be
> > > implemented.
> >
> > I agree with you that we shouldn't design API on single usecase but what
> > you are concerning is actually not our usecase because we are resilient
> > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > Actually, many hints are already racy in that the upcoming pattern would
> > be different with the behavior you thought at the moment.
>
> How come they are racy wrt address ranges? You would have to be in
> multithreaded environment and then the onus of synchronization is on
> threads. That model is quite clear. But we are talking about separate
> processes and some of them might be even not aware of an external entity
> tweaking their address space.

I don't think the difference between a thread and a process matters in
this context. Threads race on address space operations all the time
--- in the sense that multiple threads modify a process's address
space without synchronization. The main reasons that these races
hasn't been a problem are: 1) threads mostly "mind their own business"
and modify different parts of the address space or use locks to ensure
that they don't stop on each other (e.g., the malloc heap lock), and
2) POSIX mmap atomic-replacement semantics make certain classes of
operation (like "magic ring buffer" setup) safe even in the presence
of other threads stomping over an address space.

The thing that's new in this discussion from a synchronization point
of view isn't that the VM operation we're talking about is coming from
outside the process, but that we want to do a read-decide-modify-ish
thing. We want to affect (using various hints) classes of pages like
"all file pages" or "all anonymous pages" or "some pages referring to
graphics buffers up to 100MB" (to pick an example off the top of my
head of a policy that might make sense). From a synchronization point
of view, it doesn't really matter whether it's a thread within the
target process or a thread outside the target process that does the
address space manipulation. What's new is the inspection of the
address space before performing an operation.

Minchan started this thread by proposing some flags that would
implement a few of the filtering policies I used as examples above.
Personally, instead of providing a few pre-built policies as flags,
I'd rather push the page manipulation policy to userspace as much as
possible and just have the kernel provide a mechanism that *in
general* makes these read-decide-modify operations efficient and
robust. I still think there's way to achieve this goal very
inexpensively without compromising on flexibility.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:28                               ` Michal Hocko
  2019-05-28 11:42                                 ` Daniel Colascione
@ 2019-05-28 11:44                                 ` Minchan Kim
  2019-05-28 11:51                                   ` Daniel Colascione
  2019-05-28 12:06                                   ` Michal Hocko
  1 sibling, 2 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-28 11:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > 
> > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > > 
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > > 
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > > 
> > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > but thought it could be easily extended later for each address range
> > > > operation as userspace getting smarter with more information.
> > > 
> > > Never design user API based on a single usecase, please. The "easily
> > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > several times, the synchronization model has to be thought through
> > > carefuly before a remote process address range operation can be
> > > implemented.
> > 
> > I agree with you that we shouldn't design API on single usecase but what
> > you are concerning is actually not our usecase because we are resilient
> > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > Actually, many hints are already racy in that the upcoming pattern would
> > be different with the behavior you thought at the moment.
> 
> How come they are racy wrt address ranges? You would have to be in
> multithreaded environment and then the onus of synchronization is on
> threads. That model is quite clear. But we are talking about separate

Think about MADV_FREE. Allocator would think the chunk is worth to mark
"freeable" but soon, user of the allocator asked the chunk - ie, it's not
freeable any longer once user start to use it.

My point is that kinds of *hints* are always racy so any synchronization
couldn't help a lot. That's why I want to restrict hints process_madvise
supports as such kinds of non-destruptive one at next respin.

> processes and some of them might be even not aware of an external entity
> tweaking their address space.
> 
> > If you are still concerning of address range synchronization, how about
> > moving such hints to per-process level like prctl?
> > Does it make sense to you?
> 
> No it doesn't. How is prctl any relevant to any address range
> operations.

"whether we want to have the remote madvise call process or vma fd based."

You asked the above question and I answered we are using process level
hints but anon/vma filter at this moment. That's why I told you prctl to
make forward progress on discussion.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:21                             ` Daniel Colascione
@ 2019-05-28 11:49                               ` Michal Hocko
  2019-05-28 12:11                                 ` Daniel Colascione
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 11:49 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 04:21:44, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > >
> > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > so map_anon wouldn't be helpful.
> > > > > >
> > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > memory map information out of processes.  Some single-system-call
> > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > e.g., Android memory map debugging commands.
> > > > >
> > > > > I agree it's the best we can get *generally*.
> > > > > Michal, any opinion?
> > > >
> > > > I am not really sure this is directly related. I think the primary
> > > > question that we have to sort out first is whether we want to have
> > > > the remote madvise call process or vma fd based. This is an important
> > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > so far unfortunately.
> > >
> > > I don't think the vma fd approach is viable. We have some processes
> > > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > > now (and that's typical) --- and an FD operation per VMA would be
> > > excessive.
> >
> > What do you mean by excessive here? Do you expect the process to have
> > them open all at once?
> 
> Minchan's already done timing. More broadly, in an era with various
> speculative execution mitigations, making a system call is pretty
> expensive.

This is a completely separate discussion. This could be argued about
many other syscalls. Let's make the semantic correct first before we
even start thinking about mutliplexing. It is easier to multiplex on an
existing and sane interface.

Btw. Minchan concluded that multiplexing is not really all that
important based on his numbers http://lkml.kernel.org/r/20190527074940.GB6879@google.com

[...]

> > Is this really too much different from /proc/<pid>/map_files?
> 
> It's very different. See below.
> 
> > > > An interface to query address range information is a separate but
> > > > although a related topic. We have /proc/<pid>/[s]maps for that right
> > > > now and I understand it is not a general win for all usecases because
> > > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > > > provide per vma information in a binary form via a fd based interface.
> > > > But I would rather not conflate those two discussions much - well except
> > > > if it could give one of the approaches more justification but let's
> > > > focus on the madvise part first.
> > >
> > > I don't think it's a good idea to focus on one feature in a
> > > multi-feature change when the interactions between features can be
> > > very important for overall design of the multi-feature system and the
> > > design of each feature.
> > >
> > > Here's my thinking on the high-level design:
> > >
> > > I'm imagining an address-range system that would work like this: we'd
> > > create some kind of process_vm_getinfo(2) system call [1] that would
> > > accept a statx-like attribute map and a pid/fd parameter as input and
> > > return, on output, two things: 1) an array [2] of VMA descriptors
> > > containing the requested information, and 2) a VMA configuration
> > > sequence number. We'd then have process_madvise() and other
> > > cross-process VM interfaces accept both address ranges and this
> > > sequence number; they'd succeed only if the VMA configuration sequence
> > > number is still current, i.e., the target process hasn't changed its
> > > VMA configuration (implicitly or explicitly) since the call to
> > > process_vm_getinfo().
> >
> > The sequence number is essentially a cookie that is transparent to the
> > userspace right? If yes, how does it differ from a fd (returned from
> > /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can
> 
> If you want to operate on N VMAs simultaneously under an FD-per-VMA
> model, you'd need to have those N FDs all open at the same time *and*
> add some kind of system call that accepted those N FDs and an
> operation to perform. The sequence number I'm proposing also applies
> to the whole address space, not just one VMA. Even if you did have
> these N FDs open all at once and supplied them all to some batch
> operation, you couldn't guarantee via the FD mechanism that some *new*
> VMA didn't appear in the address range you want to manipulate. A
> global sequence number would catch this case. I still think supplying
> a list of address ranges (like we already do for scatter-gather IO) is
> less error-prone, less resource-intensive, more consistent with
> existing practice, and equally flexible, especially if we start
> supporting destructive cross-process memory operations, which may be
> useful for things like checkpointing and optimizing process startup.

I have a strong feeling you are over optimizing here. We are talking
about a pro-active memory management and so far I haven't heard any
usecase where all this would happen in the fast path. There are
essentially two usecases I have heard so far. Age/Reclaim the whole
process (with anon/fs preferency) and do the same on a particular
and well specified range (e.g. a garbage collector or an inactive large
image in browsert etc...). The former doesn't really care about parallel
address range manipulations because it can tolerate them. The later is a
completely different story.

Are there any others where saving few ms matter so much?

> Besides: process_vm_readv and process_vm_writev already work on
> address ranges. Why should other cross-process memory APIs use a very
> different model for naming memory regions?

I would consider those APIs not a great example. They are racy on
more levels (pid reuse and address space modification), and require a
non-trivial synchronization. Do you want something similar for madvise
on a non-cooperating remote application?
 
> > be used to revalidate when the operation is requested and fail if
> > something has changed. Moreover we already do have a fd based madvise
> > syscall so there shouldn't be really a large need to add a new set of
> > syscalls.
> 
> We have various system calls that provide hints for open files, but
> the memory operations are distinct. Modeling anonymous memory as a
> kind of file-backed memory for purposes of VMA manipulation would also
> be a departure from existing practice. Can you help me understand why
> you seem to favor the FD-per-VMA approach so heavily? I don't see any
> arguments *for* an FD-per-VMA model for remove memory manipulation and
> I see a lot of arguments against it. Is there some compelling
> advantage I'm missing?

First and foremost it provides an easy cookie to the userspace to
guarantee time-to-check-time-to-use consistency. It also naturally
extend an existing fadvise interface that achieves madvise semantic on
files. I am not really pushing hard for this particular API but I really
do care about a programming model that would be sane. If we have a
different means to achieve the same then all fine by me but so far I
haven't heard any sound arguments to invent something completely new
when we have established APIs to use. Exporting anonymous mappings via
proc the same way we do for file mappings doesn't seem to be stepping
outside of the current practice way too much.

All I am trying to say here is that process_madvise(fd, start, len) is
inherently racy API and we should focus on discussing whether this is a
sane model. And I think it would be much better to discuss that under
the respective patch which introduces that API rather than here.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:44                                 ` Minchan Kim
@ 2019-05-28 11:51                                   ` Daniel Colascione
  2019-05-28 12:06                                   ` Michal Hocko
  1 sibling, 0 replies; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28 11:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:44 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > >
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > >
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > >
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > >
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > >
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > >
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> >
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
>
> Think about MADV_FREE. Allocator would think the chunk is worth to mark
> "freeable" but soon, user of the allocator asked the chunk - ie, it's not
> freeable any longer once user start to use it.
>
> My point is that kinds of *hints* are always racy so any synchronization
> couldn't help a lot. That's why I want to restrict hints process_madvise
> supports as such kinds of non-destruptive one at next respin.

I think it's more natural for process_madvise to be a superset of
regular madvise. What's the harm? There are no security implications,
since anyone who could process_madvise could just ptrace anyway. I
also don't think limiting the hinting to non-destructive operations
guarantees safety (in a broad sense) either, since operating on the
wrong memory range can still cause unexpected system performance
issues even if there's no data loss.

More broadly, what I want to see is a family of process_* APIs that
provide a superset of the functionality that the existing intraprocess
APIs provide. I think this approach is elegant and generalizes easily.
I'm worried about prematurely limiting the interprocess memory APIs
and creating limitations that will last a long time in order to avoid
having to consider issues like VMA synchronization.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:42                                 ` Daniel Colascione
@ 2019-05-28 11:56                                   ` Michal Hocko
  2019-05-28 12:18                                     ` Daniel Colascione
  2019-05-28 12:10                                   ` Minchan Kim
  1 sibling, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 11:56 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 04:42:47, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > >
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > >
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > >
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > >
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > >
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > >
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> >
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
> > processes and some of them might be even not aware of an external entity
> > tweaking their address space.
> 
> I don't think the difference between a thread and a process matters in
> this context. Threads race on address space operations all the time
> --- in the sense that multiple threads modify a process's address
> space without synchronization.

I would disagree. They do have in-kernel synchronization as long as they
do not use MAP_FIXED. If they do want to use MAP_FIXED then they better
synchronize or the result is undefined.

> The main reasons that these races
> hasn't been a problem are: 1) threads mostly "mind their own business"
> and modify different parts of the address space or use locks to ensure
> that they don't stop on each other (e.g., the malloc heap lock), and
> 2) POSIX mmap atomic-replacement semantics make certain classes of
> operation (like "magic ring buffer" setup) safe even in the presence
> of other threads stomping over an address space.

Agreed here.

[...]

> From a synchronization point
> of view, it doesn't really matter whether it's a thread within the
> target process or a thread outside the target process that does the
> address space manipulation. What's new is the inspection of the
> address space before performing an operation.

The fundamental difference is that if you want to achieve the same
inside the process then your application is inherenly aware of the
operation and use whatever synchronization is needed to achieve a
consistency. As soon as you allow the same from outside you either
have to have an aware target application as well or you need a mechanism
to find out that your decision has been invalidated by a later
unsynchronized action.

> Minchan started this thread by proposing some flags that would
> implement a few of the filtering policies I used as examples above.
> Personally, instead of providing a few pre-built policies as flags,
> I'd rather push the page manipulation policy to userspace as much as
> possible and just have the kernel provide a mechanism that *in
> general* makes these read-decide-modify operations efficient and
> robust. I still think there's way to achieve this goal very
> inexpensively without compromising on flexibility.

Agreed here.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:44                                 ` Minchan Kim
  2019-05-28 11:51                                   ` Daniel Colascione
@ 2019-05-28 12:06                                   ` Michal Hocko
  2019-05-28 12:22                                     ` Minchan Kim
  1 sibling, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 12:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 20:44:36, Minchan Kim wrote:
> On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > > 
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > > 
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > > 
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > > 
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > > 
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > > 
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> > 
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
> 
> Think about MADV_FREE. Allocator would think the chunk is worth to mark
> "freeable" but soon, user of the allocator asked the chunk - ie, it's not
> freeable any longer once user start to use it.

That is not a race in the address space, right. The underlying object
hasn't changed. It has been declared as freeable and since that moment
nobody can rely on the content because it might have been discarded.
Or put simply, the content is undefined. It is responsibility of the
madvise caller to make sure that the object is not in active use while
it is marking it.

> My point is that kinds of *hints* are always racy so any synchronization
> couldn't help a lot. That's why I want to restrict hints process_madvise
> supports as such kinds of non-destruptive one at next respin.

I agree that a non-destructive operations are safer against paralel
modifications because you just get a annoying and unexpected latency at
worst case. But we should discuss whether this assumption is sufficient
for further development. I am pretty sure once we open remote madvise
people will find usecases for destructive operations or even new madvise
modes we haven't heard of. What then?

> > processes and some of them might be even not aware of an external entity
> > tweaking their address space.
> > 
> > > If you are still concerning of address range synchronization, how about
> > > moving such hints to per-process level like prctl?
> > > Does it make sense to you?
> > 
> > No it doesn't. How is prctl any relevant to any address range
> > operations.
> 
> "whether we want to have the remote madvise call process or vma fd based."

Still not following. So you want to have a prctl (one of the worst API
we have along with ioctl) to tell the semantic? This sounds like a
terrible idea to me.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:42                                 ` Daniel Colascione
  2019-05-28 11:56                                   ` Michal Hocko
@ 2019-05-28 12:10                                   ` Minchan Kim
  1 sibling, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-28 12:10 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 04:42:47AM -0700, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > >
> > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > >
> > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > e.g., Android memory map debugging commands.
> > > > > > >
> > > > > > > I agree it's the best we can get *generally*.
> > > > > > > Michal, any opinion?
> > > > > >
> > > > > > I am not really sure this is directly related. I think the primary
> > > > > > question that we have to sort out first is whether we want to have
> > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > so far unfortunately.
> > > > >
> > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > but thought it could be easily extended later for each address range
> > > > > operation as userspace getting smarter with more information.
> > > >
> > > > Never design user API based on a single usecase, please. The "easily
> > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > several times, the synchronization model has to be thought through
> > > > carefuly before a remote process address range operation can be
> > > > implemented.
> > >
> > > I agree with you that we shouldn't design API on single usecase but what
> > > you are concerning is actually not our usecase because we are resilient
> > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > Actually, many hints are already racy in that the upcoming pattern would
> > > be different with the behavior you thought at the moment.
> >
> > How come they are racy wrt address ranges? You would have to be in
> > multithreaded environment and then the onus of synchronization is on
> > threads. That model is quite clear. But we are talking about separate
> > processes and some of them might be even not aware of an external entity
> > tweaking their address space.
> 
> I don't think the difference between a thread and a process matters in
> this context. Threads race on address space operations all the time
> --- in the sense that multiple threads modify a process's address
> space without synchronization. The main reasons that these races
> hasn't been a problem are: 1) threads mostly "mind their own business"
> and modify different parts of the address space or use locks to ensure
> that they don't stop on each other (e.g., the malloc heap lock), and
> 2) POSIX mmap atomic-replacement semantics make certain classes of
> operation (like "magic ring buffer" setup) safe even in the presence
> of other threads stomping over an address space.
> 
> The thing that's new in this discussion from a synchronization point
> of view isn't that the VM operation we're talking about is coming from
> outside the process, but that we want to do a read-decide-modify-ish
> thing. We want to affect (using various hints) classes of pages like
> "all file pages" or "all anonymous pages" or "some pages referring to
> graphics buffers up to 100MB" (to pick an example off the top of my
> head of a policy that might make sense). From a synchronization point
> of view, it doesn't really matter whether it's a thread within the
> target process or a thread outside the target process that does the
> address space manipulation. What's new is the inspection of the
> address space before performing an operation.
> 
> Minchan started this thread by proposing some flags that would
> implement a few of the filtering policies I used as examples above.
> Personally, instead of providing a few pre-built policies as flags,
> I'd rather push the page manipulation policy to userspace as much as
> possible and just have the kernel provide a mechanism that *in
> general* makes these read-decide-modify operations efficient and
> robust. I still think there's way to achieve this goal very
> inexpensively without compromising on flexibility.

I'm looking forward to seeing the way. ;-)


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:49                               ` Michal Hocko
@ 2019-05-28 12:11                                 ` Daniel Colascione
  2019-05-28 12:32                                   ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28 12:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:49 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 04:21:44, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 28-05-19 02:39:03, Daniel Colascione wrote:
> > > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > >
> > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > so map_anon wouldn't be helpful.
> > > > > > >
> > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > e.g., Android memory map debugging commands.
> > > > > >
> > > > > > I agree it's the best we can get *generally*.
> > > > > > Michal, any opinion?
> > > > >
> > > > > I am not really sure this is directly related. I think the primary
> > > > > question that we have to sort out first is whether we want to have
> > > > > the remote madvise call process or vma fd based. This is an important
> > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > so far unfortunately.
> > > >
> > > > I don't think the vma fd approach is viable. We have some processes
> > > > with a *lot* of VMAs --- system_server had 4204 when I checked just
> > > > now (and that's typical) --- and an FD operation per VMA would be
> > > > excessive.
> > >
> > > What do you mean by excessive here? Do you expect the process to have
> > > them open all at once?
> >
> > Minchan's already done timing. More broadly, in an era with various
> > speculative execution mitigations, making a system call is pretty
> > expensive.
>
> This is a completely separate discussion. This could be argued about
> many other syscalls.

Yes, it can be. That's why we have scatter-gather IO system calls in
the first place.

> Let's make the semantic correct first before we
> even start thinking about mutliplexing. It is easier to multiplex on an
> existing and sane interface.

I don't think of it as "multiplexing" yet, not in the fundamental unit
of operation is the address range.

> Btw. Minchan concluded that multiplexing is not really all that
> important based on his numbers http://lkml.kernel.org/r/20190527074940.GB6879@google.com
>
> [...]
>
> > > Is this really too much different from /proc/<pid>/map_files?
> >
> > It's very different. See below.
> >
> > > > > An interface to query address range information is a separate but
> > > > > although a related topic. We have /proc/<pid>/[s]maps for that right
> > > > > now and I understand it is not a general win for all usecases because
> > > > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could
> > > > > provide per vma information in a binary form via a fd based interface.
> > > > > But I would rather not conflate those two discussions much - well except
> > > > > if it could give one of the approaches more justification but let's
> > > > > focus on the madvise part first.
> > > >
> > > > I don't think it's a good idea to focus on one feature in a
> > > > multi-feature change when the interactions between features can be
> > > > very important for overall design of the multi-feature system and the
> > > > design of each feature.
> > > >
> > > > Here's my thinking on the high-level design:
> > > >
> > > > I'm imagining an address-range system that would work like this: we'd
> > > > create some kind of process_vm_getinfo(2) system call [1] that would
> > > > accept a statx-like attribute map and a pid/fd parameter as input and
> > > > return, on output, two things: 1) an array [2] of VMA descriptors
> > > > containing the requested information, and 2) a VMA configuration
> > > > sequence number. We'd then have process_madvise() and other
> > > > cross-process VM interfaces accept both address ranges and this
> > > > sequence number; they'd succeed only if the VMA configuration sequence
> > > > number is still current, i.e., the target process hasn't changed its
> > > > VMA configuration (implicitly or explicitly) since the call to
> > > > process_vm_getinfo().
> > >
> > > The sequence number is essentially a cookie that is transparent to the
> > > userspace right? If yes, how does it differ from a fd (returned from
> > > /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can
> >
> > If you want to operate on N VMAs simultaneously under an FD-per-VMA
> > model, you'd need to have those N FDs all open at the same time *and*
> > add some kind of system call that accepted those N FDs and an
> > operation to perform. The sequence number I'm proposing also applies
> > to the whole address space, not just one VMA. Even if you did have
> > these N FDs open all at once and supplied them all to some batch
> > operation, you couldn't guarantee via the FD mechanism that some *new*
> > VMA didn't appear in the address range you want to manipulate. A
> > global sequence number would catch this case. I still think supplying
> > a list of address ranges (like we already do for scatter-gather IO) is
> > less error-prone, less resource-intensive, more consistent with
> > existing practice, and equally flexible, especially if we start
> > supporting destructive cross-process memory operations, which may be
> > useful for things like checkpointing and optimizing process startup.
>
> I have a strong feeling you are over optimizing here. We are talking
> about a pro-active memory management and so far I haven't heard any
> usecase where all this would happen in the fast path. There are
> essentially two usecases I have heard so far. Age/Reclaim the whole
> process (with anon/fs preferency) and do the same on a particular
> and well specified range (e.g. a garbage collector or an inactive large
> image in browsert etc...). The former doesn't really care about parallel
> address range manipulations because it can tolerate them. The later is a
> completely different story.
>
> Are there any others where saving few ms matter so much?

Saving ms matters quite a bit. We may want to perform some of this
eager memory management in response to user activity, e.g.,
application switch, and even if that work isn't quite synchronous,
every cycle the system spends on management overhead is a cycle it
can't spend on rendering frames. Overhead means jank. Additionally,
we're on battery-operated devices. Milliseconds of CPU overhead
accumulated over a long time is a real energy sink.

> > Besides: process_vm_readv and process_vm_writev already work on
> > address ranges. Why should other cross-process memory APIs use a very
> > different model for naming memory regions?
>
> I would consider those APIs not a great example. They are racy on
> more levels (pid reuse and address space modification), and require a
> non-trivial synchronization. Do you want something similar for madvise
> on a non-cooperating remote application?
>
> > > be used to revalidate when the operation is requested and fail if
> > > something has changed. Moreover we already do have a fd based madvise
> > > syscall so there shouldn't be really a large need to add a new set of
> > > syscalls.
> >
> > We have various system calls that provide hints for open files, but
> > the memory operations are distinct. Modeling anonymous memory as a
> > kind of file-backed memory for purposes of VMA manipulation would also
> > be a departure from existing practice. Can you help me understand why
> > you seem to favor the FD-per-VMA approach so heavily? I don't see any
> > arguments *for* an FD-per-VMA model for remove memory manipulation and
> > I see a lot of arguments against it. Is there some compelling
> > advantage I'm missing?
>
> First and foremost it provides an easy cookie to the userspace to
> guarantee time-to-check-time-to-use consistency.

But only for one VMA at a time.

> It also naturally
> extend an existing fadvise interface that achieves madvise semantic on
> files.

There are lots of things that madvise can do that fadvise can't and
that don't even really make sense for fadvise, e.g., MADV_FREE. It
seems odd to me to duplicate much of the madvise interface into
fadvise so that we can use file APIs to give madvise hints. It seems
simpler to me to just provide a mechanism to put the madvise hints
where they're needed.

> I am not really pushing hard for this particular API but I really
> do care about a programming model that would be sane.

You've used "sane" twice so far in this message. Can you specify more
precisely what you mean by that word? I agree that there needs to be
some defense against TOCTOU races when doing remote memory management,
but I don't think providing this robustness via a file descriptor is
any more sane than alternative approaches. A file descriptor comes
with a lot of other features --- e.g., SCM_RIGHTS, fstat, and a
concept of owning a resource --- that aren't needed to achieve
robustness.

Normally, a file descriptor refers to some resource that the kernel
holds as long as the file descriptor (well, the open file description
or struct file) lives -- things like graphics buffers, files, and
sockets. If we're using an FD *just* as a cookie and not a resource,
I'd rather just expose the cookie directly.

> If we have a
> different means to achieve the same then all fine by me but so far I
> haven't heard any sound arguments to invent something completely new
> when we have established APIs to use.

Doesn't the next sentence describe something profoundly new? :-)

> Exporting anonymous mappings via
> proc the same way we do for file mappings doesn't seem to be stepping
> outside of the current practice way too much.

It seems like a radical departure from existing practice to provide
filesystem interfaces to anonymous memory regions, e.g., anon_vma.
You've never been able to refer to those memory regions with file
descriptors.

All I'm suggesting is that we take the existing madvise mechanism,
make it work cross-process, and make it robust against TOCTOU
problems, all one step at a time. Maybe my sense of API "size" is
miscalibrated, but adding a new type of FD to refer to anonymous VMA
regions feels like a bigger departure and so requires stronger
justification, especially if the result of the FD approach is probably
something less efficient than a cookie-based one.

> and we should focus on discussing whether this is a
> sane model. And I think it would be much better to discuss that under
> the respective patch which introduces that API rather than here.

I think it's important to discuss what that API should look like. :-)


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 11:56                                   ` Michal Hocko
@ 2019-05-28 12:18                                     ` Daniel Colascione
  2019-05-28 12:38                                       ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-28 12:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 28-05-19 04:42:47, Daniel Colascione wrote:
> > On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > > >
> > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > > >
> > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > > e.g., Android memory map debugging commands.
> > > > > > > >
> > > > > > > > I agree it's the best we can get *generally*.
> > > > > > > > Michal, any opinion?
> > > > > > >
> > > > > > > I am not really sure this is directly related. I think the primary
> > > > > > > question that we have to sort out first is whether we want to have
> > > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > > so far unfortunately.
> > > > > >
> > > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > > but thought it could be easily extended later for each address range
> > > > > > operation as userspace getting smarter with more information.
> > > > >
> > > > > Never design user API based on a single usecase, please. The "easily
> > > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > > several times, the synchronization model has to be thought through
> > > > > carefuly before a remote process address range operation can be
> > > > > implemented.
> > > >
> > > > I agree with you that we shouldn't design API on single usecase but what
> > > > you are concerning is actually not our usecase because we are resilient
> > > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > > Actually, many hints are already racy in that the upcoming pattern would
> > > > be different with the behavior you thought at the moment.
> > >
> > > How come they are racy wrt address ranges? You would have to be in
> > > multithreaded environment and then the onus of synchronization is on
> > > threads. That model is quite clear. But we are talking about separate
> > > processes and some of them might be even not aware of an external entity
> > > tweaking their address space.
> >
> > I don't think the difference between a thread and a process matters in
> > this context. Threads race on address space operations all the time
> > --- in the sense that multiple threads modify a process's address
> > space without synchronization.
>
> I would disagree. They do have in-kernel synchronization as long as they
> do not use MAP_FIXED. If they do want to use MAP_FIXED then they better
> synchronize or the result is undefined.

Right. It's because the kernel hands off different regions to
different non-MAP_FIXED mmap callers that it's pretty easy for threads
to mind their own business, but they're all still using the same
address space.

> > From a synchronization point
> > of view, it doesn't really matter whether it's a thread within the
> > target process or a thread outside the target process that does the
> > address space manipulation. What's new is the inspection of the
> > address space before performing an operation.
>
> The fundamental difference is that if you want to achieve the same
> inside the process then your application is inherenly aware of the
> operation and use whatever synchronization is needed to achieve a
> consistency. As soon as you allow the same from outside you either
> have to have an aware target application as well or you need a mechanism
> to find out that your decision has been invalidated by a later
> unsynchronized action.

I thought of this objection immediately after I hit send. :-)

I still don't think the intra- vs inter-process difference matters.
It's true that threads can synchronize with each other, but different
processes can synchronize with each other too. I mean, you *could* use
sem_open(3) for your heap lock and open the semaphore from two
different processes. That's silly, but it'd work.

The important requirement, I think, is that we need to support
managing "memory-naive" uncooperative tasks (perhaps legacy ones
written before cross-process memory management even became possible),
and I think that the cooperative-vs-uncooperative distinction matters
a lot more than the tgid of the thread doing the memory manipulation.
(Although in our case, we really do need a separate tgid. :-))


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 12:06                                   ` Michal Hocko
@ 2019-05-28 12:22                                     ` Minchan Kim
  0 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-28 12:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Tue, May 28, 2019 at 02:06:14PM +0200, Michal Hocko wrote:
> On Tue 28-05-19 20:44:36, Minchan Kim wrote:
> > On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote:
> > > On Tue 28-05-19 20:12:08, Minchan Kim wrote:
> > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote:
> > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote:
> > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote:
> > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote:
> > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > > > > > > if we went with the per vma fd approach then you would get this
> > > > > > > > > > > feature automatically because map_files would refer to file backed
> > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings.
> > > > > > > > > >
> > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead
> > > > > > > > > > so map_anon wouldn't be helpful.
> > > > > > > > > 
> > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like
> > > > > > > > > to suggest that providing an efficient binary interfaces for pulling
> > > > > > > > > memory map information out of processes.  Some single-system-call
> > > > > > > > > method for retrieving a binary snapshot of a process's address space
> > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would
> > > > > > > > > reduce complexity and increase performance in a variety of areas,
> > > > > > > > > e.g., Android memory map debugging commands.
> > > > > > > > 
> > > > > > > > I agree it's the best we can get *generally*.
> > > > > > > > Michal, any opinion?
> > > > > > > 
> > > > > > > I am not really sure this is directly related. I think the primary
> > > > > > > question that we have to sort out first is whether we want to have
> > > > > > > the remote madvise call process or vma fd based. This is an important
> > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions
> > > > > > > so far unfortunately.
> > > > > > 
> > > > > > With current usecase, it's per-process API with distinguishable anon/file
> > > > > > but thought it could be easily extended later for each address range
> > > > > > operation as userspace getting smarter with more information.
> > > > > 
> > > > > Never design user API based on a single usecase, please. The "easily
> > > > > extended" part is by far not clear to me TBH. As I've already mentioned
> > > > > several times, the synchronization model has to be thought through
> > > > > carefuly before a remote process address range operation can be
> > > > > implemented.
> > > > 
> > > > I agree with you that we shouldn't design API on single usecase but what
> > > > you are concerning is actually not our usecase because we are resilient
> > > > with the race since MADV_COLD|PAGEOUT is not destruptive.
> > > > Actually, many hints are already racy in that the upcoming pattern would
> > > > be different with the behavior you thought at the moment.
> > > 
> > > How come they are racy wrt address ranges? You would have to be in
> > > multithreaded environment and then the onus of synchronization is on
> > > threads. That model is quite clear. But we are talking about separate
> > 
> > Think about MADV_FREE. Allocator would think the chunk is worth to mark
> > "freeable" but soon, user of the allocator asked the chunk - ie, it's not
> > freeable any longer once user start to use it.
> 
> That is not a race in the address space, right. The underlying object
> hasn't changed. It has been declared as freeable and since that moment
> nobody can rely on the content because it might have been discarded.
> Or put simply, the content is undefined. It is responsibility of the
> madvise caller to make sure that the object is not in active use while
> it is marking it.
> 
> > My point is that kinds of *hints* are always racy so any synchronization
> > couldn't help a lot. That's why I want to restrict hints process_madvise
> > supports as such kinds of non-destruptive one at next respin.
> 
> I agree that a non-destructive operations are safer against paralel
> modifications because you just get a annoying and unexpected latency at
> worst case. But we should discuss whether this assumption is sufficient
> for further development. I am pretty sure once we open remote madvise
> people will find usecases for destructive operations or even new madvise
> modes we haven't heard of. What then?

I support Daniel's vma seq number approach for the future plan.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 12:11                                 ` Daniel Colascione
@ 2019-05-28 12:32                                   ` Michal Hocko
  0 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 12:32 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 05:11:16, Daniel Colascione wrote:
> On Tue, May 28, 2019 at 4:49 AM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > We have various system calls that provide hints for open files, but
> > > the memory operations are distinct. Modeling anonymous memory as a
> > > kind of file-backed memory for purposes of VMA manipulation would also
> > > be a departure from existing practice. Can you help me understand why
> > > you seem to favor the FD-per-VMA approach so heavily? I don't see any
> > > arguments *for* an FD-per-VMA model for remove memory manipulation and
> > > I see a lot of arguments against it. Is there some compelling
> > > advantage I'm missing?
> >
> > First and foremost it provides an easy cookie to the userspace to
> > guarantee time-to-check-time-to-use consistency.
> 
> But only for one VMA at a time.

Which is the unit we operate on, right?

> > It also naturally
> > extend an existing fadvise interface that achieves madvise semantic on
> > files.
> 
> There are lots of things that madvise can do that fadvise can't and
> that don't even really make sense for fadvise, e.g., MADV_FREE. It
> seems odd to me to duplicate much of the madvise interface into
> fadvise so that we can use file APIs to give madvise hints. It seems
> simpler to me to just provide a mechanism to put the madvise hints
> where they're needed.

I do not see why we would duplicate. I confess I haven't tried to
implement this so I might be overlooking something but it seems to me
that we could simply reuse the same functionality from both APIs.

> > I am not really pushing hard for this particular API but I really
> > do care about a programming model that would be sane.
> 
> You've used "sane" twice so far in this message. Can you specify more
> precisely what you mean by that word?

Well, I would consider a model which would prevent from unintended side
effects (e.g. working on a completely different object) without a tricky
synchronization sane.

> I agree that there needs to be
> some defense against TOCTOU races when doing remote memory management,
> but I don't think providing this robustness via a file descriptor is
> any more sane than alternative approaches. A file descriptor comes
> with a lot of other features --- e.g., SCM_RIGHTS, fstat, and a
> concept of owning a resource --- that aren't needed to achieve
> robustness.
> 
> Normally, a file descriptor refers to some resource that the kernel
> holds as long as the file descriptor (well, the open file description
> or struct file) lives -- things like graphics buffers, files, and
> sockets. If we're using an FD *just* as a cookie and not a resource,
> I'd rather just expose the cookie directly.

You are absolutely right. But doesn't that apply to any other
revalidation method that would be tracking VMA status as well. As I've
said I am not married to this approach as long as there are better
alternatives. So far we are in a discussion what should be the actual
semantic of the operation and how much do we want to tolerate races. And
it seems that we are diving into implementation details rather than
landing with a firm decision that the current proposed API is suitable
or not.

> > If we have a
> > different means to achieve the same then all fine by me but so far I
> > haven't heard any sound arguments to invent something completely new
> > when we have established APIs to use.
> 
> Doesn't the next sentence describe something profoundly new? :-)
> 
> > Exporting anonymous mappings via
> > proc the same way we do for file mappings doesn't seem to be stepping
> > outside of the current practice way too much.
> 
> It seems like a radical departure from existing practice to provide
> filesystem interfaces to anonymous memory regions, e.g., anon_vma.
> You've never been able to refer to those memory regions with file
> descriptors.
> 
> All I'm suggesting is that we take the existing madvise mechanism,
> make it work cross-process, and make it robust against TOCTOU
> problems, all one step at a time. Maybe my sense of API "size" is
> miscalibrated, but adding a new type of FD to refer to anonymous VMA
> regions feels like a bigger departure and so requires stronger
> justification, especially if the result of the FD approach is probably
> something less efficient than a cookie-based one.

Feel free to propose the way to achieve that in the respective email
thread.
 
> > and we should focus on discussing whether this is a
> > sane model. And I think it would be much better to discuss that under
> > the respective patch which introduces that API rather than here.
> 
> I think it's important to discuss what that API should look like. :-)

It will be fun to follow this discussion and make some sense of
different parallel threads.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-28 12:18                                     ` Daniel Colascione
@ 2019-05-28 12:38                                       ` Michal Hocko
  0 siblings, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-28 12:38 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Tue 28-05-19 05:18:48, Daniel Colascione wrote:
[...]
> The important requirement, I think, is that we need to support
> managing "memory-naive" uncooperative tasks (perhaps legacy ones
> written before cross-process memory management even became possible),
> and I think that the cooperative-vs-uncooperative distinction matters
> a lot more than the tgid of the thread doing the memory manipulation.
> (Although in our case, we really do need a separate tgid. :-))

Agreed here and that requires some sort of revalidation and failure on
"object has changed" in one form or another IMHO.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20  3:52 ` [RFC 3/7] mm: introduce MADV_COLD Minchan Kim
  2019-05-20  8:27   ` Michal Hocko
@ 2019-05-28 14:54   ` Hillf Danton
  2019-05-30  0:45   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Hillf Danton @ 2019-05-28 14:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon


On Mon, 20 May 2019 12:52:50 +0900 Minchan Kim wrote:
> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> +	int nid = -1;
> +	unsigned long nr_isolated[2] = {0, };
> +	unsigned long nr_reclaimed = 0;
> +	LIST_HEAD(node_page_list);
> +	struct reclaim_stat dummy_stat;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
> +		.may_writepage = 1,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +	};
> +
> +	while (!list_empty(page_list)) {
> +		struct page *page;
> +
> +		page = lru_to_page(page_list);
> +		list_del(&page->lru);
> +
> +		if (nid == -1) {
> +			nid = page_to_nid(page);
> +			INIT_LIST_HEAD(&node_page_list);
> +			nr_isolated[0] = nr_isolated[1] = 0;
> +		}
> +
> +		if (nid == page_to_nid(page)) {
> +			list_add(&page->lru, &node_page_list);
> +			nr_isolated[!!page_is_file_cache(page)] +=
> +						hpage_nr_pages(page);
> +			continue;
> +		}
> +
Now, page's node != nid and any page on the node_page_list has
node == nid. 
> +		nid = page_to_nid(page);

After updating nid, we get the node id of the isolated pages lost.

> +
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					nr_isolated[1]);
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,

And nid no longer matches the node of the pages to be shrunk.

> +				&dummy_stat, true);
> +		while (!list_empty(&node_page_list)) {
> +			struct page *page = lru_to_page(page_list);

Non-empty node_page_list will never become empty if pages are deleted
only from the page_list.
> +
> +			list_del(&page->lru);
> +			putback_lru_page(page);
> +		}
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> +					-nr_isolated[0]);
> +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> +					-nr_isolated[1]);
> +		nr_isolated[0] = nr_isolated[1] = 0;
> +		INIT_LIST_HEAD(&node_page_list);
> +	}
> +

BR
Hillf


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
                     ` (2 preceding siblings ...)
  2019-05-21 15:31   ` Oleg Nesterov
@ 2019-05-29  3:41   ` Hillf Danton
  2019-05-30  0:38   ` Minchan Kim
  4 siblings, 0 replies; 138+ messages in thread
From: Hillf Danton @ 2019-05-29  3:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon


On Mon, 20 May 2019 12:52:52 +0900 Minchan Kim wrote:
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  425	common	io_uring_setup		__x64_sys_io_uring_setup
>  426	common	io_uring_enter		__x64_sys_io_uring_enter
>  427	common	io_uring_register	__x64_sys_io_uring_register
> +428	common	process_madvise		__x64_sys_process_madvise
>  
Much better if something similar is added for arm64.

>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
>  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
>  #define __NR_io_uring_register 427
>  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_process_madvise 428
> +__SYSCALL(__NR_process_madvise, sys_process_madvise)
>  
>  #undef __NR_syscalls
>  #define __NR_syscalls 428

Seems __NR_syscalls needs to increment by one.

BR
Hillf


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-20  3:52 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Minchan Kim
  2019-05-20  9:22   ` Michal Hocko
@ 2019-05-29  4:14   ` Hillf Danton
  2019-05-30  0:35   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Hillf Danton @ 2019-05-29  4:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon


On Mon, 20 May 2019 12:52:53 +0900 Minchan Kim wrote:
> Example)
> 
Better if the following stuff is stored somewhere under the
tools/testing directory.

BR
Hillf

> struct pr_madvise_param {
>         int size;
>         const struct iovec *vec;
> };
> 
> int main(int argc, char *argv[])
> {
>         struct pr_madvise_param retp, rangep;
>         struct iovec result_vec[2], range_vec[2];
>         int hints[2];
>         long ret[2];
>         void *addr[2];
> 
>         pid_t pid;
>         char cmd[64] = {0,};
>         addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
>                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>         if (MAP_FAILED == addr[0])
>                 return 1;
> 
>         addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
>                           MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> 
>         if (MAP_FAILED == addr[1])
>                 return 1;
> 
>         hints[0] = MADV_COLD;
> 	range_vec[0].iov_base = addr[0];
>         range_vec[0].iov_len = ALLOC_SIZE;
>         result_vec[0].iov_base = &ret[0];
>         result_vec[0].iov_len = sizeof(long);
> 	retp.vec = result_vec;
>         retp.size = sizeof(struct pr_madvise_param);
> 
>         hints[1] = MADV_COOL;
>         range_vec[1].iov_base = addr[1];
>         range_vec[1].iov_len = ALLOC_SIZE;
>         result_vec[1].iov_base = &ret[1];
>         result_vec[1].iov_len = sizeof(long);
>         rangep.vec = range_vec;
>         rangep.size = sizeof(struct pr_madvise_param);
> 
>         pid = fork();
>         if (!pid) {
>                 sleep(10);
>         } else {
>                 int pidfd = open(cmd,  O_DIRECTORY | O_CLOEXEC);
>                 if (pidfd < 0)
>                         return 1;
> 
>                 /* munmap to make pages private for the child */
>                 munmap(addr[0], ALLOC_SIZE);
>                 munmap(addr[1], ALLOC_SIZE);
>                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
>                 if (syscall(__NR_process_madvise, pidfd, 2, behaviors,
> 						&retp, &rangep, 0))
>                         perror("process_madvise fail\n");
>                 system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
>         }
> 
>         return 0;
> }


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-20  3:52 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Minchan Kim
  2019-05-20  9:28   ` Michal Hocko
@ 2019-05-29  4:36   ` Hillf Danton
  2019-05-30  1:00   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Hillf Danton @ 2019-05-29  4:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon


On Mon, 20 May 2019 12:52:54 +0900 Minchan Kim wrote:
> 
> With that, user could call a process_madvise syscall simply with a entire
> range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> MADV_FILE_FILTER so there is no need to call the syscall range by range.
> 
Cool.

Look forward to seeing the non-RFC delivery.

BR
Hillf


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-27  7:49             ` Minchan Kim
@ 2019-05-29 10:08               ` Daniel Colascione
  2019-05-29 10:33                 ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Daniel Colascione @ 2019-05-29 10:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > [Cc linux-api]
> > > > > >
> > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > so user should call the syscall several times to give hints to
> > > > > > > multiple address range.
> > > > > >
> > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > >
> > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > with number in the description at respin.
> > > >
> > > > Does this really have to be a fast operation? I would expect the monitor
> > > > is by no means a fast path. The system call overhead is not what it used
> > > > to be, sigh, but still for something that is not a hot path it should be
> > > > tolerable, especially when the whole operation is quite expensive on its
> > > > own (wrt. the syscall entry/exit).
> > >
> > > What's different with process_vm_[readv|writev] and vmsplice?
> > > If the range needed to be covered is a lot, vector operation makes senese
> > > to me.
> >
> > I am not saying that the vector API is wrong. All I am trying to say is
> > that the benefit is not really clear so far. If you want to push it
> > through then you should better get some supporting data.
>
> I measured 1000 madvise syscall vs. a vector range syscall with 1000
> ranges on ARM64 mordern device. Even though I saw 15% improvement but
> absoluate gain is just 1ms so I don't think it's worth to support.
> I will drop vector support at next revision.

Please do keep the vector support. Absolute timing is misleading,
since in a tight loop, you're not going to contend on mmap_sem. We've
seen tons of improvements in things like camera start come from
coalescing mprotect calls, with the gains coming from taking and
releasing various locks a lot less often and bouncing around less on
the contended lock paths. Raw throughput doesn't tell the whole story,
especially on mobile.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-29 10:08               ` Daniel Colascione
@ 2019-05-29 10:33                 ` Michal Hocko
  2019-05-30  2:17                   ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-29 10:33 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > [Cc linux-api]
> > > > > > >
> > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > multiple address range.
> > > > > > >
> > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > >
> > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > with number in the description at respin.
> > > > >
> > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > own (wrt. the syscall entry/exit).
> > > >
> > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > to me.
> > >
> > > I am not saying that the vector API is wrong. All I am trying to say is
> > > that the benefit is not really clear so far. If you want to push it
> > > through then you should better get some supporting data.
> >
> > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > absoluate gain is just 1ms so I don't think it's worth to support.
> > I will drop vector support at next revision.
> 
> Please do keep the vector support. Absolute timing is misleading,
> since in a tight loop, you're not going to contend on mmap_sem. We've
> seen tons of improvements in things like camera start come from
> coalescing mprotect calls, with the gains coming from taking and
> releasing various locks a lot less often and bouncing around less on
> the contended lock paths. Raw throughput doesn't tell the whole story,
> especially on mobile.

This will always be a double edge sword. Taking a lock for longer can
improve a throughput of a single call but it would make a latency for
anybody contending on the lock much worse.

Besides that, please do not overcomplicate the thing from the early
beginning please. Let's start with a simple and well defined remote
madvise alternative first and build a vector API on top with some
numbers based on _real_ workloads.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-20  3:52 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Minchan Kim
  2019-05-20  9:22   ` Michal Hocko
  2019-05-29  4:14   ` Hillf Danton
@ 2019-05-30  0:35   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-30  0:35 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Wed, May 29, 2019 at 12:14:47PM +0800, Hillf Danton wrote:
> 
> On Mon, 20 May 2019 12:52:53 +0900 Minchan Kim wrote:
> > Example)
> > 
> Better if the following stuff is stored somewhere under the
> tools/testing directory.

Sure, I will do once we figure out RFC stage.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 5/7] mm: introduce external memory hinting API
  2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
                     ` (3 preceding siblings ...)
  2019-05-29  3:41   ` Hillf Danton
@ 2019-05-30  0:38   ` Minchan Kim
  4 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-30  0:38 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Wed, May 29, 2019 at 11:41:23AM +0800, Hillf Danton wrote:
> 
> On Mon, 20 May 2019 12:52:52 +0900 Minchan Kim wrote:
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -355,6 +355,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	process_madvise		__x64_sys_process_madvise
> >  
> Much better if something similar is added for arm64.

I will port every architecture once we figure out RFC and reaches the
conclusion for right interface.

> 
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_process_madvise 428
> > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  
> >  #undef __NR_syscalls
> >  #define __NR_syscalls 428
> 
> Seems __NR_syscalls needs to increment by one.

Thanks. I will fix it.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 3/7] mm: introduce MADV_COLD
  2019-05-20  3:52 ` [RFC 3/7] mm: introduce MADV_COLD Minchan Kim
  2019-05-20  8:27   ` Michal Hocko
  2019-05-28 14:54   ` Hillf Danton
@ 2019-05-30  0:45   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-30  0:45 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Tue, May 28, 2019 at 10:54:32PM +0800, Hillf Danton wrote:
> 
> On Mon, 20 May 2019 12:52:50 +0900 Minchan Kim wrote:
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > +	int nid = -1;
> > +	unsigned long nr_isolated[2] = {0, };
> > +	unsigned long nr_reclaimed = 0;
> > +	LIST_HEAD(node_page_list);
> > +	struct reclaim_stat dummy_stat;
> > +	struct scan_control sc = {
> > +		.gfp_mask = GFP_KERNEL,
> > +		.priority = DEF_PRIORITY,
> > +		.may_writepage = 1,
> > +		.may_unmap = 1,
> > +		.may_swap = 1,
> > +	};
> > +
> > +	while (!list_empty(page_list)) {
> > +		struct page *page;
> > +
> > +		page = lru_to_page(page_list);
> > +		list_del(&page->lru);
> > +
> > +		if (nid == -1) {
> > +			nid = page_to_nid(page);
> > +			INIT_LIST_HEAD(&node_page_list);
> > +			nr_isolated[0] = nr_isolated[1] = 0;
> > +		}
> > +
> > +		if (nid == page_to_nid(page)) {
> > +			list_add(&page->lru, &node_page_list);
> > +			nr_isolated[!!page_is_file_cache(page)] +=
> > +						hpage_nr_pages(page);
> > +			continue;
> > +		}
> > +
> Now, page's node != nid and any page on the node_page_list has
> node == nid. 
> > +		nid = page_to_nid(page);
> 
> After updating nid, we get the node id of the isolated pages lost.
> 
> > +
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > +					nr_isolated[0]);
> > +		mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > +					nr_isolated[1]);
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +				NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> 
> And nid no longer matches the node of the pages to be shrunk.
> 
> > +				&dummy_stat, true);
> > +		while (!list_empty(&node_page_list)) {
> > +			struct page *page = lru_to_page(page_list);
> 
> Non-empty node_page_list will never become empty if pages are deleted
> only from the page_list.

Sure.
They were last minute change. I will fix it.

Thanks for the review!


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
  2019-05-20  3:52 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Minchan Kim
  2019-05-20  9:28   ` Michal Hocko
  2019-05-29  4:36   ` Hillf Danton
@ 2019-05-30  1:00   ` Minchan Kim
  2 siblings, 0 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-30  1:00 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon

On Wed, May 29, 2019 at 12:36:04PM +0800, Hillf Danton wrote:
> 
> On Mon, 20 May 2019 12:52:54 +0900 Minchan Kim wrote:
> > 
> > With that, user could call a process_madvise syscall simply with a entire
> > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > 
> Cool.
> 
> Look forward to seeing the non-RFC delivery.

I will drop this filter patch if userspace can parse address range fast.
Daniel suggested a new interface which could get necessary information
from the process with binary format so will it will remove endoce/decode
overhead as well as overhead we don't need to get like directory look
up.

Yes, with this filter option, it's best performance for a specific
usecase but sometime we need to give up *best* thing for *general*
stuff. I belive it's one of the category. ;-)


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-29 10:33                 ` Michal Hocko
@ 2019-05-30  2:17                   ` Minchan Kim
  2019-05-30  6:57                     ` Michal Hocko
  0 siblings, 1 reply; 138+ messages in thread
From: Minchan Kim @ 2019-05-30  2:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > [Cc linux-api]
> > > > > > > >
> > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > multiple address range.
> > > > > > > >
> > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > >
> > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > with number in the description at respin.
> > > > > >
> > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > own (wrt. the syscall entry/exit).
> > > > >
> > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > to me.
> > > >
> > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > that the benefit is not really clear so far. If you want to push it
> > > > through then you should better get some supporting data.
> > >
> > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > I will drop vector support at next revision.
> > 
> > Please do keep the vector support. Absolute timing is misleading,
> > since in a tight loop, you're not going to contend on mmap_sem. We've
> > seen tons of improvements in things like camera start come from
> > coalescing mprotect calls, with the gains coming from taking and
> > releasing various locks a lot less often and bouncing around less on
> > the contended lock paths. Raw throughput doesn't tell the whole story,
> > especially on mobile.
> 
> This will always be a double edge sword. Taking a lock for longer can
> improve a throughput of a single call but it would make a latency for
> anybody contending on the lock much worse.
> 
> Besides that, please do not overcomplicate the thing from the early
> beginning please. Let's start with a simple and well defined remote
> madvise alternative first and build a vector API on top with some
> numbers based on _real_ workloads.

First time, I didn't think about atomicity about address range race
because MADV_COLD/PAGEOUT is not critical for the race.
However you raised the atomicity issue because people would extend
hints to destructive ones easily. I agree with that and that's why
we discussed how to guarantee the race and Daniel comes up with good idea.

  - vma configuration seq number via process_getinfo(2).

We discussed the race issue without _read_ workloads/requests because
it's common sense that people might extend the syscall later.

Here is same. For current workload, we don't need to support vector
for perfomance point of view based on my experiment. However, it's
rather limited experiment. Some configuration might have 10000+ vmas
or really slow CPU. 

Furthermore, I want to have vector support due to atomicity issue
if it's really the one we should consider.
With vector support of the API and vma configuration sequence number
from Daniel, we could support address ranges operations's atomicity.
However, since we don't introduce vector at this moment, we need to
introduce *another syscall* later to be able to handle multile ranges
all at once atomically if it's okay.

Other thought:
Maybe we could extend address range batch syscall covers other MM
syscall like mmap/munmap/madvise/mprotect and so on because there
are multiple users that would benefit from this general batching
mechanism.


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  2:17                   ` Minchan Kim
@ 2019-05-30  6:57                     ` Michal Hocko
  2019-05-30  8:02                       ` Minchan Kim
  0 siblings, 1 reply; 138+ messages in thread
From: Michal Hocko @ 2019-05-30  6:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > [Cc linux-api]
> > > > > > > > >
> > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > multiple address range.
> > > > > > > > >
> > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > >
> > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > with number in the description at respin.
> > > > > > >
> > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > own (wrt. the syscall entry/exit).
> > > > > >
> > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > to me.
> > > > >
> > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > that the benefit is not really clear so far. If you want to push it
> > > > > through then you should better get some supporting data.
> > > >
> > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > I will drop vector support at next revision.
> > > 
> > > Please do keep the vector support. Absolute timing is misleading,
> > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > seen tons of improvements in things like camera start come from
> > > coalescing mprotect calls, with the gains coming from taking and
> > > releasing various locks a lot less often and bouncing around less on
> > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > especially on mobile.
> > 
> > This will always be a double edge sword. Taking a lock for longer can
> > improve a throughput of a single call but it would make a latency for
> > anybody contending on the lock much worse.
> > 
> > Besides that, please do not overcomplicate the thing from the early
> > beginning please. Let's start with a simple and well defined remote
> > madvise alternative first and build a vector API on top with some
> > numbers based on _real_ workloads.
> 
> First time, I didn't think about atomicity about address range race
> because MADV_COLD/PAGEOUT is not critical for the race.
> However you raised the atomicity issue because people would extend
> hints to destructive ones easily. I agree with that and that's why
> we discussed how to guarantee the race and Daniel comes up with good idea.

Just for the clarification, I didn't really mean atomicity but rather a
_consistency_ (essentially time to check to time to use consistency).
 
>   - vma configuration seq number via process_getinfo(2).
> 
> We discussed the race issue without _read_ workloads/requests because
> it's common sense that people might extend the syscall later.
> 
> Here is same. For current workload, we don't need to support vector
> for perfomance point of view based on my experiment. However, it's
> rather limited experiment. Some configuration might have 10000+ vmas
> or really slow CPU. 
> 
> Furthermore, I want to have vector support due to atomicity issue
> if it's really the one we should consider.
> With vector support of the API and vma configuration sequence number
> from Daniel, we could support address ranges operations's atomicity.

I am not sure what do you mean here. Perform all ranges atomicaly wrt.
other address space modifications? If yes I am not sure we want that
semantic because it can cause really long stalls for other operations
but that is a discussion on its own and I would rather focus on a simple
interface first.

> However, since we don't introduce vector at this moment, we need to
> introduce *another syscall* later to be able to handle multile ranges
> all at once atomically if it's okay.

Agreed.

> Other thought:
> Maybe we could extend address range batch syscall covers other MM
> syscall like mmap/munmap/madvise/mprotect and so on because there
> are multiple users that would benefit from this general batching
> mechanism.

Again a discussion on its own ;)

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  6:57                     ` Michal Hocko
@ 2019-05-30  8:02                       ` Minchan Kim
  2019-05-30 16:19                         ` Daniel Colascione
  2019-05-30 18:47                         ` Michal Hocko
  0 siblings, 2 replies; 138+ messages in thread
From: Minchan Kim @ 2019-05-30  8:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > >
> > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > > [Cc linux-api]
> > > > > > > > > >
> > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > > multiple address range.
> > > > > > > > > >
> > > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > > >
> > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > > with number in the description at respin.
> > > > > > > >
> > > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > > own (wrt. the syscall entry/exit).
> > > > > > >
> > > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > > to me.
> > > > > >
> > > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > > that the benefit is not really clear so far. If you want to push it
> > > > > > through then you should better get some supporting data.
> > > > >
> > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > > I will drop vector support at next revision.
> > > > 
> > > > Please do keep the vector support. Absolute timing is misleading,
> > > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > > seen tons of improvements in things like camera start come from
> > > > coalescing mprotect calls, with the gains coming from taking and
> > > > releasing various locks a lot less often and bouncing around less on
> > > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > > especially on mobile.
> > > 
> > > This will always be a double edge sword. Taking a lock for longer can
> > > improve a throughput of a single call but it would make a latency for
> > > anybody contending on the lock much worse.
> > > 
> > > Besides that, please do not overcomplicate the thing from the early
> > > beginning please. Let's start with a simple and well defined remote
> > > madvise alternative first and build a vector API on top with some
> > > numbers based on _real_ workloads.
> > 
> > First time, I didn't think about atomicity about address range race
> > because MADV_COLD/PAGEOUT is not critical for the race.
> > However you raised the atomicity issue because people would extend
> > hints to destructive ones easily. I agree with that and that's why
> > we discussed how to guarantee the race and Daniel comes up with good idea.
> 
> Just for the clarification, I didn't really mean atomicity but rather a
> _consistency_ (essentially time to check to time to use consistency).

What do you mean by *consistency*? Could you elaborate it more?

>  
> >   - vma configuration seq number via process_getinfo(2).
> > 
> > We discussed the race issue without _read_ workloads/requests because
> > it's common sense that people might extend the syscall later.
> > 
> > Here is same. For current workload, we don't need to support vector
> > for perfomance point of view based on my experiment. However, it's
> > rather limited experiment. Some configuration might have 10000+ vmas
> > or really slow CPU. 
> > 
> > Furthermore, I want to have vector support due to atomicity issue
> > if it's really the one we should consider.
> > With vector support of the API and vma configuration sequence number
> > from Daniel, we could support address ranges operations's atomicity.
> 
> I am not sure what do you mean here. Perform all ranges atomicaly wrt.
> other address space modifications? If yes I am not sure we want that

Yub, I think it's *necessary* if we want to support destructive hints
via process_madvise.

> semantic because it can cause really long stalls for other operations

It could be or it couldn't be.

For example, if we could multiplex several syscalls which we should
enumerate all of page table lookup, it could be more effective rather
than doing each page table on each syscall.

> but that is a discussion on its own and I would rather focus on a simple
> interface first.

It seems it's time to send RFCv2 since we discussed a lot although we
don't have clear conclution yet. But still want to understand what you
meant _consistency_.

Thanks for the review, Michal! It's very helpful.

> 
> > However, since we don't introduce vector at this moment, we need to
> > introduce *another syscall* later to be able to handle multile ranges
> > all at once atomically if it's okay.
> 
> Agreed.
> 
> > Other thought:
> > Maybe we could extend address range batch syscall covers other MM
> > syscall like mmap/munmap/madvise/mprotect and so on because there
> > are multiple users that would benefit from this general batching
> > mechanism.
> 
> Again a discussion on its own ;)
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  8:02                       ` Minchan Kim
@ 2019-05-30 16:19                         ` Daniel Colascione
  2019-05-30 18:47                         ` Michal Hocko
  1 sibling, 0 replies; 138+ messages in thread
From: Daniel Colascione @ 2019-05-30 16:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API

On Thu, May 30, 2019 at 1:02 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> > On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> > > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > > > [Cc linux-api]
> > > > > > > > > > >
> > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > > > multiple address range.
> > > > > > > > > > >
> > > > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > > > >
> > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > > > with number in the description at respin.
> > > > > > > > >
> > > > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > > > own (wrt. the syscall entry/exit).
> > > > > > > >
> > > > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > > > to me.
> > > > > > >
> > > > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > > > that the benefit is not really clear so far. If you want to push it
> > > > > > > through then you should better get some supporting data.
> > > > > >
> > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > > > I will drop vector support at next revision.
> > > > >
> > > > > Please do keep the vector support. Absolute timing is misleading,
> > > > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > > > seen tons of improvements in things like camera start come from
> > > > > coalescing mprotect calls, with the gains coming from taking and
> > > > > releasing various locks a lot less often and bouncing around less on
> > > > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > > > especially on mobile.
> > > >
> > > > This will always be a double edge sword. Taking a lock for longer can
> > > > improve a throughput of a single call but it would make a latency for
> > > > anybody contending on the lock much worse.
> > > >
> > > > Besides that, please do not overcomplicate the thing from the early
> > > > beginning please. Let's start with a simple and well defined remote
> > > > madvise alternative first and build a vector API on top with some
> > > > numbers based on _real_ workloads.
> > >
> > > First time, I didn't think about atomicity about address range race
> > > because MADV_COLD/PAGEOUT is not critical for the race.
> > > However you raised the atomicity issue because people would extend
> > > hints to destructive ones easily. I agree with that and that's why
> > > we discussed how to guarantee the race and Daniel comes up with good idea.
> >
> > Just for the clarification, I didn't really mean atomicity but rather a
> > _consistency_ (essentially time to check to time to use consistency).
>
> What do you mean by *consistency*? Could you elaborate it more?
>
> >
> > >   - vma configuration seq number via process_getinfo(2).
> > >
> > > We discussed the race issue without _read_ workloads/requests because
> > > it's common sense that people might extend the syscall later.
> > >
> > > Here is same. For current workload, we don't need to support vector
> > > for perfomance point of view based on my experiment. However, it's
> > > rather limited experiment. Some configuration might have 10000+ vmas
> > > or really slow CPU.
> > >
> > > Furthermore, I want to have vector support due to atomicity issue
> > > if it's really the one we should consider.
> > > With vector support of the API and vma configuration sequence number
> > > from Daniel, we could support address ranges operations's atomicity.
> >
> > I am not sure what do you mean here. Perform all ranges atomicaly wrt.
> > other address space modifications? If yes I am not sure we want that
>
> Yub, I think it's *necessary* if we want to support destructive hints
> via process_madvise.

[Puts on flame-proof suit]

Here's a quick sketch of what I have in mind for process_getinfo(2).
Keep in mind that it's still just a rough idea.

We've had trouble efficiently learning about process and memory state
of the system via procfs. Android background memory-use scans (the
android.bg daemon) consume quite a bit of CPU time for PSS; Minchan's
done a lot of thinking for how we can specify desired page sets for
compaction as part of this patch set; and the full procfs walks that
some trace collection tools need to undertake take more than 200ms to
collect (sometimes much more) due mostly to procfs iteration. ISTM we
can do better.

While procfs *works* on a functional level, it's inefficient due to
the splatting of information we want across several different files
(which need to be independently opened --- e.g.,
/proc/pid/oom_score_adj *and* /proc/pid/status), inefficient due to
the ad-hoc text formatting, inefficient due to information over-fetch,
and cumbersome due to the fundamental impedance mismatch between
filesystem APIs and process lifetimes. procfs itself is also optional,
which has caused various bits of awkwardness that you'll recall from
the pidfd discussions.

How about we solve the problem once and for all? I'm imagining a new
process_getinfo(2) that solves all of these problems at the same time.
I want something with a few properties:

1) if we want to learn M facts about N things, we enter the kernel
once and learn all M*N things,
2) the information we collect is self-consistent (which implies
atomicity most of the time),
3) we retrieve the information we want in an efficient binary format, and
4) we don't pay to learn anything not in M.

I've jotted down a quick sketch of the API below; I'm curious what
everyone else thinks. It'd basically look like this:

int process_getinfo(int nr_proc, int* proc, int flags, unsigned long
mask, void* out_buf, size_t* inout_sz)

We wouldn't use the return value for much: 0 on success and -1 on
error with errno set. NR_PROC and PROC together would specify the
objects we want to learn about, which would be either PIDs or PIDFDs
or maybe nothing at all if FLAGS tells us to inspect every process on
the system. MASK is a bitset of facts we want to learn, described
below. OUT_BUF and INOUT_SZ are for actually communicating the result.
On input, the caller would fill *INOUT_SZ with the size of the buffer
to which OUT_BUF points; on success, we'd fill *INOUT_SZ with the
number of bytes we actually used. If the output buffer is too small,
we'll truncate the result, fail the system call with E2BIG, and fill
*INOUT_SZ with the number of needed bytes, inviting the caller to try
again. (If a caller supplies something huge like a reusable 512KB
buffer on the first call, no reallocation and retries will be
necessary in practice on a typically-sized system.)

The actual returned buffer is a collection of structures and data
blocks starting with a struct process_info. The structures in the
returned buffer sometimes contain "pointers" to other structures
encoded as byte offsets from the start of the information buffer.
Using offsets instead of actual pointers keeps the format the same
across 32- and 64-bit versions of process_getinfo and makes it
possible to relocate the returned information buffer with memcpy.

struct process_info {
  int first_procrec_offset;  // struct procrec*
  //  Examples of system-wide things we could ask for
  int mem_total_kb;
  int mem_free_kb;
  int mem_available_kb;
  char reserved[];
};

struct procrec {
  int next_procrec_offset;  // struct procrec*
  // Following fields are self-explanatory and are only examples of the
  // kind of information we could provide.
  int tid;
  int tgid;
  char status;
  int oom_score_adj;
  struct { int real, effective, saved, fs; } uids;
  int prio;
  int comm_offset;  // char*
  uint64_t rss_file_kb
  uint64_t rss_anon_fb;
  uint64_t vm_seq;
  int first_procrec_vma_offset;  // struct procrec_vma*
  char reserved[];
};

struct procrec_vma {
  int next_procrec_vma_offset;  // struct procrec_vma*
  unsigned long start;
  unsigned long end;
  int backing_file_name_offset;  // char*
  int prot;
  char reserved[];
};

Callers would use the returned buffer by casting it to a struct
process_info and following the internal "pointers".

MASK would specify which bits of information we wanted: for example,
if we asked for PROCESS_VM_MEMORY_MAP, the kernel would fill in each
struct procret's memory_map field and have it point to a struct
procrec_vma in the returned output buffer. If we omitted
PROCESS_VM_MEMORY_MAP, we'd leave the memory_map field as NULL
(encoded as offset zero). The kernel would embed any strings (like
comm and VMA names) into the output buffer; the precise locations
would be unspecified so long as callers could find these fields via
output-buffer pointers.

Because all the structures are variable-length and are chained
together with explicit pointers (or offsets) instead of being stuffed
into a literal array, we can add additional fields to the output
structures any time we want without breaking binary compatibility.
Callers would tell the kernel that they're interested in the added
struct fields by asking for them via bits in MASK, and kernels that
don't understand those fields would just fail the system call with
EINVAL or something.

Depending on how we call it, we can use this API as a bunch of different things:

1) Quick retrieval of system-wide memory counters, like /proc/meminfo
2) Quick snapshot of all process-thread identities (asking for the
wildcard TID match via FLAGS)
3) Fast enumeration of one process's address space
4) Collecting process-summary VM counters (e.g., rss_anon and
rss_file) for a set of processes
5) Retrieval of every VMA of every process on the system for debugging

We can do all of this with one entry into the kernel and without
opening any new file descriptors (unless we want to use pidfds as
inputs). We can also make this operation as atomic as we want, e.g.,
taking mmap_sem while looking at each process and taking tasklist_lock
so all the thread IDs line up with their processes. We don't
necessarily need to take the mmap_sems of all processes we care about
at the same time.

Since this isn't a filesystem-based API, we don't have to deal with
seq_file or deal with consistency issues arising from userspace
programs doing strange things like reading procfs files very slowly in
small chunks. Security-wise, we'd just use different access checks for
different requested information bits in MASK, maybe supplying "no
access" struct procrec entries if a caller doesn't happen to have
access to a particular process. I suppose we can talk about whether
access check failures should result in dummy values or syscall
failure: maybe callers should select which behavior they want.
Format-wise, we could also just return flatbuffer messages from the
kernel, but I suspect that we don't want flatbuffer in the kernel
right now. :-)

The API I'm proposing accepts an array of processes to inspect. We
could simplify it by accepting just one process and making the caller
enter the kernel once per process it wants to learn about, but this
simplification would make the API less useful for answering questions
like "What's the RSS of every process on the system right now?".

What do you think?


^ permalink raw reply	[flat|nested] 138+ messages in thread

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
  2019-05-30  8:02                       ` Minchan Kim
  2019-05-30 16:19                         ` Daniel Colascione
@ 2019-05-30 18:47                         ` Michal Hocko
  1 sibling, 0 replies; 138+ messages in thread
From: Michal Hocko @ 2019-05-30 18:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Shakeel Butt, Sonny Rao, Brian Geffon, Linux API

On Thu 30-05-19 17:02:14, Minchan Kim wrote:
> On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote:
> > On Thu 30-05-19 11:17:48, Minchan Kim wrote:
[...]
> > > First time, I didn't think about atomicity about address range race
> > > because MADV_COLD/PAGEOUT is not critical for the race.
> > > However you raised the atomicity issue because people would extend
> > > hints to destructive ones easily. I agree with that and that's why
> > > we discussed how to guarantee the race and Daniel comes up with good idea.
> > 
> > Just for the clarification, I didn't really mean atomicity but rather a
> > _consistency_ (essentially time to check to time to use consistency).
> 
> What do you mean by *consistency*? Could you elaborate it more?

That you operate on the object you have got by some means. In other
words that the range you want to call madvise on hasn't been
remapped/replaced by a different mmap operation.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 138+ messages in thread

end of thread, other threads:[~2019-05-30 18:47 UTC | newest]

Thread overview: 138+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-20  3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
2019-05-20  3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
2019-05-20  8:16   ` Michal Hocko
2019-05-20  8:19     ` Michal Hocko
2019-05-20 15:08       ` Suren Baghdasaryan
2019-05-20 22:55       ` Minchan Kim
2019-05-20 22:54     ` Minchan Kim
2019-05-21  6:04       ` Michal Hocko
2019-05-21  9:11         ` Minchan Kim
2019-05-21 10:05           ` Michal Hocko
2019-05-28  8:53   ` Hillf Danton
2019-05-28 10:58   ` Minchan Kim
2019-05-20  3:52 ` [RFC 2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
2019-05-20 16:50   ` Johannes Weiner
2019-05-20 22:57     ` Minchan Kim
2019-05-20  3:52 ` [RFC 3/7] mm: introduce MADV_COLD Minchan Kim
2019-05-20  8:27   ` Michal Hocko
2019-05-20 23:00     ` Minchan Kim
2019-05-21  6:08       ` Michal Hocko
2019-05-21  9:13         ` Minchan Kim
2019-05-28 14:54   ` Hillf Danton
2019-05-30  0:45   ` Minchan Kim
2019-05-20  3:52 ` [RFC 4/7] mm: factor out madvise's core functionality Minchan Kim
2019-05-20 14:26   ` Oleksandr Natalenko
2019-05-21  1:26     ` Minchan Kim
2019-05-21  6:36       ` Oleksandr Natalenko
2019-05-21  6:50         ` Michal Hocko
2019-05-21  7:06           ` Oleksandr Natalenko
2019-05-21 10:52             ` Minchan Kim
2019-05-21 11:00               ` Michal Hocko
2019-05-21 11:24                 ` Minchan Kim
2019-05-21 11:32                   ` Michal Hocko
2019-05-21 10:49         ` Minchan Kim
2019-05-21 10:55           ` Michal Hocko
2019-05-20  3:52 ` [RFC 5/7] mm: introduce external memory hinting API Minchan Kim
2019-05-20  9:18   ` Michal Hocko
2019-05-21  2:41     ` Minchan Kim
2019-05-21  6:17       ` Michal Hocko
2019-05-21 10:32         ` Minchan Kim
2019-05-21  9:01   ` Christian Brauner
2019-05-21 11:35     ` Minchan Kim
2019-05-21 11:51       ` Christian Brauner
2019-05-21 15:31   ` Oleg Nesterov
2019-05-27  7:43     ` Minchan Kim
2019-05-27 15:12       ` Oleg Nesterov
2019-05-27 23:33         ` Minchan Kim
2019-05-28  7:23           ` Michal Hocko
2019-05-29  3:41   ` Hillf Danton
2019-05-30  0:38   ` Minchan Kim
2019-05-20  3:52 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Minchan Kim
2019-05-20  9:22   ` Michal Hocko
2019-05-21  2:48     ` Minchan Kim
2019-05-21  6:24       ` Michal Hocko
2019-05-21 10:26         ` Minchan Kim
2019-05-21 10:37           ` Michal Hocko
2019-05-27  7:49             ` Minchan Kim
2019-05-29 10:08               ` Daniel Colascione
2019-05-29 10:33                 ` Michal Hocko
2019-05-30  2:17                   ` Minchan Kim
2019-05-30  6:57                     ` Michal Hocko
2019-05-30  8:02                       ` Minchan Kim
2019-05-30 16:19                         ` Daniel Colascione
2019-05-30 18:47                         ` Michal Hocko
2019-05-29  4:14   ` Hillf Danton
2019-05-30  0:35   ` Minchan Kim
2019-05-20  3:52 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Minchan Kim
2019-05-20  9:28   ` Michal Hocko
2019-05-21  2:55     ` Minchan Kim
2019-05-21  6:26       ` Michal Hocko
2019-05-27  7:58         ` Minchan Kim
2019-05-27 12:44           ` Michal Hocko
2019-05-28  3:26             ` Minchan Kim
2019-05-28  6:29               ` Michal Hocko
2019-05-28  8:13                 ` Minchan Kim
2019-05-28  8:31                   ` Daniel Colascione
2019-05-28  8:49                     ` Minchan Kim
2019-05-28  9:08                       ` Michal Hocko
2019-05-28  9:39                         ` Daniel Colascione
2019-05-28 10:33                           ` Michal Hocko
2019-05-28 11:21                             ` Daniel Colascione
2019-05-28 11:49                               ` Michal Hocko
2019-05-28 12:11                                 ` Daniel Colascione
2019-05-28 12:32                                   ` Michal Hocko
2019-05-28 10:32                         ` Minchan Kim
2019-05-28 10:41                           ` Michal Hocko
2019-05-28 11:12                             ` Minchan Kim
2019-05-28 11:28                               ` Michal Hocko
2019-05-28 11:42                                 ` Daniel Colascione
2019-05-28 11:56                                   ` Michal Hocko
2019-05-28 12:18                                     ` Daniel Colascione
2019-05-28 12:38                                       ` Michal Hocko
2019-05-28 12:10                                   ` Minchan Kim
2019-05-28 11:44                                 ` Minchan Kim
2019-05-28 11:51                                   ` Daniel Colascione
2019-05-28 12:06                                   ` Michal Hocko
2019-05-28 12:22                                     ` Minchan Kim
2019-05-28 11:28                             ` Daniel Colascione
2019-05-21 15:33       ` Johannes Weiner
2019-05-22  1:50         ` Minchan Kim
2019-05-29  4:36   ` Hillf Danton
2019-05-30  1:00   ` Minchan Kim
2019-05-20  6:37 ` [RFC 0/7] introduce memory hinting API for external process Anshuman Khandual
2019-05-20 16:59   ` Tim Murray
2019-05-21  2:55     ` Anshuman Khandual
2019-05-21  5:14       ` Minchan Kim
2019-05-21 10:34       ` Michal Hocko
2019-05-28 10:50         ` Anshuman Khandual
2019-05-21 12:56       ` Shakeel Butt
2019-05-22  4:15         ` Brian Geffon
2019-05-22  4:23         ` Brian Geffon
2019-05-20  9:28 ` Michal Hocko
2019-05-20 14:42 ` Oleksandr Natalenko
2019-05-21  2:56   ` Minchan Kim
2019-05-20 16:46 ` Johannes Weiner
2019-05-21  4:39   ` Minchan Kim
2019-05-21  6:32     ` Michal Hocko
2019-05-21  1:44 ` Matthew Wilcox
2019-05-21  5:01   ` Minchan Kim
2019-05-21  6:34   ` Michal Hocko
2019-05-21  8:42 ` Christian Brauner
2019-05-21 11:05   ` Minchan Kim
2019-05-21 11:30     ` Christian Brauner
2019-05-21 11:39       ` Christian Brauner
2019-05-22  5:11         ` Daniel Colascione
2019-05-22  8:22           ` Christian Brauner
2019-05-22 13:16             ` Daniel Colascione
2019-05-22 14:52               ` Christian Brauner
2019-05-22 15:17                 ` Daniel Colascione
2019-05-22 15:48                   ` Christian Brauner
2019-05-22 15:57                     ` Daniel Colascione
2019-05-22 16:01                       ` Christian Brauner
2019-05-22 16:01                         ` Daniel Colascione
2019-05-23 13:07                           ` Minchan Kim
2019-05-27  8:06                             ` Minchan Kim
2019-05-21 11:41       ` Minchan Kim
2019-05-21 12:04         ` Christian Brauner
2019-05-21 12:15           ` Oleksandr Natalenko
2019-05-21 12:53 ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).