All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v1] mm: add page preemption
@ 2019-10-20 13:43 Hillf Danton
  2019-10-20 15:24 ` Matthew Wilcox
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Hillf Danton @ 2019-10-20 13:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Michal Hocko, Johannes Weiner,
	Shakeel Butt, Minchan Kim, Mel Gorman, Vladimir Davydov,
	Jan Kara, Hillf Danton


Unlike cpu preemption, page preemption would have been a two-edge
option for quite a while. It is added by preventing tasks from
reclaiming as many lru pages as possible from other tasks of
higher priorities.

Who need pp?
Users who want to manage/control jitters in lru pages under memory
pressure. Way in parallel to scheduling with task's memory footprint
taken into account, pp makes task prio a part of page reclaiming.

[Off topic: prio can also be defined and added in memory controller
and then plays a role in memcg reclaiming, for example check prio
when selecting victim memcg.]

First on the page side, page->prio that is used to mirror the prio
of page owner tasks is added, and a couple of helpers for setting,
copying and comparing page->prio to help to add pages to lru.

Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
the prio of tasks that wake up the kthread, and it is updated
every time kswapd raises its reclaiming priority.

Finally priorities on both sides are compared when deactivating lru
pages, and skip page if it is higher on prio.

V1 is based on next-20191018.

Changes since v0
- s/page->nice/page->prio/
- drop the role of kswapd's reclaiming prioirty in prio comparison
- add pgdat->kswapd_prio

Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Hillf Danton <hdanton@sina.com>
---

--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -281,6 +281,15 @@ config VIRT_TO_BUS
 	  should probably not select this.
 
 
+config PAGE_PREEMPTION
+	bool
+	help
+	  This makes a task unable to reclaim as many lru pages as
+	  possible from other tasks of higher priorities.
+
+	  Say N if unsure.
+
+
 config MMU_NOTIFIER
 	bool
 	select SRCU
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/sched/prio.h>
 
 #include <asm/mmu.h>
 
@@ -197,6 +198,10 @@ struct page {
 	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
 	atomic_t _refcount;
 
+#ifdef CONFIG_PAGE_PREEMPTION
+	int prio; /* mirror page owner task->prio */
+#endif
+
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *mem_cgroup;
 #endif
@@ -232,6 +237,54 @@ struct page {
 #define page_private(page)		((page)->private)
 #define set_page_private(page, v)	((page)->private = (v))
 
+#ifdef CONFIG_PAGE_PREEMPTION
+static inline bool page_prio_valid(struct page *p)
+{
+	return p->prio > MAX_PRIO;
+}
+
+static inline void set_page_prio(struct page *p, int task_prio)
+{
+	/* store page prio low enough to help khugepaged add lru pages */
+	if (!page_prio_valid(p))
+		p->prio = task_prio + MAX_PRIO + 1;
+}
+
+static inline void copy_page_prio(struct page *to, struct page *from)
+{
+	to->prio = from->prio;
+}
+
+static inline int page_prio(struct page *p)
+{
+	return p->prio - MAX_PRIO - 1;
+}
+
+static inline bool page_prio_higher(struct page *p, int prio)
+{
+	return page_prio(p) < prio;
+}
+#else
+static inline bool page_prio_valid(struct page *p)
+{
+	return true;
+}
+static inline void set_page_prio(struct page *p, int task_prio)
+{
+}
+static inline void copy_page_prio(struct page *to, struct page *from)
+{
+}
+static inline int page_prio(struct page *p)
+{
+	return MAX_PRIO + 1;
+}
+static inline bool page_prio_higher(struct page *p, int prio)
+{
+	return false;
+}
+#endif
+
 struct page_frag_cache {
 	void * va;
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -402,6 +402,7 @@ static void __lru_cache_add(struct page
 	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
 
 	get_page(page);
+	set_page_prio(page, current->prio);
 	if (!pagevec_add(pvec, page) || PageCompound(page))
 		__pagevec_lru_add(pvec);
 	put_cpu_var(lru_add_pvec);
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -647,6 +647,7 @@ void migrate_page_states(struct page *ne
 		end_page_writeback(newpage);
 
 	copy_page_owner(page, newpage);
+	copy_page_prio(newpage, page);
 
 	mem_cgroup_migrate(page, newpage);
 }
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1575,6 +1575,7 @@ static int shmem_replace_page(struct pag
 
 	get_page(newpage);
 	copy_highpage(newpage, oldpage);
+	copy_page_prio(newpage, oldpage);
 	flush_dcache_page(newpage);
 
 	__SetPageLocked(newpage);
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -671,6 +671,7 @@ static void __collapse_huge_page_copy(pt
 			}
 		} else {
 			src_page = pte_page(pteval);
+			copy_page_prio(page, src_page);
 			copy_user_highpage(page, src_page, address, vma);
 			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
 			release_pte_page(src_page);
@@ -1723,6 +1724,7 @@ xa_unlocked:
 				clear_highpage(new_page + (index % HPAGE_PMD_NR));
 				index++;
 			}
+			copy_page_prio(new_page, page);
 			copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
 					page);
 			list_del(&page->lru);
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -738,6 +738,9 @@ typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+#ifdef CONFIG_PAGE_PREEMPTION
+	int kswapd_prio; /* mirror task->prio waking up kswapd */
+#endif
 	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
 
 #ifdef CONFIG_COMPACTION
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -110,6 +110,10 @@ struct scan_control {
 	/* The highest zone to isolate pages for reclaim from */
 	s8 reclaim_idx;
 
+#ifdef CONFIG_PAGE_PREEMPTION
+	s8 __pad;
+	int reclaimer_prio; /* mirror task->prio */
+#endif
 	/* This context's GFP mask */
 	gfp_t gfp_mask;
 
@@ -1710,11 +1714,18 @@ static unsigned long isolate_lru_pages(u
 		total_scan += nr_pages;
 
 		if (page_zonenum(page) > sc->reclaim_idx) {
+next_page:
 			list_move(&page->lru, &pages_skipped);
 			nr_skipped[page_zonenum(page)] += nr_pages;
 			continue;
 		}
 
+		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
+		    is_active_lru(lru) &&
+		    global_reclaim(sc) &&
+		    page_prio_higher(page, sc->reclaimer_prio))
+			goto next_page;
+
 		/*
 		 * Do not count skipped pages because that makes the function
 		 * return with no isolated pages if the LRU mostly contains
@@ -3260,6 +3271,9 @@ unsigned long try_to_free_pages(struct z
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+#ifdef CONFIG_PAGE_PREEMPTION
+		.reclaimer_prio = current->prio,
+#endif
 		.gfp_mask = current_gfp_context(gfp_mask),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
@@ -3586,6 +3600,9 @@ static int balance_pgdat(pg_data_t *pgda
 	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
+#ifdef CONFIG_PAGE_PREEMPTION
+		.reclaimer_prio = pgdat->kswapd_prio,
+#endif
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
 		.may_unmap = 1,
@@ -3739,6 +3756,9 @@ restart:
 		if (nr_boost_reclaim && !nr_reclaimed)
 			break;
 
+		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
+			sc.reclaimer_prio = pgdat->kswapd_prio;
+
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
@@ -3831,6 +3851,10 @@ static void kswapd_try_to_sleep(pg_data_
 		 */
 		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
 
+		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
+			pgdat->kswapd_prio = MAX_PRIO + 1;
+			smp_wmb();
+		}
 		remaining = schedule_timeout(HZ/10);
 
 		/*
@@ -3865,8 +3889,13 @@ static void kswapd_try_to_sleep(pg_data_
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
-		if (!kthread_should_stop())
+		if (!kthread_should_stop()) {
+			if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
+				pgdat->kswapd_prio = MAX_PRIO + 1;
+				smp_wmb();
+			}
 			schedule();
+		}
 
 		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
 	} else {
@@ -3917,6 +3946,8 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
+	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
+		pgdat->kswapd_prio = MAX_PRIO + 1;
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
@@ -3985,6 +4016,17 @@ void wakeup_kswapd(struct zone *zone, gf
 		return;
 	pgdat = zone->zone_pgdat;
 
+	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
+		int prio = current->prio;
+
+		if (pgdat->kswapd_prio < prio) {
+			smp_rmb();
+			return;
+		}
+		pgdat->kswapd_prio = prio;
+		smp_wmb();
+	}
+
 	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
 		pgdat->kswapd_classzone_idx = classzone_idx;
 	else
--



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-20 13:43 [RFC v1] mm: add page preemption Hillf Danton
@ 2019-10-20 15:24 ` Matthew Wilcox
  2019-10-21 12:27 ` Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2019-10-20 15:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman,
	Vladimir Davydov, Jan Kara

On Sun, Oct 20, 2019 at 09:43:04PM +0800, Hillf Danton wrote:
> First on the page side, page->prio that is used to mirror the prio
> of page owner tasks is added, and a couple of helpers for setting,
> copying and comparing page->prio to help to add pages to lru.

Um, no.  struct page is 64 bytes and shall remain so without a very very
good reason.

> @@ -197,6 +198,10 @@ struct page {
>  	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>  	atomic_t _refcount;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	int prio; /* mirror page owner task->prio */
> +#endif
> +
>  #ifdef CONFIG_MEMCG
>  	struct mem_cgroup *mem_cgroup;
>  #endif

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-20 13:43 [RFC v1] mm: add page preemption Hillf Danton
  2019-10-20 15:24 ` Matthew Wilcox
@ 2019-10-21 12:27 ` Michal Hocko
  2019-10-22  1:41 ` Hillf Danton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2019-10-21 12:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, Andrew Morton, linux-kernel, Johannes Weiner,
	Shakeel Butt, Minchan Kim, Mel Gorman, Vladimir Davydov,
	Jan Kara

On Sun 20-10-19 21:43:04, Hillf Danton wrote:
> 
> Unlike cpu preemption, page preemption would have been a two-edge
> option for quite a while. It is added by preventing tasks from
> reclaiming as many lru pages as possible from other tasks of
> higher priorities.

This really begs for more explanation. I would start with what page
preemption actually is. Why do we care and which workloads would benefit
and how much. And last but not least why the existing infrastructure
doesn't help (e.g. if you have clearly defined workloads with different
memory consumption requirements then why don't you use memory cgroups to
reflect the priority).
 
> Who need pp?
> Users who want to manage/control jitters in lru pages under memory
> pressure. Way in parallel to scheduling with task's memory footprint
> taken into account, pp makes task prio a part of page reclaiming.

How do you assign priority to generally shared pages?

> [Off topic: prio can also be defined and added in memory controller
> and then plays a role in memcg reclaiming, for example check prio
> when selecting victim memcg.]
> 
> First on the page side, page->prio that is used to mirror the prio
> of page owner tasks is added, and a couple of helpers for setting,
> copying and comparing page->prio to help to add pages to lru.
> 
> Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
> the prio of tasks that wake up the kthread, and it is updated
> every time kswapd raises its reclaiming priority.

This sounds like a very bad design to me. You essentially hand over
to a completely detached context while you want to handle priority
inversion problems (or at least that is what I think you want).

> Finally priorities on both sides are compared when deactivating lru
> pages, and skip page if it is higher on prio.
> 
> V1 is based on next-20191018.
> 
> Changes since v0
> - s/page->nice/page->prio/
> - drop the role of kswapd's reclaiming prioirty in prio comparison
> - add pgdat->kswapd_prio
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Hillf Danton <hdanton@sina.com>
> ---
> 
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -281,6 +281,15 @@ config VIRT_TO_BUS
>  	  should probably not select this.
>  
>  
> +config PAGE_PREEMPTION
> +	bool
> +	help
> +	  This makes a task unable to reclaim as many lru pages as
> +	  possible from other tasks of higher priorities.
> +
> +	  Say N if unsure.
> +
> +
>  config MMU_NOTIFIER
>  	bool
>  	select SRCU
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -14,6 +14,7 @@
>  #include <linux/uprobes.h>
>  #include <linux/page-flags-layout.h>
>  #include <linux/workqueue.h>
> +#include <linux/sched/prio.h>
>  
>  #include <asm/mmu.h>
>  
> @@ -197,6 +198,10 @@ struct page {
>  	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
>  	atomic_t _refcount;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	int prio; /* mirror page owner task->prio */
> +#endif
> +
>  #ifdef CONFIG_MEMCG
>  	struct mem_cgroup *mem_cgroup;
>  #endif
> @@ -232,6 +237,54 @@ struct page {
>  #define page_private(page)		((page)->private)
>  #define set_page_private(page, v)	((page)->private = (v))
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +static inline bool page_prio_valid(struct page *p)
> +{
> +	return p->prio > MAX_PRIO;
> +}
> +
> +static inline void set_page_prio(struct page *p, int task_prio)
> +{
> +	/* store page prio low enough to help khugepaged add lru pages */
> +	if (!page_prio_valid(p))
> +		p->prio = task_prio + MAX_PRIO + 1;
> +}
> +
> +static inline void copy_page_prio(struct page *to, struct page *from)
> +{
> +	to->prio = from->prio;
> +}
> +
> +static inline int page_prio(struct page *p)
> +{
> +	return p->prio - MAX_PRIO - 1;
> +}
> +
> +static inline bool page_prio_higher(struct page *p, int prio)
> +{
> +	return page_prio(p) < prio;
> +}
> +#else
> +static inline bool page_prio_valid(struct page *p)
> +{
> +	return true;
> +}
> +static inline void set_page_prio(struct page *p, int task_prio)
> +{
> +}
> +static inline void copy_page_prio(struct page *to, struct page *from)
> +{
> +}
> +static inline int page_prio(struct page *p)
> +{
> +	return MAX_PRIO + 1;
> +}
> +static inline bool page_prio_higher(struct page *p, int prio)
> +{
> +	return false;
> +}
> +#endif
> +
>  struct page_frag_cache {
>  	void * va;
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -402,6 +402,7 @@ static void __lru_cache_add(struct page
>  	struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
>  
>  	get_page(page);
> +	set_page_prio(page, current->prio);
>  	if (!pagevec_add(pvec, page) || PageCompound(page))
>  		__pagevec_lru_add(pvec);
>  	put_cpu_var(lru_add_pvec);
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -647,6 +647,7 @@ void migrate_page_states(struct page *ne
>  		end_page_writeback(newpage);
>  
>  	copy_page_owner(page, newpage);
> +	copy_page_prio(newpage, page);
>  
>  	mem_cgroup_migrate(page, newpage);
>  }
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1575,6 +1575,7 @@ static int shmem_replace_page(struct pag
>  
>  	get_page(newpage);
>  	copy_highpage(newpage, oldpage);
> +	copy_page_prio(newpage, oldpage);
>  	flush_dcache_page(newpage);
>  
>  	__SetPageLocked(newpage);
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -671,6 +671,7 @@ static void __collapse_huge_page_copy(pt
>  			}
>  		} else {
>  			src_page = pte_page(pteval);
> +			copy_page_prio(page, src_page);
>  			copy_user_highpage(page, src_page, address, vma);
>  			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
>  			release_pte_page(src_page);
> @@ -1723,6 +1724,7 @@ xa_unlocked:
>  				clear_highpage(new_page + (index % HPAGE_PMD_NR));
>  				index++;
>  			}
> +			copy_page_prio(new_page, page);
>  			copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
>  					page);
>  			list_del(&page->lru);
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -738,6 +738,9 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	int kswapd_prio; /* mirror task->prio waking up kswapd */
> +#endif
>  	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
>  
>  #ifdef CONFIG_COMPACTION
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -110,6 +110,10 @@ struct scan_control {
>  	/* The highest zone to isolate pages for reclaim from */
>  	s8 reclaim_idx;
>  
> +#ifdef CONFIG_PAGE_PREEMPTION
> +	s8 __pad;
> +	int reclaimer_prio; /* mirror task->prio */
> +#endif
>  	/* This context's GFP mask */
>  	gfp_t gfp_mask;
>  
> @@ -1710,11 +1714,18 @@ static unsigned long isolate_lru_pages(u
>  		total_scan += nr_pages;
>  
>  		if (page_zonenum(page) > sc->reclaim_idx) {
> +next_page:
>  			list_move(&page->lru, &pages_skipped);
>  			nr_skipped[page_zonenum(page)] += nr_pages;
>  			continue;
>  		}
>  
> +		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
> +		    is_active_lru(lru) &&
> +		    global_reclaim(sc) &&
> +		    page_prio_higher(page, sc->reclaimer_prio))
> +			goto next_page;
> +
>  		/*
>  		 * Do not count skipped pages because that makes the function
>  		 * return with no isolated pages if the LRU mostly contains
> @@ -3260,6 +3271,9 @@ unsigned long try_to_free_pages(struct z
>  	unsigned long nr_reclaimed;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +#ifdef CONFIG_PAGE_PREEMPTION
> +		.reclaimer_prio = current->prio,
> +#endif
>  		.gfp_mask = current_gfp_context(gfp_mask),
>  		.reclaim_idx = gfp_zone(gfp_mask),
>  		.order = order,
> @@ -3586,6 +3600,9 @@ static int balance_pgdat(pg_data_t *pgda
>  	bool boosted;
>  	struct zone *zone;
>  	struct scan_control sc = {
> +#ifdef CONFIG_PAGE_PREEMPTION
> +		.reclaimer_prio = pgdat->kswapd_prio,
> +#endif
>  		.gfp_mask = GFP_KERNEL,
>  		.order = order,
>  		.may_unmap = 1,
> @@ -3739,6 +3756,9 @@ restart:
>  		if (nr_boost_reclaim && !nr_reclaimed)
>  			break;
>  
> +		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
> +			sc.reclaimer_prio = pgdat->kswapd_prio;
> +
>  		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
> @@ -3831,6 +3851,10 @@ static void kswapd_try_to_sleep(pg_data_
>  		 */
>  		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
>  
> +		if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
> +			pgdat->kswapd_prio = MAX_PRIO + 1;
> +			smp_wmb();
> +		}
>  		remaining = schedule_timeout(HZ/10);
>  
>  		/*
> @@ -3865,8 +3889,13 @@ static void kswapd_try_to_sleep(pg_data_
>  		 */
>  		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>  
> -		if (!kthread_should_stop())
> +		if (!kthread_should_stop()) {
> +			if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
> +				pgdat->kswapd_prio = MAX_PRIO + 1;
> +				smp_wmb();
> +			}
>  			schedule();
> +		}
>  
>  		set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
>  	} else {
> @@ -3917,6 +3946,8 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
>  
> +	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION))
> +		pgdat->kswapd_prio = MAX_PRIO + 1;
>  	pgdat->kswapd_order = 0;
>  	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>  	for ( ; ; ) {
> @@ -3985,6 +4016,17 @@ void wakeup_kswapd(struct zone *zone, gf
>  		return;
>  	pgdat = zone->zone_pgdat;
>  
> +	if (IS_ENABLED(CONFIG_PAGE_PREEMPTION)) {
> +		int prio = current->prio;
> +
> +		if (pgdat->kswapd_prio < prio) {
> +			smp_rmb();
> +			return;
> +		}
> +		pgdat->kswapd_prio = prio;
> +		smp_wmb();
> +	}
> +
>  	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
>  		pgdat->kswapd_classzone_idx = classzone_idx;
>  	else
> --
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-20 13:43 [RFC v1] mm: add page preemption Hillf Danton
  2019-10-20 15:24 ` Matthew Wilcox
  2019-10-21 12:27 ` Michal Hocko
@ 2019-10-22  1:41 ` Hillf Danton
  2019-10-22  5:40 ` kbuild test robot
  2019-10-22 12:14 ` Hillf Danton
  4 siblings, 0 replies; 11+ messages in thread
From: Hillf Danton @ 2019-10-22  1:41 UTC (permalink / raw)
  To: Matthew Wilcox, Hillf Danton
  Cc: linux-mm, Andrew Morton, linux-kernel, Michal Hocko,
	Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman,
	Vladimir Davydov, Jan Kara


On Sun, 20 Oct 2019 08:24:00 -0700 Matthew Wilcox wrote:
> On Sun, Oct 20, 2019 at 09:43:04PM +0800, Hillf Danton wrote:
> > First on the page side, page->prio that is used to mirror the prio
> > of page owner tasks is added, and a couple of helpers for setting,
> > copying and comparing page->prio to help to add pages to lru.
> 
> Um, no.  struct page is 64 bytes and shall remain so without a very very
> good reason.

Queued on todo list. Thanks:)

Hillf



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-20 13:43 [RFC v1] mm: add page preemption Hillf Danton
                   ` (2 preceding siblings ...)
  2019-10-22  1:41 ` Hillf Danton
@ 2019-10-22  5:40 ` kbuild test robot
  2019-10-22 12:14 ` Hillf Danton
  4 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2019-10-22  5:40 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 10948 bytes --]

Hi Hillf,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on mmotm/master]

url:    https://github.com/0day-ci/linux/commits/Hillf-Danton/mm-add-page-preemption/20191022-083630
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-f003-201942 (attached as .config)
compiler: gcc-7 (Debian 7.4.0-14) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:5:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/vmscan.c:17:
   mm/vmscan.c: In function 'isolate_lru_pages':
   mm/vmscan.c:1726:34: error: 'struct scan_control' has no member named 'reclaimer_prio'; did you mean 'reclaim_idx'?
          page_prio_higher(page, sc->reclaimer_prio))
                                     ^
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                       ^~~~
>> mm/vmscan.c:1723:3: note: in expansion of macro 'if'
      if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
      ^~
   mm/vmscan.c:1726:34: error: 'struct scan_control' has no member named 'reclaimer_prio'; did you mean 'reclaim_idx'?
          page_prio_higher(page, sc->reclaimer_prio))
                                     ^
   include/linux/compiler.h:58:61: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                                ^~~~
>> mm/vmscan.c:1723:3: note: in expansion of macro 'if'
      if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
      ^~
   mm/vmscan.c:1726:34: error: 'struct scan_control' has no member named 'reclaimer_prio'; did you mean 'reclaim_idx'?
          page_prio_higher(page, sc->reclaimer_prio))
                                     ^
   include/linux/compiler.h:69:3: note: in definition of macro '__trace_if_value'
     (cond) ?     \
      ^~~~
   include/linux/compiler.h:56:28: note: in expansion of macro '__trace_if_var'
    #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                               ^~~~~~~~~~~~~~
>> mm/vmscan.c:1723:3: note: in expansion of macro 'if'
      if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
      ^~
   mm/vmscan.c: In function 'balance_pgdat':
   mm/vmscan.c:3759:7: error: 'struct scan_control' has no member named 'reclaimer_prio'; did you mean 'reclaim_idx'?
       sc.reclaimer_prio = pgdat->kswapd_prio;
          ^~~~~~~~~~~~~~
          reclaim_idx
   mm/vmscan.c:3759:31: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
       sc.reclaimer_prio = pgdat->kswapd_prio;
                                  ^~~~~~~~~~~
                                  kswapd_wait
   mm/vmscan.c: In function 'kswapd_try_to_sleep':
   mm/vmscan.c:3854:11: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
       pgdat->kswapd_prio = MAX_PRIO + 1;
              ^~~~~~~~~~~
              kswapd_wait
   mm/vmscan.c:3893:12: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
        pgdat->kswapd_prio = MAX_PRIO + 1;
               ^~~~~~~~~~~
               kswapd_wait
   mm/vmscan.c: In function 'kswapd':
   mm/vmscan.c:3949:10: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
      pgdat->kswapd_prio = MAX_PRIO + 1;
             ^~~~~~~~~~~
             kswapd_wait
   In file included from include/asm-generic/bug.h:5:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/vmscan.c:17:
   mm/vmscan.c: In function 'wakeup_kswapd':
   mm/vmscan.c:4021:14: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
      if (pgdat->kswapd_prio < prio) {
                 ^
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                       ^~~~
   mm/vmscan.c:4021:3: note: in expansion of macro 'if'
      if (pgdat->kswapd_prio < prio) {
      ^~
   mm/vmscan.c:4021:14: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
      if (pgdat->kswapd_prio < prio) {
                 ^
   include/linux/compiler.h:58:61: note: in definition of macro '__trace_if_var'
    #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                                ^~~~
   mm/vmscan.c:4021:3: note: in expansion of macro 'if'
      if (pgdat->kswapd_prio < prio) {
      ^~
   mm/vmscan.c:4021:14: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
      if (pgdat->kswapd_prio < prio) {
                 ^
   include/linux/compiler.h:69:3: note: in definition of macro '__trace_if_value'
     (cond) ?     \
      ^~~~
   include/linux/compiler.h:56:28: note: in expansion of macro '__trace_if_var'
    #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                               ^~~~~~~~~~~~~~
   mm/vmscan.c:4021:3: note: in expansion of macro 'if'
      if (pgdat->kswapd_prio < prio) {
      ^~
   mm/vmscan.c:4025:10: error: 'pg_data_t {aka struct pglist_data}' has no member named 'kswapd_prio'; did you mean 'kswapd_wait'?
      pgdat->kswapd_prio = prio;
             ^~~~~~~~~~~
             kswapd_wait

vim +/if +1723 mm/vmscan.c

  1668	
  1669	/**
  1670	 * pgdat->lru_lock is heavily contended.  Some of the functions that
  1671	 * shrink the lists perform better by taking out a batch of pages
  1672	 * and working on them outside the LRU lock.
  1673	 *
  1674	 * For pagecache intensive workloads, this function is the hottest
  1675	 * spot in the kernel (apart from copy_*_user functions).
  1676	 *
  1677	 * Appropriate locks must be held before calling this function.
  1678	 *
  1679	 * @nr_to_scan:	The number of eligible pages to look through on the list.
  1680	 * @lruvec:	The LRU vector to pull pages from.
  1681	 * @dst:	The temp list to put pages on to.
  1682	 * @nr_scanned:	The number of pages that were scanned.
  1683	 * @sc:		The scan_control struct for this reclaim session
  1684	 * @mode:	One of the LRU isolation modes
  1685	 * @lru:	LRU list id for isolating
  1686	 *
  1687	 * returns how many pages were moved onto *@dst.
  1688	 */
  1689	static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
  1690			struct lruvec *lruvec, struct list_head *dst,
  1691			unsigned long *nr_scanned, struct scan_control *sc,
  1692			enum lru_list lru)
  1693	{
  1694		struct list_head *src = &lruvec->lists[lru];
  1695		unsigned long nr_taken = 0;
  1696		unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
  1697		unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
  1698		unsigned long skipped = 0;
  1699		unsigned long scan, total_scan, nr_pages;
  1700		LIST_HEAD(pages_skipped);
  1701		isolate_mode_t mode = (sc->may_unmap ? 0 : ISOLATE_UNMAPPED);
  1702	
  1703		total_scan = 0;
  1704		scan = 0;
  1705		while (scan < nr_to_scan && !list_empty(src)) {
  1706			struct page *page;
  1707	
  1708			page = lru_to_page(src);
  1709			prefetchw_prev_lru_page(page, src, flags);
  1710	
  1711			VM_BUG_ON_PAGE(!PageLRU(page), page);
  1712	
  1713			nr_pages = compound_nr(page);
  1714			total_scan += nr_pages;
  1715	
  1716			if (page_zonenum(page) > sc->reclaim_idx) {
  1717	next_page:
  1718				list_move(&page->lru, &pages_skipped);
  1719				nr_skipped[page_zonenum(page)] += nr_pages;
  1720				continue;
  1721			}
  1722	
> 1723			if (IS_ENABLED(CONFIG_PAGE_PREEMPTION) &&
  1724			    is_active_lru(lru) &&
  1725			    global_reclaim(sc) &&
  1726			    page_prio_higher(page, sc->reclaimer_prio))
  1727				goto next_page;
  1728	
  1729			/*
  1730			 * Do not count skipped pages because that makes the function
  1731			 * return with no isolated pages if the LRU mostly contains
  1732			 * ineligible pages.  This causes the VM to not reclaim any
  1733			 * pages, triggering a premature OOM.
  1734			 *
  1735			 * Account all tail pages of THP.  This would not cause
  1736			 * premature OOM since __isolate_lru_page() returns -EBUSY
  1737			 * only when the page is being freed somewhere else.
  1738			 */
  1739			scan += nr_pages;
  1740			switch (__isolate_lru_page(page, mode)) {
  1741			case 0:
  1742				nr_taken += nr_pages;
  1743				nr_zone_taken[page_zonenum(page)] += nr_pages;
  1744				list_move(&page->lru, dst);
  1745				break;
  1746	
  1747			case -EBUSY:
  1748				/* else it is being freed elsewhere */
  1749				list_move(&page->lru, src);
  1750				continue;
  1751	
  1752			default:
  1753				BUG();
  1754			}
  1755		}
  1756	
  1757		/*
  1758		 * Splice any skipped pages to the start of the LRU list. Note that
  1759		 * this disrupts the LRU order when reclaiming for lower zones but
  1760		 * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
  1761		 * scanning would soon rescan the same pages to skip and put the
  1762		 * system at risk of premature OOM.
  1763		 */
  1764		if (!list_empty(&pages_skipped)) {
  1765			int zid;
  1766	
  1767			list_splice(&pages_skipped, src);
  1768			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
  1769				if (!nr_skipped[zid])
  1770					continue;
  1771	
  1772				__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
  1773				skipped += nr_skipped[zid];
  1774			}
  1775		}
  1776		*nr_scanned = total_scan;
  1777		trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
  1778					    total_scan, skipped, nr_taken, mode, lru);
  1779		update_lru_sizes(lruvec, lru, nr_zone_taken);
  1780		return nr_taken;
  1781	}
  1782	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 32269 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-20 13:43 [RFC v1] mm: add page preemption Hillf Danton
                   ` (3 preceding siblings ...)
  2019-10-22  5:40 ` kbuild test robot
@ 2019-10-22 12:14 ` Hillf Danton
  2019-10-22 12:42   ` Michal Hocko
  2019-10-22 14:28   ` Hillf Danton
  4 siblings, 2 replies; 11+ messages in thread
From: Hillf Danton @ 2019-10-22 12:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hillf Danton, linux-mm, Andrew Morton, linux-kernel,
	Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman,
	Vladimir Davydov, Jan Kara


On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> 
> On Sun 20-10-19 21:43:04, Hillf Danton wrote:
> > 
> > Unlike cpu preemption, page preemption would have been a two-edge
> > option for quite a while. It is added by preventing tasks from
> > reclaiming as many lru pages as possible from other tasks of
> > higher priorities.
> 
> This really begs for more explanation.
> I would start with what page preemption actually is.

Page preemption is the system-provided facility that makes a task
able to preempt other tasks of lower priorities for page in the
same manner as for cpu.

> Why do we care and which workloads would benefit and how much.

Page preemption, disabled by default, should be turned on by those
who wish that the performance of their workloads can survive memory
pressure to certain extent.

The number of pp users is supposed near the people who change the
nice value of their apps either to -1 or higher at least once a week,
less than vi users among UK's undergraduates.

> And last but not least why the existing infrastructure doesn't help
> (e.g. if you have clearly defined workloads with different
> memory consumption requirements then why don't you use memory cgroups to
> reflect the priority).

Good question:)

Though pp is implemented by preventing any task from reclaiming as many
pages as possible from other tasks that are higher on priority, it is
trying to introduce prio into page reclaiming, to add a feature.

Page and memcg are different objects after all; pp is being added at
the page granularity. It should be an option available in environments
without memcg enabled.

What is way different from the protections offered by memory cgroup
is that pages protected by memcg:min/low can't be reclaimed regardless
of memory pressure. Such guarantee is not available under pp as it only
suggests an extra factor to consider on deactivating lru pages.

Adding prio in memory controller is another good topic, already queued
after pp and memcg lru on todo list.

> > Who need pp?
> > Users who want to manage/control jitters in lru pages under memory
> > pressure. Way in parallel to scheduling with task's memory footprint
> > taken into account, pp makes task prio a part of page reclaiming.
> 
> How do you assign priority to generally shared pages?

It is solved by setting page prio only when they are added to lru.
Prio will never change thereafter.
There is helper copy_page_prio(new_page, old_page) for scenarios like
migrating pages.

> > [Off topic: prio can also be defined and added in memory controller
> > and then plays a role in memcg reclaiming, for example check prio
> > when selecting victim memcg.]
> > 
> > First on the page side, page->prio that is used to mirror the prio
> > of page owner tasks is added, and a couple of helpers for setting,
> > copying and comparing page->prio to help to add pages to lru.
> > 
> > Then on the reclaimer side, pgdat->kswapd_prio is added to mirror
> > the prio of tasks that wake up the kthread, and it is updated
> > every time kswapd raises its reclaiming priority.
> 
> This sounds like a very bad design to me. You essentially hand over
> to a completely detached context while you want to handle priority
> inversion problems (or at least that is what I think you want).

What was added on the reclaimer side is

1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
   and allocator in terms of prio, to the lowest value before taking
   a nap.

2, any allocator is able to wake up the reclaimer because of the
   lowest prio, and it starts reclaiming pages using the waker's prio.

3, allocator comes while kswapd is active, its prio is checked and
   no-op if kswapd is higher on prio; otherwise switch is updated
   with the higher prio.

4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
   it is checked if there is pending update of switch; and kswapd's
   prio steps up if there is a pending one, thus its prio never steps
   down. Nor prio inversion. 

5, goto 1 when kswapd finishes its work.

Thanks
Hillf



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-22 12:14 ` Hillf Danton
@ 2019-10-22 12:42   ` Michal Hocko
  2019-10-22 14:28   ` Hillf Danton
  1 sibling, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2019-10-22 12:42 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, Andrew Morton, linux-kernel, Johannes Weiner,
	Shakeel Butt, Minchan Kim, Mel Gorman, Vladimir Davydov,
	Jan Kara

On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> 
> On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
[...]
> > Why do we care and which workloads would benefit and how much.
> 
> Page preemption, disabled by default, should be turned on by those
> who wish that the performance of their workloads can survive memory
> pressure to certain extent.

I am sorry but this doesn't say anything to me. How come not all
workloads would fit that description?
 
> The number of pp users is supposed near the people who change the
> nice value of their apps either to -1 or higher at least once a week,
> less than vi users among UK's undergraduates.
> 
> > And last but not least why the existing infrastructure doesn't help
> > (e.g. if you have clearly defined workloads with different
> > memory consumption requirements then why don't you use memory cgroups to
> > reflect the priority).
> 
> Good question:)
> 
> Though pp is implemented by preventing any task from reclaiming as many
> pages as possible from other tasks that are higher on priority, it is
> trying to introduce prio into page reclaiming, to add a feature.
> 
> Page and memcg are different objects after all; pp is being added at
> the page granularity. It should be an option available in environments
> without memcg enabled.

So do you actually want to establish LRUs per priority? Why using memcgs
is not an option? This is the main facility to partition reclaimable
memory in the first place. You should really focus on explaining on why
a much more fine grained control is needed much more thoroughly.

> What is way different from the protections offered by memory cgroup
> is that pages protected by memcg:min/low can't be reclaimed regardless
> of memory pressure. Such guarantee is not available under pp as it only
> suggests an extra factor to consider on deactivating lru pages.

Well, low limit can be breached if there is no eliglible memcg to be
reclaimed. That means that you can shape some sort of priority by
setting the low limit already.

[...]

> What was added on the reclaimer side is
> 
> 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
>    and allocator in terms of prio, to the lowest value before taking
>    a nap.
> 
> 2, any allocator is able to wake up the reclaimer because of the
>    lowest prio, and it starts reclaiming pages using the waker's prio.
> 
> 3, allocator comes while kswapd is active, its prio is checked and
>    no-op if kswapd is higher on prio; otherwise switch is updated
>    with the higher prio.
> 
> 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
>    it is checked if there is pending update of switch; and kswapd's
>    prio steps up if there is a pending one, thus its prio never steps
>    down. Nor prio inversion. 
> 
> 5, goto 1 when kswapd finishes its work.

What about the direct reclaim? What if pages of a lower priority are
hard to reclaim? Do you want a process of a higher priority stall more
just because it has to wait for those lower priority pages?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-22 12:14 ` Hillf Danton
  2019-10-22 12:42   ` Michal Hocko
@ 2019-10-22 14:28   ` Hillf Danton
  2019-10-23  8:17     ` Michal Hocko
  2019-10-23 11:53     ` Hillf Danton
  1 sibling, 2 replies; 11+ messages in thread
From: Hillf Danton @ 2019-10-22 14:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hillf Danton, linux-mm, Andrew Morton, linux-kernel,
	Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman,
	Vladimir Davydov, Jan Kara


On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> 
> On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > 
> > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> [...]
> > > Why do we care and which workloads would benefit and how much.
> > 
> > Page preemption, disabled by default, should be turned on by those
> > who wish that the performance of their workloads can survive memory
> > pressure to certain extent.
> 
> I am sorry but this doesn't say anything to me. How come not all
> workloads would fit that description?

That means pp plays a role when kswapd becomes active, and it may
prevent too much jitters in active lru pages.

> > The number of pp users is supposed near the people who change the
> > nice value of their apps either to -1 or higher at least once a week,
> > less than vi users among UK's undergraduates.
> > 
> > > And last but not least why the existing infrastructure doesn't help
> > > (e.g. if you have clearly defined workloads with different
> > > memory consumption requirements then why don't you use memory cgroups to
> > > reflect the priority).
> > 
> > Good question:)
> > 
> > Though pp is implemented by preventing any task from reclaiming as many
> > pages as possible from other tasks that are higher on priority, it is
> > trying to introduce prio into page reclaiming, to add a feature.
> > 
> > Page and memcg are different objects after all; pp is being added at
> > the page granularity. It should be an option available in environments
> > without memcg enabled.
> 
> So do you actually want to establish LRUs per priority?

No, no change other than the prio for every lru page was added. LRU per prio
is too much to implement.

> Why using memcgs is not an option?

I have plan to add prio in memcg. As you see, I sent a rfc before v0 with
nice added in memcg, and realised a couple days ago that its dependence on
soft limit reclaim is not acceptable.

But we can't do that without determining how to define memcg's prio.
What is in mind now is the highest (or lowest) prio of tasks in a memcg
with a knob offered to userspace.

If you like, I want to have a talk about it sometime later.

> This is the main facility to partition reclaimable
> memory in the first place. You should really focus on explaining on why
> a much more fine grained control is needed much more thoroughly.
> 
> > What is way different from the protections offered by memory cgroup
> > is that pages protected by memcg:min/low can't be reclaimed regardless
> > of memory pressure. Such guarantee is not available under pp as it only
> > suggests an extra factor to consider on deactivating lru pages.
> 
> Well, low limit can be breached if there is no eliglible memcg to be
> reclaimed. That means that you can shape some sort of priority by
> setting the low limit already.
> 
> [...]
> 
> > What was added on the reclaimer side is
> > 
> > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
> >    and allocator in terms of prio, to the lowest value before taking
> >    a nap.
> > 
> > 2, any allocator is able to wake up the reclaimer because of the
> >    lowest prio, and it starts reclaiming pages using the waker's prio.
> > 
> > 3, allocator comes while kswapd is active, its prio is checked and
> >    no-op if kswapd is higher on prio; otherwise switch is updated
> >    with the higher prio.
> > 
> > 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
> >    it is checked if there is pending update of switch; and kswapd's
> >    prio steps up if there is a pending one, thus its prio never steps
> >    down. Nor prio inversion. 
> > 
> > 5, goto 1 when kswapd finishes its work.
> 
> What about the direct reclaim?

Their prio will not change before reclaiming finishes, so leave it be.

> What if pages of a lower priority are
> hard to reclaim? Do you want a process of a higher priority stall more
> just because it has to wait for those lower priority pages?

The problems above are not introduced by pp, let Mr. Kswapd take care of
them.

(It is 22:23 local time, lets continue after a 7-hour sleep. Good night.)

Thanks
Hillf



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-22 14:28   ` Hillf Danton
@ 2019-10-23  8:17     ` Michal Hocko
  2019-10-23 11:53     ` Hillf Danton
  1 sibling, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2019-10-23  8:17 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, Andrew Morton, linux-kernel, Johannes Weiner,
	Shakeel Butt, Minchan Kim, Mel Gorman, Vladimir Davydov,
	Jan Kara

On Tue 22-10-19 22:28:02, Hillf Danton wrote:
> 
> On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> > 
> > On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > > 
> > > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> > [...]
> > > > Why do we care and which workloads would benefit and how much.
> > > 
> > > Page preemption, disabled by default, should be turned on by those
> > > who wish that the performance of their workloads can survive memory
> > > pressure to certain extent.
> > 
> > I am sorry but this doesn't say anything to me. How come not all
> > workloads would fit that description?
> 
> That means pp plays a role when kswapd becomes active, and it may
> prevent too much jitters in active lru pages.

This is still too vague to be useful in any way.

> > > The number of pp users is supposed near the people who change the
> > > nice value of their apps either to -1 or higher at least once a week,
> > > less than vi users among UK's undergraduates.
> > > 
> > > > And last but not least why the existing infrastructure doesn't help
> > > > (e.g. if you have clearly defined workloads with different
> > > > memory consumption requirements then why don't you use memory cgroups to
> > > > reflect the priority).
> > > 
> > > Good question:)
> > > 
> > > Though pp is implemented by preventing any task from reclaiming as many
> > > pages as possible from other tasks that are higher on priority, it is
> > > trying to introduce prio into page reclaiming, to add a feature.
> > > 
> > > Page and memcg are different objects after all; pp is being added at
> > > the page granularity. It should be an option available in environments
> > > without memcg enabled.
> > 
> > So do you actually want to establish LRUs per priority?
> 
> No, no change other than the prio for every lru page was added. LRU per prio
> is too much to implement.

Well, considering that per page priority is a no go as already pointed
out by Willy then you do not have other choice right?

> > Why using memcgs is not an option?
> 
> I have plan to add prio in memcg. As you see, I sent a rfc before v0 with
> nice added in memcg, and realised a couple days ago that its dependence on
> soft limit reclaim is not acceptable.
> 
> But we can't do that without determining how to define memcg's prio.
> What is in mind now is the highest (or lowest) prio of tasks in a memcg
> with a knob offered to userspace.
> 
> If you like, I want to have a talk about it sometime later.

This doesn't really answer my question. Why cannot you use memcgs as
they are now. Why exactly do you need a fixed priority?

> > This is the main facility to partition reclaimable
> > memory in the first place. You should really focus on explaining on why
> > a much more fine grained control is needed much more thoroughly.
> > 
> > > What is way different from the protections offered by memory cgroup
> > > is that pages protected by memcg:min/low can't be reclaimed regardless
> > > of memory pressure. Such guarantee is not available under pp as it only
> > > suggests an extra factor to consider on deactivating lru pages.
> > 
> > Well, low limit can be breached if there is no eliglible memcg to be
> > reclaimed. That means that you can shape some sort of priority by
> > setting the low limit already.
> > 
> > [...]
> > 
> > > What was added on the reclaimer side is
> > > 
> > > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
> > >    and allocator in terms of prio, to the lowest value before taking
> > >    a nap.
> > > 
> > > 2, any allocator is able to wake up the reclaimer because of the
> > >    lowest prio, and it starts reclaiming pages using the waker's prio.
> > > 
> > > 3, allocator comes while kswapd is active, its prio is checked and
> > >    no-op if kswapd is higher on prio; otherwise switch is updated
> > >    with the higher prio.
> > > 
> > > 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
> > >    it is checked if there is pending update of switch; and kswapd's
> > >    prio steps up if there is a pending one, thus its prio never steps
> > >    down. Nor prio inversion. 
> > > 
> > > 5, goto 1 when kswapd finishes its work.
> > 
> > What about the direct reclaim?
> 
> Their prio will not change before reclaiming finishes, so leave it be.

This doesn't answer my question.

> > What if pages of a lower priority are
> > hard to reclaim? Do you want a process of a higher priority stall more
> > just because it has to wait for those lower priority pages?
> 
> The problems above are not introduced by pp, let Mr. Kswapd take care of
> them.

No, this is not an answer.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-22 14:28   ` Hillf Danton
  2019-10-23  8:17     ` Michal Hocko
@ 2019-10-23 11:53     ` Hillf Danton
  2019-10-23 12:04       ` Michal Hocko
  1 sibling, 1 reply; 11+ messages in thread
From: Hillf Danton @ 2019-10-23 11:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hillf Danton, linux-mm, Andrew Morton, linux-kernel,
	Johannes Weiner, Shakeel Butt, Minchan Kim, Mel Gorman,
	Vladimir Davydov, Jan Kara


On Wed, 23 Oct 2019 10:17:29 +0200 Michal Hocko wrote:
> 
> On Tue 22-10-19 22:28:02, Hillf Danton wrote:
> > 
> > On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> > > 
> > > On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > > > 
> > > > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> > > [...]
> > > > > Why do we care and which workloads would benefit and how much.
> > > > 
> > > > Page preemption, disabled by default, should be turned on by those
> > > > who wish that the performance of their workloads can survive memory
> > > > pressure to certain extent.
> > > 
> > > I am sorry but this doesn't say anything to me. How come not all
> > > workloads would fit that description?
> > 
> > That means pp plays a role when kswapd becomes active, and it may
> > prevent too much jitters in active lru pages.
> 
> This is still too vague to be useful in any way.

Page preemption is designed to function only under memory pressure by
suggesting kswapd to skip deactivating some pages based on prio comparison.
No page will be skipped without difference found in prio by design.
That said, no workload can be picked out before updating prio, so let
users who know that their workloads are sensitive to jitters in lru pages
chage the nice.
We are simply adding the pp feature; users are responsible for turning pp
on and changing nice if they feel necessary.

> > > > The number of pp users is supposed near the people who change the
> > > > nice value of their apps either to -1 or higher at least once a week,
> > > > less than vi users among UK's undergraduates.
> > > > 
> > > > > And last but not least why the existing infrastructure doesn't help
> > > > > (e.g. if you have clearly defined workloads with different
> > > > > memory consumption requirements then why don't you use memory cgroups to
> > > > > reflect the priority).
> > > > 
> > > > Good question:)
> > > > 
> > > > Though pp is implemented by preventing any task from reclaiming as many
> > > > pages as possible from other tasks that are higher on priority, it is
> > > > trying to introduce prio into page reclaiming, to add a feature.
> > > > 
> > > > Page and memcg are different objects after all; pp is being added at
> > > > the page granularity. It should be an option available in environments
> > > > without memcg enabled.
> > > 
> > > So do you actually want to establish LRUs per priority?
> > 
> > No, no change other than the prio for every lru page was added. LRU per prio
> > is too much to implement.
> 
> Well, considering that per page priority is a no go as already pointed
> out by Willy then you do not have other choice right?

No need to seek extra choice because of the prio introduced to reclaiming as
no one is hurt by design without pp enabled and prio updated.

> > > Why using memcgs is not an option?
> > 
> > I have plan to add prio in memcg. As you see, I sent a rfc before v0 with
> > nice added in memcg, and realised a couple days ago that its dependence on
> > soft limit reclaim is not acceptable.
> > 
> > But we can't do that without determining how to define memcg's prio.
> > What is in mind now is the highest (or lowest) prio of tasks in a memcg
> > with a knob offered to userspace.
> > 
> > If you like, I want to have a talk about it sometime later.
> 
> This doesn't really answer my question.
> Why cannot you use memcgs as they are now.

No prio provided.

> Why exactly do you need a fixed priority?

Prio comparison in global reclaim is what was added. Because every task has
prio makes that comparison possible.

> > > This is the main facility to partition reclaimable
> > > memory in the first place.

Is every task (pid != 1) contained in memcg? And why?

> > > You should really focus on explaining on why
> > > a much more fine grained control is needed much more thoroughly.

Which do you prefer, cello or fiddle? And why?

> > > > What is way different from the protections offered by memory cgroup
> > > > is that pages protected by memcg:min/low can't be reclaimed regardless
> > > > of memory pressure. Such guarantee is not available under pp as it only
> > > > suggests an extra factor to consider on deactivating lru pages.
> > > 
> > > Well, low limit can be breached if there is no eliglible memcg to be
> > > reclaimed. That means that you can shape some sort of priority by
> > > setting the low limit already.
> > > 
> > > [...]
> > > 
> > > > What was added on the reclaimer side is
> > > > 
> > > > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclaimer
> > > >    and allocator in terms of prio, to the lowest value before taking
> > > >    a nap.
> > > > 
> > > > 2, any allocator is able to wake up the reclaimer because of the
> > > >    lowest prio, and it starts reclaiming pages using the waker's prio.
> > > > 
> > > > 3, allocator comes while kswapd is active, its prio is checked and
> > > >    no-op if kswapd is higher on prio; otherwise switch is updated
> > > >    with the higher prio.
> > > > 
> > > > 4, every time kswapd raises sc.priority that starts with DEF_PRIORITY,
> > > >    it is checked if there is pending update of switch; and kswapd's
> > > >    prio steps up if there is a pending one, thus its prio never steps
> > > >    down. Nor prio inversion. 
> > > > 
> > > > 5, goto 1 when kswapd finishes its work.
> > > 
> > > What about the direct reclaim?
> > 
> > Their prio will not change before reclaiming finishes, so leave it be.
> 
> This doesn't answer my question.

No prio inversion in direct reclaim if you mean that.

> > > What if pages of a lower priority are
> > > hard to reclaim? Do you want a process of a higher priority stall more
> > > just because it has to wait for those lower priority pages?
> > 
> > The problems above are not introduced by pp, let Mr. Kswapd take care of
> > them.
> 
> No, this is not an answer.

Is pp making them worse?

Thanks
Hillf



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC v1] mm: add page preemption
  2019-10-23 11:53     ` Hillf Danton
@ 2019-10-23 12:04       ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2019-10-23 12:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: linux-mm, Andrew Morton, linux-kernel, Johannes Weiner,
	Shakeel Butt, Minchan Kim, Mel Gorman, Vladimir Davydov,
	Jan Kara

On Wed 23-10-19 19:53:50, Hillf Danton wrote:
> 
> On Wed, 23 Oct 2019 10:17:29 +0200 Michal Hocko wrote:
[...]
> > This doesn't really answer my question.
> > Why cannot you use memcgs as they are now.
> 
> No prio provided.
> 
> > Why exactly do you need a fixed priority?
> 
> Prio comparison in global reclaim is what was added. Because every task has
> prio makes that comparison possible.

That still doesn't answer the question because it doesn't explain why is
the priority really necessary. I am sorry but I have more important
things to deal with than asking the same question again and again.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-23 12:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-20 13:43 [RFC v1] mm: add page preemption Hillf Danton
2019-10-20 15:24 ` Matthew Wilcox
2019-10-21 12:27 ` Michal Hocko
2019-10-22  1:41 ` Hillf Danton
2019-10-22  5:40 ` kbuild test robot
2019-10-22 12:14 ` Hillf Danton
2019-10-22 12:42   ` Michal Hocko
2019-10-22 14:28   ` Hillf Danton
2019-10-23  8:17     ` Michal Hocko
2019-10-23 11:53     ` Hillf Danton
2019-10-23 12:04       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.