All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>, Roman Gushchin <guro@fb.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high
Date: Wed, 19 Feb 2020 19:37:31 +0100	[thread overview]
Message-ID: <20200219183731.GC11847@dhcp22.suse.cz> (raw)
In-Reply-To: <20200219181219.54356-1-hannes@cmpxchg.org>

On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> We have received regression reports from users whose workloads moved
> into containers and subsequently encountered new latencies. For some
> users these were a nuisance, but for some it meant missing their SLA
> response times. We tracked those delays down to cgroup limits, which
> inject direct reclaim stalls into the workload where previously all
> reclaim was handled my kswapd.

I am curious why is this unexpected when the high limit is explicitly
documented as a throttling mechanism.

> This patch adds asynchronous reclaim to the memory.high cgroup limit
> while keeping direct reclaim as a fallback. In our testing, this
> eliminated all direct reclaim from the affected workload.

Who is accounted for all the work? Unless I am missing something this
just gets hidden in the system activity and that might hurt the
isolation. I do see how moving the work to a different context is
desirable but this work has to be accounted properly when it is going to
become a normal mode of operation (rather than a rare exception like the
existing irq context handling).

> memory.high has a grace buffer of about 4% between when it becomes
> exceeded and when allocating threads get throttled. We can use the
> same buffer for the async reclaimer to operate in. If the worker
> cannot keep up and the grace buffer is exceeded, allocating threads
> will fall back to direct reclaim before getting throttled.
> 
> For irq-context, there's already async memory.high enforcement. Re-use
> that work item for all allocating contexts, but switch it to the
> unbound workqueue so reclaim work doesn't compete with the workload.
> The work item is per cgroup, which means the workqueue infrastructure
> will create at maximum one worker thread per reclaiming cgroup.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
>  mm/vmscan.c     | 10 +++++++--
>  2 files changed, 54 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cf02e3ef3ed9..bad838d9c2bb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1446,6 +1446,10 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	seq_buf_printf(&s, "pgsteal %lu\n",
>  		       memcg_events(memcg, PGSTEAL_KSWAPD) +
>  		       memcg_events(memcg, PGSTEAL_DIRECT));
> +	seq_buf_printf(&s, "pgscan_direct %lu\n",
> +		       memcg_events(memcg, PGSCAN_DIRECT));
> +	seq_buf_printf(&s, "pgsteal_direct %lu\n",
> +		       memcg_events(memcg, PGSTEAL_DIRECT));
>  	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE),
>  		       memcg_events(memcg, PGACTIVATE));
>  	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE),
> @@ -2235,10 +2239,19 @@ static void reclaim_high(struct mem_cgroup *memcg,
>  
>  static void high_work_func(struct work_struct *work)
>  {
> +	unsigned long high, usage;
>  	struct mem_cgroup *memcg;
>  
>  	memcg = container_of(work, struct mem_cgroup, high_work);
> -	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
> +
> +	high = READ_ONCE(memcg->high);
> +	usage = page_counter_read(&memcg->memory);
> +
> +	if (usage <= high)
> +		return;
> +
> +	set_worker_desc("cswapd/%llx", cgroup_id(memcg->css.cgroup));
> +	reclaim_high(memcg, usage - high, GFP_KERNEL);
>  }
>  
>  /*
> @@ -2304,15 +2317,22 @@ void mem_cgroup_handle_over_high(void)
>  	unsigned long pflags;
>  	unsigned long penalty_jiffies, overage;
>  	unsigned int nr_pages = current->memcg_nr_pages_over_high;
> +	bool tried_direct_reclaim = false;
>  	struct mem_cgroup *memcg;
>  
>  	if (likely(!nr_pages))
>  		return;
>  
> -	memcg = get_mem_cgroup_from_mm(current->mm);
> -	reclaim_high(memcg, nr_pages, GFP_KERNEL);
>  	current->memcg_nr_pages_over_high = 0;
>  
> +	memcg = get_mem_cgroup_from_mm(current->mm);
> +	high = READ_ONCE(memcg->high);
> +recheck:
> +	usage = page_counter_read(&memcg->memory);
> +
> +	if (usage <= high)
> +		goto out;
> +
>  	/*
>  	 * memory.high is breached and reclaim is unable to keep up. Throttle
>  	 * allocators proactively to slow down excessive growth.
> @@ -2325,12 +2345,6 @@ void mem_cgroup_handle_over_high(void)
>  	 * overage amount.
>  	 */
>  
> -	usage = page_counter_read(&memcg->memory);
> -	high = READ_ONCE(memcg->high);
> -
> -	if (usage <= high)
> -		goto out;
> -
>  	/*
>  	 * Prevent division by 0 in overage calculation by acting as if it was a
>  	 * threshold of 1 page
> @@ -2369,6 +2383,16 @@ void mem_cgroup_handle_over_high(void)
>  	if (penalty_jiffies <= HZ / 100)
>  		goto out;
>  
> +	/*
> +	 * It's possible async reclaim just isn't able to keep
> +	 * up. Before we go to sleep, try direct reclaim.
> +	 */
> +	if (!tried_direct_reclaim) {
> +		reclaim_high(memcg, nr_pages, GFP_KERNEL);
> +		tried_direct_reclaim = true;
> +		goto recheck;
> +	}
> +
>  	/*
>  	 * If we exit early, we're guaranteed to die (since
>  	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> @@ -2544,13 +2568,21 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	do {
>  		if (page_counter_read(&memcg->memory) > memcg->high) {
> +			/*
> +			 * Kick off the async reclaimer, which should
> +			 * be doing most of the work to avoid latency
> +			 * in the workload. But also check in on its
> +			 * progress before resuming to userspace, in
> +			 * case we need to do direct reclaim, or even
> +			 * throttle the allocating thread if reclaim
> +			 * cannot keep up with allocation demand.
> +			 */
> +			queue_work(system_unbound_wq, &memcg->high_work);
>  			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> -				schedule_work(&memcg->high_work);
> -				break;
> +			if (!in_interrupt()) {
> +				current->memcg_nr_pages_over_high += batch;
> +				set_notify_resume(current);
>  			}
> -			current->memcg_nr_pages_over_high += batch;
> -			set_notify_resume(current);
>  			break;
>  		}
>  	} while ((memcg = parent_mem_cgroup(memcg)));
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 74e8edce83ca..d6085115c7f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1947,7 +1947,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
> -	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> +	if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> +		item = PGSCAN_KSWAPD;
> +	else
> +		item = PGSCAN_DIRECT;
>  	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_scanned);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> @@ -1961,7 +1964,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	spin_lock_irq(&pgdat->lru_lock);
>  
> -	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> +	if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> +		item = PGSTEAL_KSWAPD;
> +	else
> +		item = PGSTEAL_DIRECT;
>  	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_reclaimed);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> -- 
> 2.24.1
> 

-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kernel-team-b10kYP2dOMg@public.gmane.org
Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high
Date: Wed, 19 Feb 2020 19:37:31 +0100	[thread overview]
Message-ID: <20200219183731.GC11847@dhcp22.suse.cz> (raw)
In-Reply-To: <20200219181219.54356-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> We have received regression reports from users whose workloads moved
> into containers and subsequently encountered new latencies. For some
> users these were a nuisance, but for some it meant missing their SLA
> response times. We tracked those delays down to cgroup limits, which
> inject direct reclaim stalls into the workload where previously all
> reclaim was handled my kswapd.

I am curious why is this unexpected when the high limit is explicitly
documented as a throttling mechanism.

> This patch adds asynchronous reclaim to the memory.high cgroup limit
> while keeping direct reclaim as a fallback. In our testing, this
> eliminated all direct reclaim from the affected workload.

Who is accounted for all the work? Unless I am missing something this
just gets hidden in the system activity and that might hurt the
isolation. I do see how moving the work to a different context is
desirable but this work has to be accounted properly when it is going to
become a normal mode of operation (rather than a rare exception like the
existing irq context handling).

> memory.high has a grace buffer of about 4% between when it becomes
> exceeded and when allocating threads get throttled. We can use the
> same buffer for the async reclaimer to operate in. If the worker
> cannot keep up and the grace buffer is exceeded, allocating threads
> will fall back to direct reclaim before getting throttled.
> 
> For irq-context, there's already async memory.high enforcement. Re-use
> that work item for all allocating contexts, but switch it to the
> unbound workqueue so reclaim work doesn't compete with the workload.
> The work item is per cgroup, which means the workqueue infrastructure
> will create at maximum one worker thread per reclaiming cgroup.
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
>  mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
>  mm/vmscan.c     | 10 +++++++--
>  2 files changed, 54 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cf02e3ef3ed9..bad838d9c2bb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1446,6 +1446,10 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
>  	seq_buf_printf(&s, "pgsteal %lu\n",
>  		       memcg_events(memcg, PGSTEAL_KSWAPD) +
>  		       memcg_events(memcg, PGSTEAL_DIRECT));
> +	seq_buf_printf(&s, "pgscan_direct %lu\n",
> +		       memcg_events(memcg, PGSCAN_DIRECT));
> +	seq_buf_printf(&s, "pgsteal_direct %lu\n",
> +		       memcg_events(memcg, PGSTEAL_DIRECT));
>  	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE),
>  		       memcg_events(memcg, PGACTIVATE));
>  	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE),
> @@ -2235,10 +2239,19 @@ static void reclaim_high(struct mem_cgroup *memcg,
>  
>  static void high_work_func(struct work_struct *work)
>  {
> +	unsigned long high, usage;
>  	struct mem_cgroup *memcg;
>  
>  	memcg = container_of(work, struct mem_cgroup, high_work);
> -	reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
> +
> +	high = READ_ONCE(memcg->high);
> +	usage = page_counter_read(&memcg->memory);
> +
> +	if (usage <= high)
> +		return;
> +
> +	set_worker_desc("cswapd/%llx", cgroup_id(memcg->css.cgroup));
> +	reclaim_high(memcg, usage - high, GFP_KERNEL);
>  }
>  
>  /*
> @@ -2304,15 +2317,22 @@ void mem_cgroup_handle_over_high(void)
>  	unsigned long pflags;
>  	unsigned long penalty_jiffies, overage;
>  	unsigned int nr_pages = current->memcg_nr_pages_over_high;
> +	bool tried_direct_reclaim = false;
>  	struct mem_cgroup *memcg;
>  
>  	if (likely(!nr_pages))
>  		return;
>  
> -	memcg = get_mem_cgroup_from_mm(current->mm);
> -	reclaim_high(memcg, nr_pages, GFP_KERNEL);
>  	current->memcg_nr_pages_over_high = 0;
>  
> +	memcg = get_mem_cgroup_from_mm(current->mm);
> +	high = READ_ONCE(memcg->high);
> +recheck:
> +	usage = page_counter_read(&memcg->memory);
> +
> +	if (usage <= high)
> +		goto out;
> +
>  	/*
>  	 * memory.high is breached and reclaim is unable to keep up. Throttle
>  	 * allocators proactively to slow down excessive growth.
> @@ -2325,12 +2345,6 @@ void mem_cgroup_handle_over_high(void)
>  	 * overage amount.
>  	 */
>  
> -	usage = page_counter_read(&memcg->memory);
> -	high = READ_ONCE(memcg->high);
> -
> -	if (usage <= high)
> -		goto out;
> -
>  	/*
>  	 * Prevent division by 0 in overage calculation by acting as if it was a
>  	 * threshold of 1 page
> @@ -2369,6 +2383,16 @@ void mem_cgroup_handle_over_high(void)
>  	if (penalty_jiffies <= HZ / 100)
>  		goto out;
>  
> +	/*
> +	 * It's possible async reclaim just isn't able to keep
> +	 * up. Before we go to sleep, try direct reclaim.
> +	 */
> +	if (!tried_direct_reclaim) {
> +		reclaim_high(memcg, nr_pages, GFP_KERNEL);
> +		tried_direct_reclaim = true;
> +		goto recheck;
> +	}
> +
>  	/*
>  	 * If we exit early, we're guaranteed to die (since
>  	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> @@ -2544,13 +2568,21 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	do {
>  		if (page_counter_read(&memcg->memory) > memcg->high) {
> +			/*
> +			 * Kick off the async reclaimer, which should
> +			 * be doing most of the work to avoid latency
> +			 * in the workload. But also check in on its
> +			 * progress before resuming to userspace, in
> +			 * case we need to do direct reclaim, or even
> +			 * throttle the allocating thread if reclaim
> +			 * cannot keep up with allocation demand.
> +			 */
> +			queue_work(system_unbound_wq, &memcg->high_work);
>  			/* Don't bother a random interrupted task */
> -			if (in_interrupt()) {
> -				schedule_work(&memcg->high_work);
> -				break;
> +			if (!in_interrupt()) {
> +				current->memcg_nr_pages_over_high += batch;
> +				set_notify_resume(current);
>  			}
> -			current->memcg_nr_pages_over_high += batch;
> -			set_notify_resume(current);
>  			break;
>  		}
>  	} while ((memcg = parent_mem_cgroup(memcg)));
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 74e8edce83ca..d6085115c7f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1947,7 +1947,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
> -	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> +	if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> +		item = PGSCAN_KSWAPD;
> +	else
> +		item = PGSCAN_DIRECT;
>  	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_scanned);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> @@ -1961,7 +1964,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	spin_lock_irq(&pgdat->lru_lock);
>  
> -	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> +	if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> +		item = PGSTEAL_KSWAPD;
> +	else
> +		item = PGSTEAL_DIRECT;
>  	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_reclaimed);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> -- 
> 2.24.1
> 

-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2020-02-19 18:37 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-19 18:12 [PATCH] mm: memcontrol: asynchronous reclaim for memory.high Johannes Weiner
2020-02-19 18:12 ` Johannes Weiner
2020-02-19 18:37 ` Michal Hocko [this message]
2020-02-19 18:37   ` Michal Hocko
2020-02-19 19:16   ` Johannes Weiner
2020-02-19 19:16     ` Johannes Weiner
2020-02-19 19:53     ` Michal Hocko
2020-02-19 19:53       ` Michal Hocko
2020-02-19 21:17       ` Johannes Weiner
2020-02-20  9:46         ` Michal Hocko
2020-02-20  9:46           ` Michal Hocko
2020-02-20 14:41           ` Johannes Weiner
2020-02-20 14:41             ` Johannes Weiner
2020-02-19 21:41       ` Daniel Jordan
2020-02-19 21:41         ` Daniel Jordan
2020-02-19 22:08         ` Johannes Weiner
2020-02-19 22:08           ` Johannes Weiner
2020-02-20 15:45           ` Daniel Jordan
2020-02-20 15:45             ` Daniel Jordan
2020-02-20 15:56             ` Tejun Heo
2020-02-20 15:56               ` Tejun Heo
2020-02-20 18:23               ` Daniel Jordan
2020-02-20 18:23                 ` Daniel Jordan
2020-02-20 18:45                 ` Tejun Heo
2020-02-20 18:45                   ` Tejun Heo
2020-02-20 19:55                   ` Daniel Jordan
2020-02-20 19:55                     ` Daniel Jordan
2020-02-20 20:54                     ` Tejun Heo
2020-02-20 20:54                       ` Tejun Heo
2020-02-19 19:17   ` Chris Down
2020-02-19 19:17     ` Chris Down
2020-02-19 19:31   ` Andrew Morton
2020-02-19 19:31     ` Andrew Morton
2020-02-19 21:33     ` Johannes Weiner
2020-02-26 20:25 ` Shakeel Butt
2020-02-26 20:25   ` Shakeel Butt
2020-02-26 20:25   ` Shakeel Butt
2020-02-26 22:26   ` Johannes Weiner
2020-02-26 22:26     ` Johannes Weiner
2020-02-26 23:36     ` Shakeel Butt
2020-02-26 23:36       ` Shakeel Butt
2020-02-26 23:36       ` Shakeel Butt
2020-02-26 23:46       ` Johannes Weiner
2020-02-27  0:12     ` Yang Shi
2020-02-27  0:12       ` Yang Shi
2020-02-27  2:42       ` Shakeel Butt
2020-02-27  2:42         ` Shakeel Butt
2020-02-27  2:42         ` Shakeel Butt
2020-02-27  9:58       ` Michal Hocko
2020-02-27  9:58         ` Michal Hocko
2020-02-27 12:50       ` Johannes Weiner
2020-02-27 12:50         ` Johannes Weiner
2020-02-26 23:59   ` Yang Shi
2020-02-26 23:59     ` Yang Shi
2020-02-27  2:36     ` Shakeel Butt
2020-02-27  2:36       ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200219183731.GC11847@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.