From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=tZyN=I3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-14.0 required=3.0 tests=BAYES_00,INCLUDES_CR_TRAILER,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9753EC433E0
	for <linux-mm@archiver.kernel.org>; Mon, 29 Mar 2021 19:21:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 031FE61934
	for <linux-mm@archiver.kernel.org>; Mon, 29 Mar 2021 19:21:45 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 031FE61934
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5855A6B007D; Mon, 29 Mar 2021 15:21:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 535C16B007E; Mon, 29 Mar 2021 15:21:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3AEFB6B0080; Mon, 29 Mar 2021 15:21:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33])
	by kanga.kvack.org (Postfix) with ESMTP id 1C85C6B007D
	for <linux-mm@kvack.org>; Mon, 29 Mar 2021 15:21:45 -0400 (EDT)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id D2CE3824999B
	for <linux-mm@kvack.org>; Mon, 29 Mar 2021 19:21:44 +0000 (UTC)
X-FDA: 77973881328.26.82AC896
Received: from mail-io1-f50.google.com (mail-io1-f50.google.com [209.85.166.50])
	by imf01.hostedemail.com (Postfix) with ESMTP id 379B05001E9B
	for <linux-mm@kvack.org>; Mon, 29 Mar 2021 19:21:37 +0000 (UTC)
Received: by mail-io1-f50.google.com with SMTP id j26so13887039iog.13
        for <linux-mm@kvack.org>; Mon, 29 Mar 2021 12:21:37 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=G3fvMIhSqu7sJNv/kNTkn2WkYYe1axf5Dbnbe6NPasU=;
        b=neb8LKT06PlD8ssFVDrN0Ahn65MpiwTwwA2xpgC53rLAeJ9jt6SpPxf+A22HKRv8+3
         T+tJKlx/GUFMAUPoNm5LUFq2OnJedCS9gojh9w9P0DKu9/uqweivoN8Vvyn/LlKaniQ0
         ZbHzCDT9C5OvgAesD02taqjAoPN5kWw0gGuqPHxOTJ8hj3LKl3w+dRvTICXWJvvedL3N
         bk5RG3ar0Y5qufb8sKZJ2z1wKFm0xGDtCNn5/Y1BqkvHMKzr8/Oes05XMVe6X1CzORAU
         TrAjo1oHIFYuCrNDwsVijEl+2Wn+mnlLwDp8HVEHHI3SSk71l/uxRdUbnrb5DDlKWoAQ
         6qBQ==
X-Gm-Message-State: AOAM532ES0ndyeHR+/6stnO9SgVttN1U+Yj2L1T3UaLlXcyK83m5wkIc
	uFcOfm2ZB6eSOPqY6/BfmUE=
X-Google-Smtp-Source: ABdhPJymGF0byyC/d0PaFeZ6t9skx6/7Pl+nSG1ZyO2s9ibS8cKWKd42feI6wrKi088vo6GioOkWVg==
X-Received: by 2002:a05:6638:606:: with SMTP id g6mr10672730jar.52.1617045686470;
        Mon, 29 Mar 2021 12:21:26 -0700 (PDT)
Received: from google.com (243.199.238.35.bc.googleusercontent.com. [35.238.199.243])
        by smtp.gmail.com with ESMTPSA id y13sm10105672ilq.20.2021.03.29.12.21.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 29 Mar 2021 12:21:26 -0700 (PDT)
Date: Mon, 29 Mar 2021 19:21:24 +0000
From: Dennis Zhou <dennis@kernel.org>
To: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>,
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH rfc 3/4] percpu: on demand chunk depopulation
Message-ID: <YGIotHUFTrPauwrP@google.com>
References: <20210324190626.564297-1-guro@fb.com>
 <20210324190626.564297-4-guro@fb.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20210324190626.564297-4-guro@fb.com>
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 379B05001E9B
X-Stat-Signature: jrhjbyd6fcq643g8qwfephmqu7xp6iqh
Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf01; identity=mailfrom; envelope-from="<dennisszhou@gmail.com>"; helo=mail-io1-f50.google.com; client-ip=209.85.166.50
X-HE-DKIM-Result: none/none
X-HE-Tag: 1617045697-660175
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Mar 24, 2021 at 12:06:25PM -0700, Roman Gushchin wrote:
> To return unused memory to the system schedule an async
> depopulation of percpu chunks.
> 
> To balance between scanning too much and creating an overhead because
> of the pcpu_lock contention and scanning not enough, let's track an
> amount of chunks to scan and mark chunks which are potentially a good
> target for the depopulation with a new boolean flag.  The async
> depopulation work will clear the flag after trying to depopulate a
> chunk (successfully or not).
> 
> This commit suggest the following logic: if a chunk
>   1) has more than 1/4 of total pages free and populated
>   2) isn't a reserved chunk
>   3) isn't entirely free
>   4) isn't alone in the corresponding slot

I'm not sure I like the check for alone that much. The reason being what
about some odd case where each slot has a single chunk, but every slot
is populated. It doesn't really make sense to keep them all around.

I think there is some decision making we can do here to handle packing
post depopulation allocations into a handful of chunks. Depopulated
chunks could be sidelined with say a flag ->depopulated to prevent the
first attempt of allocations from using them. And then we could bring
back a chunk 1 by 1 somehow to attempt to suffice the allocation.
I'm not too sure if this is a good idea, just a thought.

> it's a good target for depopulation.
> 
> If there are 2 or more of such chunks, an async depopulation
> is scheduled.
> 
> Because chunk population and depopulation are opposite processes
> which make a little sense together, split out the shrinking part of
> pcpu_balance_populated() into pcpu_grow_populated() and make
> pcpu_balance_populated() calling into pcpu_grow_populated() or
> pcpu_shrink_populated() conditionally.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/percpu-internal.h |   1 +
>  mm/percpu.c          | 111 ++++++++++++++++++++++++++++++++-----------
>  2 files changed, 85 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 18b768ac7dca..1c5b92af02eb 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -67,6 +67,7 @@ struct pcpu_chunk {
>  
>  	void			*data;		/* chunk data */
>  	bool			immutable;	/* no [de]population allowed */
> +	bool			depopulate;	/* depopulation hint */
>  	int			start_offset;	/* the overlap with the previous
>  						   region to have a page aligned
>  						   base_addr */
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 015d076893f5..148137f0fc0b 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -178,6 +178,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
>   */
>  int pcpu_nr_empty_pop_pages;
>  
> +/*
> + * Track the number of chunks with a lot of free memory.
> + * It's used to release unused pages to the system.
> + */
> +static int pcpu_nr_chunks_to_depopulate;
> +
>  /*
>   * The number of populated pages in use by the allocator, protected by
>   * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
> @@ -1955,6 +1961,11 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
>  		if (chunk == list_first_entry(free_head, struct pcpu_chunk, list))
>  			continue;
>  
> +		if (chunk->depopulate) {
> +			chunk->depopulate = false;
> +			pcpu_nr_chunks_to_depopulate--;
> +		}
> +
>  		list_move(&chunk->list, &to_free);
>  	}
>  
> @@ -1976,7 +1987,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
>  }
>  
>  /**
> - * pcpu_balance_populated - manage the amount of populated pages
> + * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
>   * @type: chunk type
>   *
>   * Maintain a certain amount of populated pages to satisfy atomic allocations.
> @@ -1985,35 +1996,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
>   * allocation causes the failure as it is possible that requests can be
>   * serviced from already backed regions.
>   */
> -static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
>  {
>  	/* gfp flags passed to underlying allocators */
>  	const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>  	struct list_head *pcpu_slot = pcpu_chunk_list(type);
>  	struct pcpu_chunk *chunk;
> -	int slot, nr_to_pop, ret;
> +	int slot, ret;
>  
> -	/*
> -	 * Ensure there are certain number of free populated pages for
> -	 * atomic allocs.  Fill up from the most packed so that atomic
> -	 * allocs don't increase fragmentation.  If atomic allocation
> -	 * failed previously, always populate the maximum amount.  This
> -	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
> -	 * failing indefinitely; however, large atomic allocs are not
> -	 * something we support properly and can be highly unreliable and
> -	 * inefficient.
> -	 */
>  retry_pop:
> -	if (pcpu_atomic_alloc_failed) {
> -		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> -		/* best effort anyway, don't worry about synchronization */
> -		pcpu_atomic_alloc_failed = false;
> -	} else {
> -		nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
> -				  pcpu_nr_empty_pop_pages,
> -				  0, PCPU_EMPTY_POP_PAGES_HIGH);
> -	}
> -
>  	for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
>  		unsigned int nr_unpop = 0, rs, re;
>  
> @@ -2084,9 +2075,18 @@ static void pcpu_shrink_populated(enum pcpu_chunk_type type)

I missed this in the review of patch 1, but pcpu_shrink only needs to
iterate over:
for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {

>  		list_for_each_entry(chunk, &pcpu_slot[slot], list) {
>  			bool isolated = false;
>  
> -			if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> +			if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH ||
> +			    pcpu_nr_chunks_to_depopulate < 1)
>  				break;
>  
> +			/*
> +			 * Don't try to depopulate a chunk again and again.
> +			 */
> +			if (!chunk->depopulate)
> +				continue;
> +			chunk->depopulate = false;
> +			pcpu_nr_chunks_to_depopulate--;
> +
>  			for (i = 0, start = -1; i < chunk->nr_pages; i++) {
>  				if (!chunk->nr_empty_pop_pages)
>  					break;
> @@ -2153,6 +2153,41 @@ static void pcpu_shrink_populated(enum pcpu_chunk_type type)
>  	spin_unlock_irq(&pcpu_lock);
>  }
>  
> +/**
> + * pcpu_balance_populated - manage the amount of populated pages
> + * @type: chunk type
> + *
> + * Populate or depopulate chunks to maintain a certain amount
> + * of free pages to satisfy atomic allocations, but not waste
> + * large amounts of memory.
> + */
> +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +{
> +	int nr_to_pop;
> +
> +	/*
> +	 * Ensure there are certain number of free populated pages for
> +	 * atomic allocs.  Fill up from the most packed so that atomic
> +	 * allocs don't increase fragmentation.  If atomic allocation
> +	 * failed previously, always populate the maximum amount.  This
> +	 * should prevent atomic allocs larger than PAGE_SIZE from keeping
> +	 * failing indefinitely; however, large atomic allocs are not
> +	 * something we support properly and can be highly unreliable and
> +	 * inefficient.
> +	 */
> +	if (pcpu_atomic_alloc_failed) {
> +		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> +		/* best effort anyway, don't worry about synchronization */
> +		pcpu_atomic_alloc_failed = false;
> +		pcpu_grow_populated(type, nr_to_pop);
> +	} else if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH) {
> +		nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages;
> +		pcpu_grow_populated(type, nr_to_pop);
> +	} else if (pcpu_nr_chunks_to_depopulate > 0) {
> +		pcpu_shrink_populated(type);
> +	}
> +}
> +
>  /**
>   * pcpu_balance_workfn - manage the amount of free chunks and populated pages
>   * @work: unused
> @@ -2188,6 +2223,7 @@ void free_percpu(void __percpu *ptr)
>  	int size, off;
>  	bool need_balance = false;
>  	struct list_head *pcpu_slot;
> +	struct pcpu_chunk *pos;
>  
>  	if (!ptr)
>  		return;
> @@ -2207,15 +2243,36 @@ void free_percpu(void __percpu *ptr)
>  
>  	pcpu_memcg_free_hook(chunk, off, size);
>  
> -	/* if there are more than one fully free chunks, wake up grim reaper */
>  	if (chunk->free_bytes == pcpu_unit_size) {
> -		struct pcpu_chunk *pos;
> -
> +		/*
> +		 * If there are more than one fully free chunks,
> +		 * wake up grim reaper.
> +		 */
>  		list_for_each_entry(pos, &pcpu_slot[pcpu_nr_slots - 1], list)
>  			if (pos != chunk) {
>  				need_balance = true;
>  				break;
>  			}
> +
> +	} else if (chunk->nr_empty_pop_pages > chunk->nr_pages / 4) {

We should have this ignore the first and reserved chunks. While it
shouldn't be possible in theory, it would be nice to just make it
explicit here.

> +		/*
> +		 * If there is more than one chunk in the slot and
> +		 * at least 1/4 of its pages are empty, mark the chunk
> +		 * as a target for the depopulation. If there is more
> +		 * than one chunk like this, schedule an async balancing.
> +		 */
> +		int nslot = pcpu_chunk_slot(chunk);
> +
> +		list_for_each_entry(pos, &pcpu_slot[nslot], list)
> +			if (pos != chunk && !chunk->depopulate &&
> +			    !chunk->immutable) {
> +				chunk->depopulate = true;
> +				pcpu_nr_chunks_to_depopulate++;
> +				break;
> +			}
> +
> +		if (pcpu_nr_chunks_to_depopulate > 1)
> +			need_balance = true;
>  	}
>  
>  	trace_percpu_free_percpu(chunk->base_addr, off, ptr);
> -- 
> 2.30.2
> 

Some questions I have:
1. How do we prevent unnecessary scanning for atomic allocations?
2. Even in the normal case, should we try to pack future allocations
into a smaller # of chunks in after depopulation?
3. What is the right frequency to do depopulation scanning? I think of
the pcpu work item as a way to defer the 2 the freeing of chunks and in
a way more immediately replenish free pages. Depopulation isn't
necessarily as high a priority.

Thanks,
Dennis