Re: [PATCH 0/3 v5] Introduce a bulk order-0 page allocator

From: Chuck Lever III <chuck.lever@oracle.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Christoph Hellwig <hch@infradead.org>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-Net <netdev@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 0/3 v5] Introduce a bulk order-0 page allocator
Date: Mon, 22 Mar 2021 20:32:54 +0000	[thread overview]
Message-ID: <0E0B33DE-9413-4849-8E78-06B0CDF2D503@oracle.com> (raw)
In-Reply-To: <20210322194948.GI3697@techsingularity.net>

> On Mar 22, 2021, at 3:49 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Mar 22, 2021, at 5:18 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
>>> 
>>> This series is based on top of Matthew Wilcox's series "Rationalise
>>> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
>>> test and are not using Andrew's tree as a baseline, I suggest using the
>>> following git tree
>>> 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
>>> 
>>> The users of the API have been dropped in this version as the callers
>>> need to check whether they prefer an array or list interface (whether
>>> preference is based on convenience or performance).
>> 
>> I now have a consumer implementation that uses the array
>> API. If I understand the contract correctly, the return
>> value is the last array index that __alloc_pages_bulk()
>> visits. My consumer uses the return value to determine
>> if it needs to call the allocator again.
>> 
> 
> For either arrays or lists, the return value is the number of valid
> pages. For arrays, the pattern is expected to be
> 
> nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array);
> for (i = 0; i < nr_pages; i++) {
> 	do something with page_array[i] 
> }
> 
> There *could* be populated valid elements on and after nr_pages but the
> implementation did not visit those elements. The implementation can abort
> early if the array looks like this
> 
> PPP....PPP
> 
> Where P is a page and . is NULL. The implementation would skip the
> first three pages, allocate four pages and then abort when a new page
> was encountered. This is an implementation detail around how I handled
> prep_new_page. It could be addressed if many callers expect to pass in
> an array that has holes in the middle.
> 
>> It is returning some confusing (to me) results. I'd like
>> to get these resolved before posting any benchmark
>> results.
>> 
>> 1. When it has visited every array element, it returns the
>> same value as was passed in @nr_pages. That's the N + 1th
>> array element, which shouldn't be touched. Should the
>> allocator return nr_pages - 1 in the fully successful case?
>> Or should the documentation describe the return value as
>> "the number of elements visited" ?
>> 
> 
> I phrased it as "the known number of populated elements in the
> page_array".

The comment you added states:

+ * For lists, nr_pages is the number of pages that should be allocated.
+ *
+ * For arrays, only NULL elements are populated with pages and nr_pages
+ * is the maximum number of pages that will be stored in the array.
+ *
+ * Returns the number of pages added to the page_list or the index of the
+ * last known populated element of page_array.

> I did not want to write it as "the number of valid elements
> in the array" because that is not necessarily the case if an array is
> passed in with holes in the middle. I'm open to any suggestions on how
> the __alloc_pages_bulk description can be improved.

The comments states that, for the array case, a /count/ of
pages is passed in, and an /index/ is returned. If you want
to return the same type for lists and arrays, it should be
documented as a count in both cases, to match @nr_pages.
Consumers will want to compare @nr_pages with the return
value to see if they need to call again.

Comparing a count to an index is a notorious source of
off-by-one errors.

> The definition of the return value as-is makes sense for either a list
> or an array. Returning "nr_pages - 1" suits an array because it's the
> last valid index but it makes less sense when returning a list.
> 
>> 2. Frequently the allocator returns a number smaller than
>> the total number of elements. As you may recall, sunrpc
>> will delay a bit (via a call to schedule_timeout) then call
>> again. This is supposed to be a rare event, and the delay
>> is substantial. But with the array-based API, a not-fully-
>> successful allocator call seems to happen more than half
>> the time. Is that expected? I'm calling with GFP_KERNEL,
>> seems like the allocator should be trying harder.
>> 
> 
> It's not expected that the array implementation would be worse *unless*
> you are passing in arrays with holes in the middle. Otherwise, the success
> rate should be similar.

Essentially, sunrpc will always pass an array with a hole.
Each RPC consumes the first N elements in the rq_pages array.
Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
pass in an array with more than one hole. Typically:

.....PPPP

My results show that, because svc_alloc_arg() ends up calling
__alloc_pages_bulk() twice in this case, it ends up being
twice as expensive as the list case, on average, for the same
workload.

>> 3. Is the current design intended so that if the consumer
>> does call again, is it supposed to pass in the array address
>> + the returned index (and @nr_pages reduced by the returned
>> index) ?
>> 
> 
> The caller does not have to pass in array address + returned index but
> it's more efficient if it does.
> 
> If you are passing in arrays with holes in the middle then the following
> might work (not tested)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c83d38dfe936..4dc38516a5bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5002,6 +5002,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 	gfp_t alloc_gfp;
> 	unsigned int alloc_flags;
> 	int nr_populated = 0, prep_index = 0;
> +	bool hole = false;
> 
> 	if (WARN_ON_ONCE(nr_pages <= 0))
> 		return 0;
> @@ -5057,6 +5058,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 	if (!zone)
> 		goto failed;
> 
> +retry_hole:
> 	/* Attempt the batch allocation */
> 	local_irq_save(flags);
> 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> @@ -5069,6 +5071,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 		 * IRQs are enabled.
> 		 */
> 		if (page_array && page_array[nr_populated]) {
> +			hole = true;
> 			nr_populated++;
> 			break;
> 		}
> @@ -5109,6 +5112,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 			prep_new_page(page_array[prep_index++], 0, gfp, 0);
> 	}
> 
> +	if (hole && nr_populated < nr_pages && hole)
> +		goto retry_hole;
> +
> 	return nr_populated;
> 
> failed_irq:
> 
> -- 
> Mel Gorman
> SUSE Labs

If a local_irq_save() is done more than once in this case, I don't
expect that the result will be much better.

To make the array API as performant as the list API, the sunrpc
consumer will have to check if the N + 1th element is populated,
upon return, rather than checking the return value against
@nr_pages.

--
Chuck Lever