From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1C088C4345F
	for <linux-mm@archiver.kernel.org>; Tue, 30 Apr 2024 12:06:08 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9FAF06B0085; Tue, 30 Apr 2024 08:06:07 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9A9116B0087; Tue, 30 Apr 2024 08:06:07 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 870A86B0088; Tue, 30 Apr 2024 08:06:07 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 690F36B0085
	for <linux-mm@kvack.org>; Tue, 30 Apr 2024 08:06:07 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 206F8C071D
	for <linux-mm@kvack.org>; Tue, 30 Apr 2024 12:06:07 +0000 (UTC)
X-FDA: 82066069974.04.A555168
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
	by imf01.hostedemail.com (Postfix) with ESMTP id 537CC4001A
	for <linux-mm@kvack.org>; Tue, 30 Apr 2024 12:06:02 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf01.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1714478764;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ErTBgftoDtAIZHY8KEbY1y+49oCZeeRvwaTc53eg4H4=;
	b=OLYfhcAxyo9b6hxe8pfoDh8KrWFQ5h4k6mfclip0fW3S6o6jRRLo9pAOxDcDXOI9wZU1GY
	asI/6ThCPE10ps7MwGofe2ggndZ79RrPBgGjnEwaAzmSqtqk98zmTJiz2tW3+z7Y/iPYUP
	sxwQh4BloOc+n03mb1req07fLli7Dbg=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714478764; a=rsa-sha256;
	cv=none;
	b=TM1nQdW0Hbs+KUwCzsoEXRqE+paNEEt/6RZwhjJC/1mAWZHSy4/vULkMh970y96AzfjEdH
	YkF7mCNXDLeWqAownXp7+/5TaVNE34FDSWBltxDG0lMPrbAVVuSLaseHZhlNz1VALMYdcH
	GHpBtItMMQ0lAn0D3RCyGehJ/x8BtbQ=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf01.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com
Received: from mail.maildlp.com (unknown [172.19.163.174])
	by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4VTJjj6MKHzvRBm;
	Tue, 30 Apr 2024 20:02:49 +0800 (CST)
Received: from dggpemm500005.china.huawei.com (unknown [7.185.36.74])
	by mail.maildlp.com (Postfix) with ESMTPS id 805DB140485;
	Tue, 30 Apr 2024 20:05:59 +0800 (CST)
Received: from [10.69.30.204] (10.69.30.204) by dggpemm500005.china.huawei.com
 (7.185.36.74) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Tue, 30 Apr
 2024 20:05:59 +0800
Subject: Re: [PATCH net-next v2 09/15] mm: page_frag: reuse MSB of 'size'
 field for pfmemalloc
To: Alexander Duyck <alexander.duyck@gmail.com>
CC: <davem@davemloft.net>, <kuba@kernel.org>, <pabeni@redhat.com>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>, Andrew Morton
	<akpm@linux-foundation.org>, <linux-mm@kvack.org>
References: <20240415131941.51153-1-linyunsheng@huawei.com>
 <20240415131941.51153-10-linyunsheng@huawei.com>
 <37d012438d4850c3d7090e784e09088d02a2780c.camel@gmail.com>
 <8b7361c2-6f45-72e8-5aca-92e8a41a7e5e@huawei.com>
 <17066b6a4f941eea3ef567767450b311096da22b.camel@gmail.com>
 <c45fdd75-44be-82a6-8e47-42bbc5ee4795@huawei.com>
 <efd21f1d-8c67-b060-5ad2-0d500fac2ba6@huawei.com>
 <CAKgT0UfQWEkaWM_mfk=FhCErTL_ZS3RL6x3iMzPdEP3FD+9zZQ@mail.gmail.com>
From: Yunsheng Lin <linyunsheng@huawei.com>
Message-ID: <ceb36a97-31b5-62df-a216-8598210bbba8@huawei.com>
Date: Tue, 30 Apr 2024 20:05:58 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.0
MIME-Version: 1.0
In-Reply-To: <CAKgT0UfQWEkaWM_mfk=FhCErTL_ZS3RL6x3iMzPdEP3FD+9zZQ@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.69.30.204]
X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To
 dggpemm500005.china.huawei.com (7.185.36.74)
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 537CC4001A
X-Stat-Signature: p3cphux96doasgmatcpnhfoz5jh91ads
X-Rspam-User: 
X-HE-Tag: 1714478762-464771
X-HE-Meta: U2FsdGVkX18O3kpuQWe/toAP2WY6FTjlSf96L1eAAm3tN1eNwLg+sciib+cEdUQoMQERkpJdue1gdi73FxEaGQL2573p04psQjn7/q2+a/BZZlHrq/uZAGvCAgcqnd68Ja4MF6/AXFZ6QOBFBRbHKZqWC0bMUvpGJJi3dw8fdaJMIsitioNQIWiy5qernkoF5koGtsxcpzabmrFM7TT3qyE22BE9Ie0sj2lT2y5ECWAMnbWwtjm871E/TcsjO6sTtkuxtLawNgJlApkW9+uhRcew0ivPDn2fwWa7dUzHH+OT9D8vujfvNBPj2IObIMDR7qqPpAlF3on397iF76/KWFigUzEU3dU41/r8nLQcapdlGf7lnZ6hkE39+bjQ5Tvy5ZJHNoHjyoBgNbi4h0qtTAvurlshjQAD8c1r60sZKOtYQg74FulmgWshEdrSeuUiy9q3PaKG+M1JWtCDkrVTfJ6bZTTu6ZyhoDb68Db0NpYVqDXyyOYtU5GVdzHWFH9x7acrE4RC74tT36Ori+SJvTIFUpxJcPeu+xPE7jO+cYbQmV5vm5y44KAtnOzMSXHCcqkzXl7rya5mig3rCCw3PDK2IYYsNFr1oYCJw3RaybzY1yHyaXBe/4PvhMEO5qt97Yh3vHJdr9IL78qNMJYPHTrlBb9jdcuJAA3NPIiUPCthNRmnQ+dHmbBBqZcRH8Irj0sMDSoGtjiPZjZkDXDYO1xwmEz2It3JTEvBdsde2L69KAADijZUmMKf/vjp/1bVclVYiyZZE7XNTyIf5Hd2m0FM6vNXHaM5p76ZeYsh2jXLeYR36WnHGI1vg3Pxbr4HLz9Z+FuXq2SwBbKt+2QSbZqQfWUASiGYEU7zopIUM3yqmzAP1kSgjFpngUjdGm9EvvW1+nufS/9PgV5tzFEQUmaw9u9QZcPAp44Yg9zpbpY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 2024/4/29 22:49, Alexander Duyck wrote:

...

>>>
>>
>> After considering a few different layouts for 'struct page_frag_cache',
>> it seems the below is more optimized:
>>
>> struct page_frag_cache {
>>         /* page address & pfmemalloc & order */
>>         void *va;
> 
> I see. So basically just pack the much smaller bitfields in here.
> 
>>
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
>>         u16 pagecnt_bias;
>>         u16 size;
>> #else
>>         u32 pagecnt_bias;
>>         u32 size;
>> #endif
>> }
>>
>> The lower bits of 'va' is or'ed with the page order & pfmemalloc instead
>> of offset or pagecnt_bias, so that we don't have to add more checking
>> for handling the problem of not having enough space for offset or
>> pagecnt_bias for PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE and 32 bits system.
>> And page address & pfmemalloc & order is unchanged for the same page
>> in the same 'page_frag_cache' instance, it makes sense to fit them
>> together.
>>
>> Also, it seems it is better to replace 'offset' with 'size', which indicates
>> the remaining size for the cache in a 'page_frag_cache' instance, and we
>> might be able to do a single 'size >= fragsz' checking for the case of cache
>> being enough, which should be the fast path if we ensure size is zoro when
>> 'va' == NULL.
> 
> I'm not sure the rename to size is called for as it is going to be
> confusing. Maybe something like "remaining"?

Yes, using 'size' for that is a bit confusing.
Beside the above 'remaining', by googling, it seems we may have below
options too:
'residual','unused', 'surplus'

> 
>> Something like below:
>>
>> #define PAGE_FRAG_CACHE_ORDER_MASK      GENMASK(1, 0)
>> #define PAGE_FRAG_CACHE_PFMEMALLOC_BIT  BIT(2)
> 
> The only downside is that it is ossifying things so that we can only

There is 12 bits that is always useful, we can always extend ORDER_MASK
to more bits if lager order number is needed.

> ever do order 3 as the maximum cache size. It might be better to do a
> full 8 bytes as on x86 it would just mean accessing the low end of a
> 16b register. Then you can have pfmemalloc at bit 8.

I am not sure I understand the above as it seems we may have below option:
1. Use somthing like the above ORDER_MASK and PFMEMALLOC_BIT.
2. Use bitfield as something like below:

unsigned long va:20;---or 52 for 64bit system
unsigned long pfmemalloc:1
unsigned long order:11;

Or is there a better idea in your mind?

> 
>> struct page_frag_cache {
>>         /* page address & pfmemalloc & order */
>>         void *va;
>>
> 
> When you start combining things like this we normally would convert va
> to an unsigned long.

Ack.

> 
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
>>         u16 pagecnt_bias;
>>         u16 size;
>> #else
>>         u32 pagecnt_bias;
>>         u32 size;
>> #endif
>> };
>>
>>
>> static void *__page_frag_cache_refill(struct page_frag_cache *nc,
>>                                       unsigned int fragsz, gfp_t gfp_mask,
>>                                       unsigned int align_mask)
>> {
>>         gfp_t gfp = gfp_mask;
>>         struct page *page;
>>         void *va;
>>
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>         /* Ensure free_unref_page() can be used to free the page fragment */
>>         BUILD_BUG_ON(PAGE_FRAG_CACHE_MAX_ORDER > PAGE_ALLOC_COSTLY_ORDER);
>>
>>         gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
>>                    __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>>         page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>>                                 PAGE_FRAG_CACHE_MAX_ORDER);
>>         if (likely(page)) {
>>                 nc->size = PAGE_FRAG_CACHE_MAX_SIZE - fragsz;
> 
> I wouldn't pass fragsz here. Ideally we keep this from having to get
> pulled directly into the allocator and can instead treat this as a
> pristine page. We can do the subtraction further down in the actual
> frag alloc call.

Yes for the maintanability point of view.
But for performance point of view, doesn't it make sense to do the
subtraction here, as doing the subtraction in the actual frag alloc
call may involve more load/store operation to do the subtraction?

> 
>>                 va = page_address(page);
>>                 nc->va = (void *)((unsigned long)va |
>>                                   PAGE_FRAG_CACHE_MAX_ORDER |
>>                                   (page_is_pfmemalloc(page) ?
>>                                    PAGE_FRAG_CACHE_PFMEMALLOC_BIT : 0));
>>                 page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
>>                 nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE;
>>                 return va;
>>         }
>> #endif
>>         page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
>>         if (likely(page)) {
>>                 nc->size = PAGE_SIZE - fragsz;
>>                 va = page_address(page);
>>                 nc->va = (void *)((unsigned long)va |
>>                                   (page_is_pfmemalloc(page) ?
>>                                    PAGE_FRAG_CACHE_PFMEMALLOC_BIT : 0));
>>                 page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
>>                 nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE;
>>                 return va;
>>         }
>>
>>         nc->va = NULL;
>>         nc->size = 0;
>>         return NULL;
>> }
>>
>> void *__page_frag_alloc_va_align(struct page_frag_cache *nc,
>>                                  unsigned int fragsz, gfp_t gfp_mask,
>>                                  unsigned int align_mask)
>> {
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>         unsigned long page_order;
>> #endif
>>         unsigned long page_size;
>>         unsigned long size;
>>         struct page *page;
>>         void *va;
>>
>>         size = nc->size & align_mask;
>>         va = nc->va;
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>         page_order = (unsigned long)va & PAGE_FRAG_CACHE_ORDER_MASK;
>>         page_size = PAGE_SIZE << page_order;
>> #else
>>         page_size = PAGE_SIZE;
>> #endif
> 
> So I notice you got rid of the loops within the function. One of the
> reasons for structuring it the way it was is to enable better code
> caching. By unfolding the loops you are increasing the number of
> instructions that have to be fetched and processed in order to
> allocate the buffers.

I am not sure I understand what does 'the loops' means here, as there
is not 'while' or 'for' here. I suppose you are referring to the 'goto'
here?

>