Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Zi Yan <ziy@nvidia.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joao Martins <joao.m.martins@oracle.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording
Date: Sat, 13 Feb 2021 10:44:01 -0500
Message-ID: <16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia.com> (raw)
In-Reply-To: <07b6b61a-2a78-4f49-70f8-e387462a95cc@oracle.com>


[-- Attachment #1: Type: text/plain, Size: 4690 bytes --]

On 11 Feb 2021, at 18:44, Mike Kravetz wrote:

> On 2/11/21 12:47 PM, Zi Yan wrote:
>> On 28 Jan 2021, at 16:53, Mike Kravetz wrote:
>>
>>> On 1/28/21 10:26 AM, Joao Martins wrote:
>>>> For a given hugepage backing a VA, there's a rather ineficient
>>>> loop which is solely responsible for storing subpages in GUP
>>>> @pages/@vmas array. For each subpage we check whether it's within
>>>> range or size of @pages and keep increment @pfn_offset and a couple
>>>> other variables per subpage iteration.
>>>>
>>>> Simplify this logic and minimize the cost of each iteration to just
>>>> store the output page/vma. Instead of incrementing number of @refs
>>>> iteratively, we do it through pre-calculation of @refs and only
>>>> with a tight loop for storing pinned subpages/vmas.
>>>>
>>>> Additionally, retain existing behaviour with using mem_map_offset()
>>>> when recording the subpages for configurations that don't have a
>>>> contiguous mem_map.
>>>>
>>>> pinning consequently improves bringing us close to
>>>> {pin,get}_user_pages_fast:
>>>>
>>>>   - 16G with 1G huge page size
>>>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>>>
>>>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>>>> PIN_FAST_BENCHMARK: ~3.7k us
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
>>>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>>
>>> Thanks for updating this.
>>>
>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>
>>> I think there still is an open general question about whether we can always
>>> assume page structs are contiguous for really big pages.  That is outside
>>
>> I do not think page structs need to be contiguous, but PFNs within a big page
>> need to be contiguous, at least based on existing code like mem_map_offset() we have.
>
> Thanks for looking Zi,
> Yes, PFNs need to be contiguous.  Also, as you say page structs do not need
> to be contiguous.  The issue is that there is code that assumes page structs
> are contiguous for gigantic pages.  hugetlb code does not make this assumption
> and does a pfn_to_page() when looping through page structs for gigantic pages.
>
> I do not believe this to be a huge issue.  In most cases CONFIG_VIRTUAL_MEM_MAP
> is defined and struct pages can be accessed contiguously.  I 'think' we could
> run into problems with CONFIG_SPARSEMEM and without CONFIG_VIRTUAL_MEM_MAP
> and doing hotplug operations.  However, I still need to look into more.

Yeah, you are right about this. The combination of CONFIG_SPARSEMEM,
!CONFIG_SPARSEMEM_VMEMMAP and doing hotplug does cause errors, as simple as
dynamically reserving gigantic hugetlb pages then freeing them in a system
with CONFIG_SPARSEMEM_VMEMMAP not set and some hotplug memory.

Here are the steps to reproduce:
0. Configure a kernel with CONFIG_SPARSEMEM_VMEMMAP not set.
1. Create a VM using qemu with “-m size=8g,slots=16,maxmem=16g” to enable hotplug.
2. After boot the machine, add large enough memory using
   “object_add memory-backend-ram,id=mem1,size=7g” and
   “device_add pc-dimm,id=dimm1,memdev=mem1”.
3. In the guest OS, online all hot-plugged memory. My VM has 128MB memory block size.
If you have larger memory block size, I think you will need to plug in more memory.
4. Reserve gigantic hugetlb pages so that hot-plugged memory will be used. I reserved
12GB, like “echo 12 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
5. Free all hugetlb gigantic pages,
“echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
6. You will get “BUG: Bad page state in process …” errors.

The patch below can fix the error, but I suspect there might be other places missing
the necessary mem_map_next() too.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..aae99c6984f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1319,7 +1319,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
        h->nr_huge_pages--;
        h->nr_huge_pages_node[page_to_nid(page)]--;
        for (i = 0; i < pages_per_huge_page(h); i++) {
-               page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
+               struct page *subpage = mem_map_next(subpage, page, i);
+               subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
                                1 << PG_referenced | 1 << PG_dirty |
                                1 << PG_active | 1 << PG_private |
                                1 << PG_writeback);


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

  reply index

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-28 18:26 [PATCH v2 0/2] mm/hugetlb: follow_hugetlb_page() improvements Joao Martins
2021-01-28 18:26 ` [PATCH v2 1/2] mm/hugetlb: grab head page refcount once for group of subpages Joao Martins
2021-01-28 18:26 ` [PATCH v2 2/2] mm/hugetlb: refactor subpage recording Joao Martins
2021-01-28 21:53   ` Mike Kravetz
2021-02-11 20:47     ` Zi Yan
2021-02-11 23:44       ` Mike Kravetz
2021-02-13 15:44         ` Zi Yan [this message]
2021-02-13 21:04           ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git