linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Zi Yan" <ziy@nvidia.com>
To: "Roman Gushchin" <guro@fb.com>
Cc: linux-mm@kvack.org, "Matthew Wilcox" <willy@infradead.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Yang Shi" <shy828301@gmail.com>,
	"Michal Hocko" <mhocko@suse.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"David Nellans" <dnellans@nvidia.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"David Rientjes" <rientjes@google.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"David Hildenbrand" <david@redhat.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Song Liu" <songliubraving@fb.com>
Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
Date: Tue, 30 Mar 2021 13:24:14 -0400	[thread overview]
Message-ID: <06D1034A-DE8B-4970-9056-6CA1C436D2E8@nvidia.com> (raw)
In-Reply-To: <YEEOk+tLV4nX7ozv@carbon.dhcp.thefacebook.com>

[-- Attachment #1: Type: text/plain, Size: 4074 bytes --]

Hi Roman,


On 4 Mar 2021, at 11:45, Roman Gushchin wrote:

> On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote:
>> On 1 Mar 2021, at 20:59, Roman Gushchin wrote:
>>
>>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
>>>> and the code is available at
>>>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
>>>> if you want to give it a try. The actual 49 patches are not sent out with this
>>>> cover letter. :)
>>>>
>>>> Instead of asking for code review, I would like to discuss on the concerns I got
>>>> from previous RFCs. I think there are two major ones:
>>>>
>>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>>>>    regions that are reserved at boot time like hugetlbfs. The concerns on
>>>>    using CMA is that an educated guess is needed to avoid depleting kernel
>>>>    memory in case CMA regions are set too large. Recently David Rientjes
>>>>    proposes to use process_madvise() for hugepage collapse, which is an
>>>>    alternative [1] but might not work for 1GB pages, since there is no way of
>>>>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
>>>>    approach at LSF/MM 2019, generating physically contiguous memory after pages
>>>>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
>>>>    huge page promotion thus does not require page allocation.
>>>
>>> Well, I don't think there an alternative to cma as now. When the memory is almost
>>> filled at least once, any subsequent activity leading to substantial slab allocations
>>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances
>>> to find a continuous GB.
>>>
>>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping
>>> non-movable pageblocks, but it seems a separate project.
>>
>> My experiments showed that finding continuous GBs is possible, but I agree that
>> CMA is more reliable and 1GB scale defragmentation should be a separate project.
>
> I actually ran a large scale experiment (on tens of thousands of machines) in the last
> several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.

Thanks for the information. I finally have time to come back to this. Do you mind sharing
the total memory of these machines? I want to have some idea on the scale of this issue to
make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
or TBs memory?

>
> My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.

Is there a way of replicating such an environment with publicly available software?
I really want to understand the root cause and am willing to find a possible solution.
It would be much easier if I can reproduce this locally.

>
> Even with cma we had to fix a number of additional problems (like sub-optimal placement
> of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have
> a reasonable success rate about ~95-99%. And it's not 100% anyway.
>
> The problem with artificial tests is that you're likely experimenting on a freshly
> rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory
> state of a production server.

Yes, I agree that my experiment is not representative. Can you provide more information
on what application behavior(s) leading to this memory fragmentation? I guess it is
because non-moveable pages spread across the entire physical memory space. Is there
a quick reproducer for that?

Thanks.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

  reply	other threads:[~2021-03-30 17:24 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-24 22:35 [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64 Zi Yan
2021-02-25 11:02 ` David Hildenbrand
2021-02-25 22:13   ` Zi Yan
2021-03-02  8:55     ` David Hildenbrand
2021-03-03 23:42       ` Zi Yan
2021-03-04  9:26         ` David Hildenbrand
2021-03-02  1:59 ` Roman Gushchin
2021-03-04 16:26   ` Zi Yan
2021-03-04 16:45     ` Roman Gushchin
2021-03-30 17:24       ` Zi Yan [this message]
2021-03-30 18:02         ` Roman Gushchin
2021-03-31  2:04           ` Zi Yan
2021-03-31  3:09           ` Matthew Wilcox
2021-03-31  3:32             ` Roman Gushchin
2021-03-31 14:48               ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=06D1034A-DE8B-4970-9056-6CA1C436D2E8@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=dnellans@nvidia.com \
    --cc=guro@fb.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=rcampbell@nvidia.com \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=songliubraving@fb.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).