linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Michal Hocko <mhocko@suse.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>,
	Muchun Song <songmuchun@bytedance.com>,
	corbet@lwn.net, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	dave.hansen@linux.intel.com, luto@kernel.org,
	peterz@infradead.org, viro@zeniv.linux.org.uk,
	akpm@linux-foundation.org, mchehab+huawei@kernel.org,
	pawan.kumar.gupta@linux.intel.com, rdunlap@infradead.org,
	oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de,
	almasrymina@google.com, rientjes@google.com, willy@infradead.org,
	osalvador@suse.de, song.bao.hua@hisilicon.com, david@redhat.com,
	naoya.horiguchi@nec.com, joao.m.martins@oracle.com,
	duanxiongchun@bytedance.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Chen Huang <chenhuang5@huawei.com>,
	Bodeddula Balasubramaniam <bodeddub@amazon.com>
Subject: Re: [PATCH v18 4/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page
Date: Fri, 12 Mar 2021 09:50:37 -0800	[thread overview]
Message-ID: <b6f204af-ff17-3088-c717-d299b33e6fcb@oracle.com> (raw)
In-Reply-To: <YEsjGbKtyrpfas4C@dhcp22.suse.cz>

On 3/12/21 12:15 AM, Michal Hocko wrote:
> On Thu 11-03-21 14:53:08, Mike Kravetz wrote:
>> On 3/11/21 9:59 AM, Mike Kravetz wrote:
>>> On 3/11/21 4:17 AM, Michal Hocko wrote:
>>>>> Yeah per cpu preempt counting shouldn't be noticeable but I have to
>>>>> confess I haven't benchmarked it.
>>>>
>>>> But all this seems moot now http://lkml.kernel.org/r/YEoA08n60+jzsnAl@hirez.programming.kicks-ass.net
>>>>
>>>
>>> The proper fix for free_huge_page independent of this series would
>>> involve:
>>>
>>> - Make hugetlb_lock and subpool lock irq safe
>>> - Hand off freeing to a workque if the freeing could sleep
>>>
>>> Today, the only time we can sleep in free_huge_page is for gigantic
>>> pages allocated via cma.  I 'think' the concern about undesirable
>>> user visible side effects in this case is minimal as freeing/allocating
>>> 1G pages is not something that is going to happen at a high frequency.
>>> My thinking could be wrong?
>>>
>>> Of more concern, is the introduction of this series.  If this feature
>>> is enabled, then ALL free_huge_page requests must be sent to a workqueue.
>>> Any ideas on how to address this?
>>>
>>
>> Thinking about this more ...
>>
>> A call to free_huge_page has two distinct outcomes
>> 1) Page is freed back to the original allocator: buddy or cma
>> 2) Page is put on hugetlb free list
>>
>> We can only possibly sleep in the first case 1.  In addition, freeing a
>> page back to the original allocator involves these steps:
>> 1) Removing page from hugetlb lists
>> 2) Updating hugetlb counts: nr_hugepages, surplus
>> 3) Updating page fields
>> 4) Allocate vmemmap pages if needed as in this series
>> 5) Calling free routine of original allocator
>>
>> If hugetlb_lock is irq safe, we can perform the first 3 steps under that
>> lock without issue.  We would then use a workqueue to perform the last
>> two steps.  Since we are updating hugetlb user visible data under the
>> lock, there should be no delays.  Of course, giving those pages back to
>> the original allocator could still be delayed, and a user may notice
>> that.  Not sure if that would be acceptable?
> 
> Well, having many in-flight huge pages can certainly be visible. Say you
> are freeing hundreds of huge pages and your echo n > nr_hugepages will
> return just for you to find out that the memory hasn't been freed and
> therefore cannot be reused for another use - recently there was somebody
> mentioning their usecase to free up huge pages to prevent OOM for
> example. I do expect more people doing something like that.
> 
> Now, nr_hugepages can be handled by blocking on the same WQ until all
> pre-existing items are processed. Maybe we will need to have a more
> generic API to achieve the same for in kernel users but let's wait for
> those requests.
> 
>> I think Muchun had a
>> similar setup just for vmemmmap allocation in an early version of this
>> series.
>>
>> This would also require changes to where accounting is done in
>> dissolve_free_huge_page and update_and_free_page as mentioned elsewhere.
> 
> Normalizing dissolve_free_huge_page is definitely a good idea. It is
> really tricky how it sticks out and does half of the job of
> update_and_free_page.
> 
> That being said, if it is possible to have a fully consistent h state
> before handing over to WQ for sleeping operation then we should be all
> fine. I am slightly worried about potential tricky situations where the
> sleeping operation fails because that would require that page to be
> added back to the pool again. As said above we would need some sort of
> sync with in-flight operations before returning to the userspace.

Those sysfs interfaces to allocate/free huge pages will need to be
reworked.  One thing that is totally unacceptable with hugetlb_lock
being irq safe, are the calls to cond_resched_lock(&hugetlb_lock).
We will need to significantly reduce lock hold time in these situations.
I have some ideas on how this might work, but it is going to require
some a good deal of code restructuring and will take some time.
-- 
Mike Kravetz

  reply	other threads:[~2021-03-12 18:18 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-08 10:27 [PATCH v18 0/9] Free some vmemmap pages of HugeTLB page Muchun Song
2021-03-08 10:27 ` [PATCH v18 1/9] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c Muchun Song
2021-03-10 14:14   ` Michal Hocko
2021-03-11  2:58     ` [External] " Muchun Song
2021-03-11  8:45       ` Muchun Song
2021-03-11  8:53         ` Michal Hocko
2021-03-11  9:05           ` Muchun Song
2021-03-08 10:28 ` [PATCH v18 2/9] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Muchun Song
2021-03-08 10:28 ` [PATCH v18 3/9] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Muchun Song
2021-03-10 14:32   ` Michal Hocko
2021-03-11  3:35     ` [External] " Muchun Song
2021-03-08 10:28 ` [PATCH v18 4/9] mm: hugetlb: alloc " Muchun Song
2021-03-10 14:21   ` Oscar Salvador
2021-03-11  4:13     ` [External] " Muchun Song
2021-03-10 15:19   ` Michal Hocko
2021-03-10 18:56     ` Mike Kravetz
2021-03-10 21:11       ` Michal Hocko
2021-03-10 21:49         ` Paul E. McKenney
2021-03-10 22:10           ` Mike Kravetz
2021-03-10 23:28             ` Paul E. McKenney
2021-03-11  8:40               ` Michal Hocko
2021-03-11 12:17                 ` Michal Hocko
2021-03-11 17:59                   ` Mike Kravetz
2021-03-11 22:53                     ` Mike Kravetz
2021-03-12  8:15                       ` Michal Hocko
2021-03-12 17:50                         ` Mike Kravetz [this message]
2021-03-11  4:26     ` [External] " Muchun Song
2021-03-11  8:46       ` Michal Hocko
2021-03-11  8:49         ` Muchun Song
2021-03-08 10:28 ` [PATCH v18 5/9] mm: hugetlb: set the PageHWPoison to the raw error page Muchun Song
2021-03-10 15:27   ` Michal Hocko
2021-03-11  6:34     ` [External] " Muchun Song
2021-03-11  8:50       ` Michal Hocko
2021-03-11  9:13         ` Muchun Song
2021-03-08 10:28 ` [PATCH v18 6/9] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Muchun Song
2021-03-10 15:37   ` Michal Hocko
2021-03-10 17:15     ` Randy Dunlap
2021-03-11  6:36       ` [External] " Muchun Song
2021-03-11  6:36     ` Muchun Song
2021-03-08 10:28 ` [PATCH v18 7/9] mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate Muchun Song
2021-03-08 10:28 ` [PATCH v18 8/9] mm: hugetlb: gather discrete indexes of tail page Muchun Song
2021-03-10 15:39   ` Michal Hocko
2021-03-08 10:28 ` [PATCH v18 9/9] mm: hugetlb: optimize the code with the help of the compiler Muchun Song
2021-03-10 15:41   ` Michal Hocko
2021-03-11  7:33     ` [External] " Muchun Song
2021-03-11  8:55       ` Michal Hocko
2021-03-11  9:08         ` Muchun Song
2021-03-11  9:39           ` Michal Hocko
2021-03-11 10:00             ` Muchun Song
2021-03-11 12:16               ` Michal Hocko
2021-03-11 13:00                 ` Muchun Song
2021-03-11 13:45                 ` Oscar Salvador

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b6f204af-ff17-3088-c717-d299b33e6fcb@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=anshuman.khandual@arm.com \
    --cc=bodeddub@amazon.com \
    --cc=bp@alien8.de \
    --cc=chenhuang5@huawei.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=duanxiongchun@bytedance.com \
    --cc=hpa@zytor.com \
    --cc=joao.m.martins@oracle.com \
    --cc=jroedel@suse.de \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mchehab+huawei@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=oneukum@suse.com \
    --cc=osalvador@suse.de \
    --cc=paulmck@kernel.org \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=song.bao.hua@hisilicon.com \
    --cc=songmuchun@bytedance.com \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).