Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

From: David Hildenbrand <david@redhat.com>
To: Liang Li <liliang324@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Dan Williams <dan.j.williams@intel.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Liang Li <liliangleo@didiglobal.com>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	virtualization@lists.linux-foundation.org
Subject: Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
Date: Tue, 5 Jan 2021 11:27:10 +0100	[thread overview]
Message-ID: <85f16139-b499-dd02-f2bc-c3c42d57ccd8@redhat.com> (raw)
In-Reply-To: <CA+2MQi9Qb5srEcx4qKNVWdphBGP0=HHV_h0hWghDMFKFmCOTMg@mail.gmail.com>

On 05.01.21 11:22, Liang Li wrote:
>>>> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)
>>>
>>> It depends on how the scheduling component is designed. Yes, you can put
>>> 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
>>> another one. But if one type of them, e.g. 4C8G are sold out, customers
>>> can't by more 4C8G VM while there are some free 2C4G VMs, the resource
>>> reserved for them can be provided as 4C8G VMs
>>>
>>
>> 1. You can, just the startup time will be a little slower? E.g., grow
>> pre-allocated 4G file to 8G.
>>
>> 2. Or let's be creative: teach QEMU to construct a single
>> RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
>> don't go crazy on different VM sizes / size differences.
>>
>> 3. In your example above, you can dynamically rebalance as VMs are
>> getting sold, to make sure you always have "big ones" lying around you
>> can shrink on demand.
>>
> Yes, we can always come up with some ways to make things work.
> it will make the developer of the upper layer component crazy :)

I'd say that's life in upper layers to optimize special (!) use cases. :)

>>>
>>> You must know there are a lot of functions in the kernel which can
>>> be done in userspace. e.g. Some of the device emulations like APIC,
>>> vhost-net backend which has userspace implementation.   :)
>>> Bad or not depends on the benefits the solution brings.
>>> From the viewpoint of a user space application, the kernel should
>>> provide high performance memory management service. That's why
>>> I think it should be done in the kernel.
>>
>> As I expressed a couple of times already, I don't see why using
>> hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.
> 
> Did I miss something before? I thought you doubt the need for
> hugetlbfs free page pre zero out. Hugetlbfs is a good choice and is
> sufficient.

I remember even suggesting to focus on hugetlbfs during your KVM talk
when chatting. Maybe I was not clear before.

> 
>> We really don't *want* complicated things deep down in the mm core if
>> there are reasonable alternatives.
>>
> I understand your concern, we should have sufficient reason to add a new
> feature to the kernel. And for this one, it's most value is to make the
> application's life is easier. And implementing it in hugetlbfs can avoid
> adding more complexity to core MM.

Exactly, that's my point. Some people might still disagree with the
hugetlbfs approach, but there it's easier to add tunables without
affecting the overall system.

-- 
Thanks,

David / dhildenb