Re: [LSF/MM/BPF TOPIC] VM Memory Overcommit

From: David Hildenbrand <david@redhat.com>
To: David Rientjes <rientjes@google.com>
Cc: SeongJae Park <sj@kernel.org>,
	"T.J. Alumbaugh" <talumbau@google.com>,
	lsf-pc@lists.linux-foundation.org,
	"Sudarshan Rajagopalan (QUIC)" <quic_sudaraja@quicinc.com>,
	hch@lst.de, kai.huang@intel.com, jon@nutanix.com,
	Yuanchu Xie <yuanchu@google.com>, linux-mm <linux-mm@kvack.org>,
	damon@lists.linux.dev
Subject: Re: [LSF/MM/BPF TOPIC] VM Memory Overcommit
Date: Thu, 2 Mar 2023 10:32:18 +0100	[thread overview]
Message-ID: <e660ae94-7b7b-19a4-3748-60432ab389df@redhat.com> (raw)
In-Reply-To: <c57f3f06-5079-6e28-5238-c5731ee02a6e@google.com>

On 02.03.23 04:26, David Rientjes wrote:
> On Tue, 28 Feb 2023, David Hildenbrand wrote:
> 
>> On 28.02.23 23:38, SeongJae Park wrote:
>>> On Tue, 28 Feb 2023 10:20:57 +0100 David Hildenbrand <david@redhat.com>
>>> wrote:
>>>
>>>> On 23.02.23 00:59, T.J. Alumbaugh wrote:
>>>>> Hi,
>>>>>
>>>>> This topic proposal would be to present and discuss multiple MM
>>>>> features to improve host memory overcommit while running VMs. There
>>>>> are two general cases:
>>>>>
>>>>> 1. The host and its guests operate independently,
>>>>>
>>>>> 2. The host and its guests cooperate by techniques like ballooning.
>>>>>
>>>>> In the first case, we would discuss some new techniques, e.g., fast
>>>>> access bit harvesting in the KVM MMU, and some difficulties, e.g.,
>>>>> double zswapping.
>>>>>
>>>>> In the second case, we would like to discuss a novel working set size
>>>>> (WSS) notifier framework and some improvements to the ballooning
>>>>> policy. The WSS notifier, when available, can report WSS to its
>>>>> listeners. VM Memory Overcommit is one of its use cases: the
>>>>> virtio-balloon driver can register for WSS notifications and relay WSS
>>>>> to the host. The host can leverage the WSS notifications and improve
>>>>> the ballooning policy.
>>>>>
>>>>> This topic would be of interest to a wide range of audience, e.g.,
>>>>> phones, laptops and servers.
>>>>> Co-presented with Yuanchu Xie.
>>>>
>>>> In general, having the WSS available to the hypervisor might be
>>>> beneficial. I recall, that there was an idea to leverage MGLRU and to
>>>> communicate MGLRU statistics to the hypervisor, such that the hypervisor
>>>> can make decisions using these statistics.
>>>>
>>>> But note that I don't think that the future will be traditional memory
>>>> balloon inflation/deflation. I think it might be useful in related
>>>> context, though.
>>>>
>>>> What we actually might want is a way to tell the OS ruining inside the
>>>> VM to "please try not using more than XXX MiB of physical memory" but
>>>> treat it as a soft limit. So in case we mess up, or there is a sudden
>>>> peak in memory consumption due to a workload, we won't harm the guest
>>>> OS/workload, and don't have to act immediately to avoid trouble. One can
>>>> think of it like an evolution of memory ballooning: instead of creating
>>>> artificial memory pressure by inflating the balloon that is fairly event
>>>> driven and requires explicit memory deflation, we teach the OS to do it
>>>> natively and pair it with free page reporting.
>>>>
>>>> All free physical memory inside the VM can be reported using free page
>>>> reporting to the hypervisor, and the OS will try sticking to the
>>>> requested "logical" VM size, unless there is real demand for more memory.
>>>
>>> I think use of DAMON_RECLAIM[1] inside VM together with free pages reporting
>>> could be an option.  Some users tried that in a manual way and reported some
>>> positive results.  I'm trying to find a good way to provide some control of
>>> the
>>> in-VM DAMON_RECLAIM utilization to hypervisor.
>>>
>>
>> I think we might want to go one step further and not only reclaim
>> (pro)actively, but also limit e.g., the growth of caches, such as the
>> pagecache, to make them also aware of a soft-limit. Having that said, I still
>> have to learn more about DAMON reclaim :)
>>
> 
> I'm curious, is this limitation possible to impose with memcg today or are
> specifically looking to provide a cap on page cache, dentries, inodes,
> etc, without specifically requiring memcg?

Good question, I remember the last time that topic was raised, the 
common understanding was that existing mechanisms (i.e., memcg) were not 
sufficient. But I am no expert on this, so this sure sounds like a good 
topic to discuss in a bigger group, with hopefully some memcg experts 
around :)

-- 
Thanks,

David / dhildenb