From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03FAFC43381 for ; Fri, 29 Mar 2019 14:24:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C23472184C for ; Fri, 29 Mar 2019 14:24:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729393AbfC2OYg (ORCPT ); Fri, 29 Mar 2019 10:24:36 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42970 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728630AbfC2OYf (ORCPT ); Fri, 29 Mar 2019 10:24:35 -0400 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 2EA0A821C6; Fri, 29 Mar 2019 14:24:35 +0000 (UTC) Received: from [10.36.117.0] (unknown [10.36.117.0]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4F93919C4F; Fri, 29 Mar 2019 14:24:25 +0000 (UTC) Subject: Re: On guest free page hinting and OOM To: "Michael S. Tsirkin" , Nitesh Narayan Lal Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, alexander.duyck@gmail.com References: <20190329084058-mutt-send-email-mst@kernel.org> From: David Hildenbrand Openpgp: preference=signencrypt Autocrypt: addr=david@redhat.com; prefer-encrypt=mutual; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwX4EEwECACgFAljj9eoCGwMFCQlmAYAGCwkI BwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEE3eEPcA/4Na5IIP/3T/FIQMxIfNzZshIq687qgG 8UbspuE/YSUDdv7r5szYTK6KPTlqN8NAcSfheywbuYD9A4ZeSBWD3/NAVUdrCaRP2IvFyELj xoMvfJccbq45BxzgEspg/bVahNbyuBpLBVjVWwRtFCUEXkyazksSv8pdTMAs9IucChvFmmq3 jJ2vlaz9lYt/lxN246fIVceckPMiUveimngvXZw21VOAhfQ+/sofXF8JCFv2mFcBDoa7eYob s0FLpmqFaeNRHAlzMWgSsP80qx5nWWEvRLdKWi533N2vC/EyunN3HcBwVrXH4hxRBMco3jvM m8VKLKao9wKj82qSivUnkPIwsAGNPdFoPbgghCQiBjBe6A75Z2xHFrzo7t1jg7nQfIyNC7ez MZBJ59sqA9EDMEJPlLNIeJmqslXPjmMFnE7Mby/+335WJYDulsRybN+W5rLT5aMvhC6x6POK z55fMNKrMASCzBJum2Fwjf/VnuGRYkhKCqqZ8gJ3OvmR50tInDV2jZ1DQgc3i550T5JDpToh dPBxZocIhzg+MBSRDXcJmHOx/7nQm3iQ6iLuwmXsRC6f5FbFefk9EjuTKcLMvBsEx+2DEx0E UnmJ4hVg7u1PQ+2Oy+Lh/opK/BDiqlQ8Pz2jiXv5xkECvr/3Sv59hlOCZMOaiLTTjtOIU7Tq 7ut6OL64oAq+zsFNBFXLn5EBEADn1959INH2cwYJv0tsxf5MUCghCj/CA/lc/LMthqQ773ga uB9mN+F1rE9cyyXb6jyOGn+GUjMbnq1o121Vm0+neKHUCBtHyseBfDXHA6m4B3mUTWo13nid 0e4AM71r0DS8+KYh6zvweLX/LL5kQS9GQeT+QNroXcC1NzWbitts6TZ+IrPOwT1hfB4WNC+X 2n4AzDqp3+ILiVST2DT4VBc11Gz6jijpC/KI5Al8ZDhRwG47LUiuQmt3yqrmN63V9wzaPhC+ xbwIsNZlLUvuRnmBPkTJwwrFRZvwu5GPHNndBjVpAfaSTOfppyKBTccu2AXJXWAE1Xjh6GOC 8mlFjZwLxWFqdPHR1n2aPVgoiTLk34LR/bXO+e0GpzFXT7enwyvFFFyAS0Nk1q/7EChPcbRb hJqEBpRNZemxmg55zC3GLvgLKd5A09MOM2BrMea+l0FUR+PuTenh2YmnmLRTro6eZ/qYwWkC u8FFIw4pT0OUDMyLgi+GI1aMpVogTZJ70FgV0pUAlpmrzk/bLbRkF3TwgucpyPtcpmQtTkWS gDS50QG9DR/1As3LLLcNkwJBZzBG6PWbvcOyrwMQUF1nl4SSPV0LLH63+BrrHasfJzxKXzqg rW28CTAE2x8qi7e/6M/+XXhrsMYG+uaViM7n2je3qKe7ofum3s4vq7oFCPsOgwARAQABwsFl BBgBAgAPBQJVy5+RAhsMBQkJZgGAAAoJEE3eEPcA/4NagOsP/jPoIBb/iXVbM+fmSHOjEshl KMwEl/m5iLj3iHnHPVLBUWrXPdS7iQijJA/VLxjnFknhaS60hkUNWexDMxVVP/6lbOrs4bDZ NEWDMktAeqJaFtxackPszlcpRVkAs6Msn9tu8hlvB517pyUgvuD7ZS9gGOMmYwFQDyytpepo YApVV00P0u3AaE0Cj/o71STqGJKZxcVhPaZ+LR+UCBZOyKfEyq+ZN311VpOJZ1IvTExf+S/5 lqnciDtbO3I4Wq0ArLX1gs1q1XlXLaVaA3yVqeC8E7kOchDNinD3hJS4OX0e1gdsx/e6COvy qNg5aL5n0Kl4fcVqM0LdIhsubVs4eiNCa5XMSYpXmVi3HAuFyg9dN+x8thSwI836FoMASwOl C7tHsTjnSGufB+D7F7ZBT61BffNBBIm1KdMxcxqLUVXpBQHHlGkbwI+3Ye+nE6HmZH7IwLwV W+Ajl7oYF+jeKaH4DZFtgLYGLtZ1LDwKPjX7VAsa4Yx7S5+EBAaZGxK510MjIx6SGrZWBrrV TEvdV00F2MnQoeXKzD7O4WFbL55hhyGgfWTHwZ457iN9SgYi1JLPqWkZB0JRXIEtjd4JEQcx +8Umfre0Xt4713VxMygW0PnQt5aSQdMD58jHFxTk092mU+yIHj5LeYgvwSgZN4airXk5yRXl SE+xAvmumFBY Organization: Red Hat GmbH Message-ID: Date: Fri, 29 Mar 2019 15:24:24 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <20190329084058-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Fri, 29 Mar 2019 14:24:35 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 29.03.19 14:26, Michael S. Tsirkin wrote: > On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote: >> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively. > > Sorry about breaking the thread: the original subject was > KVM: Guest Free Page Hinting > but the following isn't in a response to a specific patch > so I thought it's reasonable to start a new one. > > What bothers both me (and others) with both Nitesh's asynchronous approach > to hinting and the hinting that is already supported in the balloon > driver right now is that it seems to have the potential to create a fake OOM situation: > the page that is in the process of being hinted can not be used. How > likely that is would depend on the workload so is hard to predict. We had a very simple idea in mind: As long as a hinting request is pending, don't actually trigger any OOM activity, but wait for it to be processed. Can be done using simple atomic variable. This is a scenario that will only pop up when already pretty low on memory. And the main difference to ballooning is that we *know* we will get more memory soon. > > Alex's patches do not have this problem as they block the > VCPUs from attempting to get new pages during hinting. Solves the fake OOM > issue but adds blocking which most of the time is not necessary. + not going via QEMU which I consider problematic in the future when it comes to various things 1) VFIO notifications if we ever want to support it 2) Verifying that the memory may actually be hinted. Remember where people started to madvise(DONTNEED) the BIOS and we had to fix that in QEMU. > > With both approaches there's a tradeoff: hinting is more efficient if it > hints about large sized chunks of memory at a time, but as that size > increases, chances of being able to hold on to that much memory at a > time decrease. One can claim that this is a regular performance/memory > tradeoff however there is a difference here: normally > guest performance is traded off for host memory (which host > knows how much there is of), this trades guest performance > for guest memory, but the benefit is on the host, not on > the guest. Thus this is harder to manage. One nice thing is that, when only hinting larger chunks, the probability of smaller chunks being available is more likely. It would be more of an issue when hinting any granularity. > > I have an idea: how about allocating extra guest memory on the host? An > extra hinting buffer would be appended to guest memory, with the > understanding that it is destined specifically to improve page hinting. > Balloon device would get an extra parameter specifying the > hinting buffer size - e.g. in the config space of the driver. > At driver startup, it would get hold of the amount of > memory specified by host as the hinting buffer size, and keep it around in a > buffer list - if no action is taken - forever. Whenever balloon would > want to get hold of a page of memory and send it to host for hinting, it > would release a page of the same size from the buffer into the free > list: a new page swaps places with a page in the buffer. > > In this way the amount of useful free memory stays constant. > > Once hinting is done page can be swapped back - or just stay > in the hinting buffer until the next hint. > > Clearly this is a memory/performance tradeoff: the more memory host can > allocate for the hinting buffer, the more batching we'll get so hints > become cheaper. One notes that: > - if guest memory isn't pinned, this memory is virtual and can > be reclaimed by host. In partucular guest can hint about the > memory within the hinting buffer at startup. > - guest performance/host memory tradeoffs are reasonably well understood, and > so it's easier to manage: host knows how much memory it can > sacrifice to gain the benefit of hinting. > > Thoughts? > I first want to a) See it being a real issue. Reproduce it. b) See that we can't fix it using a simple approach (loop when requests not processed yet, always keep X pages ...). c) See that an easy fix is not sufficient and actually an issue. d) See if we can document it and people that care about can life without hinting, like they would live without ballooning. What you describe sounds interesting, but really involved. And really problematic. I consider many things about your approach not realistic. "appended to guest memory", "global list of memory", malicious guests always using that memory like what about NUMA? What about different page granularity? What about malicious guests? What about more hitning requests than the buffer is capable to handle? and much much much more. Honestly, requiring page hinting to make use of actual ballooning or additional memory makes me shiver. I hope I don't get nightmares ;) In the long term we might want to get rid of the inflation/deflation side of virtio-balloon, not require it. Please don't over-engineer an issue we haven't even see yet. Especially not using a mechanism that sounds more involved than actual hinting. As always, I might be very wrong, but this sounds way too complicated to me, both on the guest and the hypervisor side. -- Thanks, David / dhildenb