LKML Archive on lore.kernel.org
 help / color / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, rkrcmar@redhat.com, x86@kernel.org,
	mingo@redhat.com, bp@alien8.de, hpa@zytor.com,
	pbonzini@redhat.com, tglx@linutronix.de,
	akpm@linux-foundation.org
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints
Date: Mon, 11 Feb 2019 17:52:30 -0500
Message-ID: <20190211174256-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <770615ef2db838775fb68130ca60711c6e593f3d.camel@linux.intel.com>

On Mon, Feb 11, 2019 at 01:00:53PM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 14:54 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> > > On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > > > > 
> > > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > > performing one per 4K page.
> > > > > > 
> > > > > > Even 2M pages start to get expensive with a TB guest.
> > > > > 
> > > > > Agreed.
> > > > > 
> > > > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > > > E.g. 256 entries, 2M each - that's more like it.
> > > > > 
> > > > > The only issue I see with doing that is that we then have to defer the
> > > > > freeing. Doing that is going to introduce issues in the guest as we are
> > > > > going to have pages going unused for some period of time while we wait
> > > > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > > > this reason.
> > > > 
> > > > Well nothing prevents us from doing an extra exit to the hypervisor if
> > > > we want. The asynchronous nature is there as an optimization
> > > > to allow hypervisor to do its thing on a separate CPU.
> > > > Why not proceed doing other things meanwhile?
> > > > And if the reason is that we are short on memory, then
> > > > maybe we should be less aggressive in hinting?
> > > > 
> > > > E.g. if we just have 2 pages:
> > > > 
> > > > hint page 1
> > > > page 1 hint processed?
> > > > 	yes - proceed to page 2
> > > > 	no - wait for interrupt
> > > > 
> > > > get interrupt that page 1 hint is processed
> > > > hint page 2
> > > > 
> > > > 
> > > > If hypervisor happens to be running on same CPU it
> > > > can process things synchronously and we never enter
> > > > the no branch.
> > > > 
> > > 
> > > Another concern I would have about processing this asynchronously is
> > > that we have the potential for multiple guest CPUs to become
> > > bottlenecked by a single host CPU. I am not sure if that is something
> > > that would be desirable.
> > 
> > Well with a hypercall per page the fix is to block VCPU
> > completely which is also not for everyone.
> > 
> > If you can't push a free page hint to host, then
> > ideally you just won't. That's a nice property of
> > hinting we have upstream right now.
> > Host too busy - hinting is just skipped.
> 
> Right, but if you do that then there is a potential to end up missing
> hints for a large portion of memory. It seems like you would end up
> with even bigger issues since then at that point you have essentially
> leaked memory.
> I would think you would need a way to resync the host and the guest
> after something like that. Otherwise you can have memory that will just
> go unused for an extended period if a guest just goes idle.

Yes and that is my point.  Existing hints code will just take a page off
the free list in that case so it resyncs using the free list.

Something like this could work then: mark up
hinted pages with a flag (its easy to find unused
flags for free pages) then when you get an interrupt
because outstanding hints have been consumed,
get unflagged/unhinted pages from buddy and pass
them to host.


> 
> > > > > > > Using the huge TLB order became the obvious
> > > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > > order memory on the host.
> > > > > > > 
> > > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > > cycles to do so.
> > > > > > 
> > > > > > Again that's leaking host implementation detail into guest interface.
> > > > > > 
> > > > > > We are giving guest page hints to host that makes sense,
> > > > > > weird interactions with other features due to host
> > > > > > implementation details should be handled by host.
> > > > > 
> > > > > I don't view this as a host implementation detail, this is guest
> > > > > feature making use of all pages for debugging. If we are placing poison
> > > > > values in the page then I wouldn't consider them an unused page, it is
> > > > > being actively used to store the poison value.
> > > > 
> > > > Well I guess it's a valid point of view for a kernel hacker, but they are
> > > > unused from application's point of view.
> > > > However poisoning is transparent to users and most distro users
> > > > are not aware of it going on. They just know that debug kernels
> > > > are slower.
> > > > User loading a debug kernel and immediately breaking overcommit
> > > > is an unpleasant experience.
> > > 
> > > How would that be any different then a user loading an older kernel
> > > that doesn't have this feature and breaking overcommit as a result?
> > 
> > Well old kernel does not have the feature so nothing to debug.
> > When we have a new feature that goes away in the debug kernel,
> > that's a big support problem since this leads to heisenbugs.
> 
> Trying to debug host features from the guest would be a pain anyway as
> a guest shouldn't even really know what the underlying setup of the
> guest is supposed to be.

I'm talking about debugging the guest though.

> > > I still think it would be better if we left the poisoning enabled in
> > > such a case and just displayed a warning message if nothing else that
> > > hinting is disabled because of page poisoning.
> > > 
> > > One other thought I had on this is that one side effect of page
> > > poisoning is probably that KSM would be able to merge all of the poison
> > > pages together into a single page since they are all set to the same
> > > values. So even with the poisoned pages it would be possible to reduce
> > > total memory overhead.
> > 
> > Right. And BTW one thing that host can do is pass
> > the hinted area to KSM for merging.
> > That requires an alloc hook to free it though.
> > 
> > Or we could add a per-VMA byte with the poison
> > value and use that on host to populate pages on fault.
> > 
> > 
> > > > > If we can achieve this
> > > > > and free the page back to the host then even better, but until the
> > > > > features can coexist we should not use the page hinting while page
> > > > > poisoning is enabled.
> > > > 
> > > > Existing hinting in balloon allows them to coexist so I think we
> > > > need to set the bar just as high for any new variant.
> > > 
> > > That is what I heard. I will have to look into this.
> > 
> > It's not doing anything smart right now, just checks
> > that poison == 0 and skips freeing if not.
> > But it can be enhanced transparently to guests.
> 
> Okay, so it probably should be extended to add something like poison
> page that could replace the zero page for reads to a page that has been
> unmapped.
> 
> > > > > This is one of the reasons why I was opposed to just disabling page
> > > > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > > > guest has page poisoning enabled it is doing something with the page.
> > > > > It shouldn't be prevented from doing that because the host wants to
> > > > > have the option to free the pages.
> > > > 
> > > > I agree but I think the decision belongs on the host. I.e.
> > > > hint the page but tell the host it needs to be careful
> > > > about the poison value. It might also mean we
> > > > need to make sure poisoning happens after the hinting, not before.
> > > 
> > > The only issue with poisoning after instead of before is that the hint
> > > is ignored and we end up triggering a page fault and zero as a result.
> > > It might make more sense to have an architecture specific call that can
> > > be paravirtualized to handle the case of poisoning the page for us if
> > > we have the unused page hint enabled. Otherwise the write to the page
> > > is a given to invalidate the hint.
> > 
> > Sounds interesting. So the arch hook will first poison and
> > then pass the page to the host?
> > 
> > Or we can also ask the host to poison for us, problem is this forces
> > host to either always write into page, or call MADV_DONTNEED,
> > without it could do MADV_FREE. Maybe that is not a big issue.
> 
> I would think we would ask the host to poison for us. If I am not
> mistaken both solutions right now are using MADV_DONTNEED. I would tend
> to lean that way if we are doing page poisoning since the cost for
> zeroing/poisoning the page on the host could be canceled out by
> dropping the page poisoning on the guest.
> 
> Then again since we are doing higher order pages only, and the
> poisoning is supposed to happen before we get into __free_one_page we
> would probably have to do both the poisoning, and the poison on fault.


Oh that's a nice trick. So in fact if we just make sure
we never report PAGE_SIZE pages then poisoning will
automatically happen before reporting?
So we just need to teach host to poison on fault.
Sounds cool and we can always optimize further later.

-- 
MST

  reply index

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-04 18:15 [RFC PATCH 0/4] kvm: Report unused guest pages to host Alexander Duyck
2019-02-04 18:15 ` [RFC PATCH 1/4] madvise: Expose ability to set dontneed from kernel Alexander Duyck
2019-02-04 18:15 ` [RFC PATCH 2/4] kvm: Add host side support for free memory hints Alexander Duyck
2019-02-10  0:44   ` Michael S. Tsirkin
2019-02-11 17:34     ` Alexander Duyck
2019-02-11 17:36       ` Michael S. Tsirkin
2019-02-11 17:41     ` Dave Hansen
2019-02-11 17:48       ` Michael S. Tsirkin
2019-02-11 18:30         ` Alexander Duyck
2019-02-11 19:24           ` Michael S. Tsirkin
2019-02-04 18:15 ` [RFC PATCH 3/4] kvm: Add guest " Alexander Duyck
2019-02-04 19:44   ` Dave Hansen
2019-02-04 20:42     ` Alexander Duyck
2019-02-04 23:00   ` Nadav Amit
2019-02-04 23:37     ` Alexander Duyck
2019-02-05  0:03       ` Nadav Amit
2019-02-05  0:16         ` Alexander Duyck
2019-02-05  1:46           ` Nadav Amit
2019-02-05 18:09             ` Alexander Duyck
2019-02-07 18:21   ` Luiz Capitulino
2019-02-07 18:44     ` Alexander Duyck
2019-02-07 20:02       ` Luiz Capitulino
2019-02-08 21:05       ` Nitesh Narayan Lal
2019-02-08 21:31         ` Alexander Duyck
2019-02-10  0:49   ` Michael S. Tsirkin
2019-02-11 16:31     ` Alexander Duyck
2019-02-11 17:36       ` Michael S. Tsirkin
2019-02-11 18:10         ` Alexander Duyck
2019-02-11 19:54           ` Michael S. Tsirkin
2019-02-11 21:00             ` Alexander Duyck
2019-02-11 22:52               ` Michael S. Tsirkin [this message]
     [not found]                 ` <94462313ccd927d25675f69de459456cf066c1a2.camel@linux.intel.com>
2019-02-12  0:34                   ` Michael S. Tsirkin
2019-02-11 17:48     ` Dave Hansen
2019-02-11 17:58       ` Michael S. Tsirkin
2019-02-11 18:19         ` Dave Hansen
2019-02-11 19:56           ` Michael S. Tsirkin
2019-02-04 18:15 ` [RFC PATCH 4/4] mm: Add merge page notifier Alexander Duyck
2019-02-04 19:40   ` Dave Hansen
2019-02-04 19:51     ` Alexander Duyck
2019-02-10  0:57   ` Michael S. Tsirkin
2019-02-11 13:30     ` Nitesh Narayan Lal
2019-02-11 14:17       ` Michael S. Tsirkin
2019-02-11 16:24         ` Nitesh Narayan Lal
2019-02-11 17:41           ` Michael S. Tsirkin
2019-02-11 18:09             ` Nitesh Narayan Lal
2019-02-11  6:40   ` Aaron Lu
2019-02-11 15:58     ` Alexander Duyck
2019-02-12  2:09       ` Aaron Lu
2019-02-12 17:20         ` Alexander Duyck
2019-02-04 18:19 ` [RFC PATCH QEMU] i386/kvm: Enable paravirtual unused page hint mechanism Alexander Duyck
2019-02-05 17:25 ` [RFC PATCH 0/4] kvm: Report unused guest pages to host Nitesh Narayan Lal
2019-02-05 18:43   ` Alexander Duyck
2019-02-07 14:48 ` Nitesh Narayan Lal
2019-02-07 16:56   ` Alexander Duyck
2019-02-10  0:51 ` Michael S. Tsirkin

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190211174256-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=hpa@zytor.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox