All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	"Li, Liang Z" <liang.z.li@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"mhocko@suse.com" <mhocko@suse.com>,
	"mst@redhat.com" <mst@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"virtualization@lists.linux-foundation.org" 
	<virtualization@lists.linux-foundation.org>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>
Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Date: Wed, 7 Dec 2016 21:28:24 +0100	[thread overview]
Message-ID: <20161207202824.GH28786@redhat.com> (raw)
In-Reply-To: <b58fd9f6-d9dd-dd56-d476-dd342174dac5@intel.com>

On Wed, Dec 07, 2016 at 11:54:34AM -0800, Dave Hansen wrote:
> We're talking about a bunch of different stuff which is all being
> conflated.  There are 3 issues here that I can see.  I'll attempt to
> summarize what I think is going on:
> 
> 1. Current patches do a hypercall for each order in the allocator.
>    This is inefficient, but independent from the underlying data
>    structure in the ABI, unless bitmaps are in play, which they aren't.
> 2. Should we have bitmaps in the ABI, even if they are not in use by the
>    guest implementation today?  Andrea says they have zero benefits
>    over a pfn/len scheme.  Dave doesn't think they have zero benefits
>    but isn't that attached to them.  QEMU's handling gets more
>    complicated when using a bitmap.
> 3. Should the ABI contain records each with a pfn/len pair or a
>    pfn/order pair?
>    3a. 'len' is more flexible, but will always be a power-of-two anyway
> 	for high-order pages (the common case)

Len wouldn't be a power of two practically only if we detect adjacent
pages of smaller order that may merge into larger orders we already
allocated (or the other way around).

[addr=2M, len=2M] allocated at order 9 pass
[addr=4M, len=1M] allocated at order 8 pass -> merge as [addr=2M, len=3M]

Not sure if it would be worth it, but that unless we do this, page-order or
len won't make much difference.

>    3b. if we decide not to have a bitmap, then we basically have plenty
> 	of space for 'len' and should just do it
>    3c. It's easiest for the hypervisor to turn pfn/len into the
>        madvise() calls that it needs.
> 
> Did I miss anything?

I think you summarized fine all my arguments in your summary.

> FWIW, I don't feel that strongly about the bitmap.  Li had one
> originally, but I think the code thus far has demonstrated a huge
> benefit without even having a bitmap.
> 
> I've got no objections to ripping the bitmap out of the ABI.

I think we need to see a statistic showing the number of bits set in
each bitmap in average, after some uptime and lru churn, like running
stresstest app for a while with I/O and then inflate the balloon and
count:

1) how many bits were set vs total number of bits used in bitmaps

2) how many times bitmaps were used vs bitmap_len = 0 case of single
   page

My guess would be like very low percentage for both points.

> Surely we can think of a few ways...
> 
> A bitmap is 64x more dense if the lists are unordered.  It means being
> able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
> That's 64x fewer cachelines to touch, 64x fewer pages to move to the
> hypervisor and lets us allocate 1/64th the memory.  Given a maximum
> allocation that we're allowed, it lets us do 64x more per-pass.
> 
> Now, are those benefits worth it?  Maybe not, but let's not pretend they
> don't exist. ;)

In the best case there are benefits obviously, the question is how
common the best case is.

The best case if I understand correctly is all high order not
available, but plenty of order 0 pages available at phys address X,
X+8k, X+16k, X+(8k*nr_bits_in_bitmap). How common is that 0 pages
exist but they're not at an address < X or > X+(8k*nr_bits_in_bitmap)?

> Yes, the current code sends one batch of pages up to the hypervisor per
> order.  But, this has nothing to do with the underlying data structure,
> or the choice to have an order vs. len in the ABI.
> 
> What you describe here is obviously more efficient.

And it isn't possible with the current ABI.

So there is a connection with the MAX_ORDER..0 allocation loop and the
ABI change, but I agree any of the ABI proposed would still allow for
it this logic to be used. Bitmap or not bitmap, the loop would still
work.

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	"Li, Liang Z" <liang.z.li@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"mhocko@suse.com" <mhocko@suse.com>,
	"mst@redhat.com" <mst@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"virtualization@lists.linux-foundation.org"
	<virtualization@lists.linux-foundation.org>,
	"kirill.shutemov@linux.intel.com"
	<kirill.shutemov@linux.intel.com>
Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
Date: Wed, 7 Dec 2016 21:28:24 +0100	[thread overview]
Message-ID: <20161207202824.GH28786@redhat.com> (raw)
In-Reply-To: <b58fd9f6-d9dd-dd56-d476-dd342174dac5@intel.com>

On Wed, Dec 07, 2016 at 11:54:34AM -0800, Dave Hansen wrote:
> We're talking about a bunch of different stuff which is all being
> conflated.  There are 3 issues here that I can see.  I'll attempt to
> summarize what I think is going on:
> 
> 1. Current patches do a hypercall for each order in the allocator.
>    This is inefficient, but independent from the underlying data
>    structure in the ABI, unless bitmaps are in play, which they aren't.
> 2. Should we have bitmaps in the ABI, even if they are not in use by the
>    guest implementation today?  Andrea says they have zero benefits
>    over a pfn/len scheme.  Dave doesn't think they have zero benefits
>    but isn't that attached to them.  QEMU's handling gets more
>    complicated when using a bitmap.
> 3. Should the ABI contain records each with a pfn/len pair or a
>    pfn/order pair?
>    3a. 'len' is more flexible, but will always be a power-of-two anyway
> 	for high-order pages (the common case)

Len wouldn't be a power of two practically only if we detect adjacent
pages of smaller order that may merge into larger orders we already
allocated (or the other way around).

[addr=2M, len=2M] allocated at order 9 pass
[addr=4M, len=1M] allocated at order 8 pass -> merge as [addr=2M, len=3M]

Not sure if it would be worth it, but that unless we do this, page-order or
len won't make much difference.

>    3b. if we decide not to have a bitmap, then we basically have plenty
> 	of space for 'len' and should just do it
>    3c. It's easiest for the hypervisor to turn pfn/len into the
>        madvise() calls that it needs.
> 
> Did I miss anything?

I think you summarized fine all my arguments in your summary.

> FWIW, I don't feel that strongly about the bitmap.  Li had one
> originally, but I think the code thus far has demonstrated a huge
> benefit without even having a bitmap.
> 
> I've got no objections to ripping the bitmap out of the ABI.

I think we need to see a statistic showing the number of bits set in
each bitmap in average, after some uptime and lru churn, like running
stresstest app for a while with I/O and then inflate the balloon and
count:

1) how many bits were set vs total number of bits used in bitmaps

2) how many times bitmaps were used vs bitmap_len = 0 case of single
   page

My guess would be like very low percentage for both points.

> Surely we can think of a few ways...
> 
> A bitmap is 64x more dense if the lists are unordered.  It means being
> able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
> That's 64x fewer cachelines to touch, 64x fewer pages to move to the
> hypervisor and lets us allocate 1/64th the memory.  Given a maximum
> allocation that we're allowed, it lets us do 64x more per-pass.
> 
> Now, are those benefits worth it?  Maybe not, but let's not pretend they
> don't exist. ;)

In the best case there are benefits obviously, the question is how
common the best case is.

The best case if I understand correctly is all high order not
available, but plenty of order 0 pages available at phys address X,
X+8k, X+16k, X+(8k*nr_bits_in_bitmap). How common is that 0 pages
exist but they're not at an address < X or > X+(8k*nr_bits_in_bitmap)?

> Yes, the current code sends one batch of pages up to the hypervisor per
> order.  But, this has nothing to do with the underlying data structure,
> or the choice to have an order vs. len in the ABI.
> 
> What you describe here is obviously more efficient.

And it isn't possible with the current ABI.

So there is a connection with the MAX_ORDER..0 allocation loop and the
ABI change, but I agree any of the ABI proposed would still allow for
it this logic to be used. Bitmap or not bitmap, the loop would still
work.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-12-07 20:28 UTC|newest]

Thread overview: 165+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
2016-11-30  8:43 ` [Qemu-devel] " Liang Li
2016-11-30  8:43 ` Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list Liang Li
2016-11-30  8:43 ` Liang Li
2016-11-30  8:43   ` [Qemu-devel] " Liang Li
2016-11-30  8:43   ` Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct Liang Li
2016-11-30  8:43 ` Liang Li
2016-11-30  8:43   ` [Qemu-devel] " Liang Li
2016-11-30  8:43   ` Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process Liang Li
2016-11-30  8:43   ` [Qemu-devel] " Liang Li
2016-11-30  8:43   ` Liang Li
2016-11-30  8:43 ` Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq Liang Li
2016-11-30  8:43   ` [Qemu-devel] " Liang Li
2016-11-30  8:43   ` Liang Li
2016-11-30  8:43 ` Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Liang Li
2016-11-30  8:43 ` Liang Li
2016-11-30  8:43   ` [Qemu-devel] " Liang Li
2016-11-30  8:43   ` Liang Li
2016-11-30 19:15   ` Dave Hansen
2016-11-30 19:15   ` Dave Hansen
2016-11-30 19:15     ` [Qemu-devel] " Dave Hansen
2016-11-30 19:15     ` Dave Hansen
2016-12-04 13:13     ` Li, Liang Z
2016-12-04 13:13     ` Li, Liang Z
2016-12-04 13:13       ` [Qemu-devel] " Li, Liang Z
2016-12-04 13:13       ` Li, Liang Z
2016-12-05 17:22       ` Dave Hansen
2016-12-05 17:22         ` [Qemu-devel] " Dave Hansen
2016-12-05 17:22         ` Dave Hansen
2016-12-05 17:22         ` Dave Hansen
2016-12-06  4:47         ` Li, Liang Z
2016-12-06  4:47           ` [Qemu-devel] " Li, Liang Z
2016-12-06  4:47           ` Li, Liang Z
2016-12-06  4:47           ` Li, Liang Z
2016-12-06  8:40 ` [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration David Hildenbrand
2016-12-06  8:40   ` [Qemu-devel] " David Hildenbrand
2016-12-06  8:40   ` David Hildenbrand
2016-12-06  8:40   ` David Hildenbrand
2016-12-07 13:35   ` Li, Liang Z
2016-12-07 13:35     ` [Qemu-devel] " Li, Liang Z
2016-12-07 13:35     ` Li, Liang Z
2016-12-07 13:35     ` Li, Liang Z
2016-12-07 15:34     ` Dave Hansen
2016-12-07 15:34       ` [Qemu-devel] " Dave Hansen
2016-12-07 15:34       ` Dave Hansen
2016-12-09  3:09       ` Li, Liang Z
2016-12-09  3:09         ` [Qemu-devel] " Li, Liang Z
2016-12-09  3:09         ` Li, Liang Z
2016-12-09  3:09         ` Li, Liang Z
2016-12-09  3:09       ` Li, Liang Z
2016-12-07 15:34     ` Dave Hansen
2016-12-07 15:42     ` David Hildenbrand
2016-12-07 15:42       ` [Qemu-devel] " David Hildenbrand
2016-12-07 15:42       ` David Hildenbrand
2016-12-07 15:45       ` Dave Hansen
2016-12-07 15:45         ` [Qemu-devel] " Dave Hansen
2016-12-07 15:45         ` Dave Hansen
2016-12-07 15:45         ` Dave Hansen
2016-12-07 16:21         ` David Hildenbrand
2016-12-07 16:21           ` [Qemu-devel] " David Hildenbrand
2016-12-07 16:21           ` David Hildenbrand
2016-12-07 16:21           ` David Hildenbrand
2016-12-07 16:57           ` Dave Hansen
2016-12-07 16:57           ` Dave Hansen
2016-12-07 16:57             ` [Qemu-devel] " Dave Hansen
2016-12-07 16:57             ` Dave Hansen
2016-12-07 18:38             ` [Qemu-devel] " Andrea Arcangeli
2016-12-07 18:38             ` Andrea Arcangeli
2016-12-07 18:38               ` Andrea Arcangeli
2016-12-07 18:38               ` Andrea Arcangeli
2016-12-07 18:44               ` Dave Hansen
2016-12-07 18:44               ` Dave Hansen
2016-12-07 18:44                 ` Dave Hansen
2016-12-07 18:44                 ` Dave Hansen
2016-12-07 18:58                 ` Andrea Arcangeli
2016-12-07 18:58                 ` Andrea Arcangeli
2016-12-07 18:58                   ` Andrea Arcangeli
2016-12-07 18:58                   ` Andrea Arcangeli
2016-12-07 19:54               ` Dave Hansen
2016-12-07 19:54                 ` Dave Hansen
2016-12-07 19:54                 ` Dave Hansen
2016-12-07 19:54                 ` Dave Hansen
2016-12-07 20:28                 ` Andrea Arcangeli
2016-12-07 20:28                 ` Andrea Arcangeli [this message]
2016-12-07 20:28                   ` Andrea Arcangeli
2016-12-07 20:28                   ` Andrea Arcangeli
2016-12-09  4:45                   ` Li, Liang Z
2016-12-09  4:45                   ` Li, Liang Z
2016-12-09  4:45                     ` Li, Liang Z
2016-12-09  4:45                     ` Li, Liang Z
2016-12-09  4:53                     ` Dave Hansen
2016-12-09  4:53                       ` Dave Hansen
2016-12-09  4:53                       ` Dave Hansen
2016-12-09  4:53                       ` Dave Hansen
2016-12-09  5:35                       ` Li, Liang Z
2016-12-09  5:35                         ` Li, Liang Z
2016-12-09  5:35                         ` Li, Liang Z
2016-12-09  5:35                         ` Li, Liang Z
2016-12-09 16:42                         ` Andrea Arcangeli
2016-12-09 16:42                           ` Andrea Arcangeli
2016-12-09 16:42                           ` Andrea Arcangeli
2016-12-09 16:42                           ` Andrea Arcangeli
2016-12-14  8:20                           ` Li, Liang Z
2016-12-14  8:20                             ` Li, Liang Z
2016-12-14  8:20                             ` Li, Liang Z
2016-12-14  8:20                             ` Li, Liang Z
2016-12-14  8:59                       ` Li, Liang Z
2016-12-14  8:59                         ` Li, Liang Z
2016-12-14  8:59                         ` Li, Liang Z
2016-12-14  8:59                         ` Li, Liang Z
2016-12-15 15:34                         ` Dave Hansen
2016-12-15 15:34                         ` Dave Hansen
2016-12-15 15:34                           ` Dave Hansen
2016-12-15 15:34                           ` Dave Hansen
2016-12-15 15:54                           ` Michael S. Tsirkin
2016-12-15 15:54                           ` Michael S. Tsirkin
2016-12-15 15:54                             ` Michael S. Tsirkin
2016-12-15 15:54                             ` Michael S. Tsirkin
2016-12-16  1:12                             ` Li, Liang Z
2016-12-16  1:12                               ` Li, Liang Z
2016-12-16  1:12                               ` Li, Liang Z
2016-12-16  1:12                               ` Li, Liang Z
2016-12-16 15:40                               ` Andrea Arcangeli
2016-12-16 15:40                                 ` Andrea Arcangeli
2016-12-16 15:40                                 ` Andrea Arcangeli
2016-12-17 11:56                                 ` Li, Liang Z
2016-12-17 11:56                                   ` Li, Liang Z
2016-12-17 11:56                                   ` Li, Liang Z
2016-12-17 11:56                                   ` Li, Liang Z
2016-12-16 15:40                               ` Andrea Arcangeli
2016-12-16  0:48                           ` Li, Liang Z
2016-12-16  0:48                             ` Li, Liang Z
2016-12-16  0:48                             ` Li, Liang Z
2016-12-16  0:48                             ` Li, Liang Z
2016-12-16  1:09                             ` Dave Hansen
2016-12-16  1:09                             ` Dave Hansen
2016-12-16  1:09                               ` Dave Hansen
2016-12-16  1:09                               ` Dave Hansen
2016-12-16  1:38                               ` Li, Liang Z
2016-12-16  1:38                                 ` Li, Liang Z
2016-12-16  1:38                                 ` Li, Liang Z
2016-12-16  1:38                                 ` Li, Liang Z
2016-12-16  1:40                                 ` Dave Hansen
2016-12-16  1:40                                   ` Dave Hansen
2016-12-16  1:40                                   ` Dave Hansen
2016-12-16  1:40                                   ` Dave Hansen
2016-12-16  1:43                                   ` Li, Liang Z
2016-12-16  1:43                                   ` Li, Liang Z
2016-12-16  1:43                                     ` Li, Liang Z
2016-12-16  1:43                                     ` Li, Liang Z
2016-12-16 16:01                                   ` Andrea Arcangeli
2016-12-16 16:01                                     ` Andrea Arcangeli
2016-12-16 16:01                                     ` Andrea Arcangeli
2016-12-16 16:01                                     ` Andrea Arcangeli
2016-12-17 12:39                                     ` Li, Liang Z
2016-12-17 12:39                                     ` Li, Liang Z
2016-12-17 12:39                                       ` Li, Liang Z
2016-12-17 12:39                                       ` Li, Liang Z
2016-12-07 15:42     ` David Hildenbrand
2016-12-07 13:35   ` Li, Liang Z

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161207202824.GH28786@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=liang.z.li@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.