KVM Archive on lore.kernel.org
 help / color / Atom feed
From: Paolo Bonzini <pbonzini@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>, Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Sean Christopherson <sean.j.christopherson@intel.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>
Subject: Re: [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking
Date: Thu, 12 Dec 2019 01:08:14 +0100
Message-ID: <46ceb88c-0ddd-0d9a-7128-3aa5a7d9d233@redhat.com> (raw)
In-Reply-To: <20191211172713-mutt-send-email-mst@kernel.org>

On 11/12/19 23:57, Michael S. Tsirkin wrote:
>>> All these seem like arbitrary limitations to me.
>>>
>>> Sizing the ring correctly might prove to be a challenge.
>>>
>>> Thus I think there's value in resizing the rings
>>> without destroying VCPU.
>>
>> Do you have an example on when we could use this feature?
> 
> So e.g. start with a small ring, and if you see stalls too often
> increase it? Otherwise I don't see how does one decide
> on ring size.

If you see stalls often, it means the guest is dirtying memory very
fast.  Harvesting the ring puts back pressure on the guest, you may
prefer a smaller ring size to avoid a bufferbloat-like situation.

Note that having a larger ring is better, even though it does incur a
memory cost, because it means the migration thread will be able to reap
the ring buffer asynchronously with no vmexits.

With smaller ring sizes the cost of flushing the TLB when resetting the
rings goes up, but the initial bulk copy phase _will_ have vmexits and
then having to reap more dirty rings becomes more expensive and
introduces some jitter.  So it will require some experimentation to find
an optimal value.

Anyway if in the future we go for resizable rings, KVM_ENABLE_CAP can be
passed the largest desired size and then another ioctl can be introduced
to set the mask for indices.

>>> Also, power of two just saves a branch here and there,
>>> but wastes lots of memory. Just wrap the index around to
>>> 0 and then users can select any size?
>>
>> Same as above to postpone until we need it?
> 
> It's to save memory, don't we always need to do that?

Does it really save that much memory?  Would it really be so beneficial
to choose 12K entries rather than 8K or 16K in the ring?

>> My understanding of this is that normally we do only want either one
>> of them depending on the major workload and the configuration of the
>> guest.
> 
> And again how does one know which to enable? No one has the
> time to fine-tune gazillion parameters.

Hopefully we can always use just the ring buffer.

> IMHO a huge amount of benchmarking has to happen if you just want to
> set this loose on users as default with these kind of
> limitations. We need to be sure that even though in theory
> it can be very bad, in practice it's actually good.
> If it's auto-tuning then it's a much easier sell to upstream
> even if there's a chance of some regressions.

Auto-tuning is not a silver bullet, it requires just as much
benchmarking to make sure that it doesn't oscillate crazily and that it
actually outperforms a simple fixed size.

>> Yeh kvm versioning could work too.  Here we can also return a zero
>> just like the most of the caps (or KVM_DIRTY_LOG_PAGE_OFFSET as in the
>> original patchset, but it's really helpless either because it's
>> defined in uapi), but I just don't see how it helps...  So I returned
>> a version number just in case we'd like to change the layout some day
>> and when we don't want to bother introducing another cap bit for the
>> same feature (like KVM_CAP_MANUAL_DIRTY_LOG_PROTECT and
>> KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2).
> 
> I guess it's up to Paolo but really I don't see the point.
> You can add a version later when it means something ...

Yeah, we can return the maximum size of the ring buffer, too.

>> I'd say it won't be a big issue on locking 1/2M of host mem for a
>> vm...
>> Also note that if dirty ring is enabled, I plan to evaporate the
>> dirty_bitmap in the next post. The old kvm->dirty_bitmap takes
>> $GUEST_MEM/32K*2 mem.  E.g., for 64G guest it's 64G/32K*2=4M.  If with
>> dirty ring of 8 vcpus, that could be 64K*8=0.5M, which could be even
>> less memory used.
> 
> Right - I think Avi described the bitmap in kernel memory as one of
> design mistakes. Why repeat that with the new design?

Do you have a source for that?  At least the dirty bitmap has to be
accessed from atomic context so it seems unlikely that it can be moved
to user memory.

The dirty ring could use user memory indeed, but it would be much harder
to set up (multiple ioctls for each ring?  what to do if userspace
forgets one? etc.).  The mmap API is easier to use.

>>>> +	entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>>>> +	/*
>>>> +	 * The ring buffer is shared with userspace, which might mmap
>>>> +	 * it and concurrently modify slot and offset.  Userspace must
>>>> +	 * not be trusted!  READ_ONCE prevents the compiler from changing
>>>> +	 * the values after they've been range-checked (the checks are
>>>> +	 * in kvm_reset_dirty_gfn).
>>>
>>> What it doesn't is prevent speculative attacks.  That's why things like
>>> copy from user have a speculation barrier.  Instead of worrying about
>>> that, unless it's really critical, I think you'd do well do just use
>>> copy to/from user.

An unconditional speculation barrier (lfence) is also expensive.  We
already have macros to add speculation checks with array_index_nospec at
the right places, for example __kvm_memslots.  We should add an
array_index_nospec to id_to_memslot as well.  I'll send a patch for that.

>>> What depends on what here? Looks suspicious ...
>>
>> Hmm, I think maybe it can be removed because the entry pointer
>> reference below should be an ordering constraint already?

entry->xxx depends on ring->reset_index.

>>> what's the story around locking here? Why is it safe
>>> not to take the lock sometimes?
>>
>> kvm_dirty_ring_push() will be with lock==true only when the per-vm
>> ring is used.  For per-vcpu ring, because that will only happen with
>> the vcpu context, then we don't need locks (so kvm_dirty_ring_push()
>> is called with lock==false).

FWIW this will be done much more nicely in v2.

>>>> +	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>>>> +	if (!page) {
>>>> +		r = -ENOMEM;
>>>> +		goto out_err_alloc_page;
>>>> +	}
>>>> +	kvm->vm_run = page_address(page);
>>>
>>> So 4K with just 8 bytes used. Not as bad as 1/2Mbyte for the ring but
>>> still. What is wrong with just a pointer and calling put_user?
>>
>> I want to make it the start point for sharing fields between
>> user/kernel per-vm.  Just like kvm_run for per-vcpu.

This page is actually not needed at all.  Userspace can just map at
KVM_DIRTY_LOG_PAGE_OFFSET, the indices reside there.  You can drop
kvm_vm_run completely.

>>>> +	} else {
>>>> +		/*
>>>> +		 * Put onto per vm ring because no vcpu context.  Kick
>>>> +		 * vcpu0 if ring is full.
>>>
>>> What about tasks on vcpu 0? Do guests realize it's a bad idea to put
>>> critical tasks there, they will be penalized disproportionally?
>>
>> Reasonable question.  So far we can't avoid it because vcpu exit is
>> the event mechanism to say "hey please collect dirty bits".  Maybe
>> someway is better than this, but I'll need to rethink all these
>> over...
> 
> Maybe signal an eventfd, and let userspace worry about deciding what to
> do.

This has to be done synchronously.  But the vm ring should be used very
rarely (it's for things like kvmclock updates that write to guest memory
outside a vCPU), possibly a handful of times in the whole run of the VM.

>>> KVM_DIRTY_RING_MAX_ENTRIES is not part of UAPI.
>>> So how does userspace know what's legal?
>>> Do you expect it to just try?
>>
>> Yep that's what I thought. :)

We should return it for KVM_CHECK_EXTENSION.

Paolo


  reply index

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-29 21:34 [PATCH RFC 00/15] KVM: Dirty ring interface Peter Xu
2019-11-29 21:34 ` [PATCH RFC 01/15] KVM: Move running VCPU from ARM to common code Peter Xu
2019-12-03 19:01   ` Sean Christopherson
2019-12-04  9:42     ` Paolo Bonzini
2019-12-09 22:05       ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 02/15] KVM: Add kvm/vcpu argument to mark_dirty_page_in_slot Peter Xu
2019-12-02 19:32   ` Sean Christopherson
2019-12-02 20:49     ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 03/15] KVM: Add build-time error check on kvm_run size Peter Xu
2019-12-02 19:30   ` Sean Christopherson
2019-12-02 20:53     ` Peter Xu
2019-12-02 22:19       ` Sean Christopherson
2019-12-02 22:40         ` Peter Xu
2019-12-03  5:50           ` Sean Christopherson
2019-12-03 13:41         ` Paolo Bonzini
2019-12-03 17:04           ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 04/15] KVM: Implement ring-based dirty memory tracking Peter Xu
2019-12-02 20:10   ` Sean Christopherson
2019-12-02 21:16     ` Peter Xu
2019-12-02 21:50       ` Sean Christopherson
2019-12-02 23:09         ` Peter Xu
2019-12-03 13:48         ` Paolo Bonzini
2019-12-03 18:46           ` Sean Christopherson
2019-12-04 10:05             ` Paolo Bonzini
2019-12-07  0:29               ` Sean Christopherson
2019-12-09  9:37                 ` Paolo Bonzini
2019-12-09 21:54               ` Peter Xu
2019-12-10 10:07                 ` Paolo Bonzini
2019-12-10 15:52                   ` Peter Xu
2019-12-10 17:09                     ` Paolo Bonzini
2019-12-15 17:21                       ` Peter Xu
2019-12-16 10:08                         ` Paolo Bonzini
2019-12-16 18:54                           ` Peter Xu
2019-12-17  9:01                             ` Paolo Bonzini
2019-12-17 16:24                               ` Peter Xu
2019-12-17 16:28                                 ` Paolo Bonzini
2019-12-18 21:58                                   ` Peter Xu
2019-12-18 22:24                                     ` Sean Christopherson
2019-12-18 22:37                                       ` Paolo Bonzini
2019-12-18 22:49                                         ` Peter Xu
2019-12-17  2:28                           ` Tian, Kevin
2019-12-17 16:18                             ` Alex Williamson
2019-12-17 16:30                               ` Paolo Bonzini
2019-12-18  0:29                                 ` Tian, Kevin
     [not found]                           ` <AADFC41AFE54684AB9EE6CBC0274A5D19D645E5F@SHSMSX104.ccr.corp.intel.com>
2019-12-17  5:17                             ` Tian, Kevin
2019-12-17  5:25                               ` Yan Zhao
2019-12-17 16:24                                 ` Alex Williamson
2019-12-03 19:13   ` Sean Christopherson
2019-12-04 10:14     ` Paolo Bonzini
2019-12-04 14:33       ` Sean Christopherson
2019-12-04 10:38   ` Jason Wang
2019-12-04 11:04     ` Paolo Bonzini
2019-12-04 19:52       ` Peter Xu
2019-12-05  6:51         ` Jason Wang
2019-12-05 12:08           ` Peter Xu
2019-12-05 13:12             ` Jason Wang
2019-12-10 13:25       ` Michael S. Tsirkin
2019-12-10 13:31         ` Paolo Bonzini
2019-12-10 16:02           ` Peter Xu
2019-12-10 21:53             ` Michael S. Tsirkin
2019-12-11  9:05               ` Paolo Bonzini
2019-12-11 13:04                 ` Michael S. Tsirkin
2019-12-11 14:54                   ` Peter Xu
2019-12-10 21:48           ` Michael S. Tsirkin
2019-12-11 12:53   ` Michael S. Tsirkin
2019-12-11 14:14     ` Paolo Bonzini
2019-12-11 20:59     ` Peter Xu
2019-12-11 22:57       ` Michael S. Tsirkin
2019-12-12  0:08         ` Paolo Bonzini [this message]
2019-12-12  7:36           ` Michael S. Tsirkin
2019-12-12  8:12             ` Paolo Bonzini
2019-12-12 10:38               ` Michael S. Tsirkin
2019-12-15 17:33           ` Peter Xu
2019-12-16  9:47             ` Michael S. Tsirkin
2019-12-16 15:07               ` Peter Xu
2019-12-16 15:33                 ` Michael S. Tsirkin
2019-12-16 15:47                   ` Peter Xu
2019-12-11 17:24   ` Christophe de Dinechin
2019-12-13 20:23     ` Peter Xu
2019-12-14  7:57       ` Paolo Bonzini
2019-12-14 16:26         ` Peter Xu
2019-12-16  9:29           ` Paolo Bonzini
2019-12-16 15:26             ` Peter Xu
2019-12-16 15:31               ` Paolo Bonzini
2019-12-16 15:43                 ` Peter Xu
2019-12-17 12:16         ` Christophe de Dinechin
2019-12-17 12:19           ` Paolo Bonzini
2019-12-17 15:38             ` Peter Xu
2019-12-17 16:31               ` Paolo Bonzini
2019-12-17 16:42                 ` Peter Xu
2019-12-17 16:48                   ` Paolo Bonzini
2019-12-17 19:41                     ` Peter Xu
2019-12-18  0:33                       ` Paolo Bonzini
2019-12-18 16:32                         ` Peter Xu
2019-12-18 16:41                           ` Paolo Bonzini
2019-12-20 18:19       ` Peter Xu
2019-11-29 21:34 ` [PATCH RFC 05/15] KVM: Make dirty ring exclusive to dirty bitmap log Peter Xu
2019-11-29 21:34 ` [PATCH RFC 06/15] KVM: Introduce dirty ring wait queue Peter Xu
2019-11-29 21:34 ` [PATCH RFC 07/15] KVM: X86: Implement ring-based dirty memory tracking Peter Xu
2019-11-29 21:34 ` [PATCH RFC 08/15] KVM: selftests: Always clear dirty bitmap after iteration Peter Xu
2019-11-29 21:34 ` [PATCH RFC 09/15] KVM: selftests: Sync uapi/linux/kvm.h to tools/ Peter Xu
2019-11-29 21:35 ` [PATCH RFC 10/15] KVM: selftests: Use a single binary for dirty/clear log test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 11/15] KVM: selftests: Introduce after_vcpu_run hook for dirty " Peter Xu
2019-11-29 21:35 ` [PATCH RFC 12/15] KVM: selftests: Add dirty ring buffer test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 13/15] KVM: selftests: Let dirty_log_test async for dirty ring test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 14/15] KVM: selftests: Add "-c" parameter to dirty log test Peter Xu
2019-11-29 21:35 ` [PATCH RFC 15/15] KVM: selftests: Test dirty ring waitqueue Peter Xu
2019-11-30  8:29 ` [PATCH RFC 00/15] KVM: Dirty ring interface Paolo Bonzini
2019-12-02  2:13   ` Peter Xu
2019-12-03 13:59     ` Paolo Bonzini
2019-12-05 19:30       ` Peter Xu
2019-12-05 19:59         ` Paolo Bonzini
2019-12-05 20:52           ` Peter Xu
2019-12-02 20:21   ` Sean Christopherson
2019-12-02 20:43     ` Peter Xu
2019-12-04 10:39 ` Jason Wang
2019-12-04 19:33   ` Peter Xu
2019-12-05  6:49     ` Jason Wang
2019-12-11 13:41 ` Christophe de Dinechin
2019-12-11 14:16   ` Paolo Bonzini
2019-12-11 17:15     ` Peter Xu

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46ceb88c-0ddd-0d9a-7128-3aa5a7d9d233@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=peterx@redhat.com \
    --cc=sean.j.christopherson@intel.com \
    --cc=vkuznets@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git