From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_RED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F86FC33CA1 for ; Thu, 9 Jan 2020 19:36:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E329E20721 for ; Thu, 9 Jan 2020 19:36:07 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QkFCL5z6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731096AbgAITgE (ORCPT ); Thu, 9 Jan 2020 14:36:04 -0500 Received: from us-smtp-1.mimecast.com ([207.211.31.81]:53622 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1730892AbgAITf5 (ORCPT ); Thu, 9 Jan 2020 14:35:57 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1578598555; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lMFm0Z3S/Vk1hYMg4RMql2dW6vnWqvhCym5yaQwkrkg=; b=QkFCL5z6g//qlZF5qsh7MXAd+I1PM/xzequWii+YYE87h6OnXldXBQcs1uaBDewADpSSsv IMuSpba/j2/v/TbwZIg7vcKvkl6jQ+aWeLMjgB5baa/2USWwXegiTIRYE2CQi/OsQpX4Az aln+LrhTnAg2ohdM52kJsmPAgWgmYjw= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-331-JCacsXU_MYiqx2duMUZ4Tw-1; Thu, 09 Jan 2020 14:35:54 -0500 X-MC-Unique: JCacsXU_MYiqx2duMUZ4Tw-1 Received: by mail-qv1-f72.google.com with SMTP id z9so4771017qvo.10 for ; Thu, 09 Jan 2020 11:35:54 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=lMFm0Z3S/Vk1hYMg4RMql2dW6vnWqvhCym5yaQwkrkg=; b=bsde1PXoXJH029pom5W3Zkq1aZbmg3sWQbRgL5xzyhjl0nHrV6tq9quQUDmtMjNin7 HQyslGuF5DFkmgYxuza2di7n4m8WgL3gTDHe1tUKtIGcIYSAnJdTP96Vv7cqaa5Wdnig OW5godGVWm0fJxp44XLGJb3wbytfdPJoRux1c6S6J2qSIpw6DJlk+7VUOnJ9htCGcFhk sP/8n5IAfZeVEd+5g9QhrBx6N+MTp2DEAcccBctPUPVAPj0Qu9Xjm7bTlYzbIQkr33TL OSRrgs6Ob1QPBQKqeNygXGci03or5T8nbrk7syUeUvcys+Z/vAPKZUJ3QoYNmG1XTGsO j1MA== X-Gm-Message-State: APjAAAUBoNAtrPkjQjvbCyXsWGW3nvOrs8bIqO5WsRuHlvS4jkhQiGMm v0zLFbZ/pL+ItN3R/VruaC7/2MR/5d01InaSgH9M6KtKVgdIpKedozK1+wmTKrF04laUEZlW0SQ eaAuEbAF4TI1j X-Received: by 2002:a0c:bf0b:: with SMTP id m11mr10373610qvi.63.1578598553475; Thu, 09 Jan 2020 11:35:53 -0800 (PST) X-Google-Smtp-Source: APXvYqylgt6lQU8eg9lz0eLJmfbRkNFIoLHXT/O2TpWNI7jmJbPQowwPmWnWHdaZiDRJxzlUx6Ugiw== X-Received: by 2002:a0c:bf0b:: with SMTP id m11mr10373583qvi.63.1578598552992; Thu, 09 Jan 2020 11:35:52 -0800 (PST) Received: from redhat.com (bzq-79-183-34-164.red.bezeqint.net. [79.183.34.164]) by smtp.gmail.com with ESMTPSA id w29sm3981370qtc.72.2020.01.09.11.35.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Jan 2020 11:35:52 -0800 (PST) Date: Thu, 9 Jan 2020 14:35:46 -0500 From: "Michael S. Tsirkin" To: Peter Xu Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Christophe de Dinechin , Paolo Bonzini , Sean Christopherson , Yan Zhao , Alex Williamson , Jason Wang , Kevin Kevin , Vitaly Kuznetsov , "Dr . David Alan Gilbert" , Lei Cao Subject: Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking Message-ID: <20200109141634-mutt-send-email-mst@kernel.org> References: <20200109145729.32898-1-peterx@redhat.com> <20200109145729.32898-13-peterx@redhat.com> <20200109110110-mutt-send-email-mst@kernel.org> <20200109191514.GD36997@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200109191514.GD36997@xz-x1> Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Thu, Jan 09, 2020 at 02:15:14PM -0500, Peter Xu wrote: > On Thu, Jan 09, 2020 at 11:29:28AM -0500, Michael S. Tsirkin wrote: > > On Thu, Jan 09, 2020 at 09:57:20AM -0500, Peter Xu wrote: > > > This patch is heavily based on previous work from Lei Cao > > > and Paolo Bonzini . [1] > > > > > > KVM currently uses large bitmaps to track dirty memory. These bitmaps > > > are copied to userspace when userspace queries KVM for its dirty page > > > information. The use of bitmaps is mostly sufficient for live > > > migration, as large parts of memory are be dirtied from one log-dirty > > > pass to another. However, in a checkpointing system, the number of > > > dirty pages is small and in fact it is often bounded---the VM is > > > paused when it has dirtied a pre-defined number of pages. Traversing a > > > large, sparsely populated bitmap to find set bits is time-consuming, > > > as is copying the bitmap to user-space. > > > > > > A similar issue will be there for live migration when the guest memory > > > is huge while the page dirty procedure is trivial. In that case for > > > each dirty sync we need to pull the whole dirty bitmap to userspace > > > and analyse every bit even if it's mostly zeros. > > > > > > The preferred data structure for above scenarios is a dense list of > > > guest frame numbers (GFN). > > > > No longer, this uses an array of structs. > > (IMHO it's more or less a wording thing, because it's still an array > of GFNs behind it...) > > [...] > > > > +Dirty GFNs (Guest Frame Numbers) are stored in the dirty_gfns array. > > > +For each of the dirty entry it's defined as: > > > + > > > +struct kvm_dirty_gfn { > > > + __u32 pad; > > > > How about sticking a length here? > > This way huge pages can be dirtied in one go. > > As we've discussed previously, current KVM tracks dirty in 4K page > only, so it seems to be something that is not easily covered in this > series. > > We probably need to justify on having KVM to track huge pages first, > or at least a trend that we're going to do that, then we can properly > reserve it here. > > > > > > + __u32 slot; /* as_id | slot_id */ > > > + __u64 offset; > > > +}; > > > + > > > +Most of the ring structure is used by KVM internally, while only the > > > +indices are exposed to userspace: > > > + > > > +struct kvm_dirty_ring_indices { > > > + __u32 avail_index; /* set by kernel */ > > > + __u32 fetch_index; /* set by userspace */ > > > +}; > > > + > > > +The two indices in the ring buffer are free running counters. > > > + > > > +Userspace calls KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM ioctl > > > +to enable this capability for the new guest and set the size of the > > > +rings. It is only allowed before creating any vCPU, and the size of > > > +the ring must be a power of two. > > > > > > I know index design is popular, but testing with virtio showed > > that it's better to just have a flags field marking > > an entry as valid. In particular this gets rid of the > > running counters and power of two limitations. > > It also removes the need for a separate index page, which is nice. > > Firstly, note that the separate index page has already been dropped > since V2, so we don't need to worry on that. changelog would be nice. So now, how does userspace tell kvm it's done with the ring? > Regarding dropping the indices: I feel like it can be done, though we > probably need two extra bits for each GFN entry, for example: > > - Bit 0 of the GFN address to show whether this is a valid publish > of dirty gfn > > - Bit 1 of the GFN address to show whether this is collected by the > user I wonder whether you will end up reinventing virtio. You are already pretty close with avail/used bits in flags. > We can also use the padding field, but just want to show the idea > first. > > Then for each GFN we can go through state changes like this (things > like "00b" stands for "bit1 bit0" values): > > 00b (invalid GFN) -> > 01b (valid gfn published by kernel, which is dirty) -> > 10b (gfn dirty page collected by userspace) -> > 00b (gfn reset by kernel, so goes back to invalid gfn) > > And we should always guarantee that both the userspace and KVM walks > the GFN array in a linear manner, for example, KVM must publish a new > GFN with bit 1 set right after the previous publish of GFN. Vice > versa to the userspace when it collects the dirty GFN and mark bit 2. > > Michael, do you mean something like this? > > I think it should work logically, however IIUC it can expose more > security risks, say, dirty ring is different from virtio in that > userspace is not trusted, In what sense? > while for virtio, both sides (hypervisor, > and the guest driver) are trusted. What gave you the impression guest is trusted in virtio? > Above means we need to do these to > change to the new design: > > - Allow the GFN array to be mapped as writable by userspace (so that > userspace can publish bit 2), > > - The userspace must be trusted to follow the design (just imagine > what if the userspace overwrites a GFN when it publishes bit 2 > over a valid dirty gfn entry? KVM could wrongly unprotect a page > for the guest...). You mean protect, right? So what? > While if we use the indices, we restrict the userspace to only be able > to write to one index only (which is the reset_index). That's all it > can do to mess things up (and it could never as long as we properly > validate the reset_index when read, which only happens during > KVM_RESET_DIRTY_RINGS and is very rare). From that pov, it seems the > indices solution still has its benefits. So if you mess up index how is this different? I agree RO page kind of feels safer generally though. I will have to re-read how does the ring works though, my comments were based on the old assumption of mmaped page with indices. > > > > > > > > > The larger the ring buffer, the less > > > +likely the ring is full and the VM is forced to exit to userspace. The > > > +optimal size depends on the workload, but it is recommended that it be > > > +at least 64 KiB (4096 entries). > > > > Where's this number coming from? Given you have indices as well, > > 4K size rings is likely to cause cache contention. > > I think we've had some similar discussion in previous versions on the > size of ring. Again imho it's really something that may not have a > direct clue as long as it's big enough (4K should be). > > Regarding to the cache contention: could you explain more? 4K is a whole cache way. 64K 16 ways. If there's anything else is a hot path then you are pushing everything out of cache. To re-read how do indices work so see whether an index is on hot path or not. If yes your structure won't fit in L1 cache which is not great. > Do you > have a suggestion on the size of ring instead considering the issue? > > [...] I'll have to re-learn how do things work with indices gone from shared memory. > > > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring) > > > +{ > > > + u32 cur_slot, next_slot; > > > + u64 cur_offset, next_offset; > > > + unsigned long mask; > > > + u32 fetch; > > > + int count = 0; > > > + struct kvm_dirty_gfn *entry; > > > + struct kvm_dirty_ring_indices *indices = ring->indices; > > > + bool first_round = true; > > > + > > > + fetch = READ_ONCE(indices->fetch_index); > > > > So this does not work if the data cache is virtually tagged. > > Which to the best of my knowledge isn't the case on any > > CPU kvm supports. However it might not stay being the > > case forever. Worth at least commenting. > > This is the read side. IIUC even if with virtually tagged archs, we > should do the flushing on the write side rather than the read side, > and that should be enough? No. See e.g. Documentation/core-api/cachetlb.rst ``void flush_dcache_page(struct page *page)`` Any time the kernel writes to a page cache page, _OR_ the kernel is about to read from a page cache page and user space shared/writable mappings of this page potentially exist, this routine is called. > Also, I believe this is the similar question that Jason has asked in > V2. Sorry I should mention this earlier, but I didn't address that in > this series because if we need to do so we probably need to do it > kvm-wise, rather than only in this series. You need to document these things. > I feel like it's missing > probably only because all existing KVM supported archs do not have > virtual-tagged caches as you mentioned. But is that a fact? ARM has such a variety of CPUs, I can't really tell. Did you research this to make sure? > If so, I would prefer if you > can allow me to ignore that issue until KVM starts to support such an > arch. Document limitations pls. Don't ignore them. > > > > > > > + > > > + /* > > > + * Note that fetch_index is written by the userspace, which > > > + * should not be trusted. If this happens, then it's probably > > > + * that the userspace has written a wrong fetch_index. > > > + */ > > > + if (fetch - ring->reset_index > ring->size) > > > + return -EINVAL; > > > + > > > + if (fetch == ring->reset_index) > > > + return 0; > > > + > > > + /* This is only needed to make compilers happy */ > > > + cur_slot = cur_offset = mask = 0; > > > + while (ring->reset_index != fetch) { > > > + entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)]; > > > + next_slot = READ_ONCE(entry->slot); > > > + next_offset = READ_ONCE(entry->offset); > > > > What is this READ_ONCE doing? Entries are only written by kernel > > and it's under lock. > > The entries are written in kvm_dirty_ring_push() where there should > have no lock (there's one wmb() though to guarantee ordering of these > and the index update). > > With the wmb(), the write side should guarantee to make it to memory. > For the read side here, I think it's still good to have it to make > sure we read from memory? > > > > > > + ring->reset_index++; > > > + count++; > > > + /* > > > + * Try to coalesce the reset operations when the guest is > > > + * scanning pages in the same slot. > > > + */ > > > + if (!first_round && next_slot == cur_slot) { > > > + s64 delta = next_offset - cur_offset; > > > + > > > + if (delta >= 0 && delta < BITS_PER_LONG) { > > > + mask |= 1ull << delta; > > > + continue; > > > + } > > > + > > > + /* Backwards visit, careful about overflows! */ > > > + if (delta > -BITS_PER_LONG && delta < 0 && > > > + (mask << -delta >> -delta) == mask) { > > > + cur_offset = next_offset; > > > + mask = (mask << -delta) | 1; > > > + continue; > > > + } > > > + } > > > > Well how important is this logic? Because it will not be > > too effective on an SMP system, so don't you need a per-cpu ring? > > It's my fault to have omit the high-level design in the cover letter, > but we do have per-vcpu ring now. Actually that's what we only have > (we dropped the per-vm ring already) so ring access does not need lock > any more. > > This logic is good because kvm_reset_dirty_gfn, especially inside that > there's kvm_arch_mmu_enable_log_dirty_pt_masked() that supports masks, > so it would be good to do the reset for continuous pages (or page > that's close enough) in a single shot. > > > > > > > > > > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask); > > > + cur_slot = next_slot; > > > + cur_offset = next_offset; > > > + mask = 1; > > > + first_round = false; > > > + } > > > + kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask); > > > + > > > + trace_kvm_dirty_ring_reset(ring); > > > + > > > + return count; > > > +} > > > + > > > +void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset) > > > +{ > > > + struct kvm_dirty_gfn *entry; > > > + struct kvm_dirty_ring_indices *indices = ring->indices; > > > + > > > + /* It should never get full */ > > > + WARN_ON_ONCE(kvm_dirty_ring_full(ring)); > > > + > > > + entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)]; > > > + entry->slot = slot; > > > + entry->offset = offset; > > > + /* > > > + * Make sure the data is filled in before we publish this to > > > + * the userspace program. There's no paired kernel-side reader. > > > + */ > > > + smp_wmb(); > > > + ring->dirty_index++; > > > > > > Do I understand it correctly that the ring is shared between CPUs? > > If so I don't understand why it's safe for SMP guests. > > Don't you need atomics or locking? > > No, it's per-vcpu. > > Thanks, > > -- > Peter Xu