linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Peter Xu <peterx@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kselftest@vger.kernel.org, pbonzini@redhat.com,
	corbet@lwn.net, james.morse@arm.com, alexandru.elisei@arm.com,
	suzuki.poulose@arm.com, oliver.upton@linux.dev,
	catalin.marinas@arm.com, will@kernel.org, shuah@kernel.org,
	seanjc@google.com, dmatlack@google.com, bgardon@google.com,
	ricarkol@google.com, zhenyzha@redhat.com, shan.gavin@gmail.com
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
Date: Wed, 24 Aug 2022 21:57:13 +0100	[thread overview]
Message-ID: <877d2xweae.wl-maz@kernel.org> (raw)
In-Reply-To: <YwZQHqS5DZBloYPZ@xz-m1.local>

On Wed, 24 Aug 2022 17:21:50 +0100,
Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
> > On Wed, 24 Aug 2022 00:19:04 +0100,
> > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > > > Atomicity doesn't guarantee ordering, unfortunately.
> > > 
> > > Right, sorry to be misleading.  The "atomicity" part I was trying to say
> > > the kernel will always see consistent update on the fields.
> > >
> > > The ordering should also be guaranteed, because things must happen with
> > > below sequence:
> > > 
> > >   (1) kernel publish dirty GFN data (slot, offset)
> > >   (2) kernel publish dirty GFN flag (set to DIRTY)
> > >   (3) user sees DIRTY, collects (slots, offset)
> > >   (4) user sets it to RESET
> > >   (5) kernel reads RESET
> > 
> > Maybe. Maybe not. The reset could well be sitting in the CPU write
> > buffer for as long as it wants and not be seen by the kernel if the
> > read occurs on another CPU. And that's the crucial bit: single-CPU is
> > fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
> > on collection, and global on reset (this seems like a bad decision,
> > but it is too late to fix this).
> 
> Regarding the last statement, that's something I had question too and
> discussed with Paolo, even though at that time it's not a outcome of
> discussing memory ordering issues.
> 
> IIUC the initial design was trying to avoid tlb flush flood when vcpu
> number is large (each RESET per ring even for one page will need all vcpus
> to flush, so O(N^2) flushing needed). With global RESET it's O(N).  So
> it's kind of a trade-off, and indeed until now I'm not sure which one is
> better.  E.g., with per-ring reset, we can have locality too in userspace,
> e.g. the vcpu thread might be able to recycle without holding global locks.

I don't get that. On x86, each CPU must perform the TLB invalidation
(there is an IPI for that). So whether you do a per-CPU scan of the
ring or a global scan is irrelevant: each entry you find in any of the
rings must result in a global invalidation, since you've updated the
PTE to make the page RO.

The same is true on ARM, except that the broadcast is done in HW
instead of being tracked in SW.

Buy anyway, this is all moot. The API is what it is, and it isn't
going to change any time soon. All we can do is add some
clarifications to the API for the more relaxed architectures, and make
sure the kernel behaves accordingly.

[...]

> > It may be safe, but it isn't what the userspace API promises.
> 
> The document says:
> 
>   After processing one or more entries in the ring buffer, userspace calls
>   the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>   the kernel will reprotect those collected GFNs.  Therefore, the ioctl
>   must be called *before* reading the content of the dirty pages.
> 
> I'd say it's not an explicit promise, but I think I agree with you that at
> least it's unclear on the behavior.

This is the least problematic part of the documentation. The bit I
literally choke on is this:

<quote>
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
</quote>

This is the core of the issue. Without ordering rules, the consumer on
the other side cannot observe the updates correctly, even if userspace
follows the above to the letter. Of course, the kernel itself must do
the right thing (I guess it amounts to the kernel doing a
load-acquire, and userspace doing a store-release -- effectively
emulating x86...).

> Since we have a global recycle mechanism, most likely the app (e.g. current
> qemu impl) will use the same thread to collect/reset dirty GFNs, and
> trigger the RESET ioctl().  In that case it's safe, IIUC, because no
> cross-core ops.
> 
> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
> 
>     if (total) {
>         ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>         assert(ret == total);
>     }
> 
> I think the assert() should never trigger as mentioned above.  But ideally
> maybe it should just be a loop until cleared gfns match total.

Right. If userspace calls the ioctl on every vcpu, then things should
work correctly. It is only that the overhead is higher than what it
should be if multiple vcpus perform a reset at the same time.

>
> > In other words, without further straightening of the API, this doesn't
> > work as expected on relaxed memory architectures. So before this gets
> > enabled on arm64, this whole ordering issue must be addressed.
> 
> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
> possibility of recycling partial of the pages, especially when collection
> and the ioctl() aren't done from the same thread?

I'd rather tell people about the ordering rules. That will come at
zero cost on x86.

> Any suggestions will be greatly welcomed.

I'll write a couple of patch when I get the time, most likely next
week. Gavin will hopefully be able to take them as part of his series.

	M.

-- 
Without deviation from the norm, progress is not possible.

  reply	other threads:[~2022-08-24 20:57 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-19  0:55 [PATCH v1 0/5] KVM: arm64: Enable ring-based dirty memory tracking Gavin Shan
2022-08-19  0:55 ` [PATCH v1 1/5] " Gavin Shan
2022-08-19  8:00   ` Marc Zyngier
2022-08-22  1:58     ` Gavin Shan
2022-08-22 18:55       ` Peter Xu
2022-08-23  3:19         ` Gavin Shan
2022-08-22 21:42       ` Marc Zyngier
2022-08-23  5:22         ` Gavin Shan
2022-08-23 13:58           ` Peter Xu
2022-08-23 19:17             ` Marc Zyngier
2022-08-23 21:20               ` Peter Xu
2022-08-23 22:47                 ` Marc Zyngier
2022-08-23 23:19                   ` Peter Xu
2022-08-24 14:45                     ` Marc Zyngier
2022-08-24 16:21                       ` Peter Xu
2022-08-24 20:57                         ` Marc Zyngier [this message]
2022-08-26  6:05                           ` Gavin Shan
2022-08-26 10:50                   ` Paolo Bonzini
2022-08-26 15:49                     ` Marc Zyngier
2022-08-27  8:27                       ` Paolo Bonzini
2022-08-23 14:44         ` Oliver Upton
2022-08-23 20:35           ` Marc Zyngier
2022-08-26 10:58             ` Paolo Bonzini
2022-08-26 15:28               ` Marc Zyngier
2022-08-30 14:42                 ` Peter Xu
2022-09-02  0:19                   ` Paolo Bonzini
2022-08-19  0:55 ` [PATCH v1 2/5] KVM: selftests: Use host page size to map ring buffer in dirty_log_test Gavin Shan
2022-08-19  0:55 ` [PATCH v1 3/5] KVM: selftests: Dirty host pages " Gavin Shan
2022-08-19  5:28   ` Andrew Jones
2022-08-22  6:29     ` Gavin Shan
2022-08-23  3:09       ` Gavin Shan
2022-08-19  0:56 ` [PATCH v1 4/5] KVM: selftests: Clear dirty ring states between two modes " Gavin Shan
2022-08-19  0:56 ` [PATCH v1 5/5] KVM: selftests: Automate choosing dirty ring size " Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877d2xweae.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=alexandru.elisei@arm.com \
    --cc=bgardon@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dmatlack@google.com \
    --cc=gshan@redhat.com \
    --cc=james.morse@arm.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=ricarkol@google.com \
    --cc=seanjc@google.com \
    --cc=shan.gavin@gmail.com \
    --cc=shuah@kernel.org \
    --cc=suzuki.poulose@arm.com \
    --cc=will@kernel.org \
    --cc=zhenyzha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).