From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EBDB8C433B4
	for <linux-kernel@archiver.kernel.org>; Sat, 17 Apr 2021 14:10:10 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id C2516610C8
	for <linux-kernel@archiver.kernel.org>; Sat, 17 Apr 2021 14:10:10 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236405AbhDQOK3 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Sat, 17 Apr 2021 10:10:29 -0400
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:42533 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S236058AbhDQOK2 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 17 Apr 2021 10:10:28 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1618668601;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         in-reply-to:in-reply-to:references:references;
        bh=jS6gcU0zW/ISDMDIvxFLcjXPgIR+GWA8fBcDp8K5zUU=;
        b=M9YFN3ZtB0nkj9P+sIvlZTpmKrl2FAw1X4zBiIovPzCOrKMLqIgvsL+HRKBzMa7ZxIiwVD
        JRB7y9R8u7ByoldNWU1rzd6NvxshKPCDmJBVFfI7eYx1nWn5XeYjqx18nRQsq5+7GXsdtk
        rKsUwAnQG3o7Af63qttRAO/TCzZn5Mw=
Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com
 [209.85.222.199]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-196-zj2m0unjOxe6jDNE1NzZjQ-1; Sat, 17 Apr 2021 10:09:59 -0400
X-MC-Unique: zj2m0unjOxe6jDNE1NzZjQ-1
Received: by mail-qk1-f199.google.com with SMTP id h23-20020a05620a0537b02902e08ab174dbso1105513qkh.19
        for <linux-kernel@vger.kernel.org>; Sat, 17 Apr 2021 07:09:59 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=jS6gcU0zW/ISDMDIvxFLcjXPgIR+GWA8fBcDp8K5zUU=;
        b=Cl59dn2oDX6yz8uXNJ3wrJ/DlQvZwAIvK3ZsVSYsBv6uGdz9iBhFlrRtSwsv/jkqp2
         SSDyGTRt7BHoJ1FQDxN8NLrF/eXzsHKc4AOUQH1b4qHPaAqPbf9ox+jxzNlqxXP0UCfi
         P3WNGcoyBMuB5W7PHumZnWMk+yY0LULHw1qS2aP48vSBIA1o65C4Uty8k1600FxHrdSA
         csvX6eqrDcJyEr5kY2msuM2Dtjoh8+8mf1CLr8Inrx8bBAjMRKkAhaH4ZxkCED/8nj6C
         8IENcbRdvl3mj9162t/xDHHtpWisMuNdFDtPtGkSzbn+xqWOVEQuyY5j3Pzk8tqvnnjy
         8xEg==
X-Gm-Message-State: AOAM531yiyL89/UUKOQ/nqLIM/H+L4il1WqFRAGJVHMWY3aPg12E8uQ+
        RZQcBNvCbETU5Y+bgaPZOAbUU3tWuTBb4pd75Pepzg9L0VUJO7Or4hsAXW4VZJdMBwTwzBYErSZ
        iJy9XyZ4pbECs7KmGUeZMJb8t
X-Received: by 2002:ac8:6711:: with SMTP id e17mr3846143qtp.139.1618668598936;
        Sat, 17 Apr 2021 07:09:58 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxwAwFY81d7batTgRNwYqmKQ0Wbywvgn18Mz3Drx+qUeJvOqgt2pxy8AbjoBogR7HfZLBKxkg==
X-Received: by 2002:ac8:6711:: with SMTP id e17mr3846119qtp.139.1618668598533;
        Sat, 17 Apr 2021 07:09:58 -0700 (PDT)
Received: from xz-x1 (bras-base-toroon474qw-grc-88-174-93-75-154.dsl.bell.ca. [174.93.75.154])
        by smtp.gmail.com with ESMTPSA id h7sm5618220qtj.15.2021.04.17.07.09.57
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sat, 17 Apr 2021 07:09:57 -0700 (PDT)
Date:   Sat, 17 Apr 2021 10:09:56 -0400
From:   Peter Xu <peterx@redhat.com>
To:     Paolo Bonzini <pbonzini@redhat.com>
Cc:     kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Andrew Jones <drjones@redhat.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Sean Christopherson <seanjc@google.com>
Subject: Re: [PATCH v2] kvm/selftests: Fix race condition with dirty_log_test
Message-ID: <20210417140956.GV4440@xz-x1>
References: <20210413213641.23742-1-peterx@redhat.com>
 <f5f5f2c8-6edd-129d-b570-47d8eaca94c0@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <f5f5f2c8-6edd-129d-b570-47d8eaca94c0@redhat.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Paolo,

Please hold-on with this patch since I got another report that this patch can
still trigger test failure if even heavier workload (e.g., "taskset -c 0
./dirty_log_test" plus another one or multiple "taskset -c 0 while :; do:;
done").  So I plan to do it in another way.  I tested longer yesterday but
haven't updated this patch yet.  More below.

On Sat, Apr 17, 2021 at 02:59:48PM +0200, Paolo Bonzini wrote:
> On 13/04/21 23:36, Peter Xu wrote:
> > This patch closes this race by allowing the main thread to give the vcpu thread
> > chance to do a VMENTER to complete that write operation.  It's done by adding a
> > vcpu loop counter (must be defined as volatile as main thread will do read
> > loop), then the main thread can guarantee the vcpu got at least another VMENTER
> > by making sure the guest_vcpu_loops increases by 2.
> > 
> > Dirty ring does not need this since dirty_ring_last_page would already help
> > avoid this specific race condition.
> 
> Just a nit, the comment and commit message should mention KVM_RUN rather
> than vmentry; it's possible to be preempted many times in vcpu_enter_guest
> without making progress, but those wouldn't return to userspace and thus
> would not update guest_vcpu_loops.

But what I really wanted to emphasize is the vmentry point rather than KVM_RUN,
e.g., KVM_RUN can return without an vmentry, while the vmentry is the exactly
point that data will be flushed.

> 
> Also, volatile is considered harmful even in userspace/test code[1].
> Technically rather than volatile one should use an atomic load (even a
> relaxed one), but in practice it's okay to use volatile too *for this
> specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as well).
> If the selftests gained 32-bit support, one should not use volatile because
> neither reads or writes to uint64_t variables would be guaranteed to be
> atomic.

Indeed!  I'll start to use atomics.

Regarding why this patch won't really solve all race conditions... The problem
is I think one guest memory write operation (of this specific test) contains a
few micro-steps when page is during kvm dirty tracking (here I'm only
considering write-protect rather than pml but pml should be similar at least
when the log buffer is full):

  (1) Guest read 'iteration' number into register, prepare to write, page fault
  (2) Set dirty bit in either dirty bitmap or dirty ring
  (3) Return to guest, data written

When we verify the data, we assumed that all these steps are "atomic", say,
when (1) happened for this page, we assume (2) & (3) must have happened.  We
had some trick to workaround "un-atomicity" of above three steps, as this patch
wanted to fix atomicity of step (2)+(3) by explicitly letting the main thread
wait for at least one vmenter of vcpu thread, which should work.  However what
I overlooked is probably that we still have race when (1) and (2) can be
interrupted.

As an example of how step (1) and (2) got interrupted, I simply tried to trace
kvm_vcpu_mark_page_dirty() and dump stack for vmexit cases, then we can see at
least a bunch of cases where vcpu can be scheduled out even before setting the
dirty bit:

@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    kmem_cache_alloc+583
    kvm_mmu_topup_memory_cache+33
    direct_page_fault+237
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 4
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    down_read+14
    get_user_pages_unlocked+90
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 23
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    __alloc_pages+663
    alloc_pages_vma+128
    wp_page_copy+773
    __handle_mm_fault+3155
    handle_mm_fault+151
    __get_user_pages+664
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 1406
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    hva_to_pfn+157
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 2579
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    mmu_notifier_invalidate_range_start+9
    wp_page_copy+296
    __handle_mm_fault+3155
    handle_mm_fault+151
    __get_user_pages+664
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 3309
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    __get_user_pages+530
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 4499

It means... it can always happen that the vcpu reads a very old "iteration"
value in step 1 and it doesn't set dirty bit (step 2) or write it to memory
(step 3) until, say, 1 year later.. :) Then the verify won't pass since the
main thread iteration has been much newer, then main thread shouts at us.

So far I don't see an easy way to guarantee all steps 1-3 atomicity (as this
patch only achieved steps 2-3), but to sync at the GUEST_SYNC() point of guest
code when we do verification of the dirty bits.  Drew mentioned something like
this previously in the bugzilla, I wanted to give it a shot with a lighter
sync, but seems not working.

Paolo, Drew, Sean - feel free to shoot if any of you have a better idea.

As I mentioned I tested v2 of this patch and so far no issue found.  I'll post
it later today, so maybe we can continue discuss there too (btw, I also found
another signal race there; so I'll post a series with 2 patches).

Thanks,

-- 
Peter Xu