From: Pavel Butsykin <pbutsykin@virtuozzo.com>
To: Sergio Lopez <slp@redhat.com>, qemu-devel@nongnu.org
Cc: pbonzini@redhat.com, famz@redhat.com, stefanha@redhat.com,
qemu-block@nongnu.org
Subject: Re: [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel
Date: Wed, 8 Nov 2017 16:50:01 +0300 [thread overview]
Message-ID: <f18d01b4-6ae6-75d9-63ba-83fb22b6a43d@virtuozzo.com> (raw)
In-Reply-To: <20171108063447.2842-1-slp@redhat.com>
On 08.11.2017 09:34, Sergio Lopez wrote:
> Commit b7a745d added a qemu_bh_cancel call to the completion function
> as an optimization to prevent it from unnecessarily rescheduling itself.
>
> This completion function is scheduled from worker_thread, after setting
> the state of a ThreadPoolElement to THREAD_DONE.
>
Great! We are seeing the same problem, and I was describing my fix,
when I came across your patch :)
> This was considered to be safe, as the completion function restarts the
> loop just after the call to qemu_bh_cancel. But, under certain access
> patterns and scheduling conditions, the loop may wrongly use a
> pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
> the completion function without having processed a pending TPE linked at
> pool->head:
I'm not quite sure that the pre-fetched is involved in this issue,
because pre-fetch reading a certain addresses should be invalidated by
write on another core to the same addresses. In our case write
req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
I am inclined to think that there is a memory-reordering read with
write. It's a very real case for x86 and I don't see the reasons which
can prevent it:
.text:000000000060E21E loc_60E21E: ; CODE
XREF: .text:000000000060E2F4\x19j
.text:000000000060E21E mov rbx, [r12+98h]
.text:000000000060E226 test rbx, rbx
.text:000000000060E229 jnz short loc_60E238
.text:000000000060E22B jmp short exit_0
.text:000000000060E22B ;
---------------------------------------------------------------------------
.text:000000000060E22D align 10h
.text:000000000060E21E loc_60E21E: ; CODE
XREF: .text:000000000060E2F4\x19j
.text:000000000060E21E mov rbx, [r12+98h]
.text:000000000060E226 test rbx, rbx
.text:000000000060E229 jnz short loc_60E238
.text:000000000060E22B jmp short exit_0
.text:000000000060E230 loc_60E230: ; CODE
XREF: .text:000000000060E240\x19j
.text:000000000060E230 test rbp, rbp
.text:000000000060E233 jz short exit_0
.text:000000000060E235
.text:000000000060E235 loc_60E235: ; CODE
XREF: .text:000000000060E289\x19j
.text:000000000060E235 mov rbx, rbp
.text:000000000060E238
.text:000000000060E238 loc_60E238: ; CODE
XREF: .text:000000000060E229\x18j
.text:000000000060E238 cmp
[rbx+ThreadPoolElement.state], 2 ; THREAD_DONE
.text:000000000060E23C mov rbp,
[rbx+ThreadPoolElement.all.link_next]
.text:000000000060E240 jnz short loc_60E230
.text:000000000060E242 mov r15d,
[rbx+ThreadPoolElement.ret]
.text:000000000060E246 mov r13,
[rbx+ThreadPoolElement.common.opaque]
.text:000000000060E24A nop
.text:000000000060E24B lea rax,
trace_events_enabled_count
.text:000000000060E252 mov eax, [rax]
.text:000000000060E254 test eax, eax
.text:000000000060E256 mov rax, rbp
.text:000000000060E259 jnz loc_60E2F9
...
.text:000000000060E2BC loc_60E2BC: ; CODE
XREF: .text:000000000060E27C\x18j
.text:000000000060E2BC mov rdi, [r12+8]
.text:000000000060E2C1 call qemu_bh_schedule
.text:000000000060E2C6 mov rdi, [r12]
.text:000000000060E2CA call aio_context_release
.text:000000000060E2CF mov esi, [rbx+44h]
.text:000000000060E2D2 mov rdi, [rbx+18h]
.text:000000000060E2D6 call qword ptr [rbx+10h]
.text:000000000060E2D9 mov rdi, [r12]
.text:000000000060E2DD call aio_context_acquire
.text:000000000060E2E2 mov rdi, [r12+8]
.text:000000000060E2E7 call qemu_bh_cancel
.text:000000000060E2EC mov rdi, rbx
.text:000000000060E2EF call qemu_aio_unref
.text:000000000060E2F4 jmp loc_60E21E
The read (req->state == THREAD_DONE) can be reordered
with qemu_bh_cancel(p->completion_bh) and then we get the same picture:
worker thread | I/O thread
------------------------------------------------------------------------
| reordered read req->state
req->state = THREAD_DONE; |
qemu_bh_schedule(p->completion_bh) |
bh->scheduled = 1; |
| qemu_bh_cancel(p->completion_bh)
| bh->scheduled = 0;
| if (req->state == THREAD_DONE)
| // sees THREAD_QUEUED
>
> worker thread | I/O thread
> ------------------------------------------------------------------------
> | speculatively read req->state
> req->state = THREAD_DONE; |
> qemu_bh_schedule(p->completion_bh) |
> bh->scheduled = 1; |
> | qemu_bh_cancel(p->completion_bh)
> | bh->scheduled = 0;
> | if (req->state == THREAD_DONE)
> | // sees THREAD_QUEUED
>
> The source of the misunderstanding was that qemu_bh_cancel is now being
> used by the _consumer_ rather than the producer, and therefore now needs
> to have acquire semantics just like e.g. aio_bh_poll.
>
> In some situations, if there are no other independent requests in the
> same aio context that could eventually trigger the scheduling of the
> completion function, the omitted TPE and all operations pending on it
> will get stuck forever.
>
> Signed-off-by: Sergio Lopez <slp@redhat.com>
> ---
> util/async.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/util/async.c b/util/async.c
> index 355af73ee7..0e1bd8780a 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -174,7 +174,7 @@ void qemu_bh_schedule(QEMUBH *bh)
> */
> void qemu_bh_cancel(QEMUBH *bh)
> {
> - bh->scheduled = 0;
> + atomic_mb_set(&bh->scheduled, 0);
But in the end, the patch looks correct. atomic_mb_set() is xchg:
#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
#define atomic_mb_set(ptr, i) ((void)atomic_xchg(ptr, i))
Reads and writes cannot be reordered with locked instructions, so it
should protect from reordering.
> }
>
> /* This func is async.The bottom half will do the delete action at the finial
>
next prev parent reply other threads:[~2017-11-08 13:50 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-08 6:34 [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel Sergio Lopez
2017-11-08 9:24 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2017-11-08 13:50 ` Pavel Butsykin [this message]
2017-11-08 14:10 ` [Qemu-devel] " Sergio Lopez
2017-11-08 14:15 ` Paolo Bonzini
2017-11-08 14:24 ` Sergio Lopez
2017-11-08 14:32 ` Pavel Butsykin
2017-11-08 19:21 ` Stefan Hajnoczi
2017-11-08 16:36 ` Pavel Butsykin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f18d01b4-6ae6-75d9-63ba-83fb22b6a43d@virtuozzo.com \
--to=pbutsykin@virtuozzo.com \
--cc=famz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=slp@redhat.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.