* [Qemu-devel] [PATCH] job: Fix nested aio_poll() hanging in job_txn_apply
@ 2018-08-21 6:45 Fam Zheng
2018-08-21 16:15 ` Eric Blake
0 siblings, 1 reply; 4+ messages in thread
From: Fam Zheng @ 2018-08-21 6:45 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, qemu-block, qemu-stable, Jeff Cody, mreitz, pbonzini
All callers have acquired ctx already. Doing that again results in
aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
callback cannot make progress because ctx is recursively locked, for
example, when drive-backup finishes.
Cc: qemu-stable@nongnu.org
Reported-by: Gu Nini <ngu@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
job.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)
diff --git a/job.c b/job.c
index fa671b431a..b07c5d0ffc 100644
--- a/job.c
+++ b/job.c
@@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job)
}
}
-static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock)
+static int job_txn_apply(JobTxn *txn, int fn(Job *))
{
- AioContext *ctx;
Job *job, *next;
int rc = 0;
QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {
- if (lock) {
- ctx = job->aio_context;
- aio_context_acquire(ctx);
- }
rc = fn(job);
- if (lock) {
- aio_context_release(ctx);
- }
if (rc) {
break;
}
@@ -807,11 +799,11 @@ static void job_do_finalize(Job *job)
assert(job && job->txn);
/* prepare the transaction to complete */
- rc = job_txn_apply(job->txn, job_prepare, true);
+ rc = job_txn_apply(job->txn, job_prepare);
if (rc) {
job_completed_txn_abort(job);
} else {
- job_txn_apply(job->txn, job_finalize_single, true);
+ job_txn_apply(job->txn, job_finalize_single);
}
}
@@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
assert(other_job->ret == 0);
}
- job_txn_apply(txn, job_transition_to_pending, false);
+ job_txn_apply(txn, job_transition_to_pending);
/* If no jobs need manual finalization, automatically do so */
- if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
+ if (job_txn_apply(txn, job_needs_finalize) == 0) {
job_do_finalize(job);
}
}
--
2.17.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] job: Fix nested aio_poll() hanging in job_txn_apply
2018-08-21 6:45 [Qemu-devel] [PATCH] job: Fix nested aio_poll() hanging in job_txn_apply Fam Zheng
@ 2018-08-21 16:15 ` Eric Blake
2018-08-22 5:48 ` Fam Zheng
0 siblings, 1 reply; 4+ messages in thread
From: Eric Blake @ 2018-08-21 16:15 UTC (permalink / raw)
To: Fam Zheng, qemu-devel
Cc: kwolf, qemu-block, Jeff Cody, qemu-stable, mreitz, pbonzini
On 08/21/2018 01:45 AM, Fam Zheng wrote:
> All callers have acquired ctx already. Doing that again results in
> aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
> callback cannot make progress because ctx is recursively locked, for
> example, when drive-backup finishes.
>
> Cc: qemu-stable@nongnu.org
> Reported-by: Gu Nini <ngu@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
> job.c | 18 +++++-------------
> 1 file changed, 5 insertions(+), 13 deletions(-)
>
> diff --git a/job.c b/job.c
> index fa671b431a..b07c5d0ffc 100644
> --- a/job.c
> +++ b/job.c
> @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job)
> }
> }
>
> -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock)
> +static int job_txn_apply(JobTxn *txn, int fn(Job *))
Is it worth adding a comment here that callers must have already
acquired the job context?
However, I was unable to quickly audit whether all callers really did
have the lock (it balloons into whether all callers of job_finalize()
have the lock), so I'm reluctant to give R-b.
> @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
> assert(other_job->ret == 0);
> }
>
> - job_txn_apply(txn, job_transition_to_pending, false);
> + job_txn_apply(txn, job_transition_to_pending);
>
> /* If no jobs need manual finalization, automatically do so */
> - if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
> + if (job_txn_apply(txn, job_needs_finalize) == 0) {
> job_do_finalize(job);
That said, the change makes sense here: since the first direct call to
job_txn_apply() did not need to lock, why should the second indirect
call through job_do_finalize() need it?
Or, is the fix to have job_do_finalize() gain a bool parameter, where
job_completed_txn_success() would pass false to that parameter, while
job_finalize() would pass true (again, going back to auditing whether
all callers of job_finalize() have already acquired the context).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] job: Fix nested aio_poll() hanging in job_txn_apply
2018-08-21 16:15 ` Eric Blake
@ 2018-08-22 5:48 ` Fam Zheng
2018-08-22 13:42 ` Eric Blake
0 siblings, 1 reply; 4+ messages in thread
From: Fam Zheng @ 2018-08-22 5:48 UTC (permalink / raw)
To: Eric Blake
Cc: qemu-devel, kwolf, qemu-block, Jeff Cody, qemu-stable, mreitz, pbonzini
On Tue, 08/21 11:15, Eric Blake wrote:
> On 08/21/2018 01:45 AM, Fam Zheng wrote:
> > All callers have acquired ctx already. Doing that again results in
> > aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
> > callback cannot make progress because ctx is recursively locked, for
> > example, when drive-backup finishes.
> >
> > Cc: qemu-stable@nongnu.org
> > Reported-by: Gu Nini <ngu@redhat.com>
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> > job.c | 18 +++++-------------
> > 1 file changed, 5 insertions(+), 13 deletions(-)
> >
> > diff --git a/job.c b/job.c
> > index fa671b431a..b07c5d0ffc 100644
> > --- a/job.c
> > +++ b/job.c
> > @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job)
> > }
> > }
> > -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock)
> > +static int job_txn_apply(JobTxn *txn, int fn(Job *))
>
> Is it worth adding a comment here that callers must have already acquired
> the job context?
I think most job accessing functions require it. If any, the comment should go
to the public functions like job_finalize().
>
> However, I was unable to quickly audit whether all callers really did have
> the lock (it balloons into whether all callers of job_finalize() have the
> lock), so I'm reluctant to give R-b.
>
> > @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
> > assert(other_job->ret == 0);
> > }
> > - job_txn_apply(txn, job_transition_to_pending, false);
> > + job_txn_apply(txn, job_transition_to_pending);
> > /* If no jobs need manual finalization, automatically do so */
> > - if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
> > + if (job_txn_apply(txn, job_needs_finalize) == 0) {
> > job_do_finalize(job);
>
> That said, the change makes sense here: since the first direct call to
> job_txn_apply() did not need to lock, why should the second indirect call
> through job_do_finalize() need it?
>
> Or, is the fix to have job_do_finalize() gain a bool parameter, where
> job_completed_txn_success() would pass false to that parameter, while
> job_finalize() would pass true (again, going back to auditing whether all
> callers of job_finalize() have already acquired the context).
There are two callers of job_finalize():
fam@lemon:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize'
blockdev.c: job_finalize(&job->job, errp);
blockdev.c- aio_context_release(aio_context);
--
job-qmp.c: job_finalize(job, errp);
job-qmp.c- aio_context_release(aio_context);
--
tests/test-blockjob.c: job_finalize(&job->job, &error_abort);
tests/test-blockjob.c- assert(job->job.status == JOB_STATUS_CONCLUDED);
Ignoring the test, it's very easy to see both callers have acquired the context.
Fam
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH] job: Fix nested aio_poll() hanging in job_txn_apply
2018-08-22 5:48 ` Fam Zheng
@ 2018-08-22 13:42 ` Eric Blake
0 siblings, 0 replies; 4+ messages in thread
From: Eric Blake @ 2018-08-22 13:42 UTC (permalink / raw)
To: Fam Zheng
Cc: qemu-devel, kwolf, qemu-block, Jeff Cody, qemu-stable, mreitz, pbonzini
On 08/22/2018 12:48 AM, Fam Zheng wrote:
>> However, I was unable to quickly audit whether all callers really did have
>> the lock (it balloons into whether all callers of job_finalize() have the
>> lock), so I'm reluctant to give R-b.
>>
>>> @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
>>> assert(other_job->ret == 0);
>>> }
>>> - job_txn_apply(txn, job_transition_to_pending, false);
>>> + job_txn_apply(txn, job_transition_to_pending);
>>> /* If no jobs need manual finalization, automatically do so */
>>> - if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
>>> + if (job_txn_apply(txn, job_needs_finalize) == 0) {
>>> job_do_finalize(job);
>>
>> That said, the change makes sense here: since the first direct call to
>> job_txn_apply() did not need to lock, why should the second indirect call
>> through job_do_finalize() need it?
>>
>> Or, is the fix to have job_do_finalize() gain a bool parameter, where
>> job_completed_txn_success() would pass false to that parameter, while
>> job_finalize() would pass true (again, going back to auditing whether all
>> callers of job_finalize() have already acquired the context).
>
> There are two callers of job_finalize():
Okay, so the audit would have been easier if I had actually tried
grepping for it. Still, it can't hurt to make the commit message give
more details on what you checked, to make it easier for the reviewers.
>
> fam@lemon:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize'
> blockdev.c: job_finalize(&job->job, errp);
> blockdev.c- aio_context_release(aio_context);
> --
> job-qmp.c: job_finalize(job, errp);
> job-qmp.c- aio_context_release(aio_context);
Yep, that's pretty conclusive that the context is already held by all
callers.
Reviewed-by: Eric Blake <eblake@redhat.com>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-08-22 13:42 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-21 6:45 [Qemu-devel] [PATCH] job: Fix nested aio_poll() hanging in job_txn_apply Fam Zheng
2018-08-21 16:15 ` Eric Blake
2018-08-22 5:48 ` Fam Zheng
2018-08-22 13:42 ` Eric Blake
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.