* [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive @ 2019-05-23 10:54 Roman Kagan 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work Roman Kagan ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Roman Kagan @ 2019-05-23 10:54 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel I came across the following AB-BA deadlock: vCPU thread main thread ----------- ----------- async_safe_run_on_cpu(self, async_synic_update) ... [cpu hot-add] process_queued_cpu_work() qemu_mutex_unlock_iothread() [grab BQL] start_exclusive() cpu_list_add() async_synic_update() finish_safe_work() qemu_mutex_lock_iothread() cpu_exec_start() ATM async_synic_update seems to be the only async safe work item that grabs BQL. However it isn't quite obvious that it shouldn't; in the past there were more examples of this (e.g. memory_region_do_invalidate_mmio_ptr). It looks like the problem is generally in the lack of the nesting rule for cpu-exclusive sections against BQL, so I thought I would try to address that. This patchset is my feeble attempt at this; I'm not sure I fully comprehend all the consequences (rather, I'm sure I don't) hence RFC. Roman Kagan (2): cpus-common: nuke finish_safe_work cpus-common: assert BQL nesting within cpu-exclusive sections cpus-common.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) -- 2.21.0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work 2019-05-23 10:54 [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Roman Kagan @ 2019-05-23 10:54 ` Roman Kagan 2019-06-24 10:58 ` Alex Bennée 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 2/2] cpus-common: assert BQL nesting within cpu-exclusive sections Roman Kagan 2019-05-23 11:31 ` [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Alex Bennée 2 siblings, 1 reply; 12+ messages in thread From: Roman Kagan @ 2019-05-23 10:54 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel It was introduced in commit b129972c8b41e15b0521895a46fd9c752b68a5e, with the following motivation: Because start_exclusive uses CPU_FOREACH, merge exclusive_lock with qemu_cpu_list_lock: together with a call to exclusive_idle (via cpu_exec_start/end) in cpu_list_add, this protects exclusive work against concurrent CPU addition and removal. However, it seems to be redundant, because the cpu-exclusive infrastructure provides suffificent protection against the newly added CPU starting execution while the cpu-exclusive work is running, and the aforementioned traversing of the cpu list is protected by qemu_cpu_list_lock. Besides, this appears to be the only place where the cpu-exclusive section is entered with the BQL taken, which has been found to trigger AB-BA deadlock as follows: vCPU thread main thread ----------- ----------- async_safe_run_on_cpu(self, async_synic_update) ... [cpu hot-add] process_queued_cpu_work() qemu_mutex_unlock_iothread() [grab BQL] start_exclusive() cpu_list_add() async_synic_update() finish_safe_work() qemu_mutex_lock_iothread() cpu_exec_start() So remove it. This paves the way to establishing a strict nesting rule of never entering the exclusive section with the BQL taken. Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> --- cpus-common.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/cpus-common.c b/cpus-common.c index 3ca58c64e8..023cfebfa3 100644 --- a/cpus-common.c +++ b/cpus-common.c @@ -69,12 +69,6 @@ static int cpu_get_free_index(void) return cpu_index; } -static void finish_safe_work(CPUState *cpu) -{ - cpu_exec_start(cpu); - cpu_exec_end(cpu); -} - void cpu_list_add(CPUState *cpu) { qemu_mutex_lock(&qemu_cpu_list_lock); @@ -86,8 +80,6 @@ void cpu_list_add(CPUState *cpu) } QTAILQ_INSERT_TAIL_RCU(&cpus, cpu, node); qemu_mutex_unlock(&qemu_cpu_list_lock); - - finish_safe_work(cpu); } void cpu_list_remove(CPUState *cpu) -- 2.21.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work Roman Kagan @ 2019-06-24 10:58 ` Alex Bennée 2019-06-24 11:50 ` Roman Kagan 0 siblings, 1 reply; 12+ messages in thread From: Alex Bennée @ 2019-06-24 10:58 UTC (permalink / raw) To: qemu-devel; +Cc: Paolo Bonzini Roman Kagan <rkagan@virtuozzo.com> writes: > It was introduced in commit b129972c8b41e15b0521895a46fd9c752b68a5e, > with the following motivation: I can't find this commit in my tree. > > Because start_exclusive uses CPU_FOREACH, merge exclusive_lock with > qemu_cpu_list_lock: together with a call to exclusive_idle (via > cpu_exec_start/end) in cpu_list_add, this protects exclusive work > against concurrent CPU addition and removal. > > However, it seems to be redundant, because the cpu-exclusive > infrastructure provides suffificent protection against the newly added > CPU starting execution while the cpu-exclusive work is running, and the > aforementioned traversing of the cpu list is protected by > qemu_cpu_list_lock. > > Besides, this appears to be the only place where the cpu-exclusive > section is entered with the BQL taken, which has been found to trigger > AB-BA deadlock as follows: > > vCPU thread main thread > ----------- ----------- > async_safe_run_on_cpu(self, > async_synic_update) > ... [cpu hot-add] > process_queued_cpu_work() > qemu_mutex_unlock_iothread() > [grab BQL] > start_exclusive() cpu_list_add() > async_synic_update() finish_safe_work() > qemu_mutex_lock_iothread() cpu_exec_start() > > So remove it. This paves the way to establishing a strict nesting rule > of never entering the exclusive section with the BQL taken. > > Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> > --- > cpus-common.c | 8 -------- > 1 file changed, 8 deletions(-) > > diff --git a/cpus-common.c b/cpus-common.c > index 3ca58c64e8..023cfebfa3 100644 > --- a/cpus-common.c > +++ b/cpus-common.c > @@ -69,12 +69,6 @@ static int cpu_get_free_index(void) > return cpu_index; > } > > -static void finish_safe_work(CPUState *cpu) > -{ > - cpu_exec_start(cpu); > - cpu_exec_end(cpu); > -} > - This makes sense to me intellectually but I'm worried I've missed the reason for it being introduced. Without finish_safe_work we have to wait for the actual vCPU thread function to acquire and release the BQL and enter it's first cpu_exec_start(). I guess I'd be happier if we had a hotplug test where we could stress test the operation and be sure we've not just moved the deadlock somewhere else. > void cpu_list_add(CPUState *cpu) > { > qemu_mutex_lock(&qemu_cpu_list_lock); > @@ -86,8 +80,6 @@ void cpu_list_add(CPUState *cpu) > } > QTAILQ_INSERT_TAIL_RCU(&cpus, cpu, node); > qemu_mutex_unlock(&qemu_cpu_list_lock); > - > - finish_safe_work(cpu); > } > > void cpu_list_remove(CPUState *cpu) -- Alex Bennée ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work 2019-06-24 10:58 ` Alex Bennée @ 2019-06-24 11:50 ` Roman Kagan 2019-06-24 12:43 ` Alex Bennée 0 siblings, 1 reply; 12+ messages in thread From: Roman Kagan @ 2019-06-24 11:50 UTC (permalink / raw) To: Alex Bennée; +Cc: Paolo Bonzini, qemu-devel On Mon, Jun 24, 2019 at 11:58:23AM +0100, Alex Bennée wrote: > Roman Kagan <rkagan@virtuozzo.com> writes: > > > It was introduced in commit b129972c8b41e15b0521895a46fd9c752b68a5e, > > with the following motivation: > > I can't find this commit in my tree. OOPS, that was supposed to be ab129972c8b41e15b0521895a46fd9c752b68a5e, sorry. > > > > > Because start_exclusive uses CPU_FOREACH, merge exclusive_lock with > > qemu_cpu_list_lock: together with a call to exclusive_idle (via > > cpu_exec_start/end) in cpu_list_add, this protects exclusive work > > against concurrent CPU addition and removal. > > > > However, it seems to be redundant, because the cpu-exclusive > > infrastructure provides suffificent protection against the newly added > > CPU starting execution while the cpu-exclusive work is running, and the > > aforementioned traversing of the cpu list is protected by > > qemu_cpu_list_lock. > > > > Besides, this appears to be the only place where the cpu-exclusive > > section is entered with the BQL taken, which has been found to trigger > > AB-BA deadlock as follows: > > > > vCPU thread main thread > > ----------- ----------- > > async_safe_run_on_cpu(self, > > async_synic_update) > > ... [cpu hot-add] > > process_queued_cpu_work() > > qemu_mutex_unlock_iothread() > > [grab BQL] > > start_exclusive() cpu_list_add() > > async_synic_update() finish_safe_work() > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > So remove it. This paves the way to establishing a strict nesting rule > > of never entering the exclusive section with the BQL taken. > > > > Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> > > --- > > cpus-common.c | 8 -------- > > 1 file changed, 8 deletions(-) > > > > diff --git a/cpus-common.c b/cpus-common.c > > index 3ca58c64e8..023cfebfa3 100644 > > --- a/cpus-common.c > > +++ b/cpus-common.c > > @@ -69,12 +69,6 @@ static int cpu_get_free_index(void) > > return cpu_index; > > } > > > > -static void finish_safe_work(CPUState *cpu) > > -{ > > - cpu_exec_start(cpu); > > - cpu_exec_end(cpu); > > -} > > - > > This makes sense to me intellectually but I'm worried I've missed the > reason for it being introduced. Without finish_safe_work we have to wait > for the actual vCPU thread function to acquire and release the BQL and > enter it's first cpu_exec_start(). > > I guess I'd be happier if we had a hotplug test where we could stress > test the operation and be sure we've not just moved the deadlock > somewhere else. Me too. Unfortunately I haven't managed to come up with an idea how to do this test. One of the race participants, the safe work in a vCPU thread, happens in response to an MSR write by the guest. ATM there's no way to do it without an actual guest running. I'll have a look if I can make a vm test for it, using a linux guest and its /dev/cpu/*/msr. Thanks, Roman. > > > void cpu_list_add(CPUState *cpu) > > { > > qemu_mutex_lock(&qemu_cpu_list_lock); > > @@ -86,8 +80,6 @@ void cpu_list_add(CPUState *cpu) > > } > > QTAILQ_INSERT_TAIL_RCU(&cpus, cpu, node); > > qemu_mutex_unlock(&qemu_cpu_list_lock); > > - > > - finish_safe_work(cpu); > > } > > > > void cpu_list_remove(CPUState *cpu) > > > -- > Alex Bennée > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work 2019-06-24 11:50 ` Roman Kagan @ 2019-06-24 12:43 ` Alex Bennée 0 siblings, 0 replies; 12+ messages in thread From: Alex Bennée @ 2019-06-24 12:43 UTC (permalink / raw) To: Roman Kagan; +Cc: Paolo Bonzini, qemu-devel Roman Kagan <rkagan@virtuozzo.com> writes: > On Mon, Jun 24, 2019 at 11:58:23AM +0100, Alex Bennée wrote: >> Roman Kagan <rkagan@virtuozzo.com> writes: >> >> > It was introduced in commit b129972c8b41e15b0521895a46fd9c752b68a5e, >> > with the following motivation: >> >> I can't find this commit in my tree. > > OOPS, that was supposed to be ab129972c8b41e15b0521895a46fd9c752b68a5e, > sorry. > >> >> > >> > Because start_exclusive uses CPU_FOREACH, merge exclusive_lock with >> > qemu_cpu_list_lock: together with a call to exclusive_idle (via >> > cpu_exec_start/end) in cpu_list_add, this protects exclusive work >> > against concurrent CPU addition and removal. >> > >> > However, it seems to be redundant, because the cpu-exclusive >> > infrastructure provides suffificent protection against the newly added >> > CPU starting execution while the cpu-exclusive work is running, and the >> > aforementioned traversing of the cpu list is protected by >> > qemu_cpu_list_lock. >> > >> > Besides, this appears to be the only place where the cpu-exclusive >> > section is entered with the BQL taken, which has been found to trigger >> > AB-BA deadlock as follows: >> > >> > vCPU thread main thread >> > ----------- ----------- >> > async_safe_run_on_cpu(self, >> > async_synic_update) >> > ... [cpu hot-add] >> > process_queued_cpu_work() >> > qemu_mutex_unlock_iothread() >> > [grab BQL] >> > start_exclusive() cpu_list_add() >> > async_synic_update() finish_safe_work() >> > qemu_mutex_lock_iothread() cpu_exec_start() >> > >> > So remove it. This paves the way to establishing a strict nesting rule >> > of never entering the exclusive section with the BQL taken. >> > >> > Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> >> > --- >> > cpus-common.c | 8 -------- >> > 1 file changed, 8 deletions(-) >> > >> > diff --git a/cpus-common.c b/cpus-common.c >> > index 3ca58c64e8..023cfebfa3 100644 >> > --- a/cpus-common.c >> > +++ b/cpus-common.c >> > @@ -69,12 +69,6 @@ static int cpu_get_free_index(void) >> > return cpu_index; >> > } >> > >> > -static void finish_safe_work(CPUState *cpu) >> > -{ >> > - cpu_exec_start(cpu); >> > - cpu_exec_end(cpu); >> > -} >> > - >> >> This makes sense to me intellectually but I'm worried I've missed the >> reason for it being introduced. Without finish_safe_work we have to wait >> for the actual vCPU thread function to acquire and release the BQL and >> enter it's first cpu_exec_start(). >> >> I guess I'd be happier if we had a hotplug test where we could stress >> test the operation and be sure we've not just moved the deadlock >> somewhere else. > > Me too. Unfortunately I haven't managed to come up with an idea how to > do this test. One of the race participants, the safe work in a vCPU > thread, happens in response to an MSR write by the guest. ATM there's > no way to do it without an actual guest running. I'll have a look if I > can make a vm test for it, using a linux guest and its /dev/cpu/*/msr. Depending on how much machinery is required to trigger this we could add a system mode test. However there isn't much point if it requires duplicating the entire guest hotplug stack. It maybe easier to trigger on ARM - the PCSI sequence isn't overly complicated to deal with but I don't know what the impact of MSIs is. -- Alex Bennée ^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [RFC PATCH 2/2] cpus-common: assert BQL nesting within cpu-exclusive sections 2019-05-23 10:54 [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Roman Kagan 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work Roman Kagan @ 2019-05-23 10:54 ` Roman Kagan 2019-05-23 11:31 ` [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Alex Bennée 2 siblings, 0 replies; 12+ messages in thread From: Roman Kagan @ 2019-05-23 10:54 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel Assert that the cpu-exclusive sections are never entered/left with the BQL taken. Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> --- cpus-common.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/cpus-common.c b/cpus-common.c index 023cfebfa3..9aa75fe1ba 100644 --- a/cpus-common.c +++ b/cpus-common.c @@ -174,6 +174,7 @@ void start_exclusive(void) CPUState *other_cpu; int running_cpus; + assert(!qemu_mutex_iothread_locked()); qemu_mutex_lock(&qemu_cpu_list_lock); exclusive_idle(); @@ -205,6 +206,7 @@ void start_exclusive(void) /* Finish an exclusive operation. */ void end_exclusive(void) { + assert(!qemu_mutex_iothread_locked()); qemu_mutex_lock(&qemu_cpu_list_lock); atomic_set(&pending_cpus, 0); qemu_cond_broadcast(&exclusive_resume); @@ -214,6 +216,7 @@ void end_exclusive(void) /* Wait for exclusive ops to finish, and begin cpu execution. */ void cpu_exec_start(CPUState *cpu) { + assert(!qemu_mutex_iothread_locked()); atomic_set(&cpu->running, true); /* Write cpu->running before reading pending_cpus. */ @@ -255,6 +258,7 @@ void cpu_exec_start(CPUState *cpu) /* Mark cpu as not executing, and release pending exclusive ops. */ void cpu_exec_end(CPUState *cpu) { + assert(!qemu_mutex_iothread_locked()); atomic_set(&cpu->running, false); /* Write cpu->running before reading pending_cpus. */ -- 2.21.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive 2019-05-23 10:54 [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Roman Kagan 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work Roman Kagan 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 2/2] cpus-common: assert BQL nesting within cpu-exclusive sections Roman Kagan @ 2019-05-23 11:31 ` Alex Bennée 2019-05-27 11:05 ` Roman Kagan 2 siblings, 1 reply; 12+ messages in thread From: Alex Bennée @ 2019-05-23 11:31 UTC (permalink / raw) To: qemu-devel; +Cc: Paolo Bonzini, cota, richard.henderson Roman Kagan <rkagan@virtuozzo.com> writes: > I came across the following AB-BA deadlock: > > vCPU thread main thread > ----------- ----------- > async_safe_run_on_cpu(self, > async_synic_update) > ... [cpu hot-add] > process_queued_cpu_work() > qemu_mutex_unlock_iothread() > [grab BQL] > start_exclusive() cpu_list_add() > async_synic_update() finish_safe_work() > qemu_mutex_lock_iothread() cpu_exec_start() > > ATM async_synic_update seems to be the only async safe work item that > grabs BQL. However it isn't quite obvious that it shouldn't; in the > past there were more examples of this (e.g. > memory_region_do_invalidate_mmio_ptr). > > It looks like the problem is generally in the lack of the nesting rule > for cpu-exclusive sections against BQL, so I thought I would try to > address that. This patchset is my feeble attempt at this; I'm not sure > I fully comprehend all the consequences (rather, I'm sure I don't) hence > RFC. Hmm I think this is an area touched by: Subject: [PATCH v7 00/73] per-CPU locks Date: Mon, 4 Mar 2019 13:17:00 -0500 Message-Id: <20190304181813.8075-1-cota@braap.org> which has stalled on it's path into the tree. Last time I checked it explicitly handled the concept of work that needed the BQL and work that didn't. How do you trigger your deadlock? Just hot-pluging CPUs? > > Roman Kagan (2): > cpus-common: nuke finish_safe_work > cpus-common: assert BQL nesting within cpu-exclusive sections > > cpus-common.c | 12 ++++-------- > 1 file changed, 4 insertions(+), 8 deletions(-) -- Alex Bennée ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive 2019-05-23 11:31 ` [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Alex Bennée @ 2019-05-27 11:05 ` Roman Kagan 2019-06-06 13:22 ` Roman Kagan 0 siblings, 1 reply; 12+ messages in thread From: Roman Kagan @ 2019-05-27 11:05 UTC (permalink / raw) To: Alex Bennée; +Cc: Paolo Bonzini, cota, richard.henderson, qemu-devel On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > I came across the following AB-BA deadlock: > > > > vCPU thread main thread > > ----------- ----------- > > async_safe_run_on_cpu(self, > > async_synic_update) > > ... [cpu hot-add] > > process_queued_cpu_work() > > qemu_mutex_unlock_iothread() > > [grab BQL] > > start_exclusive() cpu_list_add() > > async_synic_update() finish_safe_work() > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > ATM async_synic_update seems to be the only async safe work item that > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > past there were more examples of this (e.g. > > memory_region_do_invalidate_mmio_ptr). > > > > It looks like the problem is generally in the lack of the nesting rule > > for cpu-exclusive sections against BQL, so I thought I would try to > > address that. This patchset is my feeble attempt at this; I'm not sure > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > RFC. > > Hmm I think this is an area touched by: > > Subject: [PATCH v7 00/73] per-CPU locks > Date: Mon, 4 Mar 2019 13:17:00 -0500 > Message-Id: <20190304181813.8075-1-cota@braap.org> > > which has stalled on it's path into the tree. Last time I checked it > explicitly handled the concept of work that needed the BQL and work that > didn't. I'm still trying to get my head around that patchset, but it looks like it changes nothing in regards to cpu-exclusive sections and safe work, so it doesn't make the problem go. > How do you trigger your deadlock? Just hot-pluging CPUs? Yes. The window is pretty narrow so I only saw it once although this test (where the vms are started and stopped and the cpus are plugged in and out) is in our test loop for quite a bit (probably 2+ years). Roman. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive 2019-05-27 11:05 ` Roman Kagan @ 2019-06-06 13:22 ` Roman Kagan 2019-06-21 12:49 ` Roman Kagan 0 siblings, 1 reply; 12+ messages in thread From: Roman Kagan @ 2019-06-06 13:22 UTC (permalink / raw) To: Alex Bennée; +Cc: Paolo Bonzini, cota, richard.henderson, qemu-devel On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: > On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > > > I came across the following AB-BA deadlock: > > > > > > vCPU thread main thread > > > ----------- ----------- > > > async_safe_run_on_cpu(self, > > > async_synic_update) > > > ... [cpu hot-add] > > > process_queued_cpu_work() > > > qemu_mutex_unlock_iothread() > > > [grab BQL] > > > start_exclusive() cpu_list_add() > > > async_synic_update() finish_safe_work() > > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > > > ATM async_synic_update seems to be the only async safe work item that > > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > > past there were more examples of this (e.g. > > > memory_region_do_invalidate_mmio_ptr). > > > > > > It looks like the problem is generally in the lack of the nesting rule > > > for cpu-exclusive sections against BQL, so I thought I would try to > > > address that. This patchset is my feeble attempt at this; I'm not sure > > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > > RFC. > > > > Hmm I think this is an area touched by: > > > > Subject: [PATCH v7 00/73] per-CPU locks > > Date: Mon, 4 Mar 2019 13:17:00 -0500 > > Message-Id: <20190304181813.8075-1-cota@braap.org> > > > > which has stalled on it's path into the tree. Last time I checked it > > explicitly handled the concept of work that needed the BQL and work that > > didn't. > > I'm still trying to get my head around that patchset, but it looks like > it changes nothing in regards to cpu-exclusive sections and safe work, > so it doesn't make the problem go. > > > How do you trigger your deadlock? Just hot-pluging CPUs? > > Yes. The window is pretty narrow so I only saw it once although this > test (where the vms are started and stopped and the cpus are plugged in > and out) is in our test loop for quite a bit (probably 2+ years). > > Roman. ping? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive 2019-06-06 13:22 ` Roman Kagan @ 2019-06-21 12:49 ` Roman Kagan 2019-08-05 12:47 ` Roman Kagan 0 siblings, 1 reply; 12+ messages in thread From: Roman Kagan @ 2019-06-21 12:49 UTC (permalink / raw) To: Alex Bennée; +Cc: Paolo Bonzini, cota, richard.henderson, qemu-devel On Thu, Jun 06, 2019 at 01:22:33PM +0000, Roman Kagan wrote: > On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: > > On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > > > > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > > > > > I came across the following AB-BA deadlock: > > > > > > > > vCPU thread main thread > > > > ----------- ----------- > > > > async_safe_run_on_cpu(self, > > > > async_synic_update) > > > > ... [cpu hot-add] > > > > process_queued_cpu_work() > > > > qemu_mutex_unlock_iothread() > > > > [grab BQL] > > > > start_exclusive() cpu_list_add() > > > > async_synic_update() finish_safe_work() > > > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > > > > > ATM async_synic_update seems to be the only async safe work item that > > > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > > > past there were more examples of this (e.g. > > > > memory_region_do_invalidate_mmio_ptr). > > > > > > > > It looks like the problem is generally in the lack of the nesting rule > > > > for cpu-exclusive sections against BQL, so I thought I would try to > > > > address that. This patchset is my feeble attempt at this; I'm not sure > > > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > > > RFC. > > > > > > Hmm I think this is an area touched by: > > > > > > Subject: [PATCH v7 00/73] per-CPU locks > > > Date: Mon, 4 Mar 2019 13:17:00 -0500 > > > Message-Id: <20190304181813.8075-1-cota@braap.org> > > > > > > which has stalled on it's path into the tree. Last time I checked it > > > explicitly handled the concept of work that needed the BQL and work that > > > didn't. > > > > I'm still trying to get my head around that patchset, but it looks like > > it changes nothing in regards to cpu-exclusive sections and safe work, > > so it doesn't make the problem go. > > > > > How do you trigger your deadlock? Just hot-pluging CPUs? > > > > Yes. The window is pretty narrow so I only saw it once although this > > test (where the vms are started and stopped and the cpus are plugged in > > and out) is in our test loop for quite a bit (probably 2+ years). > > > > Roman. > > ping? ping? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive 2019-06-21 12:49 ` Roman Kagan @ 2019-08-05 12:47 ` Roman Kagan 2019-08-05 15:56 ` Paolo Bonzini 0 siblings, 1 reply; 12+ messages in thread From: Roman Kagan @ 2019-08-05 12:47 UTC (permalink / raw) To: Alex Bennée; +Cc: Paolo Bonzini, cota, richard.henderson, qemu-devel On Fri, Jun 21, 2019 at 12:49:07PM +0000, Roman Kagan wrote: > On Thu, Jun 06, 2019 at 01:22:33PM +0000, Roman Kagan wrote: > > On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: > > > On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > > > > > > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > > > > > > > I came across the following AB-BA deadlock: > > > > > > > > > > vCPU thread main thread > > > > > ----------- ----------- > > > > > async_safe_run_on_cpu(self, > > > > > async_synic_update) > > > > > ... [cpu hot-add] > > > > > process_queued_cpu_work() > > > > > qemu_mutex_unlock_iothread() > > > > > [grab BQL] > > > > > start_exclusive() cpu_list_add() > > > > > async_synic_update() finish_safe_work() > > > > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > > > > > > > ATM async_synic_update seems to be the only async safe work item that > > > > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > > > > past there were more examples of this (e.g. > > > > > memory_region_do_invalidate_mmio_ptr). > > > > > > > > > > It looks like the problem is generally in the lack of the nesting rule > > > > > for cpu-exclusive sections against BQL, so I thought I would try to > > > > > address that. This patchset is my feeble attempt at this; I'm not sure > > > > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > > > > RFC. > > > > > > > > Hmm I think this is an area touched by: > > > > > > > > Subject: [PATCH v7 00/73] per-CPU locks > > > > Date: Mon, 4 Mar 2019 13:17:00 -0500 > > > > Message-Id: <20190304181813.8075-1-cota@braap.org> > > > > > > > > which has stalled on it's path into the tree. Last time I checked it > > > > explicitly handled the concept of work that needed the BQL and work that > > > > didn't. > > > > > > I'm still trying to get my head around that patchset, but it looks like > > > it changes nothing in regards to cpu-exclusive sections and safe work, > > > so it doesn't make the problem go. > > > > > > > How do you trigger your deadlock? Just hot-pluging CPUs? > > > > > > Yes. The window is pretty narrow so I only saw it once although this > > > test (where the vms are started and stopped and the cpus are plugged in > > > and out) is in our test loop for quite a bit (probably 2+ years). > > > > > > Roman. > > > > ping? > > ping? ping? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive 2019-08-05 12:47 ` Roman Kagan @ 2019-08-05 15:56 ` Paolo Bonzini 0 siblings, 0 replies; 12+ messages in thread From: Paolo Bonzini @ 2019-08-05 15:56 UTC (permalink / raw) To: Roman Kagan, Alex Bennée, Paolo Bonzini, cota, richard.henderson, qemu-devel On 05/08/19 14:47, Roman Kagan wrote: > On Fri, Jun 21, 2019 at 12:49:07PM +0000, Roman Kagan wrote: >> On Thu, Jun 06, 2019 at 01:22:33PM +0000, Roman Kagan wrote: >>> On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: >>>> On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: >>>>> >>>>> Roman Kagan <rkagan@virtuozzo.com> writes: >>>>> >>>>>> I came across the following AB-BA deadlock: >>>>>> >>>>>> vCPU thread main thread >>>>>> ----------- ----------- >>>>>> async_safe_run_on_cpu(self, >>>>>> async_synic_update) >>>>>> ... [cpu hot-add] >>>>>> process_queued_cpu_work() >>>>>> qemu_mutex_unlock_iothread() >>>>>> [grab BQL] >>>>>> start_exclusive() cpu_list_add() >>>>>> async_synic_update() finish_safe_work() >>>>>> qemu_mutex_lock_iothread() cpu_exec_start() >>>>>> >>>>>> ATM async_synic_update seems to be the only async safe work item that >>>>>> grabs BQL. However it isn't quite obvious that it shouldn't; in the >>>>>> past there were more examples of this (e.g. >>>>>> memory_region_do_invalidate_mmio_ptr). >>>>>> >>>>>> It looks like the problem is generally in the lack of the nesting rule >>>>>> for cpu-exclusive sections against BQL, so I thought I would try to >>>>>> address that. This patchset is my feeble attempt at this; I'm not sure >>>>>> I fully comprehend all the consequences (rather, I'm sure I don't) hence >>>>>> RFC. >>>>> >>>>> Hmm I think this is an area touched by: >>>>> >>>>> Subject: [PATCH v7 00/73] per-CPU locks >>>>> Date: Mon, 4 Mar 2019 13:17:00 -0500 >>>>> Message-Id: <20190304181813.8075-1-cota@braap.org> >>>>> >>>>> which has stalled on it's path into the tree. Last time I checked it >>>>> explicitly handled the concept of work that needed the BQL and work that >>>>> didn't. >>>> >>>> I'm still trying to get my head around that patchset, but it looks like >>>> it changes nothing in regards to cpu-exclusive sections and safe work, >>>> so it doesn't make the problem go. >>>> >>>>> How do you trigger your deadlock? Just hot-pluging CPUs? >>>> >>>> Yes. The window is pretty narrow so I only saw it once although this >>>> test (where the vms are started and stopped and the cpus are plugged in >>>> and out) is in our test loop for quite a bit (probably 2+ years). >>>> >>>> Roman. >>> >>> ping? >> >> ping? > > ping? > Queued for 4.2. Paolo ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-08-05 15:57 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-23 10:54 [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Roman Kagan 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 1/2] cpus-common: nuke finish_safe_work Roman Kagan 2019-06-24 10:58 ` Alex Bennée 2019-06-24 11:50 ` Roman Kagan 2019-06-24 12:43 ` Alex Bennée 2019-05-23 10:54 ` [Qemu-devel] [RFC PATCH 2/2] cpus-common: assert BQL nesting within cpu-exclusive sections Roman Kagan 2019-05-23 11:31 ` [Qemu-devel] [RFC PATCH 0/2] establish nesting rule of BQL vs cpu-exclusive Alex Bennée 2019-05-27 11:05 ` Roman Kagan 2019-06-06 13:22 ` Roman Kagan 2019-06-21 12:49 ` Roman Kagan 2019-08-05 12:47 ` Roman Kagan 2019-08-05 15:56 ` Paolo Bonzini
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.