Am 15.12.2020 um 14:15 hat Sergio Lopez geschrieben: > On Tue, Dec 15, 2020 at 01:12:33PM +0100, Kevin Wolf wrote: > > Am 14.12.2020 um 18:05 hat Sergio Lopez geschrieben: > > > While processing the parents of a BDS, one of the parents may process > > > the child that's doing the tail recursion, which leads to a BDS being > > > processed twice. This is especially problematic for the aio_notifiers, > > > as they might attempt to work on both the old and the new AIO > > > contexts. > > > > > > To avoid this, add the BDS pointer to the ignore list, and check the > > > child BDS pointer while iterating over the children. > > > > > > Signed-off-by: Sergio Lopez > > > > Ugh, so we get a mixed list of BdrvChild and BlockDriverState? :-/ > > I know, it's effective but quite ugly... > > > What is the specific scenario where you saw this breaking? Did you have > > multiple BdrvChild connections between two nodes so that we would go to > > the parent node through one and then come back to the child node through > > the other? > > I don't think this is a corner case. If the graph is walked top->down, > there's no problem since children are added to the ignore list before > getting processed, and siblings don't process each other. But, if the > graph is walked bottom->up, a BDS will start processing its parents > without adding itself to the ignore list, so there's nothing > preventing them from processing it again. I don't understand. child is added to ignore before calling the parent callback on it, so how can we come back through the same BdrvChild? QLIST_FOREACH(child, &bs->parents, next_parent) { if (g_slist_find(*ignore, child)) { continue; } assert(child->klass->set_aio_ctx); *ignore = g_slist_prepend(*ignore, child); child->klass->set_aio_ctx(child, new_context, ignore); } > I'm pasting here an annotated trace of bdrv_set_aio_context_ignore I > generated while triggering the issue: > > <----- begin ------> > bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter > bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children > bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 enter > bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing children > bdrv_set_aio_context_ignore: bs=0x555ee2e52060 enter > bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing children > bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing parents > bdrv_set_aio_context_ignore: bs=0x555ee2e52060 processing itself > bdrv_set_aio_context_ignore: bs=0x555ee2e5d420 processing parents > > - We enter b_s_a_c_i with BDS 2fbf660 the first time: > > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children > > - We enter b_s_a_c_i with BDS 3bc0c00, a child of 2fbf660: > > bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 enter > bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing children > bdrv_set_aio_context_ignore: bs=0x555ee3bc0c00 processing parents > > - We start processing its parents: > > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents > > - We enter b_s_a_c_i with BDS 2e48030, a parent of 2fbf660: > > bdrv_set_aio_context_ignore: bs=0x555ee2e48030 enter > bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing children > > - We enter b_s_a_c_i with BDS 2fbf660 again, because of parent > 2e48030 didn't found us it in the ignore list: > > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 enter > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing children > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing parents > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself > bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing parents > bdrv_set_aio_context_ignore: bs=0x555ee2e48030 processing itself > > - BDS 2fbf660 will be processed here a second time, triggering the > issue: > > bdrv_set_aio_context_ignore: bs=0x555ee2fbf660 processing itself > <----- end ------> You didn't dump the BdrvChild here. I think that would add some information on why we re-entered 0x555ee2fbf660. Maybe you can also add bs->drv->format_name for each node to make the scenario less abstract? So far my reconstruction of the graph is something like this: 0x555ee2e48030 --+ | | | | | +-> 0x555ee2e5d420 -> 0x555ee2e52060 v v | 0x555ee2fbf660 --+ | +-------> 0x555ee3bc0c00 It doesn't look quite trivial, but if 0x555ee2e48030 is the filter node of a block job, it's not hard to imagine either. > I suspect this has been happening for a while, and has only surfaced > now due to the need to run an AIO context BH in an aio_notifier > function that the "nbd/server: Quiesce coroutines on context switch" > patch introduces. There the problem is that the first time the > aio_notifier AIO detach function is called, it works on the old > context (as it should be), and the second one works on the new context > (which is wrong). > > > Maybe if what we really need to do is not processing every edge once, > > but processing every node once, the list should be changed to contain > > _only_ BDS objects. But then blk_do_set_aio_context() probably won't > > work any more because it can't have blk->root ignored any more... > > I tried that in my first attempt and it broke badly. I didn't take a > deeper look at the causes. > > > Anyway, if we end up changing what the list contains, the comment needs > > an update, too. Currently it says: > > > > * @ignore will accumulate all visited BdrvChild object. The caller is > > * responsible for freeing the list afterwards. > > > > Another option: Split the parents QLIST_FOREACH loop in two. First add > > all parent BdrvChild objects to the ignore list, remember which of them > > were newly added, and only after adding all of them call > > child->klass->set_aio_ctx() for each parent that was previously not on > > the ignore list. This will avoid that we come back to the same node > > because all of its incoming edges are ignored now. > > I don't think this strategy will fix the issue illustrated in the > trace above, as the BdrvChild pointer of the BDS processing its > parents won't be the on ignore list by the time one of its parents > starts processing its own children. But why? We do append to the ignore list each time before we recurse into a child or parent node. The only way I see is if you have two separate BdrvChild links between the nodes. Kevin