From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40AE4C433DB for ; Mon, 22 Mar 2021 15:37:12 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0FB776192E for ; Mon, 22 Mar 2021 15:37:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0FB776192E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=proxmox.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:40884 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lOMcA-0006LR-7Q for qemu-devel@archiver.kernel.org; Mon, 22 Mar 2021 11:37:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:35796) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lOMaC-0004m5-UF for qemu-devel@nongnu.org; Mon, 22 Mar 2021 11:35:09 -0400 Received: from proxmox-new.maurer-it.com ([212.186.127.180]:59755) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lOMaA-0001B3-BM for qemu-devel@nongnu.org; Mon, 22 Mar 2021 11:35:08 -0400 Received: from proxmox-new.maurer-it.com (localhost.localdomain [127.0.0.1]) by proxmox-new.maurer-it.com (Proxmox) with ESMTP id 30850418B4; Mon, 22 Mar 2021 16:26:13 +0100 (CET) Subject: Re: [PATCH] monitor/qmp: fix race on CHR_EVENT_CLOSED without OOB To: Wolfgang Bumiller References: <20210318133550.13120-1-s.reiter@proxmox.com> <20210322110847.cdo477ve2gydab64@wobu-vie.proxmox.com> From: Stefan Reiter Message-ID: Date: Mon, 22 Mar 2021 16:26:12 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <20210322110847.cdo477ve2gydab64@wobu-vie.proxmox.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=212.186.127.180; envelope-from=s.reiter@proxmox.com; helo=proxmox-new.maurer-it.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kevin Wolf , Paolo Bonzini , qemu-devel@nongnu.org, Markus Armbruster , Thomas Lamprecht Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On 3/22/21 12:08 PM, Wolfgang Bumiller wrote: > On Thu, Mar 18, 2021 at 02:35:50PM +0100, Stefan Reiter wrote: >> If OOB is disabled, events received in monitor_qmp_event will be handled >> in the main context. Thus, we must not acquire a qmp_queue_lock there, >> as the dispatcher coroutine holds one over a yield point, where it >> expects to be rescheduled from the main context. If a CHR_EVENT_CLOSED >> event is received just then, it can race and block the main thread by >> waiting on the queue lock. >> >> Run monitor_qmp_cleanup_queue_and_resume in a BH on the iohandler >> thread, so the main thread can always make progress during the >> reschedule. >> >> The delaying of the cleanup is safe, since the dispatcher always moves >> back to the iothread afterward, and thus the cleanup will happen before >> it gets to its next iteration. >> >> Signed-off-by: Stefan Reiter >> --- > > This is a tough one. It *may* be fine, but I wonder if we can approach > this differently: > > From what I can gather we have the following call stacks & contexts: > > Guarded lock (lock & release): > * monitor_qmp_cleanup_queue_and_resume > by monitor_qmp_event > by file handler (from I/O loop) > ^ iohandler_context (assuming that's where the file handling happens...) > (after this patch as BH though) > > * handle_qmp_command > a) by the json parser (which is also re-initialized by > monitor_qmp_event btw., haven't checked if that can also > "trigger" its methods immediately) > b) by monitor_qmp_read > by file handler (from I/O loop) > ^ iohandler_context > > Lock-"returning": > * monitor_qmp_requests_pop_any_with_lock > by coroutine_fn monitor_qmp_dispatcher_co > ^ iohandler_context > > Lock-releasing: > * coroutine_fn monitor_qmp_dispatcher_co > ^ qemu_aio_context > > The only *weird* thing that immediately pops out here is > `monitor_qmp_requests_pop_any_with_lock()` keeping a lock while > switching contexts. monitor_qmp_dispatcher_co? _pop_any_ doesn't switch contexts... But yes, that is weird, as I mentioned in my original mail too. > This is done in order to allow `AIO_WAIT_WHILE` to work while making > progress on the events, but do we actually already need to be in this > context for the OOB `monitor_resume()` call or can we defer the context > switch to after having done that and released the lock? > `monitor_resume()` itself seems to simply schedule a BH which should > work regardless if I'm not mistaken. There's also a > `readline_show_prompt()` call, but that *looks* harmless? The BH should indeed be harmless since we don't schedule on qemu_get_current_aio_context, and the readline_show_prompt call we can ignore here since it's guarded with "!monitor_is_qmp(mon)". > `monitor_resume()` is also called without the lock later on, so even if > it needs to be in this context at that point for whatever reason, does > it need the lock? > It doesn't access the queue, so I don't see why it'd need the lock. And as you said, it currently works without too, actually, before commit 88daf0996c ("qmp: Resume OOB-enabled monitor before processing the request") it always did so. I'll cobble together a v2 with this in mind.