From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.0 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3039C33CB6 for ; Wed, 22 Jan 2020 14:36:46 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8C9F52465B for ; Wed, 22 Jan 2020 14:36:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="i7geMwkj" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8C9F52465B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:43078 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iuH7d-0002Vo-EW for qemu-devel@archiver.kernel.org; Wed, 22 Jan 2020 09:36:45 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:60627) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iuH6Q-0001ac-1A for qemu-devel@nongnu.org; Wed, 22 Jan 2020 09:35:31 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iuH6O-00082k-Is for qemu-devel@nongnu.org; Wed, 22 Jan 2020 09:35:29 -0500 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:36713 helo=us-smtp-1.mimecast.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1iuH6O-00082D-Ei for qemu-devel@nongnu.org; Wed, 22 Jan 2020 09:35:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1579703727; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aFTw2DiE1LSAyA+PYDOMuZUDFdVcCho7fLoC2xgKH4E=; b=i7geMwkja3TfDsf5l/iC4dDucLhgs+j/gEWmXy0xK1pVamWHrWF117YTJF5ZBN3xveTCse DVqebQIeD0+0eCdS6p0RLgMwYnNWG2HF7uxEVv0hvE/clvBgEeSkfFZ3MHzvWCm9xn+iL7 PDIAbtui6vM6ItvLloS1kRLqy+fl/so= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-245-Tz5HUDmeNZGymbLYVskG1w-1; Wed, 22 Jan 2020 09:35:25 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id DD1861005510; Wed, 22 Jan 2020 14:35:23 +0000 (UTC) Received: from linux.fritz.box (ovpn-117-57.ams2.redhat.com [10.36.117.57]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 50F545DA75; Wed, 22 Jan 2020 14:35:20 +0000 (UTC) Date: Wed, 22 Jan 2020 15:35:18 +0100 From: Kevin Wolf To: Markus Armbruster Subject: Re: [PATCH v4 1/4] qapi: Add a 'coroutine' flag for commands Message-ID: <20200122143518.GA14997@linux.fritz.box> References: <20200121181122.15941-1-kwolf@redhat.com> <20200121181122.15941-2-kwolf@redhat.com> <87lfq0yp9v.fsf@dusky.pond.sub.org> <20200122101021.GB5268@linux.fritz.box> <874kwnvgad.fsf@dusky.pond.sub.org> MIME-Version: 1.0 In-Reply-To: <874kwnvgad.fsf@dusky.pond.sub.org> User-Agent: Mutt/1.12.1 (2019-06-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-MC-Unique: Tz5HUDmeNZGymbLYVskG1w-1 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 207.211.31.120 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: qemu-block@nongnu.org, marcandre.lureau@gmail.com, qemu-devel@nongnu.org, stefanha@redhat.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Am 22.01.2020 um 13:15 hat Markus Armbruster geschrieben: > Kevin Wolf writes: >=20 > > Am 22.01.2020 um 07:32 hat Markus Armbruster geschrieben: > >> Kevin Wolf writes: > >>=20 > >> > This patch adds a new 'coroutine' flag to QMP command definitions th= at > >> > tells the QMP dispatcher that the command handler is safe to be run = in a > >> > coroutine. > >>=20 > >> I'm afraid I missed this question in my review of v3: when is a handle= r > >> *not* safe to be run in a coroutine? > > > > That's a hard one to answer fully. >=20 > You're welcome ;) >=20 > > Basically, I think the biggest problem is with calling functions that > > change their behaviour if run in a coroutine compared to running them > > outside of coroutine context. In most cases the differences like having > > a nested event loop instead of yielding are just fine, but they are > > still subtly different. > > > > I know this is vague, but I can assure you that problematic cases exist= . > > I hit one of them with my initial hack that just moved everything into = a > > coroutine. It was related to graph modifications and bdrv_drain and > > resulted in a hang. For the specifics, I would have to try and reproduc= e > > the problem again. >=20 > Interesting. >=20 > Is coroutine-incompatible command handler code necessary or accidental? >=20 > By "necessary" I mean there are (and likely always will be) commands > that need to do stuff that cannot or should not be done on coroutine > context. >=20 > "Accidental" is the opposite: coroutine-incompatibility can be regarded > as a fixable flaw. I expect it's mostly accidental, but maybe not all of it. bdrv_drain() is an example of a function that has a problem with running inside a coroutine: If you schedule another coroutine to run, it will only run after the currently active coroutine yields. However, bdrv_drain() doesn't yield, but runs a nested event loop, so this can lead to deadlocks. For this reason, bdrv_drain() has code to drop out of coroutine context, so that you can call it from a coroutine anyway: /* Calling bdrv_drain() from a BH ensures the current coroutine yields = and * other coroutines run if they were queued by aio_co_enter(). */ Now, after reading this, I think the problem in the QMP handler I observerd was that it called some parts of the drain code without dropping out of coroutine context first. Well possible that it was the exact deadlock described by this comment. (Even in bdrv_drain() it might not be strictly necessary. Maybe it's possible to redesign bdrv_drain() so that it doesn't have a nested event loop, but can somehow rely on the normal main loop. Touching it is scary, though, so I'd rather not unless we have a good reason.) > How widespread is coroutine-incompatibility? Common, or just a few > commands? Answering this would require a massive code audit. Way out of scope for this series. This is the reason why I decided to just make it optional instead of trying to find and fix the cases that would need a fix if we enabled coroutine handlers unconditionally. > If coroutine-incompatibility is accidental, then your code to drop out > of coroutine context can be regarded as a temporary work-around. Such > work-arounds receive a modest extra ugliness & complexity budget. >=20 > If coroutine-incompatibility is rare, we'll eventually want "mark the > known-bad ones with 'coroutine': false" instead of "mark the known-good > ones with 'coroutine': true". I'm okay with marking the known-good ones > initially, and flipping only later. >=20 > Inability to recognize coroutine-incompatibility by inspection worries > me. Can you think of ways to identify causes other than testing things > to death? I think it is possible to recognise by inspection, it would just be a massive effort. Maybe a way to start it could be adding something like a non_coroutine_fn marker to functions that we know don't want to be called from coroutine context, and propagating it to their callers. I guess I would start by looking for nested event loops (AIO_WAIT_WHILE or BDRV_POLL_WHILE), and then you would have to decide for each whether it could run into problems. I can't promise this will cover everything, but it would at least cover the case known from bdrv_drain. Kevin