From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67064C48BDF for ; Fri, 18 Jun 2021 11:27:24 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id AA9C561260 for ; Fri, 18 Jun 2021 11:27:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AA9C561260 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:42012 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1luCeg-0006Wt-LY for qemu-devel@archiver.kernel.org; Fri, 18 Jun 2021 07:27:22 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:43610) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1luCe4-0005kP-GO; Fri, 18 Jun 2021 07:26:44 -0400 Received: from mail-il1-x130.google.com ([2607:f8b0:4864:20::130]:41766) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1luCe2-000890-19; Fri, 18 Jun 2021 07:26:44 -0400 Received: by mail-il1-x130.google.com with SMTP id p14so1530360ilg.8; Fri, 18 Jun 2021 04:26:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=aa5XIUeGNE2DQ0+ymaVlEv09FVWNp2IgsYOp+kfYEPc=; b=X3eGBeeS4gSXa4mojYUYUxYbRkoi6jQaBBcFI+/jG7mfEyfgWlIP2S5Gv3wN4RNkbC nDeAiiw0CXHM8ErkzTm5HK30oFkNQI5BhxWD0MpSDt5e0wuPKrfI2ZQULYkbPIkOTV5M vAXSbRIyETDyIgW17niKehyJBAUsDoZ17hEHm0A2dVqZi5Nq4USvSIoyifW/R35Ia0is KsdvcpMxV6GIHMzsE5ZvufVTrpgLZzWPmPom3JMrxzUb51jYg08Ud4NB2GSiWV4KD/9Q Az8Po0319bbNL6BLTwYoscf8oHLG7DHyidA7p/Qog6vMN1QNEI2jEeJcd3O/zsfSCIkX 3Prw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=aa5XIUeGNE2DQ0+ymaVlEv09FVWNp2IgsYOp+kfYEPc=; b=qu1wEnzJi7XQG/WeZHb3smi7LgT7VSlj7lQ0qvWXeyOQVpykkQd6qAw3qS0UyabJhe VC1Nvxl0wSRs2gQI/qtBbHS0urDvgDnVFgvYDUZZzKvQT7RCyNAw4Jkgr7DnKC8P9VEz aOpv/6ogMSyGAG+KC9MxqyZkRoZLWgh6EkTzTdwH/ksl8GEgDedz7sLGwKANP555RjaB vM7l/KnFwV3IwHpTticKFRgVv4oeiISECrv3UqSBsvF4eh297beOamIUvF3lYsB+j6Gt pVxrqDbUR9roDpt6JlXb6FSAsWQZ406q6YDR38hxHFNMAWCNBnW17Vz+Cf6XsgmWJpZ8 kYUg== X-Gm-Message-State: AOAM530/LZu2FXw/CM78nQQ8R2ym+zvm5oL+nd8B5/gAxlQMXMUruw0n V7wmxXotX0VHJtu+XopmU58J+WIlvwJM1m53mow= X-Google-Smtp-Source: ABdhPJx4E3p855BYkMd3dgzduv+xS7iDAPyk6bNnepI5t3O1fcgvluTArpP2/zf7AWaGV2dKUzpEyAzCDnw7h6R0WDI= X-Received: by 2002:a92:cbd0:: with SMTP id s16mr6919757ilq.19.1624015600513; Fri, 18 Jun 2021 04:26:40 -0700 (PDT) MIME-Version: 1.0 References: <20210519142359.23083-1-pl@kamp.de> <20210519142359.23083-5-pl@kamp.de> <6878472c-2923-2bd8-c49b-e64c70867cbf@kamp.de> In-Reply-To: <6878472c-2923-2bd8-c49b-e64c70867cbf@kamp.de> From: Ilya Dryomov Date: Fri, 18 Jun 2021 13:26:31 +0200 Message-ID: Subject: Re: [PATCH V3 4/6] block/rbd: migrate from aio to coroutines To: Peter Lieven Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=2607:f8b0:4864:20::130; envelope-from=idryomov@gmail.com; helo=mail-il1-x130.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kwolf@redhat.com, berrange@redhat.com, qemu-block@nongnu.org, qemu-devel@nongnu.org, ct@flyingcircus.io, Paolo Bonzini , mreitz@redhat.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Fri, Jun 18, 2021 at 11:07 AM Peter Lieven wrote: > > Am 17.06.21 um 16:43 schrieb Ilya Dryomov: > > On Wed, May 19, 2021 at 4:27 PM Peter Lieven wrote: > >> Signed-off-by: Peter Lieven > >> --- > >> block/rbd.c | 255 ++++++++++++++++++---------------------------------- > >> 1 file changed, 87 insertions(+), 168 deletions(-) > >> > >> diff --git a/block/rbd.c b/block/rbd.c > >> index 97a2ae4c84..0d8612a988 100644 > >> --- a/block/rbd.c > >> +++ b/block/rbd.c > >> @@ -66,22 +66,6 @@ typedef enum { > >> RBD_AIO_FLUSH > >> } RBDAIOCmd; > >> > >> -typedef struct RBDAIOCB { > >> - BlockAIOCB common; > >> - int64_t ret; > >> - QEMUIOVector *qiov; > >> - RBDAIOCmd cmd; > >> - int error; > >> - struct BDRVRBDState *s; > >> -} RBDAIOCB; > >> - > >> -typedef struct RADOSCB { > >> - RBDAIOCB *acb; > >> - struct BDRVRBDState *s; > >> - int64_t size; > >> - int64_t ret; > >> -} RADOSCB; > >> - > >> typedef struct BDRVRBDState { > >> rados_t cluster; > >> rados_ioctx_t io_ctx; > >> @@ -93,6 +77,13 @@ typedef struct BDRVRBDState { > >> uint64_t object_size; > >> } BDRVRBDState; > >> > >> +typedef struct RBDTask { > >> + BlockDriverState *bs; > >> + Coroutine *co; > >> + bool complete; > >> + int64_t ret; > >> +} RBDTask; > >> + > >> static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx, > >> BlockdevOptionsRbd *opts, bool cache, > >> const char *keypairs, const char *secretid, > >> @@ -325,13 +316,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const char *keypairs_json, > >> return ret; > >> } > >> > >> -static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs) > >> -{ > >> - RBDAIOCB *acb = rcb->acb; > >> - iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0, > >> - acb->qiov->size - offs); > >> -} > >> - > >> /* FIXME Deprecate and remove keypairs or make it available in QMP. */ > >> static int qemu_rbd_do_create(BlockdevCreateOptions *options, > >> const char *keypairs, const char *password_secret, > >> @@ -450,46 +434,6 @@ exit: > >> return ret; > >> } > >> > >> -/* > >> - * This aio completion is being called from rbd_finish_bh() and runs in qemu > >> - * BH context. > >> - */ > >> -static void qemu_rbd_complete_aio(RADOSCB *rcb) > >> -{ > >> - RBDAIOCB *acb = rcb->acb; > >> - int64_t r; > >> - > >> - r = rcb->ret; > >> - > >> - if (acb->cmd != RBD_AIO_READ) { > >> - if (r < 0) { > >> - acb->ret = r; > >> - acb->error = 1; > >> - } else if (!acb->error) { > >> - acb->ret = rcb->size; > >> - } > >> - } else { > >> - if (r < 0) { > >> - qemu_rbd_memset(rcb, 0); > >> - acb->ret = r; > >> - acb->error = 1; > >> - } else if (r < rcb->size) { > >> - qemu_rbd_memset(rcb, r); > >> - if (!acb->error) { > >> - acb->ret = rcb->size; > >> - } > >> - } else if (!acb->error) { > >> - acb->ret = r; > >> - } > >> - } > >> - > >> - g_free(rcb); > >> - > >> - acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret)); > >> - > >> - qemu_aio_unref(acb); > >> -} > >> - > >> static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp) > >> { > >> const char **vals; > >> @@ -826,89 +770,50 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t size) > >> return 0; > >> } > >> > >> -static const AIOCBInfo rbd_aiocb_info = { > >> - .aiocb_size = sizeof(RBDAIOCB), > >> -}; > >> - > >> -static void rbd_finish_bh(void *opaque) > >> +static void qemu_rbd_finish_bh(void *opaque) > >> { > >> - RADOSCB *rcb = opaque; > >> - qemu_rbd_complete_aio(rcb); > >> + RBDTask *task = opaque; > >> + task->complete = 1; > >> + aio_co_wake(task->co); > >> } > >> > >> -/* > >> - * This is the callback function for rbd_aio_read and _write > >> - * > >> - * Note: this function is being called from a non qemu thread so > >> - * we need to be careful about what we do here. Generally we only > >> - * schedule a BH, and do the rest of the io completion handling > >> - * from rbd_finish_bh() which runs in a qemu context. > >> - */ > > I would adapt this comment instead of removing it. I mean, it is > > still true and the reason for going through aio_bh_schedule_oneshot() > > instead of calling aio_co_wake() directly, right? > > > Sure, its still right, but I think rbd is the only driver with this comment. > > I can leave it in, no problem. Yeah, just massage it a bit to reflect the reality (at least the function name). > > > > > >> -static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb) > >> +static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task) > >> { > >> - RBDAIOCB *acb = rcb->acb; > >> - > >> - rcb->ret = rbd_aio_get_return_value(c); > >> + task->ret = rbd_aio_get_return_value(c); > >> rbd_aio_release(c); > >> - > >> - replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs), > >> - rbd_finish_bh, rcb); > >> + aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs), > >> + qemu_rbd_finish_bh, task); > >> } > >> > >> -static BlockAIOCB *rbd_start_aio(BlockDriverState *bs, > >> - int64_t off, > >> - QEMUIOVector *qiov, > >> - int64_t size, > >> - BlockCompletionFunc *cb, > >> - void *opaque, > >> - RBDAIOCmd cmd) > >> +static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs, > >> + uint64_t offset, > >> + uint64_t bytes, > >> + QEMUIOVector *qiov, > >> + int flags, > >> + RBDAIOCmd cmd) > >> { > >> - RBDAIOCB *acb; > >> - RADOSCB *rcb = NULL; > >> + BDRVRBDState *s = bs->opaque; > >> + RBDTask task = { .bs = bs, .co = qemu_coroutine_self() }; > >> rbd_completion_t c; > >> int r; > >> > >> - BDRVRBDState *s = bs->opaque; > >> + assert(!qiov || qiov->size == bytes); > >> > >> - acb = qemu_aio_get(&rbd_aiocb_info, bs, cb, opaque); > >> - acb->cmd = cmd; > >> - acb->qiov = qiov; > >> - assert(!qiov || qiov->size == size); > >> - > >> - rcb = g_new(RADOSCB, 1); > >> - > >> - acb->ret = 0; > >> - acb->error = 0; > >> - acb->s = s; > >> - > >> - rcb->acb = acb; > >> - rcb->s = acb->s; > >> - rcb->size = size; > >> - r = rbd_aio_create_completion(rcb, (rbd_callback_t) rbd_finish_aiocb, &c); > >> + r = rbd_aio_create_completion(&task, > >> + (rbd_callback_t) qemu_rbd_completion_cb, &c); > >> if (r < 0) { > >> - goto failed; > >> + return r; > >> } > >> > >> switch (cmd) { > >> - case RBD_AIO_WRITE: > >> - /* > >> - * RBD APIs don't allow us to write more than actual size, so in order > >> - * to support growing images, we resize the image before write > >> - * operations that exceed the current size. > >> - */ > >> - if (off + size > s->image_size) { > >> - r = qemu_rbd_resize(bs, off + size); > >> - if (r < 0) { > >> - goto failed_completion; > >> - } > >> - } > >> - r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c); > >> - break; > >> case RBD_AIO_READ: > >> - r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c); > >> + r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, offset, c); > >> + break; > >> + case RBD_AIO_WRITE: > >> + r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, offset, c); > >> break; > >> case RBD_AIO_DISCARD: > >> - r = rbd_aio_discard(s->image, off, size, c); > >> + r = rbd_aio_discard(s->image, offset, bytes, c); > >> break; > >> case RBD_AIO_FLUSH: > >> r = rbd_aio_flush(s->image, c); > >> @@ -918,44 +823,69 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs, > >> } > >> > >> if (r < 0) { > >> - goto failed_completion; > >> + error_report("rbd request failed early: cmd %d offset %" PRIu64 > >> + " bytes %" PRIu64 " flags %d r %d (%s)", cmd, offset, > >> + bytes, flags, r, strerror(-r)); > >> + rbd_aio_release(c); > >> + return r; > >> } > >> - return &acb->common; > >> > >> -failed_completion: > >> - rbd_aio_release(c); > >> -failed: > >> - g_free(rcb); > >> + while (!task.complete) { > >> + qemu_coroutine_yield(); > >> + } > >> > >> - qemu_aio_unref(acb); > >> - return NULL; > >> + if (task.ret < 0) { > >> + error_report("rbd request failed: cmd %d offset %" PRIu64 " bytes %" > >> + PRIu64 " flags %d task.ret %" PRIi64 " (%s)", cmd, offset, > >> + bytes, flags, task.ret, strerror(-task.ret)); > >> + return task.ret; > >> + } > >> + > >> + /* zero pad short reads */ > >> + if (cmd == RBD_AIO_READ && task.ret < qiov->size) { > >> + qemu_iovec_memset(qiov, task.ret, 0, qiov->size - task.ret); > > I would use bytes instead of qiov->size here and on the previous > > line. In many systems the iovec can be larger than the op and it > > took me a couple of minutes to spot the "qiov->size == bytes" assert. > > > qemu_iovec_memset is an operation on the qiov. If one missed > > the assertion one could argue that we memset beyond the size of the qiov > > if we pass bytes - task.ret as length. Sorry, disregard -- I missed that rbd_aio_readv/writev() are specified in terms of iovecs (i.e. iovec base + iovec count instead of iovec base + number of bytes). > > > Is it true that there a systems which pass a bigger iovec than the actual op? Perhaps not in QEMU but it comes in many other places. For example in the kernel when the block device driver gets a "bio" which it doesn't own and should act only on a small piece of it, etc. > > In this case the assertion would trigger there. > > > > > > Also, previously we zeroed the entire op on read errors. I don't think > > it matters but want to make sure this was dropped on purpose. > > > I dropped it because its not done in other drivers. If you feel that we could reveal sensitive information I can put the wiping back in. I don't think so and we don't do this in kernel rbd either. > > > > > >> + } > >> + > >> + return 0; > >> +} > >> + > >> +static int > >> +coroutine_fn qemu_rbd_co_preadv(BlockDriverState *bs, uint64_t offset, > >> + uint64_t bytes, QEMUIOVector *qiov, > >> + int flags) > >> +{ > >> + return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_READ); > >> } > >> > >> -static BlockAIOCB *qemu_rbd_aio_preadv(BlockDriverState *bs, > >> - uint64_t offset, uint64_t bytes, > >> - QEMUIOVector *qiov, int flags, > >> - BlockCompletionFunc *cb, > >> - void *opaque) > >> +static int > >> +coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, uint64_t offset, > >> + uint64_t bytes, QEMUIOVector *qiov, > >> + int flags) > >> { > >> - return rbd_start_aio(bs, offset, qiov, bytes, cb, opaque, > >> - RBD_AIO_READ); > >> + BDRVRBDState *s = bs->opaque; > >> + /* > >> + * RBD APIs don't allow us to write more than actual size, so in order > >> + * to support growing images, we resize the image before write > >> + * operations that exceed the current size. > >> + */ > >> + if (offset + bytes > s->image_size) { > >> + int r = qemu_rbd_resize(bs, offset + bytes); > >> + if (r < 0) { > >> + return r; > >> + } > >> + } > > How can this be triggered today? qemu-io used to be able to, but that > > support was removed a long time ago: > > > > https://mail.gnu.org/archive/html/qemu-devel/2014-12/msg01592.html > > > > Just checking that this piece of code is not vestigial at this point. > > > We had this discussion before. The way to hit this if you want to create e.g. a qcow2 image on rbd. It does not make that much sense, but it was supported in the past. Ah, so something like qemu-img create -f qcow2 rbd:foo/bar 10G > > I would vote for removing it, but it might break something somewhere. Yeah, it's weird but definitely not up for removal in a patch that migrates from one internal API to another. Thanks, Ilya