From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 260B0C433ED for ; Mon, 10 May 2021 14:39:16 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3629160241 for ; Mon, 10 May 2021 14:39:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3629160241 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:51036 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lg73y-0005Tw-AM for qemu-devel@archiver.kernel.org; Mon, 10 May 2021 10:39:14 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:39448) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lg73C-0004Jz-O6 for qemu-devel@nongnu.org; Mon, 10 May 2021 10:38:26 -0400 Received: from mail-qt1-x833.google.com ([2607:f8b0:4864:20::833]:46743) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1lg735-0008Pw-Kw for qemu-devel@nongnu.org; Mon, 10 May 2021 10:38:23 -0400 Received: by mail-qt1-x833.google.com with SMTP id p6so12036728qtk.13 for ; Mon, 10 May 2021 07:38:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=I1Joo4rParNjPzCRgaXVEjG+974jeh84Imc9fO6ZUqY=; b=DFqwvtytg66O9yOA22LitFJb2lxeHRUWK0sconMhixxJjHAza/mdDjVIoEwzeigjK/ qfpQgazkMttM7QVyoqeSEI2sA87K3+KqIE7n+6pqSdCe+N7xJBj5KQrR2pRQHLOiaF4T tM+XYuxcO2kC1h65i8gPNVd4iMsvpHz6L+vTU3QQ14mHhNl6sQ+PcoMTCyCZy7ydodAn 7o0XS33q6PLaVMs0U36rvZb7rHIbA6nDQVbYVHAknfo8Q5xDwe3/vbJKXdGPmIZWu7AV mUwupJ0/48xY70DCDxEQdWEwrKoRLHtaqFmwUzRt3wBDB/LC5UwyQskoPWsuLJFGag1a HQCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=I1Joo4rParNjPzCRgaXVEjG+974jeh84Imc9fO6ZUqY=; b=sdUE8y6ZR5KrRrBhOqJmsSCJ/sPUk+nKXvNtEeLxy1CIG+VYlLDLpoAu6/2NaWL/zx rmGLVHhMMswZnsTM2zcOebuaHU66e4w3IRrm9SqvNBrQahMsONfTP965laivmC8Z5n1p c7I8wZQXBAF21UixkCfxOZtURpOhPzm02hbP6zoxkGvJj5RmAOoWBZgfN8wSDG/JfbiI RYjW0dH3ckqdGLOcSXYFkh8yQel9mZcyqUtxTnhZpRCDsnVQtAh6wY45wJv6ghKUaQQr GOYgIDmI0/hH9ZB/pHSCTwqAJE8h/ztK5M4TF2uBUlJdmhUjgpYfvAJSnXAq4U9h86JR Fbsw== X-Gm-Message-State: AOAM532JdvCDYYCpaxqL43zG1lMmmc7qOp1oIUONrKtaX3uihuBtuNsk dgX5M1RgvOwZ8iVCUU7EkJ4kAewa+wEN1XBQH3c2kA== X-Google-Smtp-Source: ABdhPJwSiRfJkKSf/MpneXm4t+O/AB/9LElJExtcnwM9Mrfd4gyRsmzwSquAjsatRQsFcswf8HV/oefENVB79YMiBxw= X-Received: by 2002:a05:622a:1044:: with SMTP id f4mr7333952qte.181.1620657496532; Mon, 10 May 2021 07:38:16 -0700 (PDT) MIME-Version: 1.0 References: <20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com> In-Reply-To: <20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com> From: Jiachen Zhang Date: Mon, 10 May 2021 22:38:05 +0800 Message-ID: Subject: Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection To: "Dr . David Alan Gilbert" , "Michael S . Tsirkin" , Stefan Hajnoczi , Xie Yongji Content-Type: multipart/alternative; boundary="0000000000006264f905c1faba43" Received-SPF: pass client-ip=2607:f8b0:4864:20::833; envelope-from=zhangjiachen.jaycee@bytedance.com; helo=mail-qt1-x833.google.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: virtio-fs-list , QEMU Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" --0000000000006264f905c1faba43 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi all, We are going to develop the v2 patch for virtio-fs crash reconnection. As suggested by Marc-Andr=C3=A9 and Stefan, except for the inflight I/O tracki= ng log area, all the other internal statuses of virtiofsd will be saved to some places other than QEMU. Specifically, the three lo_maps (ino_map, dirp_map, and fd_map) could be saved to several mmapped files, and the opened fds could be saved to systemd. I'd like to get some feedback on our further thoughts before we work on the revision. 1. What about by default save the opened fds as file handles to host kernel, instead of saving them to systemd. After some internal discussion, we think introducing systemd may introduce more uncertainness to the system, as we need to create one service for each daemon, and all the daemons may suffer the single-point failure of the systemd process. 2. Like the btree map implementation (multikey.rs) of virtiofsd-rs, what about splitting the flatten lo_map implementation, which supports to be persisted to files, from passhtrough_ll.c to a new separated source file. This way, maybe we can more easily wrap it with some Rust compatible interfaces, and enable crash recovery for virtiofsd-rs based on it. 3. What about dropping the dirp_map, and integrate the opened directory fds to fd_map. The virtiofsd-rs implementation only has two maps (inodes and handles). In the C version, dirp_map may also unnecessary. All the best, Jiachen On Wed, Dec 16, 2020 at 12:21 AM Jiachen Zhang < zhangjiachen.jaycee@bytedance.com> wrote: > Hi, all > > We implement virtio-fs crash reconnection in this patchset. The crash > reconnection of virtiofsd here is completely transparent to guest, no > remount in guest is needed, even the inflight requests can be handled > normally after reconnection. We are looking forward to any comments. > > Thanks, > Jiachen > > > OVERVIEW: > > To support virtio-fs crash reconnection, we need to support the recovery > of 1) inflight FUSE request, and 2) virtiofsd internal status information= . > > Fortunately, QEMU's vhost-user reconnection framework already supports > inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and > VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details). > As the FUSE requests are transferred by virtqueue I/O requests, by using > the vhost-user inflight I/O tracking, we can recover the inflight FUSE > requests. > > To support virtiofsd internal status recovery, we introduce 4 new > vhost-user message types. As shown in the following diagram, two of them > are used to persist shared lo_maps and opened fds to QEMU, the other two > message types are used to restore the status when reconnecting. > > VHOST_USER_SLAVE_SHM > VHOST_USER_SLAVE_FD > +--------------+ Persist +--------------------+ > | <---------------------+ | > | QEMU | | Virtio-fs Daemon | > | +---------------------> | > +--------------+ Restore +--------------------+ > VHOST_USER_SET_SHM > VHOST_USER_SET_FD > > Although the 4 newly added message types are to support virtiofsd > reconnection in this patchset, it might be potential in other situation. > So we keep in mind to make them more general when add them to vhost > related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be > used for memory sharing between a vhost-user daemon and QEMU, > VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to > shared opened fds between QEMU process and vhost-user daemon process. > > > USAGE and NOTES: > > - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2= b. > > - ",reconnect=3D1" should be added to the "-chardev socket" of > vhost-user-fs-pci > in the QEMU command line, for example: > > qemu-system-x86_64 ... \ > -chardev socket,id=3Dchar0,path=3D/tmp/vhostqemu,reconnect=3D1 \ > -device vhost-user-fs-pci,queue-size=3D1024,chardev=3Dchar0,tag=3Dmyf= s \ > ... > > - We add new options for virtiofsd to enable or disable crash reconnectio= n. > And some options are not supported by crash reconnection. So add followin= g > options to virtiofsd to enable reconnection: > > virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_loc= k > -o no_xattr ... > > - The reasons why virtiofsd-side locking, extended attributes, and mount > namespace are not supported is explained in the commit message of the 6th > patch (virtiofsd: Add two new options for crash reconnection). > > - The 9th patch is a work-around that will not affect the overall > correctness. > We remove the qsort related codes because we found that when resubmit_num > is > larger than 64, seccomp will kill the virtiofsd process. > > - Support for dax version virtiofsd is very possible and requires almost = no > additional change to this patchset. > > > Jiachen Zhang (9): > vhost-user-fs: Add support for reconnection of vhost-user-fs backend > vhost: Add vhost-user message types for sending shared memory and file > fds > vhost-user-fs: Support virtiofsd crash reconnection > libvhost-user: Add vhost-user message types for sending shared memory > and file fds > virtiofsd: Convert the struct lo_map array to a more flatten layout > virtiofsd: Add two new options for crash reconnection > virtiofsd: Persist/restore lo_map and opened fds to/from QEMU > virtiofsd: Ensure crash consistency after reconnection > virtiofsd: (work around) Comment qsort in inflight I/O tracking > > contrib/libvhost-user/libvhost-user.c | 106 +++- > contrib/libvhost-user/libvhost-user.h | 70 +++ > docs/interop/vhost-user.rst | 41 ++ > hw/virtio/vhost-user-fs.c | 334 ++++++++++- > hw/virtio/vhost-user.c | 123 ++++ > hw/virtio/vhost.c | 42 ++ > include/hw/virtio/vhost-backend.h | 6 + > include/hw/virtio/vhost-user-fs.h | 16 +- > include/hw/virtio/vhost.h | 42 ++ > tools/virtiofsd/fuse_lowlevel.c | 24 +- > tools/virtiofsd/fuse_virtio.c | 44 ++ > tools/virtiofsd/fuse_virtio.h | 1 + > tools/virtiofsd/helper.c | 9 + > tools/virtiofsd/passthrough_helpers.h | 2 +- > tools/virtiofsd/passthrough_ll.c | 830 ++++++++++++++++++-------- > tools/virtiofsd/passthrough_seccomp.c | 1 + > 16 files changed, 1413 insertions(+), 278 deletions(-) > > -- > 2.20.1 > > --0000000000006264f905c1faba43 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi all,


We are going to=C2=A0develop t= he v2 patch for virtio-fs crash reconnection. As suggested by Marc-Andr=C3= =A9 and Stefan, except for the inflight I/O tracking log area, all the othe= r internal statuses of virtiofsd will be saved to some places other than QE= MU. Specifically, the three lo_maps (ino_map, dirp_map, and fd_map) could b= e saved to several mmapped files, and the opened fds could be saved to syst= emd. I'd like to get some feedback on our further thoughts before we wo= rk on the revision.


1. What about by default save = the opened fds as file handles to host kernel, instead of saving them to sy= stemd. After some internal discussion, we think introducing systemd may int= roduce more uncertainness to the system, as we need to create one service f= or each daemon, and all the daemons may suffer the single-point failure of = the systemd process.


2. Like the btree map implemen= tation (multikey.rs) of virtiofsd-rs, wh= at about splitting the flatten lo_map implementation, which supports to be = persisted to files, from passhtrough_ll.c to a new separated=C2=A0source fi= le. This way, maybe we can more=C2=A0easily wrap it=C2=A0with some Rust com= patible interfaces, and enable crash recovery for virtiofsd-rs=C2=A0based o= n it.


3. What about dropping the dir= p_map, and integrate the opened directory fds to fd_map. The virtiofsd-rs= =C2=A0implementation only has two maps (inodes and handles). In the C versi= on, dirp_map may also unnecessary.


All the best,

Jiachen


On Wed, Dec 16, 202= 0 at 12:21 AM Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> wrote:
Hi, all

We implement virtio-fs crash reconnection in this patchset. The crash
reconnection of virtiofsd here is completely transparent to guest, no
remount in guest is needed, even the inflight requests can be handled
normally after reconnection. We are looking forward to any comments.

Thanks,
Jiachen


OVERVIEW:

To support virtio-fs crash reconnection, we need to support the recovery of 1) inflight FUSE request, and 2) virtiofsd internal status information.<= br>
Fortunately, QEMU's vhost-user reconnection framework already supports<= br> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
As the FUSE requests are transferred by virtqueue I/O requests, by using the vhost-user inflight I/O tracking, we can recover the inflight FUSE
requests.

To support virtiofsd internal status recovery, we introduce 4 new
vhost-user message types. As shown in the following diagram, two of them are used to persist shared lo_maps and opened fds to QEMU, the other two message types are used to restore the status when reconnecting.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0VHOST_USER_SLAVE_SHM
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0VHOST_USER_SLAVE_FD
=C2=A0 =C2=A0 +--------------+=C2=A0 =C2=A0 =C2=A0 =C2=A0Persist=C2=A0 =C2= =A0 =C2=A0 =C2=A0+--------------------+
=C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <-------= --------------+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 |
=C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0QEMU=C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 Virti= o-fs Daemon=C2=A0 |
=C2=A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +----------= ----------->=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 |
=C2=A0 =C2=A0 +--------------+=C2=A0 =C2=A0 =C2=A0 =C2=A0Restore=C2=A0 =C2= =A0 =C2=A0 =C2=A0+--------------------+
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 VHOST_USER_SET_SHM
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 VHOST_USER_SET_FD

Although the 4 newly added message types are to support virtiofsd
reconnection in this patchset, it might be potential in other situation. So we keep in mind to make them more general when add them to vhost
related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be used for memory sharing between a vhost-user daemon and QEMU,
VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
shared opened fds between QEMU process and vhost-user daemon process.


USAGE and NOTES:

- The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b.=

- ",reconnect=3D1" should be added to the "-chardev socket&q= uot; of vhost-user-fs-pci
in the QEMU command line, for example:

=C2=A0 =C2=A0 qemu-system-x86_64 ... \
=C2=A0 =C2=A0 -chardev socket,id=3Dchar0,path=3D/tmp/vhostqemu,reconnect=3D= 1 \
=C2=A0 =C2=A0 -device vhost-user-fs-pci,queue-size=3D1024,chardev=3Dchar0,t= ag=3Dmyfs \
=C2=A0 =C2=A0 ...

- We add new options for virtiofsd to enable or disable crash reconnection.=
And some options are not supported by crash reconnection. So add following<= br> options to virtiofsd to enable reconnection:

=C2=A0 =C2=A0 virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_p= osix_lock
=C2=A0 =C2=A0 -o no_xattr ...

- The reasons why virtiofsd-side locking, extended attributes, and mount namespace are not supported is explained in the commit message of the 6th patch (virtiofsd: Add two new options for crash reconnection).

- The 9th patch is a work-around that will not affect the overall correctne= ss.
We remove the qsort related codes because we found that when resubmit_num i= s
larger than 64, seccomp will kill the virtiofsd process.

- Support for dax version virtiofsd is very possible and requires almost no=
additional change to this patchset.


Jiachen Zhang (9):
=C2=A0 vhost-user-fs: Add support for reconnection of vhost-user-fs backend=
=C2=A0 vhost: Add vhost-user message types for sending shared memory and fi= le
=C2=A0 =C2=A0 fds
=C2=A0 vhost-user-fs: Support virtiofsd crash reconnection
=C2=A0 libvhost-user: Add vhost-user message types for sending shared memor= y
=C2=A0 =C2=A0 and file fds
=C2=A0 virtiofsd: Convert the struct lo_map array to a more flatten layout<= br> =C2=A0 virtiofsd: Add two new options for crash reconnection
=C2=A0 virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
=C2=A0 virtiofsd: Ensure crash consistency after reconnection
=C2=A0 virtiofsd: (work around) Comment qsort in inflight I/O tracking

=C2=A0contrib/libvhost-user/libvhost-user.c | 106 +++-
=C2=A0contrib/libvhost-user/libvhost-user.h |=C2=A0 70 +++
=C2=A0docs/interop/vhost-user.rst=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|= =C2=A0 41 ++
=C2=A0hw/virtio/vhost-user-fs.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0| 334 ++++++++++-
=C2=A0hw/virtio/vhost-user.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 | 123 ++++
=C2=A0hw/virtio/vhost.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 42 ++
=C2=A0include/hw/virtio/vhost-backend.h=C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A06 = +
=C2=A0include/hw/virtio/vhost-user-fs.h=C2=A0 =C2=A0 =C2=A0|=C2=A0 16 +- =C2=A0include/hw/virtio/vhost.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0|=C2=A0 42 ++
=C2=A0tools/virtiofsd/fuse_lowlevel.c=C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 24 = +-
=C2=A0tools/virtiofsd/fuse_virtio.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2= =A0 44 ++
=C2=A0tools/virtiofsd/fuse_virtio.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2= =A0 =C2=A01 +
=C2=A0tools/virtiofsd/helper.c=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 |=C2=A0 =C2=A09 +
=C2=A0tools/virtiofsd/passthrough_helpers.h |=C2=A0 =C2=A02 +-
=C2=A0tools/virtiofsd/passthrough_ll.c=C2=A0 =C2=A0 =C2=A0 | 830 ++++++++++= ++++++++--------
=C2=A0tools/virtiofsd/passthrough_seccomp.c |=C2=A0 =C2=A01 +
=C2=A016 files changed, 1413 insertions(+), 278 deletions(-)

--
2.20.1

--0000000000006264f905c1faba43--