From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFD63C43381 for ; Tue, 19 Feb 2019 09:57:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4D6642146F for ; Tue, 19 Feb 2019 09:57:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=nexedi.com header.i=kirr@nexedi.com header.b="pjT3PztS"; dkim=pass (1024-bit key) header.d=mandrillapp.com header.i=@mandrillapp.com header.b="X3xh+f7B" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727315AbfBSJ5N (ORCPT ); Tue, 19 Feb 2019 04:57:13 -0500 Received: from mail136-25.atl41.mandrillapp.com ([198.2.136.25]:61633 "EHLO mail136-25.atl41.mandrillapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725763AbfBSJ5N (ORCPT ); Tue, 19 Feb 2019 04:57:13 -0500 X-Greylist: delayed 901 seconds by postgrey-1.27 at vger.kernel.org; Tue, 19 Feb 2019 04:57:12 EST DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=mandrill; d=nexedi.com; h=From:Subject:To:Cc:Message-Id:Date:MIME-Version:Content-Type:Content-Transfer-Encoding; i=kirr@nexedi.com; bh=SrfAldX/egEUBm1hepTN68TtqRwyuS2H0qH5d3iy04c=; b=pjT3PztSdwtnNeTRaWeXX5tdWNMBdj7ufplZhVLslqQ9GqdRmBpdAWKMsnMO5K1w0Bk1meN1zhWf 473VIAh5XpCx0bHCp1UjYbmkH+1w0FsCRcQVcpf5KQE5ShcB9wjbltYWWA3AHOp0yb+V65WR7wVw AeIO8wXDyEi9RETb/cE= Received: from pmta04.mandrill.prod.atl01.rsglab.com (127.0.0.1) by mail136-25.atl41.mandrillapp.com id hdf9fg1sb1k6 for ; Tue, 19 Feb 2019 09:42:10 +0000 (envelope-from ) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; i=@mandrillapp.com; q=dns/txt; s=mandrill; t=1550569330; h=From : Subject : To : Cc : Message-Id : Date : MIME-Version : Content-Type : Content-Transfer-Encoding : From : Subject : Date : X-Mandrill-User : List-Unsubscribe; bh=SrfAldX/egEUBm1hepTN68TtqRwyuS2H0qH5d3iy04c=; b=X3xh+f7BsHcKSvyd45PTgwyciTx80XtL3P8RUgwPQAFg62b/wbxV5OyoeqZxbK+PgDqnml jW9N23hbi+0h4agINQHcWmI6pZVuK779uJeCcS3NBIYA8Lb4j9Sbnk+BWOL2DJ4UUo56MN81 nbCBG/zoX/PvetNmfT5za9/B1LQs8= From: Kirill Smelkov Subject: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it Received: from [87.98.221.171] by mandrillapp.com id e277703ff0eb4c039786e0a68e048370; Tue, 19 Feb 2019 09:42:10 +0000 X-Mailer: git-send-email 2.21.0.rc0.269.g1a574e7a28 To: Miklos Szeredi , Miklos Szeredi Cc: , , Kirill Smelkov , Han-Wen Nienhuys , Jakob Unterwurzacher , Message-Id: <20190219094147.32734-1-kirr@nexedi.com> X-Report-Abuse: Please forward a copy of this message, including all headers, to abuse@mandrill.com X-Report-Abuse: You can also report abuse here: http://mandrillapp.com/contact/abuse?id=31050260.e277703ff0eb4c039786e0a68e048370 X-Mandrill-User: md_31050260 Date: Tue, 19 Feb 2019 09:42:10 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity, fuse_dev_do_read might see that the "request is too large" and decide to "reply with an error and restart the read". "Reply with an error" has underlying assumption that there is a "requester thread" that is waiting for request completion, which is true for most requests, but is not true for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right after it could successfully queue NOTIFY_REPLY message without waiting for NOTIFY_REPLY completion. This leads to situation when filesystem requested to retrieve inode data with NOTIFY_RETRIEVE, got err=3DOK for that notification request, but NOTIFY_REPLY is not coming back. More, since there is no "requester thread" to handle the error, the situation shows itself as /sys/fs/fuse/connections/X/waiting=3D1 _and_ /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request was removed from pending queue and abandoned. One way to fix would be to change NOTIFY_RETRIEVE handler to wait until queued NOTIFY_REPLY is actually read back to the server and only then return NOTIFY_RETRIEVE status. However this is change in behaviour and would require filesystems to have at least 2 threads. In particular a single-threaded filesystem that was previously successfully using NOTIFY_RETRIEVE would become stuck after the change. This way of fixing is thus not acceptable. However we can fix it another way - by always returning NOTIFY_REPLY irregardless of its original size - with so much data as provided read buffer could fit. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested. This fix requires another behaviour change however - to be sure that read buffer has enough capacity to always fit fixed NOTIFY_REPLY part plus at least some (0 or more) data, we have to precheck the buffer before dequeuing and handling a request. And if the buffer is very small - return EINVAL to read in filesystem with semantic that queued read was invalid from the viewpoint of FUSE protocol. Even though this is also behaviour change, this should not practically cause problems: 1d3d752b47 (fuse: clean up request size limit checking), which originally removed such EINVAL return and reworked fuse_dev_do_read to loop and retry, also added FUSE_MIN_READ_BUFFER=3D8K to user-visible fuse.h with comment that "The read buffer is required to be at least 8k ..." Even though FUSE_MIN_READ_BUFFER is not currently checked anywhere in the kernel, libfuse always initializes session with bufsize=3D32=C2=B7pages and, since = its beginning, (at least from 2005) issues a warning should user modify fuse_session->bufsize directly to be sure that queued buffers are at least as large as that sane minimum: =09https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fu= se_lowlevel.c#L2869 =09https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fu= se_lowlevel.c#L1947 =09(semantic added in https://github.com/libfuse/libfuse/commit/044da2e9e0) This way we should be safe to add the check for minimum read buffer size. I've hit this bug for real with my filesystem that is using https://github.com/hanwen/go-fuse: there was no NOTIFY_REPLY after successful NOTIFY_RETRIEVE and the filesystem was stuck waiting, because FUSE protocol (definition scattered through many places) states that NOTIFY_REPLY is guaranteed to come after successful NOTIFY_RETRIEVE (see 2d45ba381a "fuse: add retrieve request"). After inspecting /sys/fs/fuse/connections/X/waiting and seeing it was 1, I was initially suspecting that it was user-space who is not issuing /dev/fuse reads and NOTIFY_REPLY is there but stuck in kernel pending queue. However tracing what is going on in /dev/fuse exchange and in both kernel and userspace (see https://lab.nexedi.com/kirr/wendelin.core/blob/13d2d1f8/wcfs/fusetrace= ) showed that there are correctly queued /dev/fuse reads still pending after NOTIFY_RETRIEVE returns and it is the kernel who is not replying back= : =09... =09P2 2.215710 /dev/fuse <- qread wcfs/11399_4_r: =09 syscall.Syscall+48 =09 syscall.Read+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 =09 runtime.goexit+1 =09P2 2.215810 /dev/fuse -> read wcfs/11399_4_r: =09 .56 RELEASE i8 ... (ret=3D64) =09P2 2.215859 /dev/fuse <- write wcfs/11399_5_w: =09 .56 (0) ... =09 syscall.Syscall+48 =09 syscall.Write+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931 =09 github.com/hanwen/go-fuse/fuse.(*Server).write+194 =09 github.com/hanwen/go-fuse/fuse.(*Server).handleRequest+179 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+399 =09 runtime.goexit+1 =09P2 2.215871 /dev/fuse -> write_ack wcfs/11399_5_w (ret=3D16) =09P2 2.215876 /dev/fuse <- qread wcfs/11399_5_r: <-- NOTE =09 syscall.Syscall+48 =09 syscall.Read+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 =09 runtime.goexit+1 =09P0 2.221527 /dev/fuse <- qread wcfs/11401_1_r: <-- NOTE =09 syscall.Syscall+48 =09 syscall.Read+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 =09 runtime.goexit+1 =09P1 2.239384 /dev/fuse -> read wcfs/11398_6_r:=09# woken read that = was queued before "..." =09 .57 READ i5 ... (ret=3D80) =09P0 2.239626 /dev/fuse <- write wcfs/11397_0_w: =09 NOTIFY_RETRIEVE ... =09 syscall.Syscall+48 =09 syscall.Write+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931 =09 github.com/hanwen/go-fuse/fuse.(*Server).write+194 =09 github.com/hanwen/go-fuse/fuse.(*Server).InodeRetrieveCache+764 =09 github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).Fil= eRetrieveCache+157 =09 main.(*BigFile).invalidateBlk+232 =09 main.(*Root).z=CE=B4handle1.func1+72 =09 golang.org/x/sync/errgroup.(*Group).Go.func1+87 =09 runtime.goexit+1 =09P0 2.239660 /dev/fuse -> write_ack wcfs/11397_0_w (ret=3D48) =09# stuck =09# (full trace: https://lab.nexedi.com/kirr/wendelin.core/commit/96416aaa= bd) with queued / served read analysis confirming that two reads were indeed qu= eued and not served: =09grep -w -e '<- qread\>' y.log |awk {'print $6'} |sort >qread.txt =09grep -w -e '-> read\>' y.log |awk {'print $6'} |sort >read.txt =09# xdiff qread.txt read.txt =09diff --git a/qread.txt b/read.txt =09index 4ab50d7..fdd2be1 100644 =09--- a/qread.txt =09+++ b/read.txt =09@@ -53,7 +53,5 @@ wcfs/11399_1_r: =09 wcfs/11399_2_r: =09 wcfs/11399_3_r: =09 wcfs/11399_4_r: =09-wcfs/11399_5_r: =09 wcfs/11400_0_r: =09 wcfs/11401_0_r: =09-wcfs/11401_1_r: The bug was hit because go-fuse by default uses 64K for read buffer size =09https://github.com/hanwen/go-fuse/blob/33711add/fuse/server.go#L142 and the kernel presets fuse_conn->max_pages to be 128K (=3D 32=C2=B74K page= s). Go-fuse will be likely fixed to both use bufsize=3Dkernel's and to correctly handle size > bufsize in InodeRetrieveCache. However we should also fix the kernel to always deliver NOTIFY_REPLY once NOTIFY_RETRIEVE was successful, so that FUSE protocol guarantee always holds irregardless of whether userspace used default or other valid buffer size setting, and so that filesystems can count not to get stuck waiting for kernel who promised a reply. This way this patch is here. Signed-off-by: Kirill Smelkov Cc: Han-Wen Nienhuys Cc: Jakob Unterwurzacher Cc: # v2.6.36+ --- First patch version was sent 1 week ago, but got no response: https://marc.info/?l=3Dlinux-fsdevel&m=3D155000277921155&w=3D2 Changes since v1: don't forget to also update req->misc.retrieve_in.size after truncation. ( This is my first patch to fs/fuse, so please forgive me if I missed anyt= hing. ) fs/fuse/dev.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 65 insertions(+), 6 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 8a63e52785e9..93deb8e54d88 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -381,6 +381,40 @@ static void queue_request(struct fuse_iqueue *fiq, str= uct fuse_req *req) =09kill_fasync(&fiq->fasync, SIGIO, POLL_IN); } +/* + * fuse_req_truncate_data truncates data in request that has paged data + * (req.in.argpages=3D1), so that whole request, when serialized, is <=3D = nbytes. + * + * nbytes must be >=3D size(request without data). + */ +static void fuse_req_truncate_data(struct fuse_req *req, unsigned nbytes) = { +=09unsigned size, n; + +=09BUG_ON(!req->in.argpages); +=09BUG_ON(req->in.numargs < 1); + +=09/* request size without data */ +=09size =3D sizeof(struct fuse_in_header) + +=09=09len_args(req->in.numargs - 1, (struct fuse_arg *) req->in.args); +=09BUG_ON(nbytes < size); + +=09/* truncate paged data */ +=09for (n =3D 0; n < req->num_pages; n++) { +=09=09struct fuse_page_desc *p =3D &req->page_descs[n]; + +=09=09if (size >=3D nbytes) { +=09=09=09p->length =3D 0; +=09=09} else { +=09=09=09p->length =3D min_t(unsigned, p->length, nbytes - size); +=09=09} + +=09=09size +=3D p->length; +=09} + +=09/* update whole request length in the header */ +=09req->in.h.len =3D size; +} + void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forg= et, =09=09 u64 nodeid, u64 nlookup) { @@ -1317,6 +1351,15 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud= , struct file *file, =09unsigned reqsize; =09unsigned int hash; +=09/* +=09 * Require sane minimum read buffer - that has capacity for fixed part +=09 * of any request + some room for data. If the requirement is not +=09 * satisfied return EINVAL to the filesystem without dequeueing / +=09 * aborting any request. +=09 */ +=09if (nbytes < FUSE_MIN_READ_BUFFER) +=09=09return -EINVAL; + restart: =09spin_lock(&fiq->waitq.lock); =09err =3D -EAGAIN; @@ -1358,12 +1401,28 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fu= d, struct file *file, =09/* If request is too large, reply with an error and restart the read */ =09if (nbytes < reqsize) { -=09=09req->out.h.error =3D -EIO; -=09=09/* SETXATTR is special, since it may contain too large data */ -=09=09if (in->h.opcode =3D=3D FUSE_SETXATTR) -=09=09=09req->out.h.error =3D -E2BIG; -=09=09request_end(fc, req); -=09=09goto restart; +=09=09switch (in->h.opcode) { +=09=09default: +=09=09=09req->out.h.error =3D -EIO; +=09=09=09/* SETXATTR is special, since it may contain too large data */ +=09=09=09if (in->h.opcode =3D=3D FUSE_SETXATTR) +=09=09=09=09req->out.h.error =3D -E2BIG; +=09=09=09request_end(fc, req); +=09=09=09goto restart; + +=09=09/* +=09=09 * NOTIFY_REPLY is special: if it was queued we already +=09=09 * promised to filesystem to deliver it when handling +=09=09 * NOTIFY_RETRIVE. We know that read buffer has capacity for at +=09=09 * least some data. Truncate retrieved data to read buffer size +=09=09 * and deliver it to stay to the promise. +=09=09 */ +=09=09case FUSE_NOTIFY_REPLY: +=09=09=09fuse_req_truncate_data(req, nbytes); +=09=09=09req->misc.retrieve_in.size -=3D reqsize - in->h.len; +=09=09=09reqsize =3D in->h.len; +=09=09} + =09} =09spin_lock(&fpq->lock); =09list_add(&req->list, &fpq->io); -- 2.21.0.rc0.269.g1a574e7a28 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C574C10F00 for ; Tue, 19 Feb 2019 09:57:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2492E2146F for ; Tue, 19 Feb 2019 09:57:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=nexedi.com header.i=kirr@nexedi.com header.b="Cus1q7dl"; dkim=pass (1024-bit key) header.d=mandrillapp.com header.i=@mandrillapp.com header.b="BAhtEzLD" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725763AbfBSJ5O (ORCPT ); Tue, 19 Feb 2019 04:57:14 -0500 Received: from mail136-25.atl41.mandrillapp.com ([198.2.136.25]:61633 "EHLO mail136-25.atl41.mandrillapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726548AbfBSJ5O (ORCPT ); Tue, 19 Feb 2019 04:57:14 -0500 X-Greylist: delayed 902 seconds by postgrey-1.27 at vger.kernel.org; Tue, 19 Feb 2019 04:57:13 EST DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=mandrill; d=nexedi.com; h=From:Subject:To:Cc:Message-Id:Date:MIME-Version:Content-Type:Content-Transfer-Encoding; i=kirr@nexedi.com; bh=SrfAldX/egEUBm1hepTN68TtqRwyuS2H0qH5d3iy04c=; b=Cus1q7dleAyVeQ8qPVRUGq2oEgSqkn9g1mxgMsyZO4Cv2ldxpdMJs8kJ06/6NCFL5hujPYPI/cEl BgF4KPhvGVj0R1QMlH4yBFHiQj6dKNSqA6E1CMpHSdgFlaeycfU6Cp7KM6LakCYxfR/pAJcf9i4S hhYOdpuEqZRcqN7ZD+E= Received: from pmta04.mandrill.prod.atl01.rsglab.com (127.0.0.1) by mail136-25.atl41.mandrillapp.com id hdf9fk1sb1k2 for ; Tue, 19 Feb 2019 09:42:11 +0000 (envelope-from ) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; i=@mandrillapp.com; q=dns/txt; s=mandrill; t=1550569331; h=From : Subject : To : Cc : Message-Id : Date : MIME-Version : Content-Type : Content-Transfer-Encoding : From : Subject : Date : X-Mandrill-User : List-Unsubscribe; bh=SrfAldX/egEUBm1hepTN68TtqRwyuS2H0qH5d3iy04c=; b=BAhtEzLD7gZ4ei1xyh0TcpNBVAZQslqbkcVqzu6YNTr5gp3T4I319mz+6OBHpoG6HWWY/z 72y5b31cnJg8jczebsZ/PC0Bi7rte0ueSJXTzfH3sRr6KCC8IJxhH3AvGVusxOeTO5JBLim8 DPUsWvd1SwZ1HJfgmOIy0EINW9XXs= From: Kirill Smelkov Subject: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it Received: from [87.98.221.171] by mandrillapp.com id 143076e1b646429a9c869652e1feb6c5; Tue, 19 Feb 2019 09:42:11 +0000 X-Mailer: git-send-email 2.21.0.rc0.269.g1a574e7a28 To: Miklos Szeredi , Miklos Szeredi Cc: , , Kirill Smelkov , Han-Wen Nienhuys , Jakob Unterwurzacher , Message-Id: <20190219094147.32734-1-kirr@nexedi.com> X-Report-Abuse: Please forward a copy of this message, including all headers, to abuse@mandrill.com X-Report-Abuse: You can also report abuse here: http://mandrillapp.com/contact/abuse?id=31050260.143076e1b646429a9c869652e1feb6c5 X-Mandrill-User: md_31050260 Date: Tue, 19 Feb 2019 09:42:11 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org A successful call to NOTIFY_RETRIEVE by filesystem carries promise from the kernel to send back NOTIFY_REPLY message. However if the filesystem is not reading requests with fuse_conn->max_pages capacity, fuse_dev_do_read might see that the "request is too large" and decide to "reply with an error and restart the read". "Reply with an error" has underlying assumption that there is a "requester thread" that is waiting for request completion, which is true for most requests, but is not true for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right after it could successfully queue NOTIFY_REPLY message without waiting for NOTIFY_REPLY completion. This leads to situation when filesystem requested to retrieve inode data with NOTIFY_RETRIEVE, got err=3DOK for that notification request, but NOTIFY_REPLY is not coming back. More, since there is no "requester thread" to handle the error, the situation shows itself as /sys/fs/fuse/connections/X/waiting=3D1 _and_ /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request was removed from pending queue and abandoned. One way to fix would be to change NOTIFY_RETRIEVE handler to wait until queued NOTIFY_REPLY is actually read back to the server and only then return NOTIFY_RETRIEVE status. However this is change in behaviour and would require filesystems to have at least 2 threads. In particular a single-threaded filesystem that was previously successfully using NOTIFY_RETRIEVE would become stuck after the change. This way of fixing is thus not acceptable. However we can fix it another way - by always returning NOTIFY_REPLY irregardless of its original size - with so much data as provided read buffer could fit. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested. This fix requires another behaviour change however - to be sure that read buffer has enough capacity to always fit fixed NOTIFY_REPLY part plus at least some (0 or more) data, we have to precheck the buffer before dequeuing and handling a request. And if the buffer is very small - return EINVAL to read in filesystem with semantic that queued read was invalid from the viewpoint of FUSE protocol. Even though this is also behaviour change, this should not practically cause problems: 1d3d752b47 (fuse: clean up request size limit checking), which originally removed such EINVAL return and reworked fuse_dev_do_read to loop and retry, also added FUSE_MIN_READ_BUFFER=3D8K to user-visible fuse.h with comment that "The read buffer is required to be at least 8k ..." Even though FUSE_MIN_READ_BUFFER is not currently checked anywhere in the kernel, libfuse always initializes session with bufsize=3D32=C2=B7pages and, since = its beginning, (at least from 2005) issues a warning should user modify fuse_session->bufsize directly to be sure that queued buffers are at least as large as that sane minimum: =09https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fu= se_lowlevel.c#L2869 =09https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fu= se_lowlevel.c#L1947 =09(semantic added in https://github.com/libfuse/libfuse/commit/044da2e9e0) This way we should be safe to add the check for minimum read buffer size. I've hit this bug for real with my filesystem that is using https://github.com/hanwen/go-fuse: there was no NOTIFY_REPLY after successful NOTIFY_RETRIEVE and the filesystem was stuck waiting, because FUSE protocol (definition scattered through many places) states that NOTIFY_REPLY is guaranteed to come after successful NOTIFY_RETRIEVE (see 2d45ba381a "fuse: add retrieve request"). After inspecting /sys/fs/fuse/connections/X/waiting and seeing it was 1, I was initially suspecting that it was user-space who is not issuing /dev/fuse reads and NOTIFY_REPLY is there but stuck in kernel pending queue. However tracing what is going on in /dev/fuse exchange and in both kernel and userspace (see https://lab.nexedi.com/kirr/wendelin.core/blob/13d2d1f8/wcfs/fusetrace= ) showed that there are correctly queued /dev/fuse reads still pending after NOTIFY_RETRIEVE returns and it is the kernel who is not replying back= : =09... =09P2 2.215710 /dev/fuse <- qread wcfs/11399_4_r: =09 syscall.Syscall+48 =09 syscall.Read+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 =09 runtime.goexit+1 =09P2 2.215810 /dev/fuse -> read wcfs/11399_4_r: =09 .56 RELEASE i8 ... (ret=3D64) =09P2 2.215859 /dev/fuse <- write wcfs/11399_5_w: =09 .56 (0) ... =09 syscall.Syscall+48 =09 syscall.Write+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931 =09 github.com/hanwen/go-fuse/fuse.(*Server).write+194 =09 github.com/hanwen/go-fuse/fuse.(*Server).handleRequest+179 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+399 =09 runtime.goexit+1 =09P2 2.215871 /dev/fuse -> write_ack wcfs/11399_5_w (ret=3D16) =09P2 2.215876 /dev/fuse <- qread wcfs/11399_5_r: <-- NOTE =09 syscall.Syscall+48 =09 syscall.Read+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 =09 runtime.goexit+1 =09P0 2.221527 /dev/fuse <- qread wcfs/11401_1_r: <-- NOTE =09 syscall.Syscall+48 =09 syscall.Read+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355 =09 github.com/hanwen/go-fuse/fuse.(*Server).loop+107 =09 runtime.goexit+1 =09P1 2.239384 /dev/fuse -> read wcfs/11398_6_r:=09# woken read that = was queued before "..." =09 .57 READ i5 ... (ret=3D80) =09P0 2.239626 /dev/fuse <- write wcfs/11397_0_w: =09 NOTIFY_RETRIEVE ... =09 syscall.Syscall+48 =09 syscall.Write+73 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76 =09 github.com/hanwen/go-fuse/fuse.handleEINTR+39 =09 github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931 =09 github.com/hanwen/go-fuse/fuse.(*Server).write+194 =09 github.com/hanwen/go-fuse/fuse.(*Server).InodeRetrieveCache+764 =09 github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).Fil= eRetrieveCache+157 =09 main.(*BigFile).invalidateBlk+232 =09 main.(*Root).z=CE=B4handle1.func1+72 =09 golang.org/x/sync/errgroup.(*Group).Go.func1+87 =09 runtime.goexit+1 =09P0 2.239660 /dev/fuse -> write_ack wcfs/11397_0_w (ret=3D48) =09# stuck =09# (full trace: https://lab.nexedi.com/kirr/wendelin.core/commit/96416aaa= bd) with queued / served read analysis confirming that two reads were indeed qu= eued and not served: =09grep -w -e '<- qread\>' y.log |awk {'print $6'} |sort >qread.txt =09grep -w -e '-> read\>' y.log |awk {'print $6'} |sort >read.txt =09# xdiff qread.txt read.txt =09diff --git a/qread.txt b/read.txt =09index 4ab50d7..fdd2be1 100644 =09--- a/qread.txt =09+++ b/read.txt =09@@ -53,7 +53,5 @@ wcfs/11399_1_r: =09 wcfs/11399_2_r: =09 wcfs/11399_3_r: =09 wcfs/11399_4_r: =09-wcfs/11399_5_r: =09 wcfs/11400_0_r: =09 wcfs/11401_0_r: =09-wcfs/11401_1_r: The bug was hit because go-fuse by default uses 64K for read buffer size =09https://github.com/hanwen/go-fuse/blob/33711add/fuse/server.go#L142 and the kernel presets fuse_conn->max_pages to be 128K (=3D 32=C2=B74K page= s). Go-fuse will be likely fixed to both use bufsize=3Dkernel's and to correctly handle size > bufsize in InodeRetrieveCache. However we should also fix the kernel to always deliver NOTIFY_REPLY once NOTIFY_RETRIEVE was successful, so that FUSE protocol guarantee always holds irregardless of whether userspace used default or other valid buffer size setting, and so that filesystems can count not to get stuck waiting for kernel who promised a reply. This way this patch is here. Signed-off-by: Kirill Smelkov Cc: Han-Wen Nienhuys Cc: Jakob Unterwurzacher Cc: # v2.6.36+ --- First patch version was sent 1 week ago, but got no response: https://marc.info/?l=3Dlinux-fsdevel&m=3D155000277921155&w=3D2 Changes since v1: don't forget to also update req->misc.retrieve_in.size after truncation. ( This is my first patch to fs/fuse, so please forgive me if I missed anyt= hing. ) fs/fuse/dev.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 65 insertions(+), 6 deletions(-) diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 8a63e52785e9..93deb8e54d88 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -381,6 +381,40 @@ static void queue_request(struct fuse_iqueue *fiq, str= uct fuse_req *req) =09kill_fasync(&fiq->fasync, SIGIO, POLL_IN); } +/* + * fuse_req_truncate_data truncates data in request that has paged data + * (req.in.argpages=3D1), so that whole request, when serialized, is <=3D = nbytes. + * + * nbytes must be >=3D size(request without data). + */ +static void fuse_req_truncate_data(struct fuse_req *req, unsigned nbytes) = { +=09unsigned size, n; + +=09BUG_ON(!req->in.argpages); +=09BUG_ON(req->in.numargs < 1); + +=09/* request size without data */ +=09size =3D sizeof(struct fuse_in_header) + +=09=09len_args(req->in.numargs - 1, (struct fuse_arg *) req->in.args); +=09BUG_ON(nbytes < size); + +=09/* truncate paged data */ +=09for (n =3D 0; n < req->num_pages; n++) { +=09=09struct fuse_page_desc *p =3D &req->page_descs[n]; + +=09=09if (size >=3D nbytes) { +=09=09=09p->length =3D 0; +=09=09} else { +=09=09=09p->length =3D min_t(unsigned, p->length, nbytes - size); +=09=09} + +=09=09size +=3D p->length; +=09} + +=09/* update whole request length in the header */ +=09req->in.h.len =3D size; +} + void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forg= et, =09=09 u64 nodeid, u64 nlookup) { @@ -1317,6 +1351,15 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud= , struct file *file, =09unsigned reqsize; =09unsigned int hash; +=09/* +=09 * Require sane minimum read buffer - that has capacity for fixed part +=09 * of any request + some room for data. If the requirement is not +=09 * satisfied return EINVAL to the filesystem without dequeueing / +=09 * aborting any request. +=09 */ +=09if (nbytes < FUSE_MIN_READ_BUFFER) +=09=09return -EINVAL; + restart: =09spin_lock(&fiq->waitq.lock); =09err =3D -EAGAIN; @@ -1358,12 +1401,28 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fu= d, struct file *file, =09/* If request is too large, reply with an error and restart the read */ =09if (nbytes < reqsize) { -=09=09req->out.h.error =3D -EIO; -=09=09/* SETXATTR is special, since it may contain too large data */ -=09=09if (in->h.opcode =3D=3D FUSE_SETXATTR) -=09=09=09req->out.h.error =3D -E2BIG; -=09=09request_end(fc, req); -=09=09goto restart; +=09=09switch (in->h.opcode) { +=09=09default: +=09=09=09req->out.h.error =3D -EIO; +=09=09=09/* SETXATTR is special, since it may contain too large data */ +=09=09=09if (in->h.opcode =3D=3D FUSE_SETXATTR) +=09=09=09=09req->out.h.error =3D -E2BIG; +=09=09=09request_end(fc, req); +=09=09=09goto restart; + +=09=09/* +=09=09 * NOTIFY_REPLY is special: if it was queued we already +=09=09 * promised to filesystem to deliver it when handling +=09=09 * NOTIFY_RETRIVE. We know that read buffer has capacity for at +=09=09 * least some data. Truncate retrieved data to read buffer size +=09=09 * and deliver it to stay to the promise. +=09=09 */ +=09=09case FUSE_NOTIFY_REPLY: +=09=09=09fuse_req_truncate_data(req, nbytes); +=09=09=09req->misc.retrieve_in.size -=3D reqsize - in->h.len; +=09=09=09reqsize =3D in->h.len; +=09=09} + =09} =09spin_lock(&fpq->lock); =09list_add(&req->list, &fpq->io); -- 2.21.0.rc0.269.g1a574e7a28