From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=DSUU=RC=vger.kernel.org=linux-fsdevel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.1 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8E504C43381
	for <linux-fsdevel@archiver.kernel.org>; Wed, 27 Feb 2019 20:17:10 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 1CB2D213A2
	for <linux-fsdevel@archiver.kernel.org>; Wed, 27 Feb 2019 20:17:10 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=nexedi.com header.i=kirr@nexedi.com header.b="ANxn/2Et";
	dkim=pass (1024-bit key) header.d=mandrillapp.com header.i=@mandrillapp.com header.b="pJqqYJiI"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730348AbfB0URJ (ORCPT
        <rfc822;linux-fsdevel@archiver.kernel.org>);
        Wed, 27 Feb 2019 15:17:09 -0500
Received: from mail179-15.suw41.mandrillapp.com ([198.2.179.15]:15621 "EHLO
        mail179-15.suw41.mandrillapp.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1726594AbfB0URI (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 27 Feb 2019 15:17:08 -0500
X-Greylist: delayed 900 seconds by postgrey-1.27 at vger.kernel.org; Wed, 27 Feb 2019 15:17:06 EST
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=mandrill; d=nexedi.com;
 h=From:Subject:To:Cc:Message-Id:References:In-Reply-To:Date:MIME-Version:Content-Type:Content-Transfer-Encoding; i=kirr@nexedi.com;
 bh=6PeEi5lPSjcmnRYE5dBgA+L5cQiB8/zTDhrD854Ct10=;
 b=ANxn/2EtDWD/8P+VuCHstrsqmQldNx9BCMEB4yEpCNocuRirUiL3VQThqomgTwK3u3yDz60YI9N6
   JU5l52UdEmvt1tOmf9o9WMx/nBDZHuhIG6Ob23Nu1pWY6ISoqOn7uOQuEsRIbvO+vRNNdi15Nlgs
   cH+MYCLEdCpGZYB67jM=
Received: from pmta04.mandrill.prod.suw01.rsglab.com (127.0.0.1) by mail179-15.suw41.mandrillapp.com id hero4422s28a for <linux-fsdevel@vger.kernel.org>; Wed, 27 Feb 2019 20:02:05 +0000 (envelope-from <bounce-md_31050260.5c76ecbc.v1-bb71bdbc62a54b59a7f1b2404f089835@mandrillapp.com>)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; 
 i=@mandrillapp.com; q=dns/txt; s=mandrill; t=1551297724; h=From : 
 Subject : To : Cc : Message-Id : References : In-Reply-To : Date : 
 MIME-Version : Content-Type : Content-Transfer-Encoding : From : 
 Subject : Date : X-Mandrill-User : List-Unsubscribe; 
 bh=6PeEi5lPSjcmnRYE5dBgA+L5cQiB8/zTDhrD854Ct10=; 
 b=pJqqYJiII7FV2CNU+qv0FUAh0T/zOD+aUOJp4+BqwqyTby/f62zI4O6tP7iJIHpek8c4p9
 HItIESvwz37j9lZDIY4unbvC8tHYeVp8wkmoLKuoX8JTf6e1N3UkWL9bf2PwehEHc7IpwOcp
 kW+feXhcfjzF5br34Q3tkPnKRIBJQ=
From:   Kirill Smelkov <kirr@nexedi.com>
Subject: Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
Received: from [87.98.221.171] by mandrillapp.com id bb71bdbc62a54b59a7f1b2404f089835; Wed, 27 Feb 2019 20:02:04 +0000
To:     Miklos Szeredi <miklos@szeredi.hu>
Cc:     Miklos Szeredi <mszeredi@redhat.com>,
        <linux-fsdevel@vger.kernel.org>,
        fuse-devel <fuse-devel@lists.sourceforge.net>,
        Han-Wen Nienhuys <hanwen@google.com>,
        Jakob Unterwurzacher <jakobunt@gmail.com>,
        stable <stable@vger.kernel.org>
Message-Id: <20190227200155.GA14682@deco.navytux.spb.ru>
References: <20190219094147.32734-1-kirr@nexedi.com> <CAJfpegtgNb2FpE=gHSVTmX5EWRJtDp-a5HZO2VS2_eHjPpp=fw@mail.gmail.com>
In-Reply-To: <CAJfpegtgNb2FpE=gHSVTmX5EWRJtDp-a5HZO2VS2_eHjPpp=fw@mail.gmail.com>
X-Report-Abuse: Please forward a copy of this message, including all headers, to abuse@mandrill.com
X-Report-Abuse: You can also report abuse here: http://mandrillapp.com/contact/abuse?id=31050260.bb71bdbc62a54b59a7f1b2404f089835
X-Mandrill-User: md_31050260
Date:   Wed, 27 Feb 2019 20:02:04 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Miklos, first of all thanks for feedback.

On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
> On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
> > the kernel to send back NOTIFY_REPLY message. However if the filesystem
> > is not reading requests with fuse_conn->max_pages capacity,
> 
> That's a violation of the contract by the fuse server, not the kernel.

Do you mean that even if filesystem server configures via
init_out.max_write that it is accepting e.g. only 32K max writes, it
still has to be issuing sys_read with buffer of 128K (= hardcoded
fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?

Also, I could not find any FUSE contract being specified anywhere, so I
used message of the commit that added support for NOTIFY_RETRIEVE to sense its
semantic:

	commit 2d45ba381a74a743eeaa2b06c7c5c0d2bf73ba1a
	Author: Miklos Szeredi <mszeredi@suse.cz>
	Date:   Mon Jul 12 14:41:40 2010 +0200
	
	    fuse: add retrieve request
	    
	    Userspace filesystem can request data to be retrieved from the inode's
	    mapping.  This request is synchronous and the retrieved data is queued
	    as a new request.  If the write to the fuse device returns an error
	    then the retrieve request was not completed and a reply will not be
	    sent.
	    
	    Only present pages are returned in the retrieve reply.  Retrieving
	    stops when it finds a non-present page and only data prior to that is
	    returned.
	    
	    This request doesn't change the dirty state of pages.
	    
	    Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
	
which, even if not explicitly, gives the impression that if NOTIFY_RETRIEVE was
queued successfully, the reply will come.

Also: if it is a violation of contract by filesystem server, the kernel should
return ESOMETHING for violating sys_read, instead of making that read to
be waiting indefinitely, isn't it?

In summary: instead of getting clients stuck silently I still suggest
for NOTIFY_RETRIEVE to come to client, if it can come. And also to
return EINVAL for /dev/fuse sys_read calls that are violating the
server/kernel contract.


> > fuse_dev_do_read might see that the "request is too large" and decide to
> > "reply with an error and restart the read". "Reply with an error" has
> > underlying assumption that there is a "requester thread" that is waiting
> > for request completion, which is true for most requests, but is not true
> > for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
> > after it could successfully queue NOTIFY_REPLY message without waiting
> > for NOTIFY_REPLY completion. This leads to situation when filesystem
> > requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
> > that notification request, but NOTIFY_REPLY is not coming back.
> >
> > More, since there is no "requester thread" to handle the error, the
> > situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
> > /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
> > was removed from pending queue and abandoned.
> 
> Now I don't understand how that would happen.   If the request is
> abandoned, its refcount should go down to zero and the num_waiting
> count decremented accordingly.

You are right - it was my mistake. I misinterpreted waiting=1 as a
request not being transferred to filesystem server yet, but in my test
it turned out to be already transferred and sitting on client "processing" list
waiting for corresponding reply (which was not coming because the filesystem in
turn was stuck waiting for NOTIFY_REPLY to come from the kernel):

	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/waiting
	1
	
	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/queue
	#waiting: 1
	(0)Interrupt:
	(0)Forget:
	(0)Request:
	(0)Processing.IO:
	(1)Processing.P:
	#0: .52 R15 i5


So the part of commit message that discussed X/waiting=1 is wrong and should be
dropped - thanks for catching this.


Still the main point is what should be the semantic of NOTIFY_RETRIEVE
vs NOTIFY_REPLY vs INIT.max_write, and that it is better to always send
retrieve data if client promised it, and also to explicitly indicate
with an error if filesystem server is violating FUSE server/client
contract.

Thanks,
Kirill


P.S. I attach the draft patch for /sys/fs/fuse/connections/X/queue in
case someone is interested.

---- 8< ----

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index fe80bea4ad89..f4e22f5436e2 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -63,6 +63,160 @@ static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf,
 	return simple_read_from_buffer(buf, len, ppos, tmp, size);
 }
 
+/* fuse_conn_iqueue_print prints input queue into provided buf.
+ *
+ * buf can be NULL in which case only the length of would-be printed text is
+ * returned and nothing is actually printed.
+ *
+ * must be called with fc->iq->waitq locked. */
+static size_t fuse_conn_iqueue_print(char *buf, size_t size, struct fuse_conn *fc)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+	struct fuse_req *req;
+	struct fuse_forget_link *freq;
+	size_t nreq;
+
+	size_t __n, __total = 0;
+#define emitf(FORMAT, ...) do {	\
+	__n = snprintf(buf, size, FORMAT, __VA_ARGS__);	\
+	__total += __n;					\
+	if (buf) {					\
+		size -= __n;				\
+		buf  += __n;				\
+	}						\
+} while (0)
+
+
+	if (!buf)
+		size = 0;
+
+	// XXX temp
+	emitf("#waiting: %d\n", atomic_read(&fc->num_waiting));
+
+	/* interrupts */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Interrupt:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* forgets */
+	nreq = 0;
+	for (freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		nreq++;
+	}
+
+	emitf("(%lu)Forget:\n", nreq);
+
+	nreq = 0;
+	for(freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		emitf("\t#%lu: FORGET i%llu -%llu\n", nreq,
+			freq->forget_one.nodeid, freq->forget_one.nlookup);
+		nreq++;
+	}
+
+	/* all other requests */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Request:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* processing */
+	// XXX temp? XXX locking
+	{
+	int i;
+	struct fuse_dev *fud;
+	list_for_each_entry(fud, &fc->devices, entry) {
+		struct fuse_pqueue *fpq = &fud->pq;
+
+		nreq = 0;
+		list_for_each_entry(req, &fpq->io, list) {
+			nreq++;
+		}
+		emitf("(%lu)Processing.IO:\n", nreq);
+
+		// XXX print IO elements
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				nreq++;
+			}
+		}
+		emitf("(%lu)Processing.P:\n", nreq);
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				struct fuse_in_header *h = &req->in.h;
+				emitf("\t#%lu: .%lld R%d i%llu\n", nreq, h->unique, h->opcode, h->nodeid);
+				nreq++;
+			}
+		}
+	}
+	}
+
+	return __total;
+#undef emitf
+}
+
+static ssize_t fuse_conn_iqueue_read(struct file *file, char __user *buf,
+				      size_t len, loff_t *ppos)
+{
+	char *qdump;
+
+	if (!*ppos) {
+		struct fuse_conn *fc = fuse_ctl_file_conn_get(file);
+		struct fuse_iqueue *fiq;
+		size_t n;
+		char *qdump2;
+		if (!fc)
+			return 0;
+
+		fiq = &fc->iq;
+		spin_lock(&fiq->waitq.lock);
+		n = fuse_conn_iqueue_print(NULL, 0, fc);
+		n += 1; /* trailing 0 */
+
+		qdump = kmalloc(n, GFP_ATOMIC);
+		if (qdump) {
+			fuse_conn_iqueue_print(qdump, n, fc);
+		}
+
+		spin_unlock(&fiq->waitq.lock);
+		fuse_conn_put(fc);
+
+		if (!qdump) {
+			return -ENOMEM;
+		}
+
+		/* release atomic memory, since it is scarce resource */
+		qdump2 = kstrdup(qdump, GFP_KERNEL);
+		kfree(qdump);
+
+		file->private_data = (void *)qdump2;
+		// TODO release qdump on file release
+	}
+
+	qdump = (char *)file->private_data;
+	return simple_read_from_buffer(buf, len, ppos, qdump, strlen(qdump));
+}
+
 static ssize_t fuse_conn_limit_read(struct file *file, char __user *buf,
 				    size_t len, loff_t *ppos, unsigned val)
 {
@@ -202,6 +356,12 @@ static const struct file_operations fuse_ctl_waiting_ops = {
 	.llseek = no_llseek,
 };
 
+static const struct file_operations fuse_ctl_queue_ops = {
+	.open = nonseekable_open,
+	.read = fuse_conn_iqueue_read,
+	.llseek = no_llseek,
+};
+
 static const struct file_operations fuse_conn_max_background_ops = {
 	.open = nonseekable_open,
 	.read = fuse_conn_max_background_read,
@@ -278,6 +438,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
 
 	if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1,
 				 NULL, &fuse_ctl_waiting_ops) ||
+	    !fuse_ctl_add_dentry(parent, fc, "queue", S_IFREG | 0400, 1,
+				 NULL, &fuse_ctl_queue_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1,
 				 NULL, &fuse_ctl_abort_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0920c0c032a0..7efef59caaa9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -41,7 +41,7 @@
 #define FUSE_NAME_MAX 1024
 
 /** Number of dentries for each connection in the control filesystem */
-#define FUSE_CTL_NUM_DENTRIES 5
+#define FUSE_CTL_NUM_DENTRIES 6
 
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=O+zh=RC=vger.kernel.org=stable-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.1 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B8832C43381
	for <stable@archiver.kernel.org>; Wed, 27 Feb 2019 20:17:12 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 7AB382186A
	for <stable@archiver.kernel.org>; Wed, 27 Feb 2019 20:17:12 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=nexedi.com header.i=kirr@nexedi.com header.b="VHYvTvSB";
	dkim=pass (1024-bit key) header.d=mandrillapp.com header.i=@mandrillapp.com header.b="K4zpRMq1"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730349AbfB0URL (ORCPT <rfc822;stable@archiver.kernel.org>);
        Wed, 27 Feb 2019 15:17:11 -0500
Received: from mail179-15.suw41.mandrillapp.com ([198.2.179.15]:15621 "EHLO
        mail179-15.suw41.mandrillapp.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730256AbfB0URK (ORCPT
        <rfc822;stable@vger.kernel.org>); Wed, 27 Feb 2019 15:17:10 -0500
X-Greylist: delayed 903 seconds by postgrey-1.27 at vger.kernel.org; Wed, 27 Feb 2019 15:17:08 EST
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=mandrill; d=nexedi.com;
 h=From:Subject:To:Cc:Message-Id:References:In-Reply-To:Date:MIME-Version:Content-Type:Content-Transfer-Encoding; i=kirr@nexedi.com;
 bh=6PeEi5lPSjcmnRYE5dBgA+L5cQiB8/zTDhrD854Ct10=;
 b=VHYvTvSBJkjcaZFwi8UHFF87UwsvcC41iZbFJf2Ay9+lP1mDXRIcxmRmFdShqJSFGE05mzogcqAs
   csu5ENAKXuKZ13kaZeX2ICgSq2tim2XIRowXShPfBhtnvADWlmoFIUoPLzubrvcEPs4ZMhejPu6Q
   K6jOoPVOxd1n8BjmJEU=
Received: from pmta04.mandrill.prod.suw01.rsglab.com (127.0.0.1) by mail179-15.suw41.mandrillapp.com id hero4822s286 for <stable@vger.kernel.org>; Wed, 27 Feb 2019 20:02:05 +0000 (envelope-from <bounce-md_31050260.5c76ecbd.v1-26a40bfc8eb7453486e65aa4565309c8@mandrillapp.com>)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; 
 i=@mandrillapp.com; q=dns/txt; s=mandrill; t=1551297725; h=From : 
 Subject : To : Cc : Message-Id : References : In-Reply-To : Date : 
 MIME-Version : Content-Type : Content-Transfer-Encoding : From : 
 Subject : Date : X-Mandrill-User : List-Unsubscribe; 
 bh=6PeEi5lPSjcmnRYE5dBgA+L5cQiB8/zTDhrD854Ct10=; 
 b=K4zpRMq1Iptvo/K+99XEj8BNXlUgaGChnzL/FI/kdphw57s/UShCzn7CZFrDtq1mGeIIvD
 GND6oL6cjI/h/BsuwCmMK+Oy88erp0BQKtxvQyo4LXqHCvtJmtY3Z3LMJLsZi46daxIi5K1a
 BKF3WdoUq+mUtVaeGS012PZuE9QhE=
From:   Kirill Smelkov <kirr@nexedi.com>
Subject: Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
Received: from [87.98.221.171] by mandrillapp.com id 26a40bfc8eb7453486e65aa4565309c8; Wed, 27 Feb 2019 20:02:05 +0000
To:     Miklos Szeredi <miklos@szeredi.hu>
Cc:     Miklos Szeredi <mszeredi@redhat.com>,
        <linux-fsdevel@vger.kernel.org>,
        fuse-devel <fuse-devel@lists.sourceforge.net>,
        Han-Wen Nienhuys <hanwen@google.com>,
        Jakob Unterwurzacher <jakobunt@gmail.com>,
        stable <stable@vger.kernel.org>
Message-Id: <20190227200155.GA14682@deco.navytux.spb.ru>
References: <20190219094147.32734-1-kirr@nexedi.com> <CAJfpegtgNb2FpE=gHSVTmX5EWRJtDp-a5HZO2VS2_eHjPpp=fw@mail.gmail.com>
In-Reply-To: <CAJfpegtgNb2FpE=gHSVTmX5EWRJtDp-a5HZO2VS2_eHjPpp=fw@mail.gmail.com>
X-Report-Abuse: Please forward a copy of this message, including all headers, to abuse@mandrill.com
X-Report-Abuse: You can also report abuse here: http://mandrillapp.com/contact/abuse?id=31050260.26a40bfc8eb7453486e65aa4565309c8
X-Mandrill-User: md_31050260
Date:   Wed, 27 Feb 2019 20:02:05 +0000
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: stable-owner@vger.kernel.org
Precedence: bulk
List-ID: <stable.vger.kernel.org>
X-Mailing-List: stable@vger.kernel.org

Miklos, first of all thanks for feedback.

On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
> On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
> > the kernel to send back NOTIFY_REPLY message. However if the filesystem
> > is not reading requests with fuse_conn->max_pages capacity,
> 
> That's a violation of the contract by the fuse server, not the kernel.

Do you mean that even if filesystem server configures via
init_out.max_write that it is accepting e.g. only 32K max writes, it
still has to be issuing sys_read with buffer of 128K (= hardcoded
fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?

Also, I could not find any FUSE contract being specified anywhere, so I
used message of the commit that added support for NOTIFY_RETRIEVE to sense its
semantic:

	commit 2d45ba381a74a743eeaa2b06c7c5c0d2bf73ba1a
	Author: Miklos Szeredi <mszeredi@suse.cz>
	Date:   Mon Jul 12 14:41:40 2010 +0200
	
	    fuse: add retrieve request
	    
	    Userspace filesystem can request data to be retrieved from the inode's
	    mapping.  This request is synchronous and the retrieved data is queued
	    as a new request.  If the write to the fuse device returns an error
	    then the retrieve request was not completed and a reply will not be
	    sent.
	    
	    Only present pages are returned in the retrieve reply.  Retrieving
	    stops when it finds a non-present page and only data prior to that is
	    returned.
	    
	    This request doesn't change the dirty state of pages.
	    
	    Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
	
which, even if not explicitly, gives the impression that if NOTIFY_RETRIEVE was
queued successfully, the reply will come.

Also: if it is a violation of contract by filesystem server, the kernel should
return ESOMETHING for violating sys_read, instead of making that read to
be waiting indefinitely, isn't it?

In summary: instead of getting clients stuck silently I still suggest
for NOTIFY_RETRIEVE to come to client, if it can come. And also to
return EINVAL for /dev/fuse sys_read calls that are violating the
server/kernel contract.


> > fuse_dev_do_read might see that the "request is too large" and decide to
> > "reply with an error and restart the read". "Reply with an error" has
> > underlying assumption that there is a "requester thread" that is waiting
> > for request completion, which is true for most requests, but is not true
> > for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
> > after it could successfully queue NOTIFY_REPLY message without waiting
> > for NOTIFY_REPLY completion. This leads to situation when filesystem
> > requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
> > that notification request, but NOTIFY_REPLY is not coming back.
> >
> > More, since there is no "requester thread" to handle the error, the
> > situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
> > /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
> > was removed from pending queue and abandoned.
> 
> Now I don't understand how that would happen.   If the request is
> abandoned, its refcount should go down to zero and the num_waiting
> count decremented accordingly.

You are right - it was my mistake. I misinterpreted waiting=1 as a
request not being transferred to filesystem server yet, but in my test
it turned out to be already transferred and sitting on client "processing" list
waiting for corresponding reply (which was not coming because the filesystem in
turn was stuck waiting for NOTIFY_REPLY to come from the kernel):

	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/waiting
	1
	
	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/queue
	#waiting: 1
	(0)Interrupt:
	(0)Forget:
	(0)Request:
	(0)Processing.IO:
	(1)Processing.P:
	#0: .52 R15 i5


So the part of commit message that discussed X/waiting=1 is wrong and should be
dropped - thanks for catching this.


Still the main point is what should be the semantic of NOTIFY_RETRIEVE
vs NOTIFY_REPLY vs INIT.max_write, and that it is better to always send
retrieve data if client promised it, and also to explicitly indicate
with an error if filesystem server is violating FUSE server/client
contract.

Thanks,
Kirill


P.S. I attach the draft patch for /sys/fs/fuse/connections/X/queue in
case someone is interested.

---- 8< ----

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index fe80bea4ad89..f4e22f5436e2 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -63,6 +63,160 @@ static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf,
 	return simple_read_from_buffer(buf, len, ppos, tmp, size);
 }
 
+/* fuse_conn_iqueue_print prints input queue into provided buf.
+ *
+ * buf can be NULL in which case only the length of would-be printed text is
+ * returned and nothing is actually printed.
+ *
+ * must be called with fc->iq->waitq locked. */
+static size_t fuse_conn_iqueue_print(char *buf, size_t size, struct fuse_conn *fc)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+	struct fuse_req *req;
+	struct fuse_forget_link *freq;
+	size_t nreq;
+
+	size_t __n, __total = 0;
+#define emitf(FORMAT, ...) do {	\
+	__n = snprintf(buf, size, FORMAT, __VA_ARGS__);	\
+	__total += __n;					\
+	if (buf) {					\
+		size -= __n;				\
+		buf  += __n;				\
+	}						\
+} while (0)
+
+
+	if (!buf)
+		size = 0;
+
+	// XXX temp
+	emitf("#waiting: %d\n", atomic_read(&fc->num_waiting));
+
+	/* interrupts */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Interrupt:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* forgets */
+	nreq = 0;
+	for (freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		nreq++;
+	}
+
+	emitf("(%lu)Forget:\n", nreq);
+
+	nreq = 0;
+	for(freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		emitf("\t#%lu: FORGET i%llu -%llu\n", nreq,
+			freq->forget_one.nodeid, freq->forget_one.nlookup);
+		nreq++;
+	}
+
+	/* all other requests */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Request:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* processing */
+	// XXX temp? XXX locking
+	{
+	int i;
+	struct fuse_dev *fud;
+	list_for_each_entry(fud, &fc->devices, entry) {
+		struct fuse_pqueue *fpq = &fud->pq;
+
+		nreq = 0;
+		list_for_each_entry(req, &fpq->io, list) {
+			nreq++;
+		}
+		emitf("(%lu)Processing.IO:\n", nreq);
+
+		// XXX print IO elements
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				nreq++;
+			}
+		}
+		emitf("(%lu)Processing.P:\n", nreq);
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				struct fuse_in_header *h = &req->in.h;
+				emitf("\t#%lu: .%lld R%d i%llu\n", nreq, h->unique, h->opcode, h->nodeid);
+				nreq++;
+			}
+		}
+	}
+	}
+
+	return __total;
+#undef emitf
+}
+
+static ssize_t fuse_conn_iqueue_read(struct file *file, char __user *buf,
+				      size_t len, loff_t *ppos)
+{
+	char *qdump;
+
+	if (!*ppos) {
+		struct fuse_conn *fc = fuse_ctl_file_conn_get(file);
+		struct fuse_iqueue *fiq;
+		size_t n;
+		char *qdump2;
+		if (!fc)
+			return 0;
+
+		fiq = &fc->iq;
+		spin_lock(&fiq->waitq.lock);
+		n = fuse_conn_iqueue_print(NULL, 0, fc);
+		n += 1; /* trailing 0 */
+
+		qdump = kmalloc(n, GFP_ATOMIC);
+		if (qdump) {
+			fuse_conn_iqueue_print(qdump, n, fc);
+		}
+
+		spin_unlock(&fiq->waitq.lock);
+		fuse_conn_put(fc);
+
+		if (!qdump) {
+			return -ENOMEM;
+		}
+
+		/* release atomic memory, since it is scarce resource */
+		qdump2 = kstrdup(qdump, GFP_KERNEL);
+		kfree(qdump);
+
+		file->private_data = (void *)qdump2;
+		// TODO release qdump on file release
+	}
+
+	qdump = (char *)file->private_data;
+	return simple_read_from_buffer(buf, len, ppos, qdump, strlen(qdump));
+}
+
 static ssize_t fuse_conn_limit_read(struct file *file, char __user *buf,
 				    size_t len, loff_t *ppos, unsigned val)
 {
@@ -202,6 +356,12 @@ static const struct file_operations fuse_ctl_waiting_ops = {
 	.llseek = no_llseek,
 };
 
+static const struct file_operations fuse_ctl_queue_ops = {
+	.open = nonseekable_open,
+	.read = fuse_conn_iqueue_read,
+	.llseek = no_llseek,
+};
+
 static const struct file_operations fuse_conn_max_background_ops = {
 	.open = nonseekable_open,
 	.read = fuse_conn_max_background_read,
@@ -278,6 +438,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
 
 	if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1,
 				 NULL, &fuse_ctl_waiting_ops) ||
+	    !fuse_ctl_add_dentry(parent, fc, "queue", S_IFREG | 0400, 1,
+				 NULL, &fuse_ctl_queue_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1,
 				 NULL, &fuse_ctl_abort_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0920c0c032a0..7efef59caaa9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -41,7 +41,7 @@
 #define FUSE_NAME_MAX 1024
 
 /** Number of dentries for each connection in the control filesystem */
-#define FUSE_CTL_NUM_DENTRIES 5
+#define FUSE_CTL_NUM_DENTRIES 6
 
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1