All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
@ 2019-02-19  9:42 ` Kirill Smelkov
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-19  9:42 UTC (permalink / raw)
  To: Miklos Szeredi, Miklos Szeredi
  Cc: linux-fsdevel, fuse-devel, Kirill Smelkov, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
the kernel to send back NOTIFY_REPLY message. However if the filesystem
is not reading requests with fuse_conn->max_pages capacity,
fuse_dev_do_read might see that the "request is too large" and decide to
"reply with an error and restart the read". "Reply with an error" has
underlying assumption that there is a "requester thread" that is waiting
for request completion, which is true for most requests, but is not true
for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
after it could successfully queue NOTIFY_REPLY message without waiting
for NOTIFY_REPLY completion. This leads to situation when filesystem
requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
that notification request, but NOTIFY_REPLY is not coming back.

More, since there is no "requester thread" to handle the error, the
situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
/dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
was removed from pending queue and abandoned.

One way to fix would be to change NOTIFY_RETRIEVE handler to wait until
queued NOTIFY_REPLY is actually read back to the server and only then
return NOTIFY_RETRIEVE status. However this is change in behaviour and
would require filesystems to have at least 2 threads. In particular a
single-threaded filesystem that was previously successfully using
NOTIFY_RETRIEVE would become stuck after the change. This way of fixing
is thus not acceptable.

However we can fix it another way - by always returning NOTIFY_REPLY
irregardless of its original size - with so much data as provided read
buffer could fit. This aligns with the way NOTIFY_RETRIEVE handler
works, which already unconditionally caps requested retrieve size to
fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE
semantic if we return less data than was originally requested.

This fix requires another behaviour change however - to be sure that
read buffer has enough capacity to always fit fixed NOTIFY_REPLY part
plus at least some (0 or more) data, we have to precheck the buffer
before dequeuing and handling a request. And if the buffer is very small -
return EINVAL to read in filesystem with semantic that queued read was
invalid from the viewpoint of FUSE protocol. Even though this is also
behaviour change, this should not practically cause problems: 1d3d752b47
(fuse: clean up request size limit checking), which originally removed
such EINVAL return and reworked fuse_dev_do_read to loop and retry, also
added FUSE_MIN_READ_BUFFER=8K to user-visible fuse.h with comment that
"The read buffer is required to be at least 8k ..." Even though
FUSE_MIN_READ_BUFFER is not currently checked anywhere in the kernel,
libfuse always initializes session with bufsize=32·pages and, since its
beginning, (at least from 2005) issues a warning should user modify
fuse_session->bufsize directly to be sure that queued buffers are at
least as large as that sane minimum:

	https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_lowlevel.c#L2869
	https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_lowlevel.c#L1947
	(semantic added in https://github.com/libfuse/libfuse/commit/044da2e9e0)

This way we should be safe to add the check for minimum read buffer size.

I've hit this bug for real with my filesystem that is using
https://github.com/hanwen/go-fuse: there was no NOTIFY_REPLY after
successful NOTIFY_RETRIEVE and the filesystem was stuck waiting,
because FUSE protocol (definition scattered through many places) states
that NOTIFY_REPLY is guaranteed to come after successful NOTIFY_RETRIEVE
(see 2d45ba381a "fuse: add retrieve request"). After inspecting
/sys/fs/fuse/connections/X/waiting and seeing it was 1, I was initially
suspecting that it was user-space who is not issuing /dev/fuse reads and
NOTIFY_REPLY is there but stuck in kernel pending queue. However tracing
what is going on in /dev/fuse exchange and in both kernel and userspace
(see https://lab.nexedi.com/kirr/wendelin.core/blob/13d2d1f8/wcfs/fusetrace)
showed that there are correctly queued /dev/fuse reads still pending
after NOTIFY_RETRIEVE returns and it is the kernel who is not replying back:

	...

	P2 2.215710 /dev/fuse <- qread      wcfs/11399_4_r:

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P2 2.215810 /dev/fuse -> read       wcfs/11399_4_r:
	        .56  RELEASE i8 ...             (ret=64)

	P2 2.215859 /dev/fuse <- write      wcfs/11399_5_w:
	        .56 (0) ...

	        syscall.Syscall+48
	        syscall.Write+73
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931
	        github.com/hanwen/go-fuse/fuse.(*Server).write+194
	        github.com/hanwen/go-fuse/fuse.(*Server).handleRequest+179
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+399
	        runtime.goexit+1

	P2 2.215871 /dev/fuse -> write_ack  wcfs/11399_5_w (ret=16)

	P2 2.215876 /dev/fuse <- qread      wcfs/11399_5_r:    <-- NOTE

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P0 2.221527 /dev/fuse <- qread      wcfs/11401_1_r:    <-- NOTE

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P1 2.239384 /dev/fuse -> read       wcfs/11398_6_r:	# woken read that was queued before "..."
	        .57  READ i5 ...                (ret=80)

	P0 2.239626 /dev/fuse <- write      wcfs/11397_0_w:
	        NOTIFY_RETRIEVE ...

	        syscall.Syscall+48
	        syscall.Write+73
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931
	        github.com/hanwen/go-fuse/fuse.(*Server).write+194
	        github.com/hanwen/go-fuse/fuse.(*Server).InodeRetrieveCache+764
	        github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).FileRetrieveCache+157
	        main.(*BigFile).invalidateBlk+232
	        main.(*Root).zδhandle1.func1+72
	        golang.org/x/sync/errgroup.(*Group).Go.func1+87
	        runtime.goexit+1

	P0 2.239660 /dev/fuse -> write_ack  wcfs/11397_0_w (ret=48)

	# stuck
	# (full trace: https://lab.nexedi.com/kirr/wendelin.core/commit/96416aaabd)

with queued / served read analysis confirming that two reads were indeed queued
and not served:

	grep -w -e '<- qread\>' y.log |awk {'print $6'} |sort >qread.txt
	grep -w -e '-> read\>'  y.log |awk {'print $6'} |sort >read.txt

	# xdiff qread.txt read.txt
	diff --git a/qread.txt b/read.txt
	index 4ab50d7..fdd2be1 100644
	--- a/qread.txt
	+++ b/read.txt
	@@ -53,7 +53,5 @@ wcfs/11399_1_r:
	 wcfs/11399_2_r:
	 wcfs/11399_3_r:
	 wcfs/11399_4_r:
	-wcfs/11399_5_r:
	 wcfs/11400_0_r:
	 wcfs/11401_0_r:
	-wcfs/11401_1_r:

The bug was hit because go-fuse by default uses 64K for read buffer size

	https://github.com/hanwen/go-fuse/blob/33711add/fuse/server.go#L142

and the kernel presets fuse_conn->max_pages to be 128K (= 32·4K pages).

Go-fuse will be likely fixed to both use bufsize=kernel's and to
correctly handle size > bufsize in InodeRetrieveCache. However we should
also fix the kernel to always deliver NOTIFY_REPLY once NOTIFY_RETRIEVE
was successful, so that FUSE protocol guarantee always holds
irregardless of whether userspace used default or other valid buffer
size setting, and so that filesystems can count not to get stuck waiting
for kernel who promised a reply.

This way this patch is here.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Cc: <stable@vger.kernel.org> # v2.6.36+
---

 First patch version was sent 1 week ago, but got no response:
 https://marc.info/?l=linux-fsdevel&m=155000277921155&w=2

 Changes since v1: don't forget to also update req->misc.retrieve_in.size
 after truncation.

 ( This is my first patch to fs/fuse, so please forgive me if I missed anything. )

 fs/fuse/dev.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 65 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8a63e52785e9..93deb8e54d88 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -381,6 +381,40 @@ static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
 	kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
 }

+/*
+ * fuse_req_truncate_data truncates data in request that has paged data
+ * (req.in.argpages=1), so that whole request, when serialized, is <= nbytes.
+ *
+ * nbytes must be >= size(request without data).
+ */
+static void fuse_req_truncate_data(struct fuse_req *req, unsigned nbytes) {
+	unsigned size, n;
+
+	BUG_ON(!req->in.argpages);
+	BUG_ON(req->in.numargs < 1);
+
+	/* request size without data */
+	size = sizeof(struct fuse_in_header) +
+		len_args(req->in.numargs - 1, (struct fuse_arg *) req->in.args);
+	BUG_ON(nbytes < size);
+
+	/* truncate paged data */
+	for (n = 0; n < req->num_pages; n++) {
+		struct fuse_page_desc *p = &req->page_descs[n];
+
+		if (size >= nbytes) {
+			p->length = 0;
+		} else {
+			p->length = min_t(unsigned, p->length, nbytes - size);
+		}
+
+		size += p->length;
+	}
+
+	/* update whole request length in the header */
+	req->in.h.len = size;
+}
+
 void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
 		       u64 nodeid, u64 nlookup)
 {
@@ -1317,6 +1351,15 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	unsigned reqsize;
 	unsigned int hash;

+	/*
+	 * Require sane minimum read buffer - that has capacity for fixed part
+	 * of any request + some room for data. If the requirement is not
+	 * satisfied return EINVAL to the filesystem without dequeueing /
+	 * aborting any request.
+	 */
+	if (nbytes < FUSE_MIN_READ_BUFFER)
+		return -EINVAL;
+
  restart:
 	spin_lock(&fiq->waitq.lock);
 	err = -EAGAIN;
@@ -1358,12 +1401,28 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,

 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
-		req->out.h.error = -EIO;
-		/* SETXATTR is special, since it may contain too large data */
-		if (in->h.opcode == FUSE_SETXATTR)
-			req->out.h.error = -E2BIG;
-		request_end(fc, req);
-		goto restart;
+		switch (in->h.opcode) {
+		default:
+			req->out.h.error = -EIO;
+			/* SETXATTR is special, since it may contain too large data */
+			if (in->h.opcode == FUSE_SETXATTR)
+				req->out.h.error = -E2BIG;
+			request_end(fc, req);
+			goto restart;
+
+		/*
+		 * NOTIFY_REPLY is special: if it was queued we already
+		 * promised to filesystem to deliver it when handling
+		 * NOTIFY_RETRIVE. We know that read buffer has capacity for at
+		 * least some data. Truncate retrieved data to read buffer size
+		 * and deliver it to stay to the promise.
+		 */
+		case FUSE_NOTIFY_REPLY:
+			fuse_req_truncate_data(req, nbytes);
+			req->misc.retrieve_in.size -= reqsize - in->h.len;
+			reqsize = in->h.len;
+		}
+
 	}
 	spin_lock(&fpq->lock);
 	list_add(&req->list, &fpq->io);
--
2.21.0.rc0.269.g1a574e7a28


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
@ 2019-02-19  9:42 ` Kirill Smelkov
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-19  9:42 UTC (permalink / raw)
  To: Miklos Szeredi, Miklos Szeredi
  Cc: linux-fsdevel, fuse-devel, Kirill Smelkov, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
the kernel to send back NOTIFY_REPLY message. However if the filesystem
is not reading requests with fuse_conn->max_pages capacity,
fuse_dev_do_read might see that the "request is too large" and decide to
"reply with an error and restart the read". "Reply with an error" has
underlying assumption that there is a "requester thread" that is waiting
for request completion, which is true for most requests, but is not true
for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
after it could successfully queue NOTIFY_REPLY message without waiting
for NOTIFY_REPLY completion. This leads to situation when filesystem
requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
that notification request, but NOTIFY_REPLY is not coming back.

More, since there is no "requester thread" to handle the error, the
situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
/dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
was removed from pending queue and abandoned.

One way to fix would be to change NOTIFY_RETRIEVE handler to wait until
queued NOTIFY_REPLY is actually read back to the server and only then
return NOTIFY_RETRIEVE status. However this is change in behaviour and
would require filesystems to have at least 2 threads. In particular a
single-threaded filesystem that was previously successfully using
NOTIFY_RETRIEVE would become stuck after the change. This way of fixing
is thus not acceptable.

However we can fix it another way - by always returning NOTIFY_REPLY
irregardless of its original size - with so much data as provided read
buffer could fit. This aligns with the way NOTIFY_RETRIEVE handler
works, which already unconditionally caps requested retrieve size to
fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE
semantic if we return less data than was originally requested.

This fix requires another behaviour change however - to be sure that
read buffer has enough capacity to always fit fixed NOTIFY_REPLY part
plus at least some (0 or more) data, we have to precheck the buffer
before dequeuing and handling a request. And if the buffer is very small -
return EINVAL to read in filesystem with semantic that queued read was
invalid from the viewpoint of FUSE protocol. Even though this is also
behaviour change, this should not practically cause problems: 1d3d752b47
(fuse: clean up request size limit checking), which originally removed
such EINVAL return and reworked fuse_dev_do_read to loop and retry, also
added FUSE_MIN_READ_BUFFER=8K to user-visible fuse.h with comment that
"The read buffer is required to be at least 8k ..." Even though
FUSE_MIN_READ_BUFFER is not currently checked anywhere in the kernel,
libfuse always initializes session with bufsize=32·pages and, since its
beginning, (at least from 2005) issues a warning should user modify
fuse_session->bufsize directly to be sure that queued buffers are at
least as large as that sane minimum:

	https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_lowlevel.c#L2869
	https://github.com/libfuse/libfuse/blob/fuse-3.3.0-22-g63d53ecc3a/lib/fuse_lowlevel.c#L1947
	(semantic added in https://github.com/libfuse/libfuse/commit/044da2e9e0)

This way we should be safe to add the check for minimum read buffer size.

I've hit this bug for real with my filesystem that is using
https://github.com/hanwen/go-fuse: there was no NOTIFY_REPLY after
successful NOTIFY_RETRIEVE and the filesystem was stuck waiting,
because FUSE protocol (definition scattered through many places) states
that NOTIFY_REPLY is guaranteed to come after successful NOTIFY_RETRIEVE
(see 2d45ba381a "fuse: add retrieve request"). After inspecting
/sys/fs/fuse/connections/X/waiting and seeing it was 1, I was initially
suspecting that it was user-space who is not issuing /dev/fuse reads and
NOTIFY_REPLY is there but stuck in kernel pending queue. However tracing
what is going on in /dev/fuse exchange and in both kernel and userspace
(see https://lab.nexedi.com/kirr/wendelin.core/blob/13d2d1f8/wcfs/fusetrace)
showed that there are correctly queued /dev/fuse reads still pending
after NOTIFY_RETRIEVE returns and it is the kernel who is not replying back:

	...

	P2 2.215710 /dev/fuse <- qread      wcfs/11399_4_r:

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P2 2.215810 /dev/fuse -> read       wcfs/11399_4_r:
	        .56  RELEASE i8 ...             (ret=64)

	P2 2.215859 /dev/fuse <- write      wcfs/11399_5_w:
	        .56 (0) ...

	        syscall.Syscall+48
	        syscall.Write+73
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931
	        github.com/hanwen/go-fuse/fuse.(*Server).write+194
	        github.com/hanwen/go-fuse/fuse.(*Server).handleRequest+179
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+399
	        runtime.goexit+1

	P2 2.215871 /dev/fuse -> write_ack  wcfs/11399_5_w (ret=16)

	P2 2.215876 /dev/fuse <- qread      wcfs/11399_5_r:    <-- NOTE

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P0 2.221527 /dev/fuse <- qread      wcfs/11401_1_r:    <-- NOTE

	        syscall.Syscall+48
	        syscall.Read+73
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest.func1+85
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).readRequest+355
	        github.com/hanwen/go-fuse/fuse.(*Server).loop+107
	        runtime.goexit+1

	P1 2.239384 /dev/fuse -> read       wcfs/11398_6_r:	# woken read that was queued before "..."
	        .57  READ i5 ...                (ret=80)

	P0 2.239626 /dev/fuse <- write      wcfs/11397_0_w:
	        NOTIFY_RETRIEVE ...

	        syscall.Syscall+48
	        syscall.Write+73
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite.func1+76
	        github.com/hanwen/go-fuse/fuse.handleEINTR+39
	        github.com/hanwen/go-fuse/fuse.(*Server).systemWrite+931
	        github.com/hanwen/go-fuse/fuse.(*Server).write+194
	        github.com/hanwen/go-fuse/fuse.(*Server).InodeRetrieveCache+764
	        github.com/hanwen/go-fuse/fuse/nodefs.(*FileSystemConnector).FileRetrieveCache+157
	        main.(*BigFile).invalidateBlk+232
	        main.(*Root).zδhandle1.func1+72
	        golang.org/x/sync/errgroup.(*Group).Go.func1+87
	        runtime.goexit+1

	P0 2.239660 /dev/fuse -> write_ack  wcfs/11397_0_w (ret=48)

	# stuck
	# (full trace: https://lab.nexedi.com/kirr/wendelin.core/commit/96416aaabd)

with queued / served read analysis confirming that two reads were indeed queued
and not served:

	grep -w -e '<- qread\>' y.log |awk {'print $6'} |sort >qread.txt
	grep -w -e '-> read\>'  y.log |awk {'print $6'} |sort >read.txt

	# xdiff qread.txt read.txt
	diff --git a/qread.txt b/read.txt
	index 4ab50d7..fdd2be1 100644
	--- a/qread.txt
	+++ b/read.txt
	@@ -53,7 +53,5 @@ wcfs/11399_1_r:
	 wcfs/11399_2_r:
	 wcfs/11399_3_r:
	 wcfs/11399_4_r:
	-wcfs/11399_5_r:
	 wcfs/11400_0_r:
	 wcfs/11401_0_r:
	-wcfs/11401_1_r:

The bug was hit because go-fuse by default uses 64K for read buffer size

	https://github.com/hanwen/go-fuse/blob/33711add/fuse/server.go#L142

and the kernel presets fuse_conn->max_pages to be 128K (= 32·4K pages).

Go-fuse will be likely fixed to both use bufsize=kernel's and to
correctly handle size > bufsize in InodeRetrieveCache. However we should
also fix the kernel to always deliver NOTIFY_REPLY once NOTIFY_RETRIEVE
was successful, so that FUSE protocol guarantee always holds
irregardless of whether userspace used default or other valid buffer
size setting, and so that filesystems can count not to get stuck waiting
for kernel who promised a reply.

This way this patch is here.

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Cc: <stable@vger.kernel.org> # v2.6.36+
---

 First patch version was sent 1 week ago, but got no response:
 https://marc.info/?l=linux-fsdevel&m=155000277921155&w=2

 Changes since v1: don't forget to also update req->misc.retrieve_in.size
 after truncation.

 ( This is my first patch to fs/fuse, so please forgive me if I missed anything. )

 fs/fuse/dev.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 65 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8a63e52785e9..93deb8e54d88 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -381,6 +381,40 @@ static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
 	kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
 }

+/*
+ * fuse_req_truncate_data truncates data in request that has paged data
+ * (req.in.argpages=1), so that whole request, when serialized, is <= nbytes.
+ *
+ * nbytes must be >= size(request without data).
+ */
+static void fuse_req_truncate_data(struct fuse_req *req, unsigned nbytes) {
+	unsigned size, n;
+
+	BUG_ON(!req->in.argpages);
+	BUG_ON(req->in.numargs < 1);
+
+	/* request size without data */
+	size = sizeof(struct fuse_in_header) +
+		len_args(req->in.numargs - 1, (struct fuse_arg *) req->in.args);
+	BUG_ON(nbytes < size);
+
+	/* truncate paged data */
+	for (n = 0; n < req->num_pages; n++) {
+		struct fuse_page_desc *p = &req->page_descs[n];
+
+		if (size >= nbytes) {
+			p->length = 0;
+		} else {
+			p->length = min_t(unsigned, p->length, nbytes - size);
+		}
+
+		size += p->length;
+	}
+
+	/* update whole request length in the header */
+	req->in.h.len = size;
+}
+
 void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
 		       u64 nodeid, u64 nlookup)
 {
@@ -1317,6 +1351,15 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	unsigned reqsize;
 	unsigned int hash;

+	/*
+	 * Require sane minimum read buffer - that has capacity for fixed part
+	 * of any request + some room for data. If the requirement is not
+	 * satisfied return EINVAL to the filesystem without dequeueing /
+	 * aborting any request.
+	 */
+	if (nbytes < FUSE_MIN_READ_BUFFER)
+		return -EINVAL;
+
  restart:
 	spin_lock(&fiq->waitq.lock);
 	err = -EAGAIN;
@@ -1358,12 +1401,28 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,

 	/* If request is too large, reply with an error and restart the read */
 	if (nbytes < reqsize) {
-		req->out.h.error = -EIO;
-		/* SETXATTR is special, since it may contain too large data */
-		if (in->h.opcode == FUSE_SETXATTR)
-			req->out.h.error = -E2BIG;
-		request_end(fc, req);
-		goto restart;
+		switch (in->h.opcode) {
+		default:
+			req->out.h.error = -EIO;
+			/* SETXATTR is special, since it may contain too large data */
+			if (in->h.opcode == FUSE_SETXATTR)
+				req->out.h.error = -E2BIG;
+			request_end(fc, req);
+			goto restart;
+
+		/*
+		 * NOTIFY_REPLY is special: if it was queued we already
+		 * promised to filesystem to deliver it when handling
+		 * NOTIFY_RETRIVE. We know that read buffer has capacity for at
+		 * least some data. Truncate retrieved data to read buffer size
+		 * and deliver it to stay to the promise.
+		 */
+		case FUSE_NOTIFY_REPLY:
+			fuse_req_truncate_data(req, nbytes);
+			req->misc.retrieve_in.size -= reqsize - in->h.len;
+			reqsize = in->h.len;
+		}
+
 	}
 	spin_lock(&fpq->lock);
 	list_add(&req->list, &fpq->io);
--
2.21.0.rc0.269.g1a574e7a28


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
  2019-02-19  9:42 ` Kirill Smelkov
  (?)
@ 2019-02-26 15:14 ` Miklos Szeredi
  2019-02-27 20:02     ` Kirill Smelkov
  -1 siblings, 1 reply; 15+ messages in thread
From: Miklos Szeredi @ 2019-02-26 15:14 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
> the kernel to send back NOTIFY_REPLY message. However if the filesystem
> is not reading requests with fuse_conn->max_pages capacity,

That's a violation of the contract by the fuse server, not the kernel.

> fuse_dev_do_read might see that the "request is too large" and decide to
> "reply with an error and restart the read". "Reply with an error" has
> underlying assumption that there is a "requester thread" that is waiting
> for request completion, which is true for most requests, but is not true
> for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
> after it could successfully queue NOTIFY_REPLY message without waiting
> for NOTIFY_REPLY completion. This leads to situation when filesystem
> requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
> that notification request, but NOTIFY_REPLY is not coming back.
>
> More, since there is no "requester thread" to handle the error, the
> situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
> /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
> was removed from pending queue and abandoned.

Now I don't understand how that would happen.   If the request is
abandoned, its refcount should go down to zero and the num_waiting
count decremented accordingly.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
  2019-02-26 15:14 ` Miklos Szeredi
@ 2019-02-27 20:02     ` Kirill Smelkov
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-27 20:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

Miklos, first of all thanks for feedback.

On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
> On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
> > the kernel to send back NOTIFY_REPLY message. However if the filesystem
> > is not reading requests with fuse_conn->max_pages capacity,
> 
> That's a violation of the contract by the fuse server, not the kernel.

Do you mean that even if filesystem server configures via
init_out.max_write that it is accepting e.g. only 32K max writes, it
still has to be issuing sys_read with buffer of 128K (= hardcoded
fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?

Also, I could not find any FUSE contract being specified anywhere, so I
used message of the commit that added support for NOTIFY_RETRIEVE to sense its
semantic:

	commit 2d45ba381a74a743eeaa2b06c7c5c0d2bf73ba1a
	Author: Miklos Szeredi <mszeredi@suse.cz>
	Date:   Mon Jul 12 14:41:40 2010 +0200
	
	    fuse: add retrieve request
	    
	    Userspace filesystem can request data to be retrieved from the inode's
	    mapping.  This request is synchronous and the retrieved data is queued
	    as a new request.  If the write to the fuse device returns an error
	    then the retrieve request was not completed and a reply will not be
	    sent.
	    
	    Only present pages are returned in the retrieve reply.  Retrieving
	    stops when it finds a non-present page and only data prior to that is
	    returned.
	    
	    This request doesn't change the dirty state of pages.
	    
	    Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
	
which, even if not explicitly, gives the impression that if NOTIFY_RETRIEVE was
queued successfully, the reply will come.

Also: if it is a violation of contract by filesystem server, the kernel should
return ESOMETHING for violating sys_read, instead of making that read to
be waiting indefinitely, isn't it?

In summary: instead of getting clients stuck silently I still suggest
for NOTIFY_RETRIEVE to come to client, if it can come. And also to
return EINVAL for /dev/fuse sys_read calls that are violating the
server/kernel contract.


> > fuse_dev_do_read might see that the "request is too large" and decide to
> > "reply with an error and restart the read". "Reply with an error" has
> > underlying assumption that there is a "requester thread" that is waiting
> > for request completion, which is true for most requests, but is not true
> > for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
> > after it could successfully queue NOTIFY_REPLY message without waiting
> > for NOTIFY_REPLY completion. This leads to situation when filesystem
> > requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
> > that notification request, but NOTIFY_REPLY is not coming back.
> >
> > More, since there is no "requester thread" to handle the error, the
> > situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
> > /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
> > was removed from pending queue and abandoned.
> 
> Now I don't understand how that would happen.   If the request is
> abandoned, its refcount should go down to zero and the num_waiting
> count decremented accordingly.

You are right - it was my mistake. I misinterpreted waiting=1 as a
request not being transferred to filesystem server yet, but in my test
it turned out to be already transferred and sitting on client "processing" list
waiting for corresponding reply (which was not coming because the filesystem in
turn was stuck waiting for NOTIFY_REPLY to come from the kernel):

	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/waiting
	1
	
	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/queue
	#waiting: 1
	(0)Interrupt:
	(0)Forget:
	(0)Request:
	(0)Processing.IO:
	(1)Processing.P:
	#0: .52 R15 i5


So the part of commit message that discussed X/waiting=1 is wrong and should be
dropped - thanks for catching this.


Still the main point is what should be the semantic of NOTIFY_RETRIEVE
vs NOTIFY_REPLY vs INIT.max_write, and that it is better to always send
retrieve data if client promised it, and also to explicitly indicate
with an error if filesystem server is violating FUSE server/client
contract.

Thanks,
Kirill


P.S. I attach the draft patch for /sys/fs/fuse/connections/X/queue in
case someone is interested.

---- 8< ----

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index fe80bea4ad89..f4e22f5436e2 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -63,6 +63,160 @@ static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf,
 	return simple_read_from_buffer(buf, len, ppos, tmp, size);
 }
 
+/* fuse_conn_iqueue_print prints input queue into provided buf.
+ *
+ * buf can be NULL in which case only the length of would-be printed text is
+ * returned and nothing is actually printed.
+ *
+ * must be called with fc->iq->waitq locked. */
+static size_t fuse_conn_iqueue_print(char *buf, size_t size, struct fuse_conn *fc)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+	struct fuse_req *req;
+	struct fuse_forget_link *freq;
+	size_t nreq;
+
+	size_t __n, __total = 0;
+#define emitf(FORMAT, ...) do {	\
+	__n = snprintf(buf, size, FORMAT, __VA_ARGS__);	\
+	__total += __n;					\
+	if (buf) {					\
+		size -= __n;				\
+		buf  += __n;				\
+	}						\
+} while (0)
+
+
+	if (!buf)
+		size = 0;
+
+	// XXX temp
+	emitf("#waiting: %d\n", atomic_read(&fc->num_waiting));
+
+	/* interrupts */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Interrupt:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* forgets */
+	nreq = 0;
+	for (freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		nreq++;
+	}
+
+	emitf("(%lu)Forget:\n", nreq);
+
+	nreq = 0;
+	for(freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		emitf("\t#%lu: FORGET i%llu -%llu\n", nreq,
+			freq->forget_one.nodeid, freq->forget_one.nlookup);
+		nreq++;
+	}
+
+	/* all other requests */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Request:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* processing */
+	// XXX temp? XXX locking
+	{
+	int i;
+	struct fuse_dev *fud;
+	list_for_each_entry(fud, &fc->devices, entry) {
+		struct fuse_pqueue *fpq = &fud->pq;
+
+		nreq = 0;
+		list_for_each_entry(req, &fpq->io, list) {
+			nreq++;
+		}
+		emitf("(%lu)Processing.IO:\n", nreq);
+
+		// XXX print IO elements
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				nreq++;
+			}
+		}
+		emitf("(%lu)Processing.P:\n", nreq);
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				struct fuse_in_header *h = &req->in.h;
+				emitf("\t#%lu: .%lld R%d i%llu\n", nreq, h->unique, h->opcode, h->nodeid);
+				nreq++;
+			}
+		}
+	}
+	}
+
+	return __total;
+#undef emitf
+}
+
+static ssize_t fuse_conn_iqueue_read(struct file *file, char __user *buf,
+				      size_t len, loff_t *ppos)
+{
+	char *qdump;
+
+	if (!*ppos) {
+		struct fuse_conn *fc = fuse_ctl_file_conn_get(file);
+		struct fuse_iqueue *fiq;
+		size_t n;
+		char *qdump2;
+		if (!fc)
+			return 0;
+
+		fiq = &fc->iq;
+		spin_lock(&fiq->waitq.lock);
+		n = fuse_conn_iqueue_print(NULL, 0, fc);
+		n += 1; /* trailing 0 */
+
+		qdump = kmalloc(n, GFP_ATOMIC);
+		if (qdump) {
+			fuse_conn_iqueue_print(qdump, n, fc);
+		}
+
+		spin_unlock(&fiq->waitq.lock);
+		fuse_conn_put(fc);
+
+		if (!qdump) {
+			return -ENOMEM;
+		}
+
+		/* release atomic memory, since it is scarce resource */
+		qdump2 = kstrdup(qdump, GFP_KERNEL);
+		kfree(qdump);
+
+		file->private_data = (void *)qdump2;
+		// TODO release qdump on file release
+	}
+
+	qdump = (char *)file->private_data;
+	return simple_read_from_buffer(buf, len, ppos, qdump, strlen(qdump));
+}
+
 static ssize_t fuse_conn_limit_read(struct file *file, char __user *buf,
 				    size_t len, loff_t *ppos, unsigned val)
 {
@@ -202,6 +356,12 @@ static const struct file_operations fuse_ctl_waiting_ops = {
 	.llseek = no_llseek,
 };
 
+static const struct file_operations fuse_ctl_queue_ops = {
+	.open = nonseekable_open,
+	.read = fuse_conn_iqueue_read,
+	.llseek = no_llseek,
+};
+
 static const struct file_operations fuse_conn_max_background_ops = {
 	.open = nonseekable_open,
 	.read = fuse_conn_max_background_read,
@@ -278,6 +438,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
 
 	if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1,
 				 NULL, &fuse_ctl_waiting_ops) ||
+	    !fuse_ctl_add_dentry(parent, fc, "queue", S_IFREG | 0400, 1,
+				 NULL, &fuse_ctl_queue_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1,
 				 NULL, &fuse_ctl_abort_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0920c0c032a0..7efef59caaa9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -41,7 +41,7 @@
 #define FUSE_NAME_MAX 1024
 
 /** Number of dentries for each connection in the control filesystem */
-#define FUSE_CTL_NUM_DENTRIES 5
+#define FUSE_CTL_NUM_DENTRIES 6
 
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
@ 2019-02-27 20:02     ` Kirill Smelkov
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-27 20:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

Miklos, first of all thanks for feedback.

On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
> On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
> > the kernel to send back NOTIFY_REPLY message. However if the filesystem
> > is not reading requests with fuse_conn->max_pages capacity,
> 
> That's a violation of the contract by the fuse server, not the kernel.

Do you mean that even if filesystem server configures via
init_out.max_write that it is accepting e.g. only 32K max writes, it
still has to be issuing sys_read with buffer of 128K (= hardcoded
fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?

Also, I could not find any FUSE contract being specified anywhere, so I
used message of the commit that added support for NOTIFY_RETRIEVE to sense its
semantic:

	commit 2d45ba381a74a743eeaa2b06c7c5c0d2bf73ba1a
	Author: Miklos Szeredi <mszeredi@suse.cz>
	Date:   Mon Jul 12 14:41:40 2010 +0200
	
	    fuse: add retrieve request
	    
	    Userspace filesystem can request data to be retrieved from the inode's
	    mapping.  This request is synchronous and the retrieved data is queued
	    as a new request.  If the write to the fuse device returns an error
	    then the retrieve request was not completed and a reply will not be
	    sent.
	    
	    Only present pages are returned in the retrieve reply.  Retrieving
	    stops when it finds a non-present page and only data prior to that is
	    returned.
	    
	    This request doesn't change the dirty state of pages.
	    
	    Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
	
which, even if not explicitly, gives the impression that if NOTIFY_RETRIEVE was
queued successfully, the reply will come.

Also: if it is a violation of contract by filesystem server, the kernel should
return ESOMETHING for violating sys_read, instead of making that read to
be waiting indefinitely, isn't it?

In summary: instead of getting clients stuck silently I still suggest
for NOTIFY_RETRIEVE to come to client, if it can come. And also to
return EINVAL for /dev/fuse sys_read calls that are violating the
server/kernel contract.


> > fuse_dev_do_read might see that the "request is too large" and decide to
> > "reply with an error and restart the read". "Reply with an error" has
> > underlying assumption that there is a "requester thread" that is waiting
> > for request completion, which is true for most requests, but is not true
> > for NOTIFY_REPLY: NOTIFY_RETRIEVE handler completes with OK status right
> > after it could successfully queue NOTIFY_REPLY message without waiting
> > for NOTIFY_REPLY completion. This leads to situation when filesystem
> > requested to retrieve inode data with NOTIFY_RETRIEVE, got err=OK for
> > that notification request, but NOTIFY_REPLY is not coming back.
> >
> > More, since there is no "requester thread" to handle the error, the
> > situation shows itself as /sys/fs/fuse/connections/X/waiting=1 _and_
> > /dev/fuse read(s) queued. Which is misleading since NOTIFY_REPLY request
> > was removed from pending queue and abandoned.
> 
> Now I don't understand how that would happen.   If the request is
> abandoned, its refcount should go down to zero and the num_waiting
> count decremented accordingly.

You are right - it was my mistake. I misinterpreted waiting=1 as a
request not being transferred to filesystem server yet, but in my test
it turned out to be already transferred and sitting on client "processing" list
waiting for corresponding reply (which was not coming because the filesystem in
turn was stuck waiting for NOTIFY_REPLY to come from the kernel):

	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/waiting
	1
	
	root@(none):~/src/wendelin/wendelin.core/wcfs# cat /sys/fs/fuse/connections/25/queue
	#waiting: 1
	(0)Interrupt:
	(0)Forget:
	(0)Request:
	(0)Processing.IO:
	(1)Processing.P:
	#0: .52 R15 i5


So the part of commit message that discussed X/waiting=1 is wrong and should be
dropped - thanks for catching this.


Still the main point is what should be the semantic of NOTIFY_RETRIEVE
vs NOTIFY_REPLY vs INIT.max_write, and that it is better to always send
retrieve data if client promised it, and also to explicitly indicate
with an error if filesystem server is violating FUSE server/client
contract.

Thanks,
Kirill


P.S. I attach the draft patch for /sys/fs/fuse/connections/X/queue in
case someone is interested.

---- 8< ----

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index fe80bea4ad89..f4e22f5436e2 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -63,6 +63,160 @@ static ssize_t fuse_conn_waiting_read(struct file *file, char __user *buf,
 	return simple_read_from_buffer(buf, len, ppos, tmp, size);
 }
 
+/* fuse_conn_iqueue_print prints input queue into provided buf.
+ *
+ * buf can be NULL in which case only the length of would-be printed text is
+ * returned and nothing is actually printed.
+ *
+ * must be called with fc->iq->waitq locked. */
+static size_t fuse_conn_iqueue_print(char *buf, size_t size, struct fuse_conn *fc)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+	struct fuse_req *req;
+	struct fuse_forget_link *freq;
+	size_t nreq;
+
+	size_t __n, __total = 0;
+#define emitf(FORMAT, ...) do {	\
+	__n = snprintf(buf, size, FORMAT, __VA_ARGS__);	\
+	__total += __n;					\
+	if (buf) {					\
+		size -= __n;				\
+		buf  += __n;				\
+	}						\
+} while (0)
+
+
+	if (!buf)
+		size = 0;
+
+	// XXX temp
+	emitf("#waiting: %d\n", atomic_read(&fc->num_waiting));
+
+	/* interrupts */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Interrupt:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->interrupts, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* forgets */
+	nreq = 0;
+	for (freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		nreq++;
+	}
+
+	emitf("(%lu)Forget:\n", nreq);
+
+	nreq = 0;
+	for(freq = fiq->forget_list_head.next; freq != NULL; freq = freq->next) {
+		emitf("\t#%lu: FORGET i%llu -%llu\n", nreq,
+			freq->forget_one.nodeid, freq->forget_one.nlookup);
+		nreq++;
+	}
+
+	/* all other requests */
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		nreq++;
+	}
+
+	emitf("(%lu)Request:\n", nreq);
+
+	nreq = 0;
+	list_for_each_entry(req, &fiq->pending, list) {
+		emitf("\t#%lu: R%d\n", nreq, req->in.h.opcode);
+		nreq++;
+	}
+
+	/* processing */
+	// XXX temp? XXX locking
+	{
+	int i;
+	struct fuse_dev *fud;
+	list_for_each_entry(fud, &fc->devices, entry) {
+		struct fuse_pqueue *fpq = &fud->pq;
+
+		nreq = 0;
+		list_for_each_entry(req, &fpq->io, list) {
+			nreq++;
+		}
+		emitf("(%lu)Processing.IO:\n", nreq);
+
+		// XXX print IO elements
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				nreq++;
+			}
+		}
+		emitf("(%lu)Processing.P:\n", nreq);
+
+		nreq = 0;
+		for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+			list_for_each_entry(req, &fpq->processing[i], list) {
+				struct fuse_in_header *h = &req->in.h;
+				emitf("\t#%lu: .%lld R%d i%llu\n", nreq, h->unique, h->opcode, h->nodeid);
+				nreq++;
+			}
+		}
+	}
+	}
+
+	return __total;
+#undef emitf
+}
+
+static ssize_t fuse_conn_iqueue_read(struct file *file, char __user *buf,
+				      size_t len, loff_t *ppos)
+{
+	char *qdump;
+
+	if (!*ppos) {
+		struct fuse_conn *fc = fuse_ctl_file_conn_get(file);
+		struct fuse_iqueue *fiq;
+		size_t n;
+		char *qdump2;
+		if (!fc)
+			return 0;
+
+		fiq = &fc->iq;
+		spin_lock(&fiq->waitq.lock);
+		n = fuse_conn_iqueue_print(NULL, 0, fc);
+		n += 1; /* trailing 0 */
+
+		qdump = kmalloc(n, GFP_ATOMIC);
+		if (qdump) {
+			fuse_conn_iqueue_print(qdump, n, fc);
+		}
+
+		spin_unlock(&fiq->waitq.lock);
+		fuse_conn_put(fc);
+
+		if (!qdump) {
+			return -ENOMEM;
+		}
+
+		/* release atomic memory, since it is scarce resource */
+		qdump2 = kstrdup(qdump, GFP_KERNEL);
+		kfree(qdump);
+
+		file->private_data = (void *)qdump2;
+		// TODO release qdump on file release
+	}
+
+	qdump = (char *)file->private_data;
+	return simple_read_from_buffer(buf, len, ppos, qdump, strlen(qdump));
+}
+
 static ssize_t fuse_conn_limit_read(struct file *file, char __user *buf,
 				    size_t len, loff_t *ppos, unsigned val)
 {
@@ -202,6 +356,12 @@ static const struct file_operations fuse_ctl_waiting_ops = {
 	.llseek = no_llseek,
 };
 
+static const struct file_operations fuse_ctl_queue_ops = {
+	.open = nonseekable_open,
+	.read = fuse_conn_iqueue_read,
+	.llseek = no_llseek,
+};
+
 static const struct file_operations fuse_conn_max_background_ops = {
 	.open = nonseekable_open,
 	.read = fuse_conn_max_background_read,
@@ -278,6 +438,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
 
 	if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1,
 				 NULL, &fuse_ctl_waiting_ops) ||
+	    !fuse_ctl_add_dentry(parent, fc, "queue", S_IFREG | 0400, 1,
+				 NULL, &fuse_ctl_queue_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1,
 				 NULL, &fuse_ctl_abort_ops) ||
 	    !fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0920c0c032a0..7efef59caaa9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -41,7 +41,7 @@
 #define FUSE_NAME_MAX 1024
 
 /** Number of dentries for each connection in the control filesystem */
-#define FUSE_CTL_NUM_DENTRIES 5
+#define FUSE_CTL_NUM_DENTRIES 6
 
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
  2019-02-27 20:02     ` Kirill Smelkov
  (?)
@ 2019-02-27 20:26     ` Miklos Szeredi
  2019-02-27 20:39       ` Kirill Smelkov
  -1 siblings, 1 reply; 15+ messages in thread
From: Miklos Szeredi @ 2019-02-27 20:26 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

On Wed, Feb 27, 2019 at 9:02 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> Miklos, first of all thanks for feedback.
>
> On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
> > On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> > >
> > > A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
> > > the kernel to send back NOTIFY_REPLY message. However if the filesystem
> > > is not reading requests with fuse_conn->max_pages capacity,
> >
> > That's a violation of the contract by the fuse server, not the kernel.
>
> Do you mean that even if filesystem server configures via
> init_out.max_write that it is accepting e.g. only 32K max writes, it
> still has to be issuing sys_read with buffer of 128K (= hardcoded
> fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?

Filesystem is asking for a specific number of bytes to retrieve.  It
does not have to be less than max_writes, but it does have to fit into
the request buffer it is using.  If the filesystem is asking to
retrieve 64k and it is using a 32k request buffer, then that obviously
won't work.   Kernel could limit the retrieve length to max_writes,
that may make sense, but it doesn't fundamentally change the fact that
if the filesystem is not properly sizing the request buffer, it may
result in various forms of breakage.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
  2019-02-27 20:26     ` Miklos Szeredi
@ 2019-02-27 20:39       ` Kirill Smelkov
  2019-02-28  8:10         ` Miklos Szeredi
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-27 20:39 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

On Wed, Feb 27, 2019 at 09:26:32PM +0100, Miklos Szeredi wrote:
> On Wed, Feb 27, 2019 at 9:02 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > Miklos, first of all thanks for feedback.
> >
> > On Tue, Feb 26, 2019 at 04:14:22PM +0100, Miklos Szeredi wrote:
> > > On Tue, Feb 19, 2019 at 10:42 AM Kirill Smelkov <kirr@nexedi.com> wrote:
> > > >
> > > > A successful call to NOTIFY_RETRIEVE by filesystem carries promise from
> > > > the kernel to send back NOTIFY_REPLY message. However if the filesystem
> > > > is not reading requests with fuse_conn->max_pages capacity,
> > >
> > > That's a violation of the contract by the fuse server, not the kernel.
> >
> > Do you mean that even if filesystem server configures via
> > init_out.max_write that it is accepting e.g. only 32K max writes, it
> > still has to be issuing sys_read with buffer of 128K (= hardcoded
> > fuse_conn->max_pages before Linux 4.20, and default since Linux 4.20)?
> 
> Filesystem is asking for a specific number of bytes to retrieve.  It
> does not have to be less than max_writes, but it does have to fit into
> the request buffer it is using.  If the filesystem is asking to
> retrieve 64k and it is using a 32k request buffer, then that obviously
> won't work.   Kernel could limit the retrieve length to max_writes,
> that may make sense, but it doesn't fundamentally change the fact that
> if the filesystem is not properly sizing the request buffer, it may
> result in various forms of breakage.

I more or less agree with this statement. However can we please make the
breakage to be explicitly visible with an error instead of exhibiting it
via harder to debug stucks/deadlocks? For example sys_read < max_write
-> error instead of getting stuck. And if notify_retrieve requests
buffer larger than max_write -> error or cut to max_write, but don't
return OK when we know we will never send what was requested to
filesystem even if it uses max_write sized reads. What is the point of
breaking in hard to diagnose way when we can make the breakage showing
itself explicitly? Would a patch for such behaviour accepted?

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
  2019-02-27 20:39       ` Kirill Smelkov
@ 2019-02-28  8:10         ` Miklos Szeredi
  2019-02-28 11:48             ` Kirill Smelkov
  0 siblings, 1 reply; 15+ messages in thread
From: Miklos Szeredi @ 2019-02-28  8:10 UTC (permalink / raw)
  To: Kirill Smelkov
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov <kirr@nexedi.com> wrote:

> I more or less agree with this statement. However can we please make the
> breakage to be explicitly visible with an error instead of exhibiting it
> via harder to debug stucks/deadlocks? For example sys_read < max_write
> -> error instead of getting stuck. And if notify_retrieve requests
> buffer larger than max_write -> error or cut to max_write, but don't
> return OK when we know we will never send what was requested to
> filesystem even if it uses max_write sized reads. What is the point of
> breaking in hard to diagnose way when we can make the breakage showing
> itself explicitly? Would a patch for such behaviour accepted?

Sure, if it's only adds a couple of lines.   Adding more than say ten
lines for such a non-bug fix is definitely excessive.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
  2019-02-28  8:10         ` Miklos Szeredi
@ 2019-02-28 11:48             ` Kirill Smelkov
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-28 11:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

On Thu, Feb 28, 2019 at 09:10:15AM +0100, Miklos Szeredi wrote:
> On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> > I more or less agree with this statement. However can we please make the
> > breakage to be explicitly visible with an error instead of exhibiting it
> > via harder to debug stucks/deadlocks? For example sys_read < max_write
> > -> error instead of getting stuck. And if notify_retrieve requests
> > buffer larger than max_write -> error or cut to max_write, but don't
> > return OK when we know we will never send what was requested to
> > filesystem even if it uses max_write sized reads. What is the point of
> > breaking in hard to diagnose way when we can make the breakage showing
> > itself explicitly? Would a patch for such behaviour accepted?
>
> Sure, if it's only adds a couple of lines.   Adding more than say ten
> lines for such a non-bug fix is definitely excessive.

Ok, thanks. Please consider applying the following patch. (It's a bit
pity to hear the problem is not considered to be a bug, but anyway).

I will also send the second patch as another mail, since I could not
made `git am --scissors` to apply several patched extracted from one
mail successfully.

Thanks,
Kirill

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Date: Thu, 28 Feb 2019 13:06:18 +0300
Subject: [PATCH 1/2] fuse: retrieve: cap requested size to negotiated
 max_write
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header
+ that max_write bytes. A filesystem server is free to set its max_write
in anywhere in the range between [1·page, fc->max_pages·page]. In
particular go-fuse[2] sets max_write by default as 64K, wheres default
fc->max_pages corresponds to 128K. Libfuse also allows users to
configure max_write, but by default presets it to possible maximum.

If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract,
will be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server
request buffer. In turn the filesystem server could get stuck waiting
indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
which is understood by clients as that NOTIFY_REPLY was queued and will
be sent back.

-> Cap requested size to negotiate max_write to avoid the problem.
This aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages.
This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
data than was originally requested.

Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Cc: <stable@vger.kernel.org> # v2.6.36+
---
 fs/fuse/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8a63e52785e9..38e94bc43053 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
 	offset = outarg->offset & ~PAGE_MASK;
 	file_size = i_size_read(inode);

-	num = outarg->size;
+	num = min(outarg->size, fc->max_write);
 	if (outarg->offset > file_size)
 		num = 0;
 	else if (outarg->offset + num > file_size)
--
2.21.0.352.gf09ad66450


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
@ 2019-02-28 11:48             ` Kirill Smelkov
  0 siblings, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-28 11:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

On Thu, Feb 28, 2019 at 09:10:15AM +0100, Miklos Szeredi wrote:
> On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov <kirr@nexedi.com> wrote:
>
> > I more or less agree with this statement. However can we please make the
> > breakage to be explicitly visible with an error instead of exhibiting it
> > via harder to debug stucks/deadlocks? For example sys_read < max_write
> > -> error instead of getting stuck. And if notify_retrieve requests
> > buffer larger than max_write -> error or cut to max_write, but don't
> > return OK when we know we will never send what was requested to
> > filesystem even if it uses max_write sized reads. What is the point of
> > breaking in hard to diagnose way when we can make the breakage showing
> > itself explicitly? Would a patch for such behaviour accepted?
>
> Sure, if it's only adds a couple of lines.   Adding more than say ten
> lines for such a non-bug fix is definitely excessive.

Ok, thanks. Please consider applying the following patch. (It's a bit
pity to hear the problem is not considered to be a bug, but anyway).

I will also send the second patch as another mail, since I could not
made `git am --scissors` to apply several patched extracted from one
mail successfully.

Thanks,
Kirill

---- 8< ----
From: Kirill Smelkov <kirr@nexedi.com>
Date: Thu, 28 Feb 2019 13:06:18 +0300
Subject: [PATCH 1/2] fuse: retrieve: cap requested size to negotiated
 max_write
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header
+ that max_write bytes. A filesystem server is free to set its max_write
in anywhere in the range between [1·page, fc->max_pages·page]. In
particular go-fuse[2] sets max_write by default as 64K, wheres default
fc->max_pages corresponds to 128K. Libfuse also allows users to
configure max_write, but by default presets it to possible maximum.

If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract,
will be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server
request buffer. In turn the filesystem server could get stuck waiting
indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
which is understood by clients as that NOTIFY_REPLY was queued and will
be sent back.

-> Cap requested size to negotiate max_write to avoid the problem.
This aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages.
This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
data than was originally requested.

Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Cc: <stable@vger.kernel.org> # v2.6.36+
---
 fs/fuse/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8a63e52785e9..38e94bc43053 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
 	offset = outarg->offset & ~PAGE_MASK;
 	file_size = i_size_read(inode);

-	num = outarg->size;
+	num = min(outarg->size, fc->max_write);
 	if (outarg->offset > file_size)
 		num = 0;
 	else if (outarg->offset + num > file_size)
--
2.21.0.352.gf09ad66450


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated
  2019-02-28 11:48             ` Kirill Smelkov
  (?)
@ 2019-02-28 11:50             ` Kirill Smelkov
  -1 siblings, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-02-28 11:50 UTC (permalink / raw)
  To: Miklos Szeredi, Miklos Szeredi
  Cc: linux-fsdevel, fuse-devel, Kirill Smelkov, Han-Wen Nienhuys,
	Jakob Unterwurzacher

A FUSE filesystem server queues /dev/fuse sys_read calls to get
filesystem requests to handle. It does not know in advance what would be
that request as it can be anything that client issues - LOOKUP, READ,
WRITE, ... Many requests are short and retrieve data from the
filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.

Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will
ever issue. After negotiation the contract in between server/client is
that the filesystem server then should queue /dev/fuse sys_read calls with
enough buffer capacity to receive any client request - WRITE in
particular, while FUSE client should not, in particular, send WRITE
requests with > negotiated max_write payload. FUSE client in kernel and
libfuse historically reserve 4K for request header. This way the
contract is that filesystem server should queue sys_reads with
4K+max_write buffer.

If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.

-> We can make the problem easy to diagnose if we indicate via error
return to filesystem server when it is violating the contract.
This should not practically cause problems because if a filesystem
server is using shorter buffer, writes to it were already very likely to
cause EIO, and if the filesystem is read-only it should be too following
8K minimum buffer size (= either FUSE_MIN_READ_BUFFER, see 1d3d752b47,
or = 4K + min(max_write)=4k cared to be so by process_init_reply).

Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
---
 fs/fuse/dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 38e94bc43053..8fdfbafed037 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1317,6 +1317,16 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	unsigned reqsize;
 	unsigned int hash;
 
+	/*
+	 * Require sane minimum read buffer - that has capacity for fixed part
+	 * of any request header + negotated max_write room for data. If the
+	 * requirement is not satisfied return EINVAL to the filesystem server
+	 * to indicate that it is not following FUSE server/client contract.
+	 * Don't dequeue / abort any request.
+	 */
+	if (nbytes < (fc->conn_init ? 4096 + fc->max_write : FUSE_MIN_READ_BUFFER))
+		return -EINVAL;
+
  restart:
 	spin_lock(&fiq->waitq.lock);
 	err = -EAGAIN;
-- 
2.21.0.352.gf09ad66450

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it
  2019-02-28 11:48             ` Kirill Smelkov
  (?)
  (?)
@ 2019-03-07  9:34             ` Kirill Smelkov
  2019-03-14 10:45               ` [RESEND3, PATCH 0/2] fuse: don't stuck clients on retrieve_notify with size > max_write Kirill Smelkov
  -1 siblings, 1 reply; 15+ messages in thread
From: Kirill Smelkov @ 2019-03-07  9:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, linux-fsdevel, fuse-devel, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

On Thu, Feb 28, 2019 at 02:47:57PM +0300, Kirill Smelkov wrote:
> On Thu, Feb 28, 2019 at 09:10:15AM +0100, Miklos Szeredi wrote:
> > On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> >
> > > I more or less agree with this statement. However can we please make the
> > > breakage to be explicitly visible with an error instead of exhibiting it
> > > via harder to debug stucks/deadlocks? For example sys_read < max_write
> > > -> error instead of getting stuck. And if notify_retrieve requests
> > > buffer larger than max_write -> error or cut to max_write, but don't
> > > return OK when we know we will never send what was requested to
> > > filesystem even if it uses max_write sized reads. What is the point of
> > > breaking in hard to diagnose way when we can make the breakage showing
> > > itself explicitly? Would a patch for such behaviour accepted?
> >
> > Sure, if it's only adds a couple of lines.   Adding more than say ten
> > lines for such a non-bug fix is definitely excessive.
>
> Ok, thanks. Please consider applying the following patch. (It's a bit
> pity to hear the problem is not considered to be a bug, but anyway).
>
> I will also send the second patch as another mail, since I could not
> made `git am --scissors` to apply several patched extracted from one
> mail successfully.

Ping. Miklos, is there anything wrong with this patch and its
second counterpart?

Thank beforehand for feedback,
Kirill


> ---- 8< ----
> From: Kirill Smelkov <kirr@nexedi.com>
> Date: Thu, 28 Feb 2019 13:06:18 +0300
> Subject: [PATCH 1/2] fuse: retrieve: cap requested size to negotiated
>  max_write
> MIME-Version: 1.0
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: 8bit
>
> FUSE filesystem server and kernel client negotiate during initialization
> phase, what should be the maximum write size the client will ever issue.
> Correspondingly the filesystem server then queues sys_read calls to read
> requests with buffer capacity large enough to carry request header
> + that max_write bytes. A filesystem server is free to set its max_write
> in anywhere in the range between [1·page, fc->max_pages·page]. In
> particular go-fuse[2] sets max_write by default as 64K, wheres default
> fc->max_pages corresponds to 128K. Libfuse also allows users to
> configure max_write, but by default presets it to possible maximum.
>
> If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
> allow to retrieve more than max_write bytes, corresponding prepared
> NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
> filesystem server, in full correspondence with server/client contract,
> will be only queuing sys_read with ~max_write buffer capacity, and
> fuse_dev_do_read throws away requests that cannot fit into server
> request buffer. In turn the filesystem server could get stuck waiting
> indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
> which is understood by clients as that NOTIFY_REPLY was queued and will
> be sent back.
>
> -> Cap requested size to negotiate max_write to avoid the problem.
> This aligns with the way NOTIFY_RETRIEVE handler works, which already
> unconditionally caps requested retrieve size to fuse_conn->max_pages.
> This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
> data than was originally requested.
>
> Please see [1] for context where the problem of stuck filesystem was hit
> for real, how the situation was traced and for more involving patch that
> did not make it into the tree.
>
> [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
> [2] https://github.com/hanwen/go-fuse
>
> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
> Cc: Han-Wen Nienhuys <hanwen@google.com>
> Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
> Cc: <stable@vger.kernel.org> # v2.6.36+
> ---
>  fs/fuse/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 8a63e52785e9..38e94bc43053 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
>  	offset = outarg->offset & ~PAGE_MASK;
>  	file_size = i_size_read(inode);
>
> -	num = outarg->size;
> +	num = min(outarg->size, fc->max_write);
>  	if (outarg->offset > file_size)
>  		num = 0;
>  	else if (outarg->offset + num > file_size)
> --
> 2.21.0.352.gf09ad66450


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RESEND3, PATCH 0/2] fuse: don't stuck clients on retrieve_notify with size > max_write
  2019-03-07  9:34             ` [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it Kirill Smelkov
@ 2019-03-14 10:45               ` Kirill Smelkov
  2019-03-14 10:46                 ` [PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write Kirill Smelkov
  2019-03-14 10:46                 ` [PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated Kirill Smelkov
  0 siblings, 2 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-03-14 10:45 UTC (permalink / raw)
  To: Miklos Szeredi, Miklos Szeredi
  Cc: linux-fsdevel, fuse-devel, Kirill Smelkov, Han-Wen Nienhuys,
	Jakob Unterwurzacher

Miklos,

On Thu, Feb 28, 2019 at 02:47:57PM +0300, Kirill Smelkov wrote:
> On Thu, Feb 28, 2019 at 09:10:15AM +0100, Miklos Szeredi wrote:
> > On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov <kirr@nexedi.com> wrote:
> > 
> > > I more or less agree with this statement. However can we please make the
> > > breakage to be explicitly visible with an error instead of exhibiting it
> > > via harder to debug stucks/deadlocks? For example sys_read < max_write
> > > -> error instead of getting stuck. And if notify_retrieve requests
> > > buffer larger than max_write -> error or cut to max_write, but don't
> > > return OK when we know we will never send what was requested to
> > > filesystem even if it uses max_write sized reads. What is the point of
> > > breaking in hard to diagnose way when we can make the breakage showing
> > > itself explicitly? Would a patch for such behaviour accepted?
> > 
> > Sure, if it's only adds a couple of lines.   Adding more than say ten
> > lines for such a non-bug fix is definitely excessive.
>
> Ok, thanks. Please consider applying the following patch. (It's a bit
> pity to hear the problem is not considered to be a bug, but anyway).
>
> I will also send the second patch as another mail, since I could not
> made `git am --scissors` to apply several patched extracted from one
> mail successfully.

[...]

On Thu, Mar 07, 2019 at 12:34:21PM +0300, Kirill Smelkov wrote:
> Ping. Miklos, is there anything wrong with this patch and its
> second counterpart?

As we were talking here are those patches. The first one cuts notify_retrieve
request to max_write and is one line only. The second one returns error to
filesystem server if it is buggy and does sys_read with buffer size <
max_write. It is 2 lines of code and 7 lines of comments.

I still think that the patches fix real bugs. It is a bug if server behaviour
is a bit non-confirming or simply on an edge of being correct or questionable,
and instead of properly getting plain error from kernel, the whole system gets
stuck. It is a bug because bug amplification factor here is at least one order
of magnitude instead of staying ~1x.

I'm sending the patches for the third time already, but did not get any
feedback. Could you please have a look?

Thanks beforehand,
Kirill


Kirill Smelkov (2):
  fuse: retrieve: cap requested size to negotiated max_write
  fuse: require /dev/fuse reads to have enough buffer capacity as negotiated

 fs/fuse/dev.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

-- 
2.21.0.225.g810b269d1a

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write
  2019-03-14 10:45               ` [RESEND3, PATCH 0/2] fuse: don't stuck clients on retrieve_notify with size > max_write Kirill Smelkov
@ 2019-03-14 10:46                 ` Kirill Smelkov
  2019-03-14 10:46                 ` [PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated Kirill Smelkov
  1 sibling, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-03-14 10:46 UTC (permalink / raw)
  To: Miklos Szeredi, Miklos Szeredi
  Cc: linux-fsdevel, fuse-devel, Kirill Smelkov, Han-Wen Nienhuys,
	Jakob Unterwurzacher, stable

FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header
+ that max_write bytes. A filesystem server is free to set its max_write
in anywhere in the range between [1·page, fc->max_pages·page]. In
particular go-fuse[2] sets max_write by default as 64K, wheres default
fc->max_pages corresponds to 128K. Libfuse also allows users to
configure max_write, but by default presets it to possible maximum.

If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract,
will be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server
request buffer. In turn the filesystem server could get stuck waiting
indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
which is understood by clients as that NOTIFY_REPLY was queued and will
be sent back.

-> Cap requested size to negotiate max_write to avoid the problem.
This aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages.
This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
data than was originally requested.

Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
Cc: <stable@vger.kernel.org> # v2.6.36+
---
 fs/fuse/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8a63e52785e9..38e94bc43053 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
 	offset = outarg->offset & ~PAGE_MASK;
 	file_size = i_size_read(inode);

-	num = outarg->size;
+	num = min(outarg->size, fc->max_write);
 	if (outarg->offset > file_size)
 		num = 0;
 	else if (outarg->offset + num > file_size)
--
2.21.0.225.g810b269d1a


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated
  2019-03-14 10:45               ` [RESEND3, PATCH 0/2] fuse: don't stuck clients on retrieve_notify with size > max_write Kirill Smelkov
  2019-03-14 10:46                 ` [PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write Kirill Smelkov
@ 2019-03-14 10:46                 ` Kirill Smelkov
  1 sibling, 0 replies; 15+ messages in thread
From: Kirill Smelkov @ 2019-03-14 10:46 UTC (permalink / raw)
  To: Miklos Szeredi, Miklos Szeredi
  Cc: linux-fsdevel, fuse-devel, Kirill Smelkov, Han-Wen Nienhuys,
	Jakob Unterwurzacher

A FUSE filesystem server queues /dev/fuse sys_read calls to get
filesystem requests to handle. It does not know in advance what would be
that request as it can be anything that client issues - LOOKUP, READ,
WRITE, ... Many requests are short and retrieve data from the
filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.

Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will
ever issue. After negotiation the contract in between server/client is
that the filesystem server then should queue /dev/fuse sys_read calls with
enough buffer capacity to receive any client request - WRITE in
particular, while FUSE client should not, in particular, send WRITE
requests with > negotiated max_write payload. FUSE client in kernel and
libfuse historically reserve 4K for request header. This way the
contract is that filesystem server should queue sys_reads with
4K+max_write buffer.

If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.

-> We can make the problem easy to diagnose if we indicate via error
return to filesystem server when it is violating the contract.
This should not practically cause problems because if a filesystem
server is using shorter buffer, writes to it were already very likely to
cause EIO, and if the filesystem is read-only it should be too following
8K minimum buffer size (= either FUSE_MIN_READ_BUFFER, see 1d3d752b47,
or = 4K + min(max_write)=4k cared to be so by process_init_reply).

Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
---
 fs/fuse/dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 38e94bc43053..8fdfbafed037 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1317,6 +1317,16 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
 	unsigned reqsize;
 	unsigned int hash;
 
+	/*
+	 * Require sane minimum read buffer - that has capacity for fixed part
+	 * of any request header + negotated max_write room for data. If the
+	 * requirement is not satisfied return EINVAL to the filesystem server
+	 * to indicate that it is not following FUSE server/client contract.
+	 * Don't dequeue / abort any request.
+	 */
+	if (nbytes < (fc->conn_init ? 4096 + fc->max_write : FUSE_MIN_READ_BUFFER))
+		return -EINVAL;
+
  restart:
 	spin_lock(&fiq->waitq.lock);
 	err = -EAGAIN;
-- 
2.21.0.225.g810b269d1a

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-03-14 11:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-19  9:42 [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it Kirill Smelkov
2019-02-19  9:42 ` Kirill Smelkov
2019-02-26 15:14 ` Miklos Szeredi
2019-02-27 20:02   ` Kirill Smelkov
2019-02-27 20:02     ` Kirill Smelkov
2019-02-27 20:26     ` Miklos Szeredi
2019-02-27 20:39       ` Kirill Smelkov
2019-02-28  8:10         ` Miklos Szeredi
2019-02-28 11:48           ` Kirill Smelkov
2019-02-28 11:48             ` Kirill Smelkov
2019-02-28 11:50             ` [PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated Kirill Smelkov
2019-03-07  9:34             ` [RESEND, PATCH v2] fuse: Don't drop NOTIFY_REPLY if we promised it Kirill Smelkov
2019-03-14 10:45               ` [RESEND3, PATCH 0/2] fuse: don't stuck clients on retrieve_notify with size > max_write Kirill Smelkov
2019-03-14 10:46                 ` [PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write Kirill Smelkov
2019-03-14 10:46                 ` [PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated Kirill Smelkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.