From: Alex Bligh <alex@alex.org.uk> To: Wouter Verhelst <w@uter.be> Cc: Alex Bligh <alex@alex.org.uk>, Christoph Hellwig <hch@infradead.org>, "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>, Josef Bacik <jbacik@fb.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, linux-block@vger.kernel.org, mpa@pengutronix.de, kernel-team@fb.com Subject: Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements Date: Thu, 15 Sep 2016 17:08:21 +0100 [thread overview] Message-ID: <ECDC1983-E21B-48D5-892D-63CA3572B2B4@alex.org.uk> (raw) In-Reply-To: <20160915131741.cth6kilmcgnobbuu@grep.be> Wouter, > The server can always refuse to allow multiple connections. Sure, but it would be neater to warn the client of that at negotiation stage (it would only be one flag, e.g. 'multiple connections unsafe'). That way the connection won't fail with a cryptic EBUSY or whatever, but will just negotiate a single connection. > I was thinking of changing the spec as follows: >=20 > diff --git a/doc/proto.md b/doc/proto.md > index 217f57e..cb099e2 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -308,6 +308,23 @@ specification, the > [kernel = documentation](https://www.kernel.org/doc/Documentation/block/writeback_ca= che_control.txt) > may be useful. >=20 > +For performance reasons, clients MAY open multiple connections to the > +same server. To support such clients, servers SHOULD ensure that at > +least one of the following conditions hold: > + > +* Flush commands are processed for ALL connections. That is, when an > + `NBD_CMD_WRITE` is processed on one connection, and then an > + `NBD_CMD_FLUSH` is processed on another connection, the data of the > + `NBD_CMD_WRITE` on the first connection MUST reach permanent = storage > + before the reply of the `NBD_CMD_FLUSH` is sent. > +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most = one > + connection > +* Multiple connections are not allowed > + > +In addition, clients using multiple connections SHOULD NOT send > +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation = to > +the flush has not been replied to yet. > + I don't think that should be a mandatory behaviour. For once, it would be reasonably easy on gonbdserver but a PITA on the reference server. You'd need to put in IPC between each of the forked processes OR rely on fdatasync() only - I'm not sure that would necessarily work safely with (e.g.) the 'treefiles' / COW options. I think better would be to say that the server MUST either * Not support NBD_CMD_FLUSH at all * Support NBD_CMD_FLUSH across channels (as you set out above), or * Indicate that it does not support multiple channels. > #### Request message >=20 > The request message, sent by the client, looks as follows: >=20 > The latter bit (on the client side) is because even if your backend = has > no cache coherency issues, TCP does not guarantee ordering between > multiple connections. I don't know if the above is in line with what > blk-mq does, but consider the following scenario: >=20 > - A client sends two writes to the server, followed (immediately) by a > flush, where at least the second write and the flush are not sent = over > the same connection. > - The first write is a small one, and it is handled almost = immediately. > - The second write takes a little longer, so the flush is handled > earlier than the second write > - The network packet containing the flush reply gets lost for whatever > reason, so the client doesn't get it, and we fall into TCP > retransmits. > - The second write finishes, and its reply header does not get lost > - After the second write reply reaches the client, the TCP retransmits > for the flush reply are handled. >=20 > In the above scenario, the flush reply arrives on the client side = after > a write reply which it did not cover; so the client will (incorrectly) > assume that the write has reached permanent storage when in fact it = may > not have done so yet. >=20 > If the kernel does not care about the ordering of the two writes = versus > the flush, then there is no problem. I don't know how blk-mq works in > that context, but if the above is a likely scenario, we may have to > reconsider adding blk-mq to nbd. Actually I think this is a problem anyway. A simpler failure case is one where (by chance) one channel gets the writes, and one channel gets the flushes. The flush reply is delayed beyond the replies to unconnected writes (on the other channel) and hence the kernel thinks replied-to writes have been persisted when they have not been. The only way to fix that (as far as I can see) without changing flush semantics is for the block layer to issue flush requests on each channel when flushing on one channel. This, incidentally, would resolve every other issue! --=20 Alex Bligh
WARNING: multiple messages have this Message-ID (diff)
From: Alex Bligh <alex@alex.org.uk> To: Wouter Verhelst <w@uter.be> Cc: Alex Bligh <alex@alex.org.uk>, Christoph Hellwig <hch@infradead.org>, "nbd-general@lists.sourceforge.net" <nbd-general@lists.sourceforge.net>, Josef Bacik <jbacik@fb.com>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, linux-block@vger.kernel.org, mpa@pengutronix.de, kernel-team@fb.com Subject: Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements Date: Thu, 15 Sep 2016 17:08:21 +0100 [thread overview] Message-ID: <ECDC1983-E21B-48D5-892D-63CA3572B2B4@alex.org.uk> (raw) In-Reply-To: <20160915131741.cth6kilmcgnobbuu@grep.be> Wouter, > The server can always refuse to allow multiple connections. Sure, but it would be neater to warn the client of that at negotiation stage (it would only be one flag, e.g. 'multiple connections unsafe'). That way the connection won't fail with a cryptic EBUSY or whatever, but will just negotiate a single connection. > I was thinking of changing the spec as follows: > > diff --git a/doc/proto.md b/doc/proto.md > index 217f57e..cb099e2 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -308,6 +308,23 @@ specification, the > [kernel documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt) > may be useful. > > +For performance reasons, clients MAY open multiple connections to the > +same server. To support such clients, servers SHOULD ensure that at > +least one of the following conditions hold: > + > +* Flush commands are processed for ALL connections. That is, when an > + `NBD_CMD_WRITE` is processed on one connection, and then an > + `NBD_CMD_FLUSH` is processed on another connection, the data of the > + `NBD_CMD_WRITE` on the first connection MUST reach permanent storage > + before the reply of the `NBD_CMD_FLUSH` is sent. > +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one > + connection > +* Multiple connections are not allowed > + > +In addition, clients using multiple connections SHOULD NOT send > +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to > +the flush has not been replied to yet. > + I don't think that should be a mandatory behaviour. For once, it would be reasonably easy on gonbdserver but a PITA on the reference server. You'd need to put in IPC between each of the forked processes OR rely on fdatasync() only - I'm not sure that would necessarily work safely with (e.g.) the 'treefiles' / COW options. I think better would be to say that the server MUST either * Not support NBD_CMD_FLUSH at all * Support NBD_CMD_FLUSH across channels (as you set out above), or * Indicate that it does not support multiple channels. > #### Request message > > The request message, sent by the client, looks as follows: > > The latter bit (on the client side) is because even if your backend has > no cache coherency issues, TCP does not guarantee ordering between > multiple connections. I don't know if the above is in line with what > blk-mq does, but consider the following scenario: > > - A client sends two writes to the server, followed (immediately) by a > flush, where at least the second write and the flush are not sent over > the same connection. > - The first write is a small one, and it is handled almost immediately. > - The second write takes a little longer, so the flush is handled > earlier than the second write > - The network packet containing the flush reply gets lost for whatever > reason, so the client doesn't get it, and we fall into TCP > retransmits. > - The second write finishes, and its reply header does not get lost > - After the second write reply reaches the client, the TCP retransmits > for the flush reply are handled. > > In the above scenario, the flush reply arrives on the client side after > a write reply which it did not cover; so the client will (incorrectly) > assume that the write has reached permanent storage when in fact it may > not have done so yet. > > If the kernel does not care about the ordering of the two writes versus > the flush, then there is no problem. I don't know how blk-mq works in > that context, but if the above is a likely scenario, we may have to > reconsider adding blk-mq to nbd. Actually I think this is a problem anyway. A simpler failure case is one where (by chance) one channel gets the writes, and one channel gets the flushes. The flush reply is delayed beyond the replies to unconnected writes (on the other channel) and hence the kernel thinks replied-to writes have been persisted when they have not been. The only way to fix that (as far as I can see) without changing flush semantics is for the block layer to issue flush requests on each channel when flushing on one channel. This, incidentally, would resolve every other issue! -- Alex Bligh
next prev parent reply other threads:[~2016-09-15 16:08 UTC|newest] Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top 2016-09-08 21:12 [RESEND][PATCH 0/5] nbd improvements Josef Bacik 2016-09-08 21:12 ` [PATCH 1/5] nbd: convert to blkmq Josef Bacik 2016-09-08 21:12 ` [PATCH 2/5] nbd: don't shutdown sock with irq's disabled Josef Bacik 2016-09-08 21:12 ` [PATCH 3/5] nbd: use flags instead of bool Josef Bacik 2016-09-09 1:20 ` Joe Perches 2016-09-09 13:55 ` Jens Axboe 2016-09-09 16:04 ` Joe Perches 2016-09-09 16:04 ` Joe Perches 2016-09-09 16:11 ` Jens Axboe 2016-09-09 16:15 ` Joe Perches 2016-09-09 16:20 ` Jens Axboe 2016-09-09 16:20 ` Jens Axboe 2016-09-08 21:12 ` [PATCH 4/5] nbd: allow block mq to deal with timeouts Josef Bacik 2016-09-08 21:12 ` [PATCH 5/5] nbd: add multi-connection support Josef Bacik 2016-09-10 7:43 ` Christoph Hellwig 2016-09-12 13:11 ` Josef Bacik 2016-09-09 20:02 ` [Nbd] [RESEND][PATCH 0/5] nbd improvements Wouter Verhelst 2016-09-09 20:36 ` Josef Bacik 2016-09-09 20:55 ` Wouter Verhelst 2016-09-09 23:00 ` Josef Bacik 2016-09-09 23:37 ` Jens Axboe 2016-09-15 10:49 ` Wouter Verhelst 2016-09-15 11:09 ` Alex Bligh 2016-09-15 11:09 ` Alex Bligh 2016-09-15 11:29 ` Wouter Verhelst 2016-09-15 11:29 ` Wouter Verhelst 2016-09-15 11:40 ` Christoph Hellwig 2016-09-15 11:40 ` Christoph Hellwig 2016-09-15 11:46 ` Alex Bligh 2016-09-15 11:46 ` Alex Bligh 2016-09-15 11:52 ` Christoph Hellwig 2016-09-15 11:52 ` Christoph Hellwig 2016-09-15 12:01 ` Wouter Verhelst 2016-09-15 12:01 ` Wouter Verhelst 2016-09-15 12:20 ` Christoph Hellwig 2016-09-15 12:20 ` Christoph Hellwig 2016-09-15 12:26 ` Wouter Verhelst 2016-09-15 12:26 ` Wouter Verhelst 2016-09-15 12:27 ` Christoph Hellwig 2016-09-15 12:27 ` Christoph Hellwig 2016-09-15 12:04 ` Alex Bligh 2016-09-15 12:04 ` Alex Bligh 2016-09-15 11:39 ` Christoph Hellwig 2016-09-15 11:39 ` Christoph Hellwig 2016-09-15 13:34 ` Eric Blake 2016-09-15 13:34 ` Eric Blake 2016-09-15 14:07 ` Paolo Bonzini 2016-09-15 14:07 ` Paolo Bonzini 2016-09-15 15:23 ` Alex Bligh 2016-09-15 15:23 ` Alex Bligh 2016-09-15 21:10 ` Paolo Bonzini 2016-09-15 21:10 ` Paolo Bonzini 2016-09-15 15:25 ` Alex Bligh 2016-09-15 15:25 ` Alex Bligh 2016-09-15 11:38 ` Christoph Hellwig 2016-09-15 11:43 ` Alex Bligh 2016-09-15 11:43 ` Alex Bligh 2016-09-15 11:46 ` Christoph Hellwig 2016-09-15 11:46 ` Christoph Hellwig 2016-09-15 11:56 ` Alex Bligh 2016-09-15 11:56 ` Alex Bligh 2016-09-15 11:55 ` Wouter Verhelst 2016-09-15 12:01 ` Christoph Hellwig 2016-09-15 12:11 ` Alex Bligh 2016-09-15 12:11 ` Alex Bligh 2016-09-15 12:18 ` Christoph Hellwig 2016-09-15 12:18 ` Christoph Hellwig 2016-09-15 12:28 ` Alex Bligh 2016-09-15 12:28 ` Alex Bligh 2016-09-15 12:21 ` Wouter Verhelst 2016-09-15 12:23 ` Christoph Hellwig 2016-09-15 12:33 ` Alex Bligh 2016-09-15 12:33 ` Alex Bligh 2016-09-15 12:36 ` Christoph Hellwig 2016-09-15 12:36 ` Christoph Hellwig 2016-09-15 12:39 ` Alex Bligh 2016-09-15 12:39 ` Alex Bligh 2016-09-15 12:41 ` Christoph Hellwig 2016-09-15 12:41 ` Christoph Hellwig 2016-09-15 12:44 ` Alex Bligh 2016-09-15 12:44 ` Alex Bligh 2016-09-15 13:17 ` Wouter Verhelst 2016-09-15 13:17 ` Wouter Verhelst 2016-09-15 13:57 ` Josef Bacik 2016-09-15 13:57 ` Josef Bacik 2016-09-15 15:17 ` Alex Bligh 2016-09-15 15:17 ` Alex Bligh 2016-09-15 16:08 ` Alex Bligh [this message] 2016-09-15 16:08 ` Alex Bligh 2016-09-15 16:27 ` Wouter Verhelst 2016-09-15 16:27 ` Wouter Verhelst 2016-09-15 16:42 ` Alex Bligh 2016-09-15 16:42 ` Alex Bligh 2016-09-15 19:02 ` Eric Blake 2016-09-15 19:02 ` Eric Blake
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=ECDC1983-E21B-48D5-892D-63CA3572B2B4@alex.org.uk \ --to=alex@alex.org.uk \ --cc=hch@infradead.org \ --cc=jbacik@fb.com \ --cc=kernel-team@fb.com \ --cc=linux-block@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mpa@pengutronix.de \ --cc=nbd-general@lists.sourceforge.net \ --cc=w@uter.be \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.