linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Lots of NBD fixes and enhancements
@ 2017-02-28 16:57 Josef Bacik
  2017-02-28 16:57 ` [PATCH 1/6] nbd: handle single path failures gracefully Josef Bacik
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Josef Bacik @ 2017-02-28 16:57 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

This is kind of a big batch of patches, but they all depend on eachother so it
was hard to tease out the fixes from the enhancements without making my life
miserable.

FIXES:
nbd: set queue timeout properly
nbd: handle ERESTARTSYS properly

The ERSTARTSYS one in particular is pretty awful as we can tear down a whole
device if a userspace app has a signal pending while submitting IO.  This is big
and scary but I had a debug patch on top of this to randomly induce ERESTARTSYS
to make sure it was behaving properly.

ENHANCEMENTS:

1) Handle signle path failures gracefully.  This is the first step to handling
reconnects gracefully, but for right now we can easily fall back on other
connections if we happen to lose one connection.

2) Ref counting and bdev change.  This is in preparation for the netlink patch
and handling reconnects and the such better.

3) Netlink interface.  Trying to add the nbd-control thing was controversial,
and I realized the more I wanted to do with monitoring and stuff I would need to
use netlink for anyway.  With this new interface we can easily configure and
disconnect devices, dynamically add devices if we are past our pre-allocated
limit, and dynamically find free devices to use if we don't want to try and hunt
for a device.  The userspace patch to utilize this is kind of rough but can be
found in my github tree

https://github.com/josefbacik/nbd

These have been pretty well tested, but I'd like to hear any thoughts on the new
interface.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/6] nbd: handle single path failures gracefully
  2017-02-28 16:57 [PATCH 0/6] Lots of NBD fixes and enhancements Josef Bacik
@ 2017-02-28 16:57 ` Josef Bacik
  2017-02-28 16:57 ` [PATCH 2/6] nbd: ref count the nbd device Josef Bacik
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2017-02-28 16:57 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

Currently if we have multiple connections and one of them goes down we will tear
down the whole device.  However there's no reason we need to do this as we
could have other connections that are working fine.  Deal with this by keeping
track of the state of the different connections, and if we lose one we mark it
as dead and send all IO destined for that socket to one of the other healthy
sockets.  Any outstanding requests that were on the dead socket will timeout and
be re-submitted properly.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 165 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 124 insertions(+), 41 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 0bf2b21..7a91c8f 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -47,6 +47,8 @@ static DEFINE_MUTEX(nbd_index_mutex);
 struct nbd_sock {
 	struct socket *sock;
 	struct mutex tx_lock;
+	bool dead;
+	int fallback_index;
 };
 
 #define NBD_TIMEDOUT			0
@@ -80,6 +82,7 @@ struct nbd_device {
 
 struct nbd_cmd {
 	struct nbd_device *nbd;
+	int index;
 	struct completion send_complete;
 };
 
@@ -188,7 +191,32 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 	struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 	struct nbd_device *nbd = cmd->nbd;
 
-	dev_err(nbd_to_dev(nbd), "Connection timed out, shutting down connection\n");
+	if (nbd->num_connections > 1) {
+		dev_err_ratelimited(nbd_to_dev(nbd),
+				    "Connection timed out, retrying\n");
+		mutex_lock(&nbd->config_lock);
+		/*
+		 * Hooray we have more connections, requeue this IO, the submit
+		 * path will put it on a real connection.
+		 */
+		if (nbd->socks && nbd->num_connections > 1) {
+			if (cmd->index < nbd->num_connections) {
+				struct nbd_sock *nsock =
+					nbd->socks[cmd->index];
+				mutex_lock(&nsock->tx_lock);
+				nsock->dead = true;
+				kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
+				mutex_unlock(&nsock->tx_lock);
+			}
+			mutex_unlock(&nbd->config_lock);
+			blk_mq_requeue_request(req, true);
+			return BLK_EH_RESET_TIMER;
+		}
+		mutex_unlock(&nbd->config_lock);
+	} else {
+		dev_err_ratelimited(nbd_to_dev(nbd),
+				    "Connection timed out\n");
+	}
 	set_bit(NBD_TIMEDOUT, &nbd->runtime_flags);
 	req->errors++;
 
@@ -294,6 +322,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		return -EIO;
 	}
 
+	cmd->index = index;
 	memset(&request, 0, sizeof(request));
 	request.magic = htonl(NBD_REQUEST_MAGIC);
 	request.type = htonl(type);
@@ -311,7 +340,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	if (result <= 0) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 			"Send control failed (result %d)\n", result);
-		return -EIO;
+		return -EAGAIN;
 	}
 
 	if (type != NBD_CMD_WRITE)
@@ -334,7 +363,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 				dev_err(disk_to_dev(nbd->disk),
 					"Send data failed (result %d)\n",
 					result);
-				return -EIO;
+				return -EAGAIN;
 			}
 			/*
 			 * The completion might already have come in,
@@ -361,6 +390,12 @@ static inline int sock_recv_bvec(struct nbd_device *nbd, int index,
 	return result;
 }
 
+static int nbd_disconnected(struct nbd_device *nbd)
+{
+	return test_bit(NBD_DISCONNECTED, &nbd->runtime_flags) ||
+		test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags);
+}
+
 /* NULL returned = something went wrong, inform userspace */
 static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 {
@@ -374,8 +409,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 	reply.magic = 0;
 	result = sock_xmit(nbd, index, 0, &reply, sizeof(reply), MSG_WAITALL);
 	if (result <= 0) {
-		if (!test_bit(NBD_DISCONNECTED, &nbd->runtime_flags) &&
-		    !test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
+		if (!nbd_disconnected(nbd))
 			dev_err(disk_to_dev(nbd->disk),
 				"Receive control failed (result %d)\n", result);
 		return ERR_PTR(result);
@@ -416,8 +450,19 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 			if (result <= 0) {
 				dev_err(disk_to_dev(nbd->disk), "Receive data failed (result %d)\n",
 					result);
-				req->errors++;
-				return cmd;
+				/*
+				 * If we've disconnected or we only have 1
+				 * connection then we need to make sure we
+				 * complete this request, otherwise error out
+				 * and let the timeout stuff handle resubmitting
+				 * this request onto another connection.
+				 */
+				if (nbd_disconnected(nbd) ||
+				    nbd->num_connections <= 1) {
+					req->errors++;
+					return cmd;
+				}
+				return ERR_PTR(-EIO);
 			}
 			dev_dbg(nbd_to_dev(nbd), "request %p: got %d bytes data\n",
 				cmd, bvec.bv_len);
@@ -462,19 +507,13 @@ static void recv_work(struct work_struct *work)
 	while (1) {
 		cmd = nbd_read_stat(nbd, args->index);
 		if (IS_ERR(cmd)) {
+			nbd->socks[args->index]->dead = true;
 			ret = PTR_ERR(cmd);
 			break;
 		}
 
 		nbd_end_request(cmd);
 	}
-
-	/*
-	 * We got an error, shut everybody down if this wasn't the result of a
-	 * disconnect request.
-	 */
-	if (ret && !test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
-		sock_shutdown(nbd);
 	atomic_dec(&nbd->recv_threads);
 	wake_up(&nbd->recv_wq);
 }
@@ -498,50 +537,89 @@ static void nbd_clear_que(struct nbd_device *nbd)
 	dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n");
 }
 
+static int find_fallback(struct nbd_device *nbd, int index)
+{
+	int new_index = -1;
+	struct nbd_sock *nsock = nbd->socks[index];
+	int fallback = nsock->fallback_index;
+
+	if (test_bit(NBD_DISCONNECTED, &nbd->runtime_flags))
+		return new_index;
+
+	if (nbd->num_connections <= 1) {
+		dev_err_ratelimited(disk_to_dev(nbd->disk),
+				    "Attempted send on invalid socket\n");
+		return new_index;
+	}
+
+	if (fallback >= 0 && fallback < nbd->num_connections &&
+	    !nbd->socks[fallback]->dead)
+		return fallback;
+
+	mutex_lock(&nsock->tx_lock);
+	if (nsock->fallback_index < 0 ||
+	    nsock->fallback_index >= nbd->num_connections ||
+	    nbd->socks[nsock->fallback_index]->dead) {
+		int i;
+		for (i = 0; i < nbd->num_connections; i++) {
+			if (i == index)
+				continue;
+			if (!nbd->socks[i]->dead) {
+				new_index = i;
+				break;
+			}
+		}
+		if (new_index < 0) {
+			mutex_unlock(&nsock->tx_lock);
+			dev_err_ratelimited(disk_to_dev(nbd->disk),
+					    "Dead connection, failed to find a fallback\n");
+			return new_index;
+		}
+		nsock->fallback_index = new_index;
+	}
+	new_index = nsock->fallback_index;
+	mutex_unlock(&nsock->tx_lock);
+	return new_index;
+}
 
-static void nbd_handle_cmd(struct nbd_cmd *cmd, int index)
+static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
 	struct nbd_device *nbd = cmd->nbd;
 	struct nbd_sock *nsock;
+	int ret;
 
 	if (index >= nbd->num_connections) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Attempted send on invalid socket\n");
-		goto error_out;
-	}
-
-	if (test_bit(NBD_DISCONNECTED, &nbd->runtime_flags)) {
-		dev_err_ratelimited(disk_to_dev(nbd->disk),
-				    "Attempted send on closed socket\n");
-		goto error_out;
+		return -EINVAL;
 	}
-
 	req->errors = 0;
-
+again:
 	nsock = nbd->socks[index];
-	mutex_lock(&nsock->tx_lock);
-	if (unlikely(!nsock->sock)) {
-		mutex_unlock(&nsock->tx_lock);
-		dev_err_ratelimited(disk_to_dev(nbd->disk),
-				    "Attempted send on closed socket\n");
-		goto error_out;
+	if (nsock->dead) {
+		index = find_fallback(nbd, index);
+		if (index < 0)
+			return -EIO;
+		nsock = nbd->socks[index];
 	}
 
-	if (nbd_send_cmd(nbd, cmd, index) != 0) {
+	/*
+	 * Some failures are related to the link going down, so anything that
+	 * returns EAGAIN can be retried on a different socket.
+	 */
+	mutex_lock(&nsock->tx_lock);
+	ret = nbd_send_cmd(nbd, cmd, index);
+	if (ret == -EAGAIN) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
-				    "Request send failed\n");
-		req->errors++;
-		nbd_end_request(cmd);
+				    "Request send failed trying another connection\n");
+		nsock->dead = true;
+		mutex_unlock(&nsock->tx_lock);
+		goto again;
 	}
-
 	mutex_unlock(&nsock->tx_lock);
 
-	return;
-
-error_out:
-	req->errors++;
-	nbd_end_request(cmd);
+	return ret;
 }
 
 static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -560,7 +638,10 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	 */
 	init_completion(&cmd->send_complete);
 	blk_mq_start_request(bd->rq);
-	nbd_handle_cmd(cmd, hctx->queue_num);
+	if (nbd_handle_cmd(cmd, hctx->queue_num) != 0) {
+		bd->rq->errors++;
+		nbd_end_request(cmd);
+	}
 	complete(&cmd->send_complete);
 
 	return BLK_MQ_RQ_QUEUE_OK;
@@ -596,6 +677,8 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 
 	nbd->socks = socks;
 
+	nsock->fallback_index = -1;
+	nsock->dead = false;
 	mutex_init(&nsock->tx_lock);
 	nsock->sock = sock;
 	socks[nbd->num_connections++] = nsock;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/6] nbd: ref count the nbd device
  2017-02-28 16:57 [PATCH 0/6] Lots of NBD fixes and enhancements Josef Bacik
  2017-02-28 16:57 ` [PATCH 1/6] nbd: handle single path failures gracefully Josef Bacik
@ 2017-02-28 16:57 ` Josef Bacik
  2017-02-28 16:57 ` [PATCH 3/6] nbd: stop using the bdev everywhere Josef Bacik
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2017-02-28 16:57 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

In preparation for seamless reconnects and the netlink configuration
interface we need a way to make sure our nbd device configuration
doesn't disappear until we are finished with it.  So add a ref counter,
and on the final put we do all of the cleanup work on the nbd device.
At configuration time we only allow a device to be configured if it's
ref count is currently 0.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 210 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 133 insertions(+), 77 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 7a91c8f..6760ee5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -51,22 +51,33 @@ struct nbd_sock {
 	int fallback_index;
 };
 
+struct recv_thread_args {
+	struct work_struct work;
+	struct nbd_device *nbd;
+	int index;
+};
+
 #define NBD_TIMEDOUT			0
 #define NBD_DISCONNECT_REQUESTED	1
 #define NBD_DISCONNECTED		2
-#define NBD_RUNNING			3
+#define NBD_HAS_SOCKS_REF		3
 
 struct nbd_device {
 	u32 flags;
 	unsigned long runtime_flags;
+
 	struct nbd_sock **socks;
+	atomic_t refs;
+	wait_queue_head_t socks_wq;
+	int num_connections;
+
+	struct recv_thread_args *args;
 	int magic;
 
 	struct blk_mq_tag_set tag_set;
 
 	struct mutex config_lock;
 	struct gendisk *disk;
-	int num_connections;
 	atomic_t recv_threads;
 	wait_queue_head_t recv_wq;
 	loff_t blksize;
@@ -101,7 +112,7 @@ static int part_shift;
 
 static int nbd_dev_dbg_init(struct nbd_device *nbd);
 static void nbd_dev_dbg_close(struct nbd_device *nbd);
-
+static void nbd_put(struct nbd_device *nbd);
 
 static inline struct device *nbd_to_dev(struct nbd_device *nbd)
 {
@@ -125,6 +136,25 @@ static const char *nbdcmd_to_ascii(int cmd)
 	return "invalid";
 }
 
+static ssize_t pid_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct nbd_device *nbd = (struct nbd_device *)disk->private_data;
+
+	return sprintf(buf, "%d\n", task_pid_nr(nbd->task_recv));
+}
+
+static struct device_attribute pid_attr = {
+	.attr = { .name = "pid", .mode = S_IRUGO},
+	.show = pid_show,
+};
+
+static int nbd_get_unless_zero(struct nbd_device *nbd)
+{
+	return atomic_add_unless(&nbd->refs, 1, 0);
+}
+
 static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
 {
 	bd_set_size(bdev, 0);
@@ -181,6 +211,7 @@ static void sock_shutdown(struct nbd_device *nbd)
 		mutex_lock(&nsock->tx_lock);
 		kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
 		mutex_unlock(&nsock->tx_lock);
+		nsock->dead = true;
 	}
 	dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
 }
@@ -191,10 +222,14 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 	struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 	struct nbd_device *nbd = cmd->nbd;
 
+	if (!nbd_get_unless_zero(nbd)) {
+		req->errors++;
+		return BLK_EH_HANDLED;
+	}
+
 	if (nbd->num_connections > 1) {
 		dev_err_ratelimited(nbd_to_dev(nbd),
 				    "Connection timed out, retrying\n");
-		mutex_lock(&nbd->config_lock);
 		/*
 		 * Hooray we have more connections, requeue this IO, the submit
 		 * path will put it on a real connection.
@@ -208,21 +243,19 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 				kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
 				mutex_unlock(&nsock->tx_lock);
 			}
-			mutex_unlock(&nbd->config_lock);
 			blk_mq_requeue_request(req, true);
+			nbd_put(nbd);
 			return BLK_EH_RESET_TIMER;
 		}
-		mutex_unlock(&nbd->config_lock);
 	} else {
 		dev_err_ratelimited(nbd_to_dev(nbd),
 				    "Connection timed out\n");
 	}
 	set_bit(NBD_TIMEDOUT, &nbd->runtime_flags);
 	req->errors++;
-
-	mutex_lock(&nbd->config_lock);
 	sock_shutdown(nbd);
-	mutex_unlock(&nbd->config_lock);
+	nbd_put(nbd);
+
 	return BLK_EH_HANDLED;
 }
 
@@ -474,26 +507,6 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 	return cmd;
 }
 
-static ssize_t pid_show(struct device *dev,
-			struct device_attribute *attr, char *buf)
-{
-	struct gendisk *disk = dev_to_disk(dev);
-	struct nbd_device *nbd = (struct nbd_device *)disk->private_data;
-
-	return sprintf(buf, "%d\n", task_pid_nr(nbd->task_recv));
-}
-
-static struct device_attribute pid_attr = {
-	.attr = { .name = "pid", .mode = S_IRUGO},
-	.show = pid_show,
-};
-
-struct recv_thread_args {
-	struct work_struct work;
-	struct nbd_device *nbd;
-	int index;
-};
-
 static void recv_work(struct work_struct *work)
 {
 	struct recv_thread_args *args = container_of(work,
@@ -516,6 +529,7 @@ static void recv_work(struct work_struct *work)
 	}
 	atomic_dec(&nbd->recv_threads);
 	wake_up(&nbd->recv_wq);
+	nbd_put(nbd);
 }
 
 static void nbd_clear_req(struct request *req, void *data, bool reserved)
@@ -589,9 +603,16 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 	struct nbd_sock *nsock;
 	int ret;
 
+	if (!nbd_get_unless_zero(nbd)) {
+		dev_err_ratelimited(disk_to_dev(nbd->disk),
+				    "Socks array is empty\n");
+		return -EINVAL;
+	}
+
 	if (index >= nbd->num_connections) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Attempted send on invalid socket\n");
+		nbd_put(nbd);
 		return -EINVAL;
 	}
 	req->errors = 0;
@@ -599,8 +620,10 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 	nsock = nbd->socks[index];
 	if (nsock->dead) {
 		index = find_fallback(nbd, index);
-		if (index < 0)
+		if (index < 0) {
+			nbd_put(nbd);
 			return -EIO;
+		}
 		nsock = nbd->socks[index];
 	}
 
@@ -618,7 +641,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 		goto again;
 	}
 	mutex_unlock(&nsock->tx_lock);
-
+	nbd_put(nbd);
 	return ret;
 }
 
@@ -659,21 +682,27 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 	if (!sock)
 		return err;
 
-	if (!nbd->task_setup)
+	err = -EINVAL;
+	if (!nbd->task_setup && !atomic_cmpxchg(&nbd->refs, 0, 1)) {
 		nbd->task_setup = current;
+		set_bit(NBD_HAS_SOCKS_REF, &nbd->runtime_flags);
+		try_module_get(THIS_MODULE);
+	}
+
 	if (nbd->task_setup != current) {
 		dev_err(disk_to_dev(nbd->disk),
 			"Device being setup by another task");
-		return -EINVAL;
+		goto out;
 	}
 
+	err = -ENOMEM;
 	socks = krealloc(nbd->socks, (nbd->num_connections + 1) *
 			 sizeof(struct nbd_sock *), GFP_KERNEL);
 	if (!socks)
-		return -ENOMEM;
+		goto out;
 	nsock = kzalloc(sizeof(struct nbd_sock), GFP_KERNEL);
 	if (!nsock)
-		return -ENOMEM;
+		goto out;
 
 	nbd->socks = socks;
 
@@ -685,7 +714,9 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 
 	if (max_part)
 		bdev->bd_invalidated = 1;
-	return 0;
+	err = 0;
+out:
+	return err;
 }
 
 /* Reset all properties of an NBD device */
@@ -697,6 +728,7 @@ static void nbd_reset(struct nbd_device *nbd)
 	set_capacity(nbd->disk, 0);
 	nbd->flags = 0;
 	nbd->tag_set.timeout = 0;
+	nbd->args = NULL;
 	queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue);
 }
 
@@ -741,20 +773,17 @@ static void send_disconnects(struct nbd_device *nbd)
 static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev)
 {
 	dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n");
-	if (!nbd->socks)
+	if (!nbd_get_unless_zero(nbd))
 		return -EINVAL;
 
 	mutex_unlock(&nbd->config_lock);
 	fsync_bdev(bdev);
 	mutex_lock(&nbd->config_lock);
 
-	/* Check again after getting mutex back.  */
-	if (!nbd->socks)
-		return -EINVAL;
-
 	if (!test_and_set_bit(NBD_DISCONNECT_REQUESTED,
 			      &nbd->runtime_flags))
 		send_disconnects(nbd);
+	nbd_put(nbd);
 	return 0;
 }
 
@@ -764,49 +793,74 @@ static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
 	nbd_clear_que(nbd);
 	kill_bdev(bdev);
 	nbd_bdev_reset(bdev);
-	/*
-	 * We want to give the run thread a chance to wait for everybody
-	 * to clean up and then do it's own cleanup.
-	 */
-	if (!test_bit(NBD_RUNNING, &nbd->runtime_flags) &&
-	    nbd->num_connections) {
-		int i;
-
-		for (i = 0; i < nbd->num_connections; i++)
-			kfree(nbd->socks[i]);
-		kfree(nbd->socks);
-		nbd->socks = NULL;
-		nbd->num_connections = 0;
-	}
 	nbd->task_setup = NULL;
-
+	if (test_and_clear_bit(NBD_HAS_SOCKS_REF, &nbd->runtime_flags))
+		nbd_put(nbd);
 	return 0;
 }
 
+static void nbd_put(struct nbd_device *nbd)
+{
+	if (atomic_dec_and_test(&nbd->refs)) {
+		struct block_device *bdev;
+
+		bdev = bdget_disk(nbd->disk, 0);
+		if (!bdev)
+			return;
+
+		mutex_lock(&nbd->config_lock);
+		nbd_dev_dbg_close(nbd);
+		nbd_size_clear(nbd, bdev);
+		device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
+		nbd->task_recv = NULL;
+		nbd_clear_sock(nbd, bdev);
+		if (nbd->num_connections) {
+			int i;
+			for (i = 0; i < nbd->num_connections; i++)
+				kfree(nbd->socks[i]);
+			kfree(nbd->socks);
+			nbd->num_connections = 0;
+			nbd->socks = NULL;
+		}
+		kfree(nbd->args);
+		nbd_reset(nbd);
+		mutex_unlock(&nbd->config_lock);
+		bdput(bdev);
+		module_put(THIS_MODULE);
+	}
+}
+
 static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 {
 	struct recv_thread_args *args;
 	int num_connections = nbd->num_connections;
 	int error = 0, i;
 
-	if (nbd->task_recv)
-		return -EBUSY;
-	if (!nbd->socks)
+	if (!nbd_get_unless_zero(nbd))
 		return -EINVAL;
+	if (nbd->task_recv) {
+		error = -EBUSY;
+		goto out;
+	}
+	if (!nbd->socks) {
+		error = -EINVAL;
+		goto out;
+	}
+
 	if (num_connections > 1 &&
 	    !(nbd->flags & NBD_FLAG_CAN_MULTI_CONN)) {
 		dev_err(disk_to_dev(nbd->disk), "server does not support multiple connections per device.\n");
 		error = -EINVAL;
-		goto out_err;
+		goto out;
 	}
 
-	set_bit(NBD_RUNNING, &nbd->runtime_flags);
 	blk_mq_update_nr_hw_queues(&nbd->tag_set, nbd->num_connections);
 	args = kcalloc(num_connections, sizeof(*args), GFP_KERNEL);
 	if (!args) {
 		error = -ENOMEM;
-		goto out_err;
+		goto out;
 	}
+	nbd->args = args;
 	nbd->task_recv = current;
 	mutex_unlock(&nbd->config_lock);
 
@@ -815,7 +869,7 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 	error = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
 	if (error) {
 		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
-		goto out_recv;
+		goto out;
 	}
 
 	nbd_size_update(nbd, bdev);
@@ -824,32 +878,26 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 	for (i = 0; i < num_connections; i++) {
 		sk_set_memalloc(nbd->socks[i]->sock->sk);
 		atomic_inc(&nbd->recv_threads);
+		atomic_inc(&nbd->refs);
 		INIT_WORK(&args[i].work, recv_work);
 		args[i].nbd = nbd;
 		args[i].index = i;
 		queue_work(recv_workqueue, &args[i].work);
 	}
-	wait_event_interruptible(nbd->recv_wq,
-				 atomic_read(&nbd->recv_threads) == 0);
+	error = wait_event_interruptible(nbd->recv_wq,
+					 atomic_read(&nbd->recv_threads) == 0);
+	if (error)
+		sock_shutdown(nbd);
 	for (i = 0; i < num_connections; i++)
 		flush_work(&args[i].work);
-	nbd_dev_dbg_close(nbd);
-	nbd_size_clear(nbd, bdev);
-	device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
-out_recv:
 	mutex_lock(&nbd->config_lock);
-	nbd->task_recv = NULL;
-out_err:
-	clear_bit(NBD_RUNNING, &nbd->runtime_flags);
-	nbd_clear_sock(nbd, bdev);
-
+out:
 	/* user requested, ignore socket errors */
 	if (test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
 		error = 0;
 	if (test_bit(NBD_TIMEDOUT, &nbd->runtime_flags))
 		error = -ETIMEDOUT;
-
-	nbd_reset(nbd);
+	nbd_put(nbd);
 	return error;
 }
 
@@ -905,16 +953,21 @@ static int nbd_ioctl(struct block_device *bdev, fmode_t mode,
 {
 	struct nbd_device *nbd = bdev->bd_disk->private_data;
 	int error;
+	bool need_put = false;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
 	BUG_ON(nbd->magic != NBD_MAGIC);
 
+	/* This is to keep us from doing the final put under the config_lock. */
+	if (nbd_get_unless_zero(nbd))
+		need_put = true;
 	mutex_lock(&nbd->config_lock);
 	error = __nbd_ioctl(bdev, nbd, cmd, arg);
 	mutex_unlock(&nbd->config_lock);
-
+	if (need_put)
+		nbd_put(nbd);
 	return error;
 }
 
@@ -1141,12 +1194,15 @@ static int nbd_dev_add(int index)
 
 	nbd->magic = NBD_MAGIC;
 	mutex_init(&nbd->config_lock);
+	atomic_set(&nbd->refs, 0);
+	nbd->args = NULL;
 	disk->major = NBD_MAJOR;
 	disk->first_minor = index << part_shift;
 	disk->fops = &nbd_fops;
 	disk->private_data = nbd;
 	sprintf(disk->disk_name, "nbd%d", index);
 	init_waitqueue_head(&nbd->recv_wq);
+	init_waitqueue_head(&nbd->socks_wq);
 	nbd_reset(nbd);
 	add_disk(disk);
 	return index;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/6] nbd: stop using the bdev everywhere
  2017-02-28 16:57 [PATCH 0/6] Lots of NBD fixes and enhancements Josef Bacik
  2017-02-28 16:57 ` [PATCH 1/6] nbd: handle single path failures gracefully Josef Bacik
  2017-02-28 16:57 ` [PATCH 2/6] nbd: ref count the nbd device Josef Bacik
@ 2017-02-28 16:57 ` Josef Bacik
  2017-02-28 16:57 ` [PATCH 4/6] nbd: set queue timeout properly Josef Bacik
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2017-02-28 16:57 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

In preparation for the upcoming netlink interface we need to not rely on
already having the bdev for the NBD device we are doing operations on.
Instead of passing the bdev around, just use it in places where we know
we already have the bdev.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 93 ++++++++++++++++++++++-------------------------------
 1 file changed, 39 insertions(+), 54 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 6760ee5..059c80a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -119,11 +119,6 @@ static inline struct device *nbd_to_dev(struct nbd_device *nbd)
 	return disk_to_dev(nbd->disk);
 }
 
-static bool nbd_is_connected(struct nbd_device *nbd)
-{
-	return !!nbd->task_recv;
-}
-
 static const char *nbdcmd_to_ascii(int cmd)
 {
 	switch (cmd) {
@@ -155,31 +150,26 @@ static int nbd_get_unless_zero(struct nbd_device *nbd)
 	return atomic_add_unless(&nbd->refs, 1, 0);
 }
 
-static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_size_clear(struct nbd_device *nbd)
 {
-	bd_set_size(bdev, 0);
 	set_capacity(nbd->disk, 0);
 	kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
-
-	return 0;
 }
 
-static void nbd_size_update(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_size_update(struct nbd_device *nbd)
 {
 	blk_queue_logical_block_size(nbd->disk->queue, nbd->blksize);
 	blk_queue_physical_block_size(nbd->disk->queue, nbd->blksize);
-	bd_set_size(bdev, nbd->bytesize);
 	set_capacity(nbd->disk, nbd->bytesize >> 9);
 	kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
 }
 
-static void nbd_size_set(struct nbd_device *nbd, struct block_device *bdev,
-			loff_t blocksize, loff_t nr_blocks)
+static void nbd_size_set(struct nbd_device *nbd, loff_t blocksize,
+			 loff_t nr_blocks)
 {
 	nbd->blksize = blocksize;
 	nbd->bytesize = blocksize * nr_blocks;
-	if (nbd_is_connected(nbd))
-		nbd_size_update(nbd, bdev);
+	nbd_size_update(nbd);
 }
 
 static void nbd_end_request(struct nbd_cmd *cmd)
@@ -670,8 +660,7 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_MQ_RQ_QUEUE_OK;
 }
 
-static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
-			  unsigned long arg)
+static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
 {
 	struct socket *sock;
 	struct nbd_sock **socks;
@@ -712,8 +701,6 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
 	nsock->sock = sock;
 	socks[nbd->num_connections++] = nsock;
 
-	if (max_part)
-		bdev->bd_invalidated = 1;
 	err = 0;
 out:
 	return err;
@@ -734,18 +721,19 @@ static void nbd_reset(struct nbd_device *nbd)
 
 static void nbd_bdev_reset(struct block_device *bdev)
 {
-	set_device_ro(bdev, false);
-	bdev->bd_inode->i_size = 0;
+	bd_set_size(bdev, 0);
 	if (max_part > 0) {
 		blkdev_reread_part(bdev);
 		bdev->bd_invalidated = 1;
 	}
 }
 
-static void nbd_parse_flags(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_parse_flags(struct nbd_device *nbd)
 {
 	if (nbd->flags & NBD_FLAG_READ_ONLY)
-		set_device_ro(bdev, true);
+		set_disk_ro(nbd->disk, true);
+	else
+		set_disk_ro(nbd->disk, false);
 	if (nbd->flags & NBD_FLAG_SEND_TRIM)
 		queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue);
 	if (nbd->flags & NBD_FLAG_SEND_FLUSH)
@@ -770,16 +758,11 @@ static void send_disconnects(struct nbd_device *nbd)
 	}
 }
 
-static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_disconnect(struct nbd_device *nbd)
 {
 	dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n");
 	if (!nbd_get_unless_zero(nbd))
 		return -EINVAL;
-
-	mutex_unlock(&nbd->config_lock);
-	fsync_bdev(bdev);
-	mutex_lock(&nbd->config_lock);
-
 	if (!test_and_set_bit(NBD_DISCONNECT_REQUESTED,
 			      &nbd->runtime_flags))
 		send_disconnects(nbd);
@@ -787,33 +770,22 @@ static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev)
 	return 0;
 }
 
-static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_clear_sock(struct nbd_device *nbd)
 {
 	sock_shutdown(nbd);
 	nbd_clear_que(nbd);
-	kill_bdev(bdev);
-	nbd_bdev_reset(bdev);
 	nbd->task_setup = NULL;
-	if (test_and_clear_bit(NBD_HAS_SOCKS_REF, &nbd->runtime_flags))
-		nbd_put(nbd);
-	return 0;
 }
 
 static void nbd_put(struct nbd_device *nbd)
 {
 	if (atomic_dec_and_test(&nbd->refs)) {
-		struct block_device *bdev;
-
-		bdev = bdget_disk(nbd->disk, 0);
-		if (!bdev)
-			return;
-
 		mutex_lock(&nbd->config_lock);
 		nbd_dev_dbg_close(nbd);
-		nbd_size_clear(nbd, bdev);
+		nbd_size_clear(nbd);
 		device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
 		nbd->task_recv = NULL;
-		nbd_clear_sock(nbd, bdev);
+		nbd_clear_sock(nbd);
 		if (nbd->num_connections) {
 			int i;
 			for (i = 0; i < nbd->num_connections; i++)
@@ -825,7 +797,6 @@ static void nbd_put(struct nbd_device *nbd)
 		kfree(nbd->args);
 		nbd_reset(nbd);
 		mutex_unlock(&nbd->config_lock);
-		bdput(bdev);
 		module_put(THIS_MODULE);
 	}
 }
@@ -846,7 +817,6 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 		error = -EINVAL;
 		goto out;
 	}
-
 	if (num_connections > 1 &&
 	    !(nbd->flags & NBD_FLAG_CAN_MULTI_CONN)) {
 		dev_err(disk_to_dev(nbd->disk), "server does not support multiple connections per device.\n");
@@ -854,6 +824,9 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 		goto out;
 	}
 
+	if (max_part)
+		bdev->bd_invalidated = 1;
+
 	blk_mq_update_nr_hw_queues(&nbd->tag_set, nbd->num_connections);
 	args = kcalloc(num_connections, sizeof(*args), GFP_KERNEL);
 	if (!args) {
@@ -864,15 +837,16 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 	nbd->task_recv = current;
 	mutex_unlock(&nbd->config_lock);
 
-	nbd_parse_flags(nbd, bdev);
+	nbd_parse_flags(nbd);
 
 	error = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
 	if (error) {
 		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
 		goto out;
 	}
-
-	nbd_size_update(nbd, bdev);
+	if (max_part)
+		bdev->bd_invalidated = 1;
+	bd_set_size(bdev, nbd->bytesize);
 
 	nbd_dev_dbg_init(nbd);
 	for (i = 0; i < num_connections; i++) {
@@ -901,27 +875,38 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 	return error;
 }
 
+static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
+				 struct block_device *bdev)
+{
+	nbd_clear_sock(nbd);
+	kill_bdev(bdev);
+	nbd_bdev_reset(bdev);
+	if (test_and_clear_bit(NBD_HAS_SOCKS_REF, &nbd->runtime_flags))
+		nbd_put(nbd);
+}
+
 /* Must be called with config_lock held */
 static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		       unsigned int cmd, unsigned long arg)
 {
 	switch (cmd) {
 	case NBD_DISCONNECT:
-		return nbd_disconnect(nbd, bdev);
+		return nbd_disconnect(nbd);
 	case NBD_CLEAR_SOCK:
-		return nbd_clear_sock(nbd, bdev);
+		nbd_clear_sock_ioctl(nbd, bdev);
+		return 0;
 	case NBD_SET_SOCK:
-		return nbd_add_socket(nbd, bdev, arg);
+		return nbd_add_socket(nbd, arg);
 	case NBD_SET_BLKSIZE:
-		nbd_size_set(nbd, bdev, arg,
+		nbd_size_set(nbd, arg,
 			     div_s64(nbd->bytesize, arg));
 		return 0;
 	case NBD_SET_SIZE:
-		nbd_size_set(nbd, bdev, nbd->blksize,
+		nbd_size_set(nbd, nbd->blksize,
 			     div_s64(arg, nbd->blksize));
 		return 0;
 	case NBD_SET_SIZE_BLOCKS:
-		nbd_size_set(nbd, bdev, nbd->blksize, arg);
+		nbd_size_set(nbd, nbd->blksize, arg);
 		return 0;
 	case NBD_SET_TIMEOUT:
 		nbd->tag_set.timeout = arg * HZ;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/6] nbd: set queue timeout properly
  2017-02-28 16:57 [PATCH 0/6] Lots of NBD fixes and enhancements Josef Bacik
                   ` (2 preceding siblings ...)
  2017-02-28 16:57 ` [PATCH 3/6] nbd: stop using the bdev everywhere Josef Bacik
@ 2017-02-28 16:57 ` Josef Bacik
  2017-02-28 16:57 ` [PATCH 5/6] nbd: handle ERESTARTSYS properly Josef Bacik
  2017-02-28 16:57 ` [PATCH 6/6] nbd: add a basic netlink interface Josef Bacik
  5 siblings, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2017-02-28 16:57 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

We can't just set the timeout on the tagset, we have to set it on the
queue as it would have been setup already at this point.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 059c80a..ac5a03a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -909,7 +909,10 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		nbd_size_set(nbd, nbd->blksize, arg);
 		return 0;
 	case NBD_SET_TIMEOUT:
-		nbd->tag_set.timeout = arg * HZ;
+		if (arg) {
+			nbd->tag_set.timeout = arg * HZ;
+			blk_queue_rq_timeout(nbd->disk->queue, arg * HZ);
+		}
 		return 0;
 
 	case NBD_SET_FLAGS:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/6] nbd: handle ERESTARTSYS properly
  2017-02-28 16:57 [PATCH 0/6] Lots of NBD fixes and enhancements Josef Bacik
                   ` (3 preceding siblings ...)
  2017-02-28 16:57 ` [PATCH 4/6] nbd: set queue timeout properly Josef Bacik
@ 2017-02-28 16:57 ` Josef Bacik
  2017-02-28 16:57 ` [PATCH 6/6] nbd: add a basic netlink interface Josef Bacik
  5 siblings, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2017-02-28 16:57 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

We can submit IO in a processes context, which means there can be
pending signals.  This isn't a fatal error for NBD, but it does require
some finesse.  If the signal happens before we transmit anything then we
are ok, just requeue the request and carry on.  However if we've done a
partial transmit we can't allow anything else to be transmitted on this
socket until we transmit the remaining part of the request.  Deal with
this by keeping track of how much we've sent for the current request,
and if we get an ERESTARTSYS during any part of our transmission save
the state of that request and requeue the IO.  If anybody tries to
submit a request that isn't our pending request then requeue that
request until we are able to service the one that is pending.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c | 138 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 22 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index ac5a03a..aba1aba 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -49,6 +49,8 @@ struct nbd_sock {
 	struct mutex tx_lock;
 	bool dead;
 	int fallback_index;
+	struct request *pending;
+	int sent;
 };
 
 struct recv_thread_args {
@@ -145,6 +147,13 @@ static struct device_attribute pid_attr = {
 	.show = pid_show,
 };
 
+static void nbd_mark_nsock_dead(struct nbd_sock *nsock)
+{
+	nsock->dead = true;
+	nsock->pending = NULL;
+	nsock->sent = 0;
+}
+
 static int nbd_get_unless_zero(struct nbd_device *nbd)
 {
 	return atomic_add_unless(&nbd->refs, 1, 0);
@@ -200,8 +209,8 @@ static void sock_shutdown(struct nbd_device *nbd)
 		struct nbd_sock *nsock = nbd->socks[i];
 		mutex_lock(&nsock->tx_lock);
 		kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
+		nbd_mark_nsock_dead(nsock);
 		mutex_unlock(&nsock->tx_lock);
-		nsock->dead = true;
 	}
 	dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
 }
@@ -229,8 +238,8 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 				struct nbd_sock *nsock =
 					nbd->socks[cmd->index];
 				mutex_lock(&nsock->tx_lock);
-				nsock->dead = true;
 				kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
+				nbd_mark_nsock_dead(nsock);
 				mutex_unlock(&nsock->tx_lock);
 			}
 			blk_mq_requeue_request(req, true);
@@ -253,7 +262,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
  *  Send or receive packet.
  */
 static int sock_xmit(struct nbd_device *nbd, int index, int send, void *buf,
-		     int size, int msg_flags)
+		     int size, int msg_flags, int *sent)
 {
 	struct socket *sock = nbd->socks[index]->sock;
 	int result;
@@ -292,6 +301,8 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send, void *buf,
 		}
 		size -= result;
 		buf += result;
+		if (sent)
+			*sent += result;
 	} while (size > 0);
 
 	tsk_restore_flags(current, pflags, PF_MEMALLOC);
@@ -300,12 +311,28 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send, void *buf,
 }
 
 static inline int sock_send_bvec(struct nbd_device *nbd, int index,
-				 struct bio_vec *bvec, int flags)
+				 struct bio_vec *bvec, int flags, int *skip,
+				 int *sent)
 {
+	void *kaddr;
+	unsigned int offset = bvec->bv_offset;
+	unsigned int len = bvec->bv_len;
 	int result;
-	void *kaddr = kmap(bvec->bv_page);
-	result = sock_xmit(nbd, index, 1, kaddr + bvec->bv_offset,
-			   bvec->bv_len, flags);
+
+	/* If we had partially sent this request previously we need to skip over
+	 * what we've already sent.
+	 */
+	if (*skip) {
+		if (len <= *skip) {
+			*skip -= len;
+			return len;
+		}
+		offset += *skip;
+		len -= *skip;
+		*skip = 0;
+	}
+	kaddr = kmap(bvec->bv_page);
+	result = sock_xmit(nbd, index, 1, kaddr + offset, len, flags, sent);
 	kunmap(bvec->bv_page);
 	return result;
 }
@@ -314,12 +341,14 @@ static inline int sock_send_bvec(struct nbd_device *nbd, int index,
 static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 {
 	struct request *req = blk_mq_rq_from_pdu(cmd);
+	struct nbd_sock *nsock = nbd->socks[index];
 	int result;
 	struct nbd_request request;
 	unsigned long size = blk_rq_bytes(req);
 	struct bio *bio;
 	u32 type;
 	u32 tag = blk_mq_unique_tag(req);
+	int sent = nsock->sent, skip = 0;
 
 	switch (req_op(req)) {
 	case REQ_OP_DISCARD:
@@ -345,6 +374,15 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		return -EIO;
 	}
 
+	/* We did a partial send previously, and we at least sent the whole
+	 * request struct, so just go and send the rest of the pages in the
+	 * request.
+	 */
+	if (sent && sent >= sizeof(request)) {
+		skip = sent - sizeof(request);
+		goto send_pages;
+	}
+
 	cmd->index = index;
 	memset(&request, 0, sizeof(request));
 	request.magic = htonl(NBD_REQUEST_MAGIC);
@@ -358,16 +396,33 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 	dev_dbg(nbd_to_dev(nbd), "request %p: sending control (%s@%llu,%uB)\n",
 		cmd, nbdcmd_to_ascii(type),
 		(unsigned long long)blk_rq_pos(req) << 9, blk_rq_bytes(req));
-	result = sock_xmit(nbd, index, 1, &request, sizeof(request),
-			(type == NBD_CMD_WRITE) ? MSG_MORE : 0);
+
+	/* We may have only partially sent the request previously, so handle
+	 * this horrible, horrible, terrifying possiblity here.
+	 */
+	result = sock_xmit(nbd, index, 1, (void *)(&request) + sent,
+			   sizeof(request) - sent,
+			   (type == NBD_CMD_WRITE) ? MSG_MORE : 0, &sent);
 	if (result <= 0) {
+		if (result == -ERESTARTSYS) {
+			/* If we havne't sent anything we can just return BUSY,
+			 * however if we have sent something we need to make
+			 * sure we only allow this req to be sent until we are
+			 * completely done.
+			 */
+			if (sent) {
+				nsock->pending = req;
+				nsock->sent = sent;
+			}
+			return BLK_MQ_RQ_QUEUE_BUSY;
+		}
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 			"Send control failed (result %d)\n", result);
 		return -EAGAIN;
 	}
-
+send_pages:
 	if (type != NBD_CMD_WRITE)
-		return 0;
+		goto out;
 
 	bio = req->bio;
 	while (bio) {
@@ -381,8 +436,18 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 
 			dev_dbg(nbd_to_dev(nbd), "request %p: sending %d bytes data\n",
 				cmd, bvec.bv_len);
-			result = sock_send_bvec(nbd, index, &bvec, flags);
+			result = sock_send_bvec(nbd, index, &bvec, flags,
+						&skip, &sent);
 			if (result <= 0) {
+				if (result == -ERESTARTSYS) {
+					/* We've already sent the header, we
+					 * have no choice but to set pending and
+					 * return BUSY.
+					 */
+					nsock->pending = req;
+					nsock->sent = sent;
+					return BLK_MQ_RQ_QUEUE_BUSY;
+				}
 				dev_err(disk_to_dev(nbd->disk),
 					"Send data failed (result %d)\n",
 					result);
@@ -399,6 +464,9 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 		}
 		bio = next;
 	}
+out:
+	nsock->pending = NULL;
+	nsock->sent = 0;
 	return 0;
 }
 
@@ -408,7 +476,7 @@ static inline int sock_recv_bvec(struct nbd_device *nbd, int index,
 	int result;
 	void *kaddr = kmap(bvec->bv_page);
 	result = sock_xmit(nbd, index, 0, kaddr + bvec->bv_offset,
-			   bvec->bv_len, MSG_WAITALL);
+			   bvec->bv_len, MSG_WAITALL, NULL);
 	kunmap(bvec->bv_page);
 	return result;
 }
@@ -430,7 +498,8 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 	u32 tag;
 
 	reply.magic = 0;
-	result = sock_xmit(nbd, index, 0, &reply, sizeof(reply), MSG_WAITALL);
+	result = sock_xmit(nbd, index, 0, &reply, sizeof(reply), MSG_WAITALL,
+			   NULL);
 	if (result <= 0) {
 		if (!nbd_disconnected(nbd))
 			dev_err(disk_to_dev(nbd->disk),
@@ -510,7 +579,10 @@ static void recv_work(struct work_struct *work)
 	while (1) {
 		cmd = nbd_read_stat(nbd, args->index);
 		if (IS_ERR(cmd)) {
-			nbd->socks[args->index]->dead = true;
+			struct nbd_sock *nsock = nbd->socks[args->index];
+			mutex_lock(&nsock->tx_lock);
+			nbd_mark_nsock_dead(nsock);
+			mutex_unlock(&nsock->tx_lock);
 			ret = PTR_ERR(cmd);
 			break;
 		}
@@ -622,14 +694,26 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
 	 * returns EAGAIN can be retried on a different socket.
 	 */
 	mutex_lock(&nsock->tx_lock);
+
+	/* Handle the case that we have a pending request that was partially
+	 * transmitted that _has_ to be serviced first.  We need to call requeue
+	 * here so that it gets put _after_ the request that is already on the
+	 * dispatch list.
+	 */
+	if (unlikely(nsock->pending && nsock->pending != req)) {
+		blk_mq_requeue_request(req, true);
+		ret = 0;
+		goto out;
+	}
 	ret = nbd_send_cmd(nbd, cmd, index);
 	if (ret == -EAGAIN) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
 				    "Request send failed trying another connection\n");
-		nsock->dead = true;
+		nbd_mark_nsock_dead(nsock);
 		mutex_unlock(&nsock->tx_lock);
 		goto again;
 	}
+out:
 	mutex_unlock(&nsock->tx_lock);
 	nbd_put(nbd);
 	return ret;
@@ -639,6 +723,7 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 			const struct blk_mq_queue_data *bd)
 {
 	struct nbd_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	int ret;
 
 	/*
 	 * Since we look at the bio's to send the request over the network we
@@ -651,13 +736,20 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	 */
 	init_completion(&cmd->send_complete);
 	blk_mq_start_request(bd->rq);
-	if (nbd_handle_cmd(cmd, hctx->queue_num) != 0) {
-		bd->rq->errors++;
-		nbd_end_request(cmd);
-	}
+
+	/* We can be called directly from the user space process, which means we
+	 * could possibly have signals pending so our sendmsg will fail.  In
+	 * this case we need to return that we are busy, otherwise error out as
+	 * appropriate.
+	 */
+	ret = nbd_handle_cmd(cmd, hctx->queue_num);
+	if (ret < 0)
+		ret = BLK_MQ_RQ_QUEUE_ERROR;
+	if (!ret)
+		ret = BLK_MQ_RQ_QUEUE_OK;
 	complete(&cmd->send_complete);
 
-	return BLK_MQ_RQ_QUEUE_OK;
+	return ret;
 }
 
 static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
@@ -699,6 +791,8 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
 	nsock->dead = false;
 	mutex_init(&nsock->tx_lock);
 	nsock->sock = sock;
+	nsock->pending = NULL;
+	nsock->sent = 0;
 	socks[nbd->num_connections++] = nsock;
 
 	err = 0;
@@ -751,7 +845,7 @@ static void send_disconnects(struct nbd_device *nbd)
 	request.type = htonl(NBD_CMD_DISC);
 
 	for (i = 0; i < nbd->num_connections; i++) {
-		ret = sock_xmit(nbd, i, 1, &request, sizeof(request), 0);
+		ret = sock_xmit(nbd, i, 1, &request, sizeof(request), 0, NULL);
 		if (ret <= 0)
 			dev_err(disk_to_dev(nbd->disk),
 				"Send disconnect failed %d\n", ret);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/6] nbd: add a basic netlink interface
  2017-02-28 16:57 [PATCH 0/6] Lots of NBD fixes and enhancements Josef Bacik
                   ` (4 preceding siblings ...)
  2017-02-28 16:57 ` [PATCH 5/6] nbd: handle ERESTARTSYS properly Josef Bacik
@ 2017-02-28 16:57 ` Josef Bacik
  2017-03-08 10:07   ` [Nbd] " Wouter Verhelst
  5 siblings, 1 reply; 10+ messages in thread
From: Josef Bacik @ 2017-02-28 16:57 UTC (permalink / raw)
  To: axboe, nbd-general, linux-block, kernel-team

The existing ioctl interface for configuring NBD devices is a bit
cumbersome and hard to extend.  The other problem is we leave a
userspace app sitting in it's syscall until the device disconnects,
which is less than ideal.

This patch introduces a netlink interface for adding and disconnecting
nbd devices.  This has the benefits of being easily extendable without
breaking older userspace applications, and allows us to configure a nbd
device without leaving a userspace app sitting waiting for the device to
disconnect.

With this interface we also gain the ability to configure more devices
than are preallocated at insmod time.  We also have gained the ability
to not specify a particular device and be provided one for us so that
userspace doesn't need to find a free device to configure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 drivers/block/nbd.c              | 328 ++++++++++++++++++++++++++++++++++-----
 include/uapi/linux/nbd-netlink.h |  69 ++++++++
 2 files changed, 354 insertions(+), 43 deletions(-)
 create mode 100644 include/uapi/linux/nbd-netlink.h

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index aba1aba..62a659e 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -40,6 +40,8 @@
 #include <asm/types.h>
 
 #include <linux/nbd.h>
+#include <linux/nbd-netlink.h>
+#include <net/genetlink.h>
 
 static DEFINE_IDR(nbd_index_idr);
 static DEFINE_MUTEX(nbd_index_mutex);
@@ -63,6 +65,7 @@ struct recv_thread_args {
 #define NBD_DISCONNECT_REQUESTED	1
 #define NBD_DISCONNECTED		2
 #define NBD_HAS_SOCKS_REF		3
+#define NBD_BOUND			4
 
 struct nbd_device {
 	u32 flags;
@@ -75,6 +78,7 @@ struct nbd_device {
 
 	struct recv_thread_args *args;
 	int magic;
+	int index;
 
 	struct blk_mq_tag_set tag_set;
 
@@ -115,6 +119,7 @@ static int part_shift;
 static int nbd_dev_dbg_init(struct nbd_device *nbd);
 static void nbd_dev_dbg_close(struct nbd_device *nbd);
 static void nbd_put(struct nbd_device *nbd);
+static void nbd_connect_reply(struct genl_info *info, int index);
 
 static inline struct device *nbd_to_dev(struct nbd_device *nbd)
 {
@@ -752,7 +757,8 @@ static int nbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
-static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
+static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg,
+			  bool netlink)
 {
 	struct socket *sock;
 	struct nbd_sock **socks;
@@ -764,13 +770,14 @@ static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
 		return err;
 
 	err = -EINVAL;
-	if (!nbd->task_setup && !atomic_cmpxchg(&nbd->refs, 0, 1)) {
+	if (!netlink && !nbd->task_setup &&
+	    !atomic_cmpxchg(&nbd->refs, 0, 1)) {
 		nbd->task_setup = current;
 		set_bit(NBD_HAS_SOCKS_REF, &nbd->runtime_flags);
 		try_module_get(THIS_MODULE);
 	}
 
-	if (nbd->task_setup != current) {
+	if (!netlink && nbd->task_setup != current) {
 		dev_err(disk_to_dev(nbd->disk),
 			"Device being setup by another task");
 		goto out;
@@ -895,52 +902,36 @@ static void nbd_put(struct nbd_device *nbd)
 	}
 }
 
-static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_start_device(struct nbd_device *nbd)
 {
 	struct recv_thread_args *args;
 	int num_connections = nbd->num_connections;
 	int error = 0, i;
 
-	if (!nbd_get_unless_zero(nbd))
+	if (nbd->task_recv)
+		return -EBUSY;
+	if (!nbd->socks)
 		return -EINVAL;
-	if (nbd->task_recv) {
-		error = -EBUSY;
-		goto out;
-	}
-	if (!nbd->socks) {
-		error = -EINVAL;
-		goto out;
-	}
 	if (num_connections > 1 &&
 	    !(nbd->flags & NBD_FLAG_CAN_MULTI_CONN)) {
 		dev_err(disk_to_dev(nbd->disk), "server does not support multiple connections per device.\n");
-		error = -EINVAL;
-		goto out;
+		return -EINVAL;
 	}
 
-	if (max_part)
-		bdev->bd_invalidated = 1;
-
 	blk_mq_update_nr_hw_queues(&nbd->tag_set, nbd->num_connections);
 	args = kcalloc(num_connections, sizeof(*args), GFP_KERNEL);
-	if (!args) {
-		error = -ENOMEM;
-		goto out;
-	}
+	if (!args)
+		return -ENOMEM;
 	nbd->args = args;
 	nbd->task_recv = current;
-	mutex_unlock(&nbd->config_lock);
 
 	nbd_parse_flags(nbd);
 
 	error = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
 	if (error) {
 		dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
-		goto out;
+		return error;
 	}
-	if (max_part)
-		bdev->bd_invalidated = 1;
-	bd_set_size(bdev, nbd->bytesize);
 
 	nbd_dev_dbg_init(nbd);
 	for (i = 0; i < num_connections; i++) {
@@ -952,21 +943,39 @@ static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
 		args[i].index = i;
 		queue_work(recv_workqueue, &args[i].work);
 	}
-	error = wait_event_interruptible(nbd->recv_wq,
+	return error;
+}
+
+static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *bdev)
+{
+	int ret, i;
+
+	if (!nbd_get_unless_zero(nbd))
+		return -EINVAL;
+
+	ret = nbd_start_device(nbd);
+	if (ret)
+		return ret;
+
+	bd_set_size(bdev, nbd->bytesize);
+	if (max_part)
+		bdev->bd_invalidated = 1;
+	mutex_unlock(&nbd->config_lock);
+	ret = wait_event_interruptible(nbd->recv_wq,
 					 atomic_read(&nbd->recv_threads) == 0);
-	if (error)
+	if (ret)
 		sock_shutdown(nbd);
-	for (i = 0; i < num_connections; i++)
-		flush_work(&args[i].work);
+	for (i = 0; i < nbd->num_connections; i++)
+		flush_work(&nbd->args[i].work);
 	mutex_lock(&nbd->config_lock);
-out:
+	bd_set_size(bdev, 0);
 	/* user requested, ignore socket errors */
 	if (test_bit(NBD_DISCONNECT_REQUESTED, &nbd->runtime_flags))
-		error = 0;
+		ret = 0;
 	if (test_bit(NBD_TIMEDOUT, &nbd->runtime_flags))
-		error = -ETIMEDOUT;
+		ret = -ETIMEDOUT;
 	nbd_put(nbd);
-	return error;
+	return ret;
 }
 
 static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
@@ -990,7 +999,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		nbd_clear_sock_ioctl(nbd, bdev);
 		return 0;
 	case NBD_SET_SOCK:
-		return nbd_add_socket(nbd, arg);
+		return nbd_add_socket(nbd, arg, false);
 	case NBD_SET_BLKSIZE:
 		nbd_size_set(nbd, arg,
 			     div_s64(nbd->bytesize, arg));
@@ -1013,7 +1022,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
 		nbd->flags = arg;
 		return 0;
 	case NBD_DO_IT:
-		return nbd_start_device(nbd, bdev);
+		return nbd_start_device_ioctl(nbd, bdev);
 	case NBD_CLEAR_QUE:
 		/*
 		 * This is for compatibility only.  The queue is always cleared
@@ -1034,7 +1043,7 @@ static int nbd_ioctl(struct block_device *bdev, fmode_t mode,
 		     unsigned int cmd, unsigned long arg)
 {
 	struct nbd_device *nbd = bdev->bd_disk->private_data;
-	int error;
+	int error = -EINVAL;
 	bool need_put = false;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -1046,7 +1055,15 @@ static int nbd_ioctl(struct block_device *bdev, fmode_t mode,
 	if (nbd_get_unless_zero(nbd))
 		need_put = true;
 	mutex_lock(&nbd->config_lock);
-	error = __nbd_ioctl(bdev, nbd, cmd, arg);
+
+	/* Don't allow ioctl operations on a nbd device that was created with
+	 * netlink, unless it's DISCONNECT or CLEAR_SOCK, which are fine.
+	 */
+	if (!test_bit(NBD_BOUND, &nbd->runtime_flags) ||
+	    (cmd == NBD_DISCONNECT || cmd == NBD_CLEAR_SOCK))
+		error = __nbd_ioctl(bdev, nbd, cmd, arg);
+	else
+		dev_err(nbd_to_dev(nbd), "Cannot use ioctl interface on a netlink controlled device.\n");
 	mutex_unlock(&nbd->config_lock);
 	if (need_put)
 		nbd_put(nbd);
@@ -1242,6 +1259,7 @@ static int nbd_dev_add(int index)
 	if (err < 0)
 		goto out_free_disk;
 
+	nbd->index = index;
 	nbd->disk = disk;
 	nbd->tag_set.ops = &nbd_mq_ops;
 	nbd->tag_set.nr_hw_queues = 1;
@@ -1301,10 +1319,228 @@ static int nbd_dev_add(int index)
 	return err;
 }
 
-/*
- * And here should be modules and kernel interface 
- *  (Just smiley confuses emacs :-)
- */
+static int find_free_cb(int id, void *ptr, void *data)
+{
+	struct nbd_device *nbd = ptr;
+	struct nbd_device **found = data;
+
+	if (!atomic_read(&nbd->refs)) {
+		*found = nbd;
+		return 1;
+	}
+	return 0;
+}
+
+/* Netlink interface. */
+static struct nla_policy nbd_attr_policy[NBD_ATTR_MAX + 1] = {
+	[NBD_ATTR_INDEX]		=	{ .type = NLA_U32 },
+	[NBD_ATTR_SIZE_BYTES]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_BLOCK_SIZE_BYTES]	=	{ .type = NLA_U64 },
+	[NBD_ATTR_TIMEOUT]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_SERVER_FLAGS]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_CLIENT_FLAGS]		=	{ .type = NLA_U64 },
+	[NBD_ATTR_SOCKETS]		=	{ .type = NLA_NESTED},
+};
+
+static struct nla_policy nbd_sock_policy[NBD_SOCK_MAX + 1] = {
+	[NBD_SOCK_FD]			=	{ .type = NLA_U32 },
+};
+
+static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nbd_device *nbd = NULL;
+	int index = -1;
+	int ret;
+
+	if (!netlink_capable(skb, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (info->attrs[NBD_ATTR_INDEX])
+		index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
+	if (!info->attrs[NBD_ATTR_SOCKETS]) {
+		printk(KERN_ERR "nbd: must specify at least one socket\n");
+		return -EINVAL;
+	}
+	if (!info->attrs[NBD_ATTR_SIZE_BYTES]) {
+		printk(KERN_ERR "nbd: must specify a size in bytes for the device\n");
+		return -EINVAL;
+	}
+again:
+	mutex_lock(&nbd_index_mutex);
+	if (index == -1) {
+		ret = idr_for_each(&nbd_index_idr, &find_free_cb, &nbd);
+		if (ret == 0) {
+			int new_index;
+			new_index = nbd_dev_add(-1);
+			if (new_index < 0) {
+				mutex_unlock(&nbd_index_mutex);
+				printk(KERN_ERR "nbd: failed to add new device\n");
+				return ret;
+			}
+			nbd = idr_find(&nbd_index_idr, new_index);
+		}
+	} else {
+		nbd = idr_find(&nbd_index_idr, index);
+	}
+	mutex_unlock(&nbd_index_mutex);
+	if (!nbd) {
+		printk(KERN_ERR "nbd: couldn't find device at index %d\n",
+		       index);
+		return -EINVAL;
+	}
+
+	mutex_lock(&nbd->config_lock);
+	if (atomic_cmpxchg(&nbd->refs, 0, 1)) {
+		mutex_unlock(&nbd->config_lock);
+		if (index == -1)
+			goto again;
+		printk(KERN_ERR "nbd: nbd%d already in use\n", index);
+		return -EBUSY;
+	}
+	try_module_get(THIS_MODULE);
+	/*
+	 * There is a very slight race where we could have just put the last
+	 * reference on this device and are now unable to tear everything down,
+	 * so check for NBD_BOUND or task_recv and cycle the config_mutex to
+	 * allow the teardown to happen.
+	 */
+	if (test_and_set_bit(NBD_BOUND, &nbd->runtime_flags) ||
+	    nbd->task_recv) {
+		mutex_unlock(&nbd->config_lock);
+		mutex_lock(&nbd->config_lock);
+		set_bit(NBD_BOUND, &nbd->runtime_flags);
+	}
+
+	if (info->attrs[NBD_ATTR_SIZE_BYTES]) {
+		u64 bytes = nla_get_u64(info->attrs[NBD_ATTR_SIZE_BYTES]);
+		nbd_size_set(nbd, nbd->blksize,
+			     div64_u64(bytes, nbd->blksize));
+	}
+	if (info->attrs[NBD_ATTR_BLOCK_SIZE_BYTES]) {
+		u64 bsize =
+			nla_get_u64(info->attrs[NBD_ATTR_BLOCK_SIZE_BYTES]);
+		nbd_size_set(nbd, bsize, div64_u64(nbd->bytesize, bsize));
+	}
+	if (info->attrs[NBD_ATTR_TIMEOUT]) {
+		u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]);
+		nbd->tag_set.timeout = timeout * HZ;
+	}
+	if (info->attrs[NBD_ATTR_SERVER_FLAGS])
+		nbd->flags =
+			nla_get_u64(info->attrs[NBD_ATTR_SERVER_FLAGS]);
+	if (info->attrs[NBD_ATTR_SOCKETS]) {
+		struct nlattr *attr;
+		int rem, fd;
+
+		nla_for_each_nested(attr, info->attrs[NBD_ATTR_SOCKETS],
+				    rem) {
+			struct nlattr *socks[NBD_SOCK_MAX+1];
+
+			if (nla_type(attr) != NBD_SOCK_ITEM) {
+				printk(KERN_ERR "nbd: socks must be embedded in a SOCK_ITEM attr\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			ret = nla_parse_nested(socks, NBD_SOCK_MAX, attr,
+					       nbd_sock_policy);
+			if (ret != 0) {
+				printk(KERN_ERR "nbd: error processing sock list\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!socks[NBD_SOCK_FD])
+				continue;
+			fd = (int)nla_get_u32(socks[NBD_SOCK_FD]);
+			ret = nbd_add_socket(nbd, fd, true);
+			if (ret)
+				goto out;
+		}
+	}
+	ret = nbd_start_device(nbd);
+out:
+	mutex_unlock(&nbd->config_lock);
+	if (!ret)
+		nbd_connect_reply(info, nbd->index);
+	nbd_put(nbd);
+	return ret;
+}
+
+static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nbd_device *nbd;
+	int index;
+
+	if (!netlink_capable(skb, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!info->attrs[NBD_ATTR_INDEX]) {
+		printk(KERN_ERR "nbd: must specify an index to disconnect\n");
+		return -EINVAL;
+	}
+	index = nla_get_u32(info->attrs[NBD_ATTR_INDEX]);
+	mutex_lock(&nbd_index_mutex);
+	nbd = idr_find(&nbd_index_idr, index);
+	mutex_unlock(&nbd_index_mutex);
+	if (!nbd) {
+		printk(KERN_ERR "nbd: couldn't find device at index %d\n",
+		       index);
+		return -EINVAL;
+	}
+	if (!nbd_get_unless_zero(nbd))
+		return 0;
+	mutex_lock(&nbd->config_lock);
+	nbd_disconnect(nbd);
+	mutex_unlock(&nbd->config_lock);
+	nbd_put(nbd);
+	return 0;
+}
+
+static const struct genl_ops nbd_connect_genl_ops[] = {
+	{
+		.cmd	= NBD_CMD_CONNECT,
+		.policy	= nbd_attr_policy,
+		.doit	= nbd_genl_connect,
+	},
+	{
+		.cmd	= NBD_CMD_DISCONNECT,
+		.policy	= nbd_attr_policy,
+		.doit	= nbd_genl_disconnect,
+	},
+};
+
+static struct genl_family nbd_genl_family __ro_after_init = {
+	.hdrsize	= 0,
+	.name		= NBD_GENL_FAMILY_NAME,
+	.version	= NBD_GENL_VERSION,
+	.module		= THIS_MODULE,
+	.ops		= nbd_connect_genl_ops,
+	.n_ops		= ARRAY_SIZE(nbd_connect_genl_ops),
+	.maxattr	= NBD_ATTR_MAX,
+};
+
+static void nbd_connect_reply(struct genl_info *info, int index)
+{
+	struct sk_buff *skb;
+	void *msg_head;
+	int ret;
+
+	skb = genlmsg_new(nla_total_size(sizeof(u32)), GFP_KERNEL);
+	if (!skb)
+		return;
+	msg_head = genlmsg_put_reply(skb, info, &nbd_genl_family, 0,
+				     NBD_CMD_CONNECT);
+	if (!msg_head) {
+		nlmsg_free(skb);
+		return;
+	}
+	ret = nla_put_u32(skb, NBD_ATTR_INDEX, index);
+	if (ret) {
+		nlmsg_free(skb);
+		return;
+	}
+	genlmsg_end(skb, msg_head);
+	genlmsg_reply(skb, info);
+}
 
 static int __init nbd_init(void)
 {
@@ -1347,6 +1583,11 @@ static int __init nbd_init(void)
 		return -EIO;
 	}
 
+	if (genl_register_family(&nbd_genl_family)) {
+		unregister_blkdev(NBD_MAJOR, "nbd");
+		destroy_workqueue(recv_workqueue);
+		return -EINVAL;
+	}
 	nbd_dbg_init();
 
 	mutex_lock(&nbd_index_mutex);
@@ -1369,6 +1610,7 @@ static void __exit nbd_cleanup(void)
 
 	idr_for_each(&nbd_index_idr, &nbd_exit_cb, NULL);
 	idr_destroy(&nbd_index_idr);
+	genl_unregister_family(&nbd_genl_family);
 	destroy_workqueue(recv_workqueue);
 	unregister_blkdev(NBD_MAJOR, "nbd");
 }
diff --git a/include/uapi/linux/nbd-netlink.h b/include/uapi/linux/nbd-netlink.h
new file mode 100644
index 0000000..fd0f4e4
--- /dev/null
+++ b/include/uapi/linux/nbd-netlink.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (C) 2017 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#ifndef _UAPILINUX_NBD_NETLINK_H
+#define _UAPILINUX_NBD_NETLINK_H
+
+#define NBD_GENL_FAMILY_NAME	"nbd"
+#define NBD_GENL_VERSION	0x1
+
+/* Configuration policy attributes, used for CONNECT */
+enum {
+	NBD_ATTR_UNSPEC,
+	NBD_ATTR_INDEX,
+	NBD_ATTR_SIZE_BYTES,
+	NBD_ATTR_BLOCK_SIZE_BYTES,
+	NBD_ATTR_TIMEOUT,
+	NBD_ATTR_SERVER_FLAGS,
+	NBD_ATTR_CLIENT_FLAGS,
+	NBD_ATTR_SOCKETS,
+	__NBD_ATTR_MAX,
+};
+#define NBD_ATTR_MAX (__NBD_ATTR_MAX - 1)
+
+/*
+ * This is the format for multiple sockets with NBD_ATTR_SOCKETS
+ *
+ * [NBD_ATTR_SOCKETS]
+ *   [NBD_SOCK_ITEM]
+ *     [NBD_SOCK_FD]
+ *   [NBD_SOCK_ITEM]
+ *     [NBD_SOCK_FD]
+ */
+enum {
+	NBD_SOCK_ITEM_UNSPEC,
+	NBD_SOCK_ITEM,
+	__NBD_SOCK_ITEM_MAX,
+};
+#define NBD_SOCK_ITEM_MAX (__NBD_SOCK_ITEM_MAX - 1)
+
+enum {
+	NBD_SOCK_UNSPEC,
+	NBD_SOCK_FD,
+	__NBD_SOCK_MAX,
+};
+#define NBD_SOCK_MAX (__NBD_SOCK_MAX - 1)
+
+enum {
+	NBD_CMD_UNSPEC,
+	NBD_CMD_CONNECT,
+	NBD_CMD_DISCONNECT,
+	__NBD_CMD_MAX,
+};
+#define NBD_CMD_MAX	(__NBD_CMD_MAX - 1)
+
+#endif /* _UAPILINUX_NBD_NETLINK_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Nbd] [PATCH 6/6] nbd: add a basic netlink interface
  2017-02-28 16:57 ` [PATCH 6/6] nbd: add a basic netlink interface Josef Bacik
@ 2017-03-08 10:07   ` Wouter Verhelst
  2017-03-08 14:56     ` Josef Bacik
  0 siblings, 1 reply; 10+ messages in thread
From: Wouter Verhelst @ 2017-03-08 10:07 UTC (permalink / raw)
  To: Josef Bacik; +Cc: axboe, nbd-general, linux-block, kernel-team

On Tue, Feb 28, 2017 at 11:57:11AM -0500, Josef Bacik wrote:
> The existing ioctl interface for configuring NBD devices is a bit
> cumbersome and hard to extend.  The other problem is we leave a
> userspace app sitting in it's syscall until the device disconnects,
> which is less than ideal.

True.

On the other hand, it has the advantage that you leave a userspace app
sitting around until the device disconnects, which allows for some form
of recovery in case you're doing root-on-NBD and the server has a
hiccup. Don't underestimate the advantage of that.

(of course, that requires that the return value of NBD_DO_IT makes a
difference between "unexpected connection drop" and "we sent
NBD_CMD_DISC", but that's a different matter entirely)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Nbd] [PATCH 6/6] nbd: add a basic netlink interface
  2017-03-08 10:07   ` [Nbd] " Wouter Verhelst
@ 2017-03-08 14:56     ` Josef Bacik
  2017-03-08 20:44       ` Wouter Verhelst
  0 siblings, 1 reply; 10+ messages in thread
From: Josef Bacik @ 2017-03-08 14:56 UTC (permalink / raw)
  To: Wouter Verhelst; +Cc: axboe, nbd-general, linux-block, kernel-team

On Wed, 2017-03-08 at 11:07 +0100, Wouter Verhelst wrote:
> On Tue, Feb 28, 2017 at 11:57:11AM -0500, Josef Bacik wrote:
> > 
> > The existing ioctl interface for configuring NBD devices is a bit
> > cumbersome and hard to extend.  The other problem is we leave a
> > userspace app sitting in it's syscall until the device disconnects,
> > which is less than ideal.
> True.
> 
> On the other hand, it has the advantage that you leave a userspace
> app
> sitting around until the device disconnects, which allows for some
> form
> of recovery in case you're doing root-on-NBD and the server has a
> hiccup. Don't underestimate the advantage of that.
> 
> (of course, that requires that the return value of NBD_DO_IT makes a
> difference between "unexpected connection drop" and "we sent
> NBD_CMD_DISC", but that's a different matter entirely)
> 

Stay tuned for further developments ;).  Yeah the problem is that even
though we can return and allow the user to reconnect, we completely
tear down the device and will return EIO to anything that comes in
while we're reconnecting, which sucks for users.  The patches that I'm
testing now will multi-cast messages over netlink when a link goes down
so a user space application can reconnect and provide a new connection
seamlessly.  The next step after that is to allow a complete failure of
all connections and we will simply sit there and queue IO until
userspace reconnects or the configured timeout elapses at which point
we'll tear down the device.  The end goal of all of this is seamless
reconnects without throwing errors.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Nbd] [PATCH 6/6] nbd: add a basic netlink interface
  2017-03-08 14:56     ` Josef Bacik
@ 2017-03-08 20:44       ` Wouter Verhelst
  0 siblings, 0 replies; 10+ messages in thread
From: Wouter Verhelst @ 2017-03-08 20:44 UTC (permalink / raw)
  To: Josef Bacik; +Cc: axboe, nbd-general, linux-block, kernel-team

On Wed, Mar 08, 2017 at 09:56:48AM -0500, Josef Bacik wrote:
> On Wed, 2017-03-08 at 11:07 +0100, Wouter Verhelst wrote:
> > On Tue, Feb 28, 2017 at 11:57:11AM -0500, Josef Bacik wrote:
> > > 
> > > The existing ioctl interface for configuring NBD devices is a bit
> > > cumbersome and hard to extend.��The other problem is we leave a
> > > userspace app sitting in it's syscall until the device disconnects,
> > > which is less than ideal.
> > True.
> > 
> > On the other hand, it has the advantage that you leave a userspace
> > app
> > sitting around until the device disconnects, which allows for some
> > form
> > of recovery in case you're doing root-on-NBD and the server has a
> > hiccup. Don't underestimate the advantage of that.
> > 
> > (of course, that requires that the return value of NBD_DO_IT makes a
> > difference between "unexpected connection drop" and "we sent
> > NBD_CMD_DISC", but that's a different matter entirely)
> 
> Stay tuned for further developments ;).

Heh.

> Yeah the problem is that even
> though we can return and allow the user to reconnect, we completely
> tear down the device and will return EIO to anything that comes in
> while we're reconnecting, which sucks for users.

Quite, yes.

>�The patches that I'm testing now will multi-cast messages over netlink
> when a link goes down so a user space application can reconnect and
> provide a new connection seamlessly. �The next step after that is to
> allow a complete failure of all connections and we will simply sit
> there and queue IO until userspace reconnects or the configured
> timeout elapses at which point we'll tear down the device. �The end
> goal of all of this is seamless reconnects without throwing errors.

Awesome. That would mean userspace would need to sit around, but I
suppose that isn't something we can't live with (and actually has other
advantages, too).

Thanks,

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-03-08 20:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 16:57 [PATCH 0/6] Lots of NBD fixes and enhancements Josef Bacik
2017-02-28 16:57 ` [PATCH 1/6] nbd: handle single path failures gracefully Josef Bacik
2017-02-28 16:57 ` [PATCH 2/6] nbd: ref count the nbd device Josef Bacik
2017-02-28 16:57 ` [PATCH 3/6] nbd: stop using the bdev everywhere Josef Bacik
2017-02-28 16:57 ` [PATCH 4/6] nbd: set queue timeout properly Josef Bacik
2017-02-28 16:57 ` [PATCH 5/6] nbd: handle ERESTARTSYS properly Josef Bacik
2017-02-28 16:57 ` [PATCH 6/6] nbd: add a basic netlink interface Josef Bacik
2017-03-08 10:07   ` [Nbd] " Wouter Verhelst
2017-03-08 14:56     ` Josef Bacik
2017-03-08 20:44       ` Wouter Verhelst

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).