From: Mike Christie <mchristi@redhat.com>
To: "sunke (E)" <sunke32@huawei.com>, josef@toxicpanda.com, axboe@kernel.dk
Cc: linux-block@vger.kernel.org, nbd@other.debian.org,
linux-kernel@vger.kernel.org, Xiubo Li <xiubli@redhat.com>
Subject: Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
Date: Sun, 9 Feb 2020 21:16:47 -0600 [thread overview]
Message-ID: <5E40CB1F.7070301@redhat.com> (raw)
In-Reply-To: <c15baa64-ef8d-970f-f4e0-ecd10cc0b0a0@huawei.com>
[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]
On 01/19/2020 01:10 AM, sunke (E) wrote:
>
> Thanks for your detailed suggestions.
>
> 在 2020/1/18 1:32, Mike Christie 写道:
>> On 01/17/2020 05:50 AM, Sun Ke wrote:
>>> Connect and disconnect a nbd device repeatedly, will cause
>>> NULL pointer fault.
>>>
>>> It will appear by the steps:
>>> 1. Connect the nbd device and disconnect it, but now nbd device
>>> is not disconnected totally.
>>> 2. Connect the same nbd device again immediately, it will fail
>>> in nbd_start_device with a EBUSY return value.
>>> 3. Wait a second to make sure the last config_refs is reduced
>>> and run nbd_config_put to disconnect the nbd device totally.
>>> 4. Start another process to open the nbd_device, config_refs
>>> will increase and at the same time disconnect it.
>>
>> Just to make sure I understood this, for step 4 the process is doing:
>>
>> open(/dev/nbdX);
>> ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX)
>>
>> ?
>>
> do nbd_genl_disconnect(for /dev/nbdX);
> I tested it. Connect /dev/nbdX
> through ioctl interface by nbd-client -L -N export localhost /dev/nbdX and
> through netlink interface by nbd-client localhost XXXX /dev/nbdX,
> disconnect /dev/nbdX by nbd-client -d /dev/nbdX.
> Both call nbd_genl_disconnect(for /dev/nbdX) and both contain the same
> null pointer dereference.
>> There is no successful NBD_DO_IT / nbd_genl_connect between the open and
>> disconnect calls at step #4, because it would normally be done at #2 and
>> that failed. nbd_disconnect_and_put could then reference a null
>> recv_workq. If we are also racing with a close() then that could free
>> the device/config from under nbd_disconnect_and_put.
>>
> Yes, nbd_disconnect_and_put could then reference a null recv_workq.
Hey Sunke
How about the attached patch. I am still testing it. The basic idea is
that we need to do a flush whenever we have done a sock_shutdown and are
in the disconnect/connect/clear sock path, so it just adds the flush in
that function. We then do not need to keep adding these flushes everywhere.
[-- Attachment #2: 0001-nbd-fix-crash-during-nbd_genl_disconnect.patch --]
[-- Type: text/x-patch, Size: 5016 bytes --]
From c007f70b73d31a11ea90ae53c748266eec88c0ab Mon Sep 17 00:00:00 2001
From: Mike Christie <mchristi@redhat.com>
Date: Sun, 9 Feb 2020 21:06:00 -0600
Subject: [PATCH] nbd: fix crash during nbd_genl_disconnect
If we open a nbd device, but have not done a nbd_genl_connect we will
crash in nbd_genl_disconnect when we try to flush the workqueue since it
was never allocated.
This patch moves all the flush calls to sock_shutdown and adds a check
for a valid workqueue.
Note that we flush_workqueue will do the right thing if there are no
running works so we did not need the if (i) check in nbd_start_device.
This also fixes a possible bug where you could do nbd_genl_connect, then
do a NBD_DISCONNECT and NBD_CLEAR_SOCK and the workqueue would not get
flushed.
---
drivers/block/nbd.c | 44 +++++++++++++++++---------------------------
1 file changed, 17 insertions(+), 27 deletions(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 78181908f0df..1afc70ed1f0a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -341,9 +341,12 @@ static void nbd_complete_rq(struct request *req)
}
/*
- * Forcibly shutdown the socket causing all listeners to error
+ * Forcibly shutdown the socket causing all listeners to error. If called
+ * from a disconnect/clearing context we must make sure the recv workqueues
+ * have exited so they don't release the last nbd_config reference and try to
+ * destroy the workqueue from inside itself.
*/
-static void sock_shutdown(struct nbd_device *nbd)
+static void sock_shutdown(struct nbd_device *nbd, bool flush_work)
{
struct nbd_config *config = nbd->config;
int i;
@@ -351,7 +354,7 @@ static void sock_shutdown(struct nbd_device *nbd)
if (config->num_connections == 0)
return;
if (test_and_set_bit(NBD_RT_DISCONNECTED, &config->runtime_flags))
- return;
+ goto try_flush;
for (i = 0; i < config->num_connections; i++) {
struct nbd_sock *nsock = config->socks[i];
@@ -360,6 +363,10 @@ static void sock_shutdown(struct nbd_device *nbd)
mutex_unlock(&nsock->tx_lock);
}
dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
+
+try_flush:
+ if (flush_work && nbd->recv_workq)
+ flush_workqueue(nbd->recv_workq);
}
static u32 req_to_nbd_cmd_type(struct request *req)
@@ -446,7 +453,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
set_bit(NBD_RT_TIMEDOUT, &config->runtime_flags);
cmd->status = BLK_STS_IOERR;
mutex_unlock(&cmd->lock);
- sock_shutdown(nbd);
+ sock_shutdown(nbd, false);
nbd_config_put(nbd);
done:
blk_mq_complete_request(req);
@@ -910,7 +917,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
* for the reconnect timer don't trigger the timer again
* and instead just error out.
*/
- sock_shutdown(nbd);
+ sock_shutdown(nbd, false);
nbd_config_put(nbd);
blk_mq_start_request(req);
return -EIO;
@@ -1178,7 +1185,7 @@ static int nbd_disconnect(struct nbd_device *nbd)
static void nbd_clear_sock(struct nbd_device *nbd)
{
- sock_shutdown(nbd);
+ sock_shutdown(nbd, true);
nbd_clear_que(nbd);
nbd->task_setup = NULL;
}
@@ -1264,17 +1271,7 @@ static int nbd_start_device(struct nbd_device *nbd)
args = kzalloc(sizeof(*args), GFP_KERNEL);
if (!args) {
- sock_shutdown(nbd);
- /*
- * If num_connections is m (2 < m),
- * and NO.1 ~ NO.n(1 < n < m) kzallocs are successful.
- * But NO.(n + 1) failed. We still have n recv threads.
- * So, add flush_workqueue here to prevent recv threads
- * dropping the last config_refs and trying to destroy
- * the workqueue from inside the workqueue.
- */
- if (i)
- flush_workqueue(nbd->recv_workq);
+ sock_shutdown(nbd, true);
return -ENOMEM;
}
sk_set_memalloc(config->socks[i]->sock->sk);
@@ -1306,9 +1303,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
mutex_unlock(&nbd->config_lock);
ret = wait_event_interruptible(config->recv_wq,
atomic_read(&config->recv_threads) == 0);
- if (ret)
- sock_shutdown(nbd);
- flush_workqueue(nbd->recv_workq);
+ sock_shutdown(nbd, true);
mutex_lock(&nbd->config_lock);
nbd_bdev_reset(bdev);
@@ -1323,7 +1318,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
struct block_device *bdev)
{
- sock_shutdown(nbd);
+ sock_shutdown(nbd, true);
__invalidate_device(bdev, true);
nbd_bdev_reset(bdev);
if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF,
@@ -1986,12 +1981,7 @@ static void nbd_disconnect_and_put(struct nbd_device *nbd)
nbd_disconnect(nbd);
nbd_clear_sock(nbd);
mutex_unlock(&nbd->config_lock);
- /*
- * Make sure recv thread has finished, so it does not drop the last
- * config ref and try to destroy the workqueue from inside the work
- * queue.
- */
- flush_workqueue(nbd->recv_workq);
+
if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF,
&nbd->config->runtime_flags))
nbd_config_put(nbd);
--
2.20.1
next prev parent reply other threads:[~2020-02-10 3:16 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-17 11:50 [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process Sun Ke
2020-01-17 14:18 ` Josef Bacik
2020-01-19 6:27 ` sunke (E)
2020-01-17 17:32 ` Mike Christie
2020-01-19 7:10 ` sunke (E)
2020-02-10 3:16 ` Mike Christie [this message]
2020-02-10 9:15 ` sunke (E)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5E40CB1F.7070301@redhat.com \
--to=mchristi@redhat.com \
--cc=axboe@kernel.dk \
--cc=josef@toxicpanda.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=nbd@other.debian.org \
--cc=sunke32@huawei.com \
--cc=xiubli@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).