* [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
@ 2020-01-17 11:50 Sun Ke
2020-01-17 14:18 ` Josef Bacik
2020-01-17 17:32 ` Mike Christie
0 siblings, 2 replies; 7+ messages in thread
From: Sun Ke @ 2020-01-17 11:50 UTC (permalink / raw)
To: josef, axboe, sunke32; +Cc: linux-block, nbd, linux-kernel
Connect and disconnect a nbd device repeatedly, will cause
NULL pointer fault.
It will appear by the steps:
1. Connect the nbd device and disconnect it, but now nbd device
is not disconnected totally.
2. Connect the same nbd device again immediately, it will fail
in nbd_start_device with a EBUSY return value.
3. Wait a second to make sure the last config_refs is reduced
and run nbd_config_put to disconnect the nbd device totally.
4. Start another process to open the nbd_device, config_refs
will increase and at the same time disconnect it.
To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl
and nbd_genl_connect if nbd device is started successfully.
Clear it in nbd_config_put. Test it in nbd_genl_disconnect and
nbd_genl_reconfigure.
Signed-off-by: Sun Ke <sunke32@huawei.com>
---
drivers/block/nbd.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index b4607dd96185..ddd364e208ab 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -83,6 +83,7 @@ struct link_dead_args {
#define NBD_DESTROY_ON_DISCONNECT 0
#define NBD_DISCONNECT_REQUESTED 1
+#define NBD_HAS_STARTED 2
struct nbd_config {
u32 flags;
@@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd)
nbd->disk->queue->limits.discard_alignment = 0;
blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX);
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue);
+ clear_bit(NBD_HAS_STARTED, &nbd->flags);
mutex_unlock(&nbd->config_lock);
nbd_put(nbd);
@@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
ret = nbd_start_device(nbd);
if (ret)
return ret;
+ else
+ set_bit(NBD_HAS_STARTED, &nbd->flags);
if (max_part)
bdev->bd_invalidated = 1;
@@ -1961,6 +1965,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
mutex_unlock(&nbd->config_lock);
if (!ret) {
set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags);
+ set_bit(NBD_HAS_STARTED, &nbd->flags);
refcount_inc(&nbd->config_refs);
nbd_connect_reply(info, nbd->index);
}
@@ -2008,6 +2013,14 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
index);
return -EINVAL;
}
+
+ if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) {
+ mutex_unlock(&nbd_index_mutex);
+ printk(KERN_ERR "nbd: device at index %d failed to start\n",
+ index);
+ return -EBUSY;
+ }
+
if (!refcount_inc_not_zero(&nbd->refs)) {
mutex_unlock(&nbd_index_mutex);
printk(KERN_ERR "nbd: device at index %d is going down\n",
@@ -2049,6 +2062,14 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
index);
return -EINVAL;
}
+
+ if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) {
+ mutex_unlock(&nbd_index_mutex);
+ printk(KERN_ERR "nbd: device at index %d failed to start\n",
+ index);
+ return -EBUSY;
+ }
+
if (!refcount_inc_not_zero(&nbd->refs)) {
mutex_unlock(&nbd_index_mutex);
printk(KERN_ERR "nbd: device at index %d is going down\n",
--
2.17.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
2020-01-17 11:50 [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process Sun Ke
@ 2020-01-17 14:18 ` Josef Bacik
2020-01-19 6:27 ` sunke (E)
2020-01-17 17:32 ` Mike Christie
1 sibling, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2020-01-17 14:18 UTC (permalink / raw)
To: Sun Ke, axboe; +Cc: linux-block, nbd, linux-kernel
On 1/17/20 6:50 AM, Sun Ke wrote:
> Connect and disconnect a nbd device repeatedly, will cause
> NULL pointer fault.
>
> It will appear by the steps:
> 1. Connect the nbd device and disconnect it, but now nbd device
> is not disconnected totally.
> 2. Connect the same nbd device again immediately, it will fail
> in nbd_start_device with a EBUSY return value.
> 3. Wait a second to make sure the last config_refs is reduced
> and run nbd_config_put to disconnect the nbd device totally.
> 4. Start another process to open the nbd_device, config_refs
> will increase and at the same time disconnect it.
>
> To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl
> and nbd_genl_connect if nbd device is started successfully.
> Clear it in nbd_config_put. Test it in nbd_genl_disconnect and
> nbd_genl_reconfigure.
I don't doubt what you are seeing, but what exactly are we NULL pointer
dereferencing? I can't quite figure it out from the steps.
>
> Signed-off-by: Sun Ke <sunke32@huawei.com>
> ---
> drivers/block/nbd.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index b4607dd96185..ddd364e208ab 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -83,6 +83,7 @@ struct link_dead_args {
>
> #define NBD_DESTROY_ON_DISCONNECT 0
> #define NBD_DISCONNECT_REQUESTED 1
> +#define NBD_HAS_STARTED 2
>
> struct nbd_config {
> u32 flags;
> @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd)
> nbd->disk->queue->limits.discard_alignment = 0;
> blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX);
> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue);
> + clear_bit(NBD_HAS_STARTED, &nbd->flags);
>
> mutex_unlock(&nbd->config_lock);
> nbd_put(nbd);
> @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
> ret = nbd_start_device(nbd);
> if (ret)
> return ret;
> + else
> + set_bit(NBD_HAS_STARTED, &nbd->flags);
The else is superfluous here. Thanks,
Josef
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
2020-01-17 11:50 [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process Sun Ke
2020-01-17 14:18 ` Josef Bacik
@ 2020-01-17 17:32 ` Mike Christie
2020-01-19 7:10 ` sunke (E)
1 sibling, 1 reply; 7+ messages in thread
From: Mike Christie @ 2020-01-17 17:32 UTC (permalink / raw)
To: Sun Ke, josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li
On 01/17/2020 05:50 AM, Sun Ke wrote:
> Connect and disconnect a nbd device repeatedly, will cause
> NULL pointer fault.
>
> It will appear by the steps:
> 1. Connect the nbd device and disconnect it, but now nbd device
> is not disconnected totally.
> 2. Connect the same nbd device again immediately, it will fail
> in nbd_start_device with a EBUSY return value.
> 3. Wait a second to make sure the last config_refs is reduced
> and run nbd_config_put to disconnect the nbd device totally.
> 4. Start another process to open the nbd_device, config_refs
> will increase and at the same time disconnect it.
Just to make sure I understood this, for step 4 the process is doing:
open(/dev/nbdX);
ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX)
?
There is no successful NBD_DO_IT / nbd_genl_connect between the open and
disconnect calls at step #4, because it would normally be done at #2 and
that failed. nbd_disconnect_and_put could then reference a null
recv_workq. If we are also racing with a close() then that could free
the device/config from under nbd_disconnect_and_put.
>
> To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl
I'm not sure if we need the new bit. We could just add a check for a non
null task_recv in nbd_genl_disconnect like how nbd_start_device and
nbd_genl_disconnect do.
The new bit might be more clear which is nice. If we got this route,
should the new bit be a runtime_flag like other device state bits?
> and nbd_genl_connect if nbd device is started successfully.
> Clear it in nbd_config_put. Test it in nbd_genl_disconnect and
> nbd_genl_reconfigure.
>
> Signed-off-by: Sun Ke <sunke32@huawei.com>
> ---
> drivers/block/nbd.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index b4607dd96185..ddd364e208ab 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -83,6 +83,7 @@ struct link_dead_args {
>
> #define NBD_DESTROY_ON_DISCONNECT 0
> #define NBD_DISCONNECT_REQUESTED 1
> +#define NBD_HAS_STARTED 2
>
> struct nbd_config {
> u32 flags;
> @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd)
> nbd->disk->queue->limits.discard_alignment = 0;
> blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX);
> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue);
> + clear_bit(NBD_HAS_STARTED, &nbd->flags);
>
> mutex_unlock(&nbd->config_lock);
> nbd_put(nbd);
> @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
> ret = nbd_start_device(nbd);
> if (ret)
> return ret;
> + else
> + set_bit(NBD_HAS_STARTED, &nbd->flags);
>
> if (max_part)
> bdev->bd_invalidated = 1;
> @@ -1961,6 +1965,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
> mutex_unlock(&nbd->config_lock);
> if (!ret) {
> set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags);
> + set_bit(NBD_HAS_STARTED, &nbd->flags);
> refcount_inc(&nbd->config_refs);
> nbd_connect_reply(info, nbd->index);
> }
> @@ -2008,6 +2013,14 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
> index);
> return -EINVAL;
> }
> +
> + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) {
> + mutex_unlock(&nbd_index_mutex);
> + printk(KERN_ERR "nbd: device at index %d failed to start\n",
> + index);
> + return -EBUSY;
> + }
> +
> if (!refcount_inc_not_zero(&nbd->refs)) {
> mutex_unlock(&nbd_index_mutex);
> printk(KERN_ERR "nbd: device at index %d is going down\n",
> @@ -2049,6 +2062,14 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
> index);
> return -EINVAL;
> }
> +
> + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) {
> + mutex_unlock(&nbd_index_mutex);
> + printk(KERN_ERR "nbd: device at index %d failed to start\n",
> + index);
> + return -EBUSY;
> + }
> +
> if (!refcount_inc_not_zero(&nbd->refs)) {
> mutex_unlock(&nbd_index_mutex);
> printk(KERN_ERR "nbd: device at index %d is going down\n",
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
2020-01-17 14:18 ` Josef Bacik
@ 2020-01-19 6:27 ` sunke (E)
0 siblings, 0 replies; 7+ messages in thread
From: sunke (E) @ 2020-01-19 6:27 UTC (permalink / raw)
To: Josef Bacik, axboe; +Cc: linux-block, nbd, linux-kernel
在 2020/1/17 22:18, Josef Bacik 写道:
> On 1/17/20 6:50 AM, Sun Ke wrote:
>> Connect and disconnect a nbd device repeatedly, will cause
>> NULL pointer fault.
>>
>> It will appear by the steps:
>> 1. Connect the nbd device and disconnect it, but now nbd device
>> is not disconnected totally.
>> 2. Connect the same nbd device again immediately, it will fail
>> in nbd_start_device with a EBUSY return value.
>> 3. Wait a second to make sure the last config_refs is reduced
>> and run nbd_config_put to disconnect the nbd device totally.
>> 4. Start another process to open the nbd_device, config_refs
>> will increase and at the same time disconnect it.
>>
>> To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl
>> and nbd_genl_connect if nbd device is started successfully.
>> Clear it in nbd_config_put. Test it in nbd_genl_disconnect and
>> nbd_genl_reconfigure.
>
> I don't doubt what you are seeing, but what exactly are we NULL pointer
> dereferencing? I can't quite figure it out from the steps.
The root case is when do disconnect, pointers in structure nbd_device
will not be free immeditily, it should wait for the last config_refs to
be decreased.
I got this kernel NULL pointer dereference report:
[ 256.454582] Dev nbd0: unable to read RDB block 0
[ 256.455611] Dev nbd0: unable to read RDB block 0
[ 256.457528] Dev nbd0: unable to read RDB block 0
[ 256.458742] Dev nbd0: unable to read RDB block 0
[ 256.516375] Dev nbd0: unable to read RDB block 0
[ 257.468970] BUG: kernel NULL pointer dereference, address:
0000000000000020
[ 257.469645] #PF: supervisor write access in kernel mode
[ 257.470445] #PF: error_code(0x0002) - not-present page
[ 257.470888] PGD 12ecb7067 P4D 12ecb7067 PUD 12f3f2067 PMD 0
[ 257.471384] Oops: 0002 [#1] SMP
[ 257.471671] CPU: 1 PID: 1651 Comm: nbd-client Not tainted
5.5.0-rc5-00039-gae6088216ce4 #22
[ 257.472501] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[ 257.473776] RIP: 0010:mutex_lock+0x29/0x60
[ 257.474593] Code: 00 0f 1f 44 00 00 53 48 89 fb 48 83 05 cf f8 01 02
01 e8 ea bd ff ff 48 83 05 ca f8 01 02 01 31 c0 65 48 8b 14 25 00 7d 01
00 <f0> 48 0f b1 13 74 f
[ 257.476221] RSP: 0018:ffffc900004cfa10 EFLAGS: 00010246
[ 257.476670] RAX: 0000000000000000 RBX: 0000000000000020 RCX:
0000000000000000
[ 257.477289] RDX: ffff88812f524a00 RSI: ffffffff82e44212 RDI:
0000000000000020
[ 257.477999] RBP: ffffc900004cfab0 R08: ffff88813bc6c110 R09:
0000000000000000
[ 257.478617] R10: 8080808080808080 R11: 0000000000000018 R12:
ffff88813584b000
[ 257.479228] R13: ffffffff838b1f00 R14: ffffc900004cfbb8 R15:
ffffc900004cfa40
[ 257.479871] FS: 00007f0c30d75b40(0000) GS:ffff88813bc40000(0000)
knlGS:0000000000000000
[ 257.480569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 257.481336] CR2: 0000000000000020 CR3: 000000012f5ea000 CR4:
00000000000006e0
[ 257.481980] Call Trace:
[ 257.482262] flush_workqueue+0x91/0x690
[ 257.482627] ? nbd_size_update+0x180/0x180 [nbd]
[ 257.483063] nbd_disconnect_and_put+0x80/0xd0 [nbd]
[ 257.483497] nbd_genl_disconnect+0x153/0x2d0 [nbd]
[ 257.483969] genl_rcv_msg+0x2ab/0x620
[ 257.484302] ? netlink_unicast+0x3b8/0x5e0
[ 257.484663] ? __nlmsg_put+0x78/0x90
[ 257.485009] ? genl_family_rcv_msg_attrs_parse+0x1a0/0x1a0
[ 257.485488] netlink_rcv_skb+0x5a/0x1a0
[ 257.485849] genl_rcv+0x34/0x60
[ 257.486129] netlink_unicast+0x2a4/0x5e0
[ 257.486468] netlink_sendmsg+0x369/0x6b0
[ 257.486854] ? rw_copy_check_uvector+0x50/0x1d0
[ 257.487257] ____sys_sendmsg+0x1f7/0x370
[ 257.487604] ? copy_msghdr_from_user+0xff/0x1e0
[ 257.488016] ___sys_sendmsg+0x8c/0xe0
[ 257.488335] ? copy_msghdr_from_user+0xff/0x1e0
[ 257.488730] ? ___sys_recvmsg+0xa1/0xe0
[ 257.489091] ? handle_mm_fault+0x199/0x390
[ 257.489454] __sys_sendmsg+0x6b/0xe0
[ 257.489766] __x64_sys_sendmsg+0x23/0x30
[ 257.490149] do_syscall_64+0xab/0x410
[ 257.490474] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 257.490942] RIP: 0033:0x7f0c3047cb87
[ 257.491288] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00
00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f
05 <48> 3d 00 f0 ff ff 8
[ 257.492940] RSP: 002b:00007ffefb59db28 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[ 257.493627] RAX: ffffffffffffffda RBX: 00000000023b0120 RCX:
00007f0c3047cb87
[ 257.494497] RDX: 0000000000000000 RSI: 00007ffefb59db60 RDI:
0000000000000003
[ 257.495116] RBP: 00000000023b01f0 R08: 0000000000000014 R09:
0000000000000002
[ 257.495731] R10: 0000000000000006 R11: 0000000000000246 R12:
00000000023b0030
[ 257.496356] R13: 00007ffefb59db60 R14: 0000000000000001 R15:
00000000ffffffff
[ 257.496989] Modules linked in: nbd
[ 257.497580] CR2: 0000000000000020
Thanks,
Ke
>
>>
>> Signed-off-by: Sun Ke <sunke32@huawei.com>
>> ---
>> drivers/block/nbd.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>>
>> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
>> index b4607dd96185..ddd364e208ab 100644
>> --- a/drivers/block/nbd.c
>> +++ b/drivers/block/nbd.c
>> @@ -83,6 +83,7 @@ struct link_dead_args {
>> #define NBD_DESTROY_ON_DISCONNECT 0
>> #define NBD_DISCONNECT_REQUESTED 1
>> +#define NBD_HAS_STARTED 2
>> struct nbd_config {
>> u32 flags;
>> @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd)
>> nbd->disk->queue->limits.discard_alignment = 0;
>> blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX);
>> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue);
>> + clear_bit(NBD_HAS_STARTED, &nbd->flags);
>> mutex_unlock(&nbd->config_lock);
>> nbd_put(nbd);
>> @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct
>> nbd_device *nbd, struct block_device *b
>> ret = nbd_start_device(nbd);
>> if (ret)
>> return ret;
>> + else
>> + set_bit(NBD_HAS_STARTED, &nbd->flags);
>
> The else is superfluous here. Thanks,
>
> Josef
>
> .
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
2020-01-17 17:32 ` Mike Christie
@ 2020-01-19 7:10 ` sunke (E)
2020-02-10 3:16 ` Mike Christie
0 siblings, 1 reply; 7+ messages in thread
From: sunke (E) @ 2020-01-19 7:10 UTC (permalink / raw)
To: Mike Christie, josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li
Thanks for your detailed suggestions.
在 2020/1/18 1:32, Mike Christie 写道:
> On 01/17/2020 05:50 AM, Sun Ke wrote:
>> Connect and disconnect a nbd device repeatedly, will cause
>> NULL pointer fault.
>>
>> It will appear by the steps:
>> 1. Connect the nbd device and disconnect it, but now nbd device
>> is not disconnected totally.
>> 2. Connect the same nbd device again immediately, it will fail
>> in nbd_start_device with a EBUSY return value.
>> 3. Wait a second to make sure the last config_refs is reduced
>> and run nbd_config_put to disconnect the nbd device totally.
>> 4. Start another process to open the nbd_device, config_refs
>> will increase and at the same time disconnect it.
>
> Just to make sure I understood this, for step 4 the process is doing:
>
> open(/dev/nbdX);
> ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX)
>
> ?
>
do nbd_genl_disconnect(for /dev/nbdX);
I tested it. Connect /dev/nbdX
through ioctl interface by nbd-client -L -N export localhost /dev/nbdX and
through netlink interface by nbd-client localhost XXXX /dev/nbdX,
disconnect /dev/nbdX by nbd-client -d /dev/nbdX.
Both call nbd_genl_disconnect(for /dev/nbdX) and both contain the same
null pointer dereference.
> There is no successful NBD_DO_IT / nbd_genl_connect between the open and
> disconnect calls at step #4, because it would normally be done at #2 and
> that failed. nbd_disconnect_and_put could then reference a null
> recv_workq. If we are also racing with a close() then that could free
> the device/config from under nbd_disconnect_and_put.
>
Yes, nbd_disconnect_and_put could then reference a null recv_workq.
>>
>> To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl
>
> I'm not sure if we need the new bit. We could just add a check for a non
> null task_recv in nbd_genl_disconnect like how nbd_start_device and
> nbd_genl_disconnect do.
>
I am also not very sure which is better.
because in nbd_config_put, not only recv_workq is null,
nbd->task_recv and nbd->config the same.
so I doubt that if step 4 do something else will also reference a null
pointer.
> The new bit might be more clear which is nice. If we got this route,
> should the new bit be a runtime_flag like other device state bits?
>
>
Yes, I realize it. Just add a check for a non null task_recv in
nbd_genl_disconnect is better, right?
>> and nbd_genl_connect if nbd device is started successfully.
>> Clear it in nbd_config_put. Test it in nbd_genl_disconnect and
>> nbd_genl_reconfigure.
>>
>> Signed-off-by: Sun Ke <sunke32@huawei.com>
>> ---
>> drivers/block/nbd.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>>
>> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
>> index b4607dd96185..ddd364e208ab 100644
>> --- a/drivers/block/nbd.c
>> +++ b/drivers/block/nbd.c
>> @@ -83,6 +83,7 @@ struct link_dead_args {
>>
>> #define NBD_DESTROY_ON_DISCONNECT 0
>> #define NBD_DISCONNECT_REQUESTED 1
>> +#define NBD_HAS_STARTED 2
>>
>> struct nbd_config {
>> u32 flags;
>> @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd)
>> nbd->disk->queue->limits.discard_alignment = 0;
>> blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX);
>> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue);
>> + clear_bit(NBD_HAS_STARTED, &nbd->flags);
>>
>> mutex_unlock(&nbd->config_lock);
>> nbd_put(nbd);
>> @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
>> ret = nbd_start_device(nbd);
>> if (ret)
>> return ret;
>> + else
>> + set_bit(NBD_HAS_STARTED, &nbd->flags);
>>
>> if (max_part)
>> bdev->bd_invalidated = 1;
>> @@ -1961,6 +1965,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info)
>> mutex_unlock(&nbd->config_lock);
>> if (!ret) {
>> set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags);
>> + set_bit(NBD_HAS_STARTED, &nbd->flags);
>> refcount_inc(&nbd->config_refs);
>> nbd_connect_reply(info, nbd->index);
>> }
>> @@ -2008,6 +2013,14 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info)
>> index);
>> return -EINVAL;
>> }
>> +
>> + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) {
>> + mutex_unlock(&nbd_index_mutex);
>> + printk(KERN_ERR "nbd: device at index %d failed to start\n",
>> + index);
>> + return -EBUSY;
>> + }
>> +
>> if (!refcount_inc_not_zero(&nbd->refs)) {
>> mutex_unlock(&nbd_index_mutex);
>> printk(KERN_ERR "nbd: device at index %d is going down\n",
>> @@ -2049,6 +2062,14 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info)
>> index);
>> return -EINVAL;
>> }
>> +
>> + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) {
>> + mutex_unlock(&nbd_index_mutex);
>> + printk(KERN_ERR "nbd: device at index %d failed to start\n",
>> + index);
>> + return -EBUSY;
>> + }
>> +
>> if (!refcount_inc_not_zero(&nbd->refs)) {
>> mutex_unlock(&nbd_index_mutex);
>> printk(KERN_ERR "nbd: device at index %d is going down\n",
>>
I thought the changes in nbd_genl_reconfigure is necessary althought my
test do not call it. but now I think it is superfluous,
nbd_genl_reconfigure checks for a non null task_recv.
Thanks,
Ke
>
>
> .
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
2020-01-19 7:10 ` sunke (E)
@ 2020-02-10 3:16 ` Mike Christie
2020-02-10 9:15 ` sunke (E)
0 siblings, 1 reply; 7+ messages in thread
From: Mike Christie @ 2020-02-10 3:16 UTC (permalink / raw)
To: sunke (E), josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li
[-- Attachment #1: Type: text/plain, Size: 2110 bytes --]
On 01/19/2020 01:10 AM, sunke (E) wrote:
>
> Thanks for your detailed suggestions.
>
> 在 2020/1/18 1:32, Mike Christie 写道:
>> On 01/17/2020 05:50 AM, Sun Ke wrote:
>>> Connect and disconnect a nbd device repeatedly, will cause
>>> NULL pointer fault.
>>>
>>> It will appear by the steps:
>>> 1. Connect the nbd device and disconnect it, but now nbd device
>>> is not disconnected totally.
>>> 2. Connect the same nbd device again immediately, it will fail
>>> in nbd_start_device with a EBUSY return value.
>>> 3. Wait a second to make sure the last config_refs is reduced
>>> and run nbd_config_put to disconnect the nbd device totally.
>>> 4. Start another process to open the nbd_device, config_refs
>>> will increase and at the same time disconnect it.
>>
>> Just to make sure I understood this, for step 4 the process is doing:
>>
>> open(/dev/nbdX);
>> ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX)
>>
>> ?
>>
> do nbd_genl_disconnect(for /dev/nbdX);
> I tested it. Connect /dev/nbdX
> through ioctl interface by nbd-client -L -N export localhost /dev/nbdX and
> through netlink interface by nbd-client localhost XXXX /dev/nbdX,
> disconnect /dev/nbdX by nbd-client -d /dev/nbdX.
> Both call nbd_genl_disconnect(for /dev/nbdX) and both contain the same
> null pointer dereference.
>> There is no successful NBD_DO_IT / nbd_genl_connect between the open and
>> disconnect calls at step #4, because it would normally be done at #2 and
>> that failed. nbd_disconnect_and_put could then reference a null
>> recv_workq. If we are also racing with a close() then that could free
>> the device/config from under nbd_disconnect_and_put.
>>
> Yes, nbd_disconnect_and_put could then reference a null recv_workq.
Hey Sunke
How about the attached patch. I am still testing it. The basic idea is
that we need to do a flush whenever we have done a sock_shutdown and are
in the disconnect/connect/clear sock path, so it just adds the flush in
that function. We then do not need to keep adding these flushes everywhere.
[-- Attachment #2: 0001-nbd-fix-crash-during-nbd_genl_disconnect.patch --]
[-- Type: text/x-patch, Size: 5016 bytes --]
From c007f70b73d31a11ea90ae53c748266eec88c0ab Mon Sep 17 00:00:00 2001
From: Mike Christie <mchristi@redhat.com>
Date: Sun, 9 Feb 2020 21:06:00 -0600
Subject: [PATCH] nbd: fix crash during nbd_genl_disconnect
If we open a nbd device, but have not done a nbd_genl_connect we will
crash in nbd_genl_disconnect when we try to flush the workqueue since it
was never allocated.
This patch moves all the flush calls to sock_shutdown and adds a check
for a valid workqueue.
Note that we flush_workqueue will do the right thing if there are no
running works so we did not need the if (i) check in nbd_start_device.
This also fixes a possible bug where you could do nbd_genl_connect, then
do a NBD_DISCONNECT and NBD_CLEAR_SOCK and the workqueue would not get
flushed.
---
drivers/block/nbd.c | 44 +++++++++++++++++---------------------------
1 file changed, 17 insertions(+), 27 deletions(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 78181908f0df..1afc70ed1f0a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -341,9 +341,12 @@ static void nbd_complete_rq(struct request *req)
}
/*
- * Forcibly shutdown the socket causing all listeners to error
+ * Forcibly shutdown the socket causing all listeners to error. If called
+ * from a disconnect/clearing context we must make sure the recv workqueues
+ * have exited so they don't release the last nbd_config reference and try to
+ * destroy the workqueue from inside itself.
*/
-static void sock_shutdown(struct nbd_device *nbd)
+static void sock_shutdown(struct nbd_device *nbd, bool flush_work)
{
struct nbd_config *config = nbd->config;
int i;
@@ -351,7 +354,7 @@ static void sock_shutdown(struct nbd_device *nbd)
if (config->num_connections == 0)
return;
if (test_and_set_bit(NBD_RT_DISCONNECTED, &config->runtime_flags))
- return;
+ goto try_flush;
for (i = 0; i < config->num_connections; i++) {
struct nbd_sock *nsock = config->socks[i];
@@ -360,6 +363,10 @@ static void sock_shutdown(struct nbd_device *nbd)
mutex_unlock(&nsock->tx_lock);
}
dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n");
+
+try_flush:
+ if (flush_work && nbd->recv_workq)
+ flush_workqueue(nbd->recv_workq);
}
static u32 req_to_nbd_cmd_type(struct request *req)
@@ -446,7 +453,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
set_bit(NBD_RT_TIMEDOUT, &config->runtime_flags);
cmd->status = BLK_STS_IOERR;
mutex_unlock(&cmd->lock);
- sock_shutdown(nbd);
+ sock_shutdown(nbd, false);
nbd_config_put(nbd);
done:
blk_mq_complete_request(req);
@@ -910,7 +917,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
* for the reconnect timer don't trigger the timer again
* and instead just error out.
*/
- sock_shutdown(nbd);
+ sock_shutdown(nbd, false);
nbd_config_put(nbd);
blk_mq_start_request(req);
return -EIO;
@@ -1178,7 +1185,7 @@ static int nbd_disconnect(struct nbd_device *nbd)
static void nbd_clear_sock(struct nbd_device *nbd)
{
- sock_shutdown(nbd);
+ sock_shutdown(nbd, true);
nbd_clear_que(nbd);
nbd->task_setup = NULL;
}
@@ -1264,17 +1271,7 @@ static int nbd_start_device(struct nbd_device *nbd)
args = kzalloc(sizeof(*args), GFP_KERNEL);
if (!args) {
- sock_shutdown(nbd);
- /*
- * If num_connections is m (2 < m),
- * and NO.1 ~ NO.n(1 < n < m) kzallocs are successful.
- * But NO.(n + 1) failed. We still have n recv threads.
- * So, add flush_workqueue here to prevent recv threads
- * dropping the last config_refs and trying to destroy
- * the workqueue from inside the workqueue.
- */
- if (i)
- flush_workqueue(nbd->recv_workq);
+ sock_shutdown(nbd, true);
return -ENOMEM;
}
sk_set_memalloc(config->socks[i]->sock->sk);
@@ -1306,9 +1303,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
mutex_unlock(&nbd->config_lock);
ret = wait_event_interruptible(config->recv_wq,
atomic_read(&config->recv_threads) == 0);
- if (ret)
- sock_shutdown(nbd);
- flush_workqueue(nbd->recv_workq);
+ sock_shutdown(nbd, true);
mutex_lock(&nbd->config_lock);
nbd_bdev_reset(bdev);
@@ -1323,7 +1318,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b
static void nbd_clear_sock_ioctl(struct nbd_device *nbd,
struct block_device *bdev)
{
- sock_shutdown(nbd);
+ sock_shutdown(nbd, true);
__invalidate_device(bdev, true);
nbd_bdev_reset(bdev);
if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF,
@@ -1986,12 +1981,7 @@ static void nbd_disconnect_and_put(struct nbd_device *nbd)
nbd_disconnect(nbd);
nbd_clear_sock(nbd);
mutex_unlock(&nbd->config_lock);
- /*
- * Make sure recv thread has finished, so it does not drop the last
- * config ref and try to destroy the workqueue from inside the work
- * queue.
- */
- flush_workqueue(nbd->recv_workq);
+
if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF,
&nbd->config->runtime_flags))
nbd_config_put(nbd);
--
2.20.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process
2020-02-10 3:16 ` Mike Christie
@ 2020-02-10 9:15 ` sunke (E)
0 siblings, 0 replies; 7+ messages in thread
From: sunke (E) @ 2020-02-10 9:15 UTC (permalink / raw)
To: Mike Christie, josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li
Hi Mike
Your idea looks good.
Thanks,
Sun Ke
在 2020/2/10 11:16, Mike Christie 写道:
> On 01/19/2020 01:10 AM, sunke (E) wrote:
>>
>> Thanks for your detailed suggestions.
>>
>> 在 2020/1/18 1:32, Mike Christie 写道:
>>> On 01/17/2020 05:50 AM, Sun Ke wrote:
>>>> Connect and disconnect a nbd device repeatedly, will cause
>>>> NULL pointer fault.
>>>>
>>>> It will appear by the steps:
>>>> 1. Connect the nbd device and disconnect it, but now nbd device
>>>> is not disconnected totally.
>>>> 2. Connect the same nbd device again immediately, it will fail
>>>> in nbd_start_device with a EBUSY return value.
>>>> 3. Wait a second to make sure the last config_refs is reduced
>>>> and run nbd_config_put to disconnect the nbd device totally.
>>>> 4. Start another process to open the nbd_device, config_refs
>>>> will increase and at the same time disconnect it.
>>>
>>> Just to make sure I understood this, for step 4 the process is doing:
>>>
>>> open(/dev/nbdX);
>>> ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX)
>>>
>>> ?
>>>
>> do nbd_genl_disconnect(for /dev/nbdX);
>> I tested it. Connect /dev/nbdX
>> through ioctl interface by nbd-client -L -N export localhost /dev/nbdX and
>> through netlink interface by nbd-client localhost XXXX /dev/nbdX,
>> disconnect /dev/nbdX by nbd-client -d /dev/nbdX.
>> Both call nbd_genl_disconnect(for /dev/nbdX) and both contain the same
>> null pointer dereference.
>>> There is no successful NBD_DO_IT / nbd_genl_connect between the open and
>>> disconnect calls at step #4, because it would normally be done at #2 and
>>> that failed. nbd_disconnect_and_put could then reference a null
>>> recv_workq. If we are also racing with a close() then that could free
>>> the device/config from under nbd_disconnect_and_put.
>>>
>> Yes, nbd_disconnect_and_put could then reference a null recv_workq.
>
> Hey Sunke
>
> How about the attached patch. I am still testing it. The basic idea is
> that we need to do a flush whenever we have done a sock_shutdown and are
> in the disconnect/connect/clear sock path, so it just adds the flush in
> that function. We then do not need to keep adding these flushes everywhere.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-02-10 9:15 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-17 11:50 [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process Sun Ke
2020-01-17 14:18 ` Josef Bacik
2020-01-19 6:27 ` sunke (E)
2020-01-17 17:32 ` Mike Christie
2020-01-19 7:10 ` sunke (E)
2020-02-10 3:16 ` Mike Christie
2020-02-10 9:15 ` sunke (E)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).