* [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process @ 2020-01-17 11:50 Sun Ke 2020-01-17 14:18 ` Josef Bacik 2020-01-17 17:32 ` Mike Christie 0 siblings, 2 replies; 7+ messages in thread From: Sun Ke @ 2020-01-17 11:50 UTC (permalink / raw) To: josef, axboe, sunke32; +Cc: linux-block, nbd, linux-kernel Connect and disconnect a nbd device repeatedly, will cause NULL pointer fault. It will appear by the steps: 1. Connect the nbd device and disconnect it, but now nbd device is not disconnected totally. 2. Connect the same nbd device again immediately, it will fail in nbd_start_device with a EBUSY return value. 3. Wait a second to make sure the last config_refs is reduced and run nbd_config_put to disconnect the nbd device totally. 4. Start another process to open the nbd_device, config_refs will increase and at the same time disconnect it. To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl and nbd_genl_connect if nbd device is started successfully. Clear it in nbd_config_put. Test it in nbd_genl_disconnect and nbd_genl_reconfigure. Signed-off-by: Sun Ke <sunke32@huawei.com> --- drivers/block/nbd.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index b4607dd96185..ddd364e208ab 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -83,6 +83,7 @@ struct link_dead_args { #define NBD_DESTROY_ON_DISCONNECT 0 #define NBD_DISCONNECT_REQUESTED 1 +#define NBD_HAS_STARTED 2 struct nbd_config { u32 flags; @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd) nbd->disk->queue->limits.discard_alignment = 0; blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX); blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue); + clear_bit(NBD_HAS_STARTED, &nbd->flags); mutex_unlock(&nbd->config_lock); nbd_put(nbd); @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b ret = nbd_start_device(nbd); if (ret) return ret; + else + set_bit(NBD_HAS_STARTED, &nbd->flags); if (max_part) bdev->bd_invalidated = 1; @@ -1961,6 +1965,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info) mutex_unlock(&nbd->config_lock); if (!ret) { set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags); + set_bit(NBD_HAS_STARTED, &nbd->flags); refcount_inc(&nbd->config_refs); nbd_connect_reply(info, nbd->index); } @@ -2008,6 +2013,14 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info) index); return -EINVAL; } + + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) { + mutex_unlock(&nbd_index_mutex); + printk(KERN_ERR "nbd: device at index %d failed to start\n", + index); + return -EBUSY; + } + if (!refcount_inc_not_zero(&nbd->refs)) { mutex_unlock(&nbd_index_mutex); printk(KERN_ERR "nbd: device at index %d is going down\n", @@ -2049,6 +2062,14 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info) index); return -EINVAL; } + + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) { + mutex_unlock(&nbd_index_mutex); + printk(KERN_ERR "nbd: device at index %d failed to start\n", + index); + return -EBUSY; + } + if (!refcount_inc_not_zero(&nbd->refs)) { mutex_unlock(&nbd_index_mutex); printk(KERN_ERR "nbd: device at index %d is going down\n", -- 2.17.2 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process 2020-01-17 11:50 [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process Sun Ke @ 2020-01-17 14:18 ` Josef Bacik 2020-01-19 6:27 ` sunke (E) 2020-01-17 17:32 ` Mike Christie 1 sibling, 1 reply; 7+ messages in thread From: Josef Bacik @ 2020-01-17 14:18 UTC (permalink / raw) To: Sun Ke, axboe; +Cc: linux-block, nbd, linux-kernel On 1/17/20 6:50 AM, Sun Ke wrote: > Connect and disconnect a nbd device repeatedly, will cause > NULL pointer fault. > > It will appear by the steps: > 1. Connect the nbd device and disconnect it, but now nbd device > is not disconnected totally. > 2. Connect the same nbd device again immediately, it will fail > in nbd_start_device with a EBUSY return value. > 3. Wait a second to make sure the last config_refs is reduced > and run nbd_config_put to disconnect the nbd device totally. > 4. Start another process to open the nbd_device, config_refs > will increase and at the same time disconnect it. > > To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl > and nbd_genl_connect if nbd device is started successfully. > Clear it in nbd_config_put. Test it in nbd_genl_disconnect and > nbd_genl_reconfigure. I don't doubt what you are seeing, but what exactly are we NULL pointer dereferencing? I can't quite figure it out from the steps. > > Signed-off-by: Sun Ke <sunke32@huawei.com> > --- > drivers/block/nbd.c | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c > index b4607dd96185..ddd364e208ab 100644 > --- a/drivers/block/nbd.c > +++ b/drivers/block/nbd.c > @@ -83,6 +83,7 @@ struct link_dead_args { > > #define NBD_DESTROY_ON_DISCONNECT 0 > #define NBD_DISCONNECT_REQUESTED 1 > +#define NBD_HAS_STARTED 2 > > struct nbd_config { > u32 flags; > @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd) > nbd->disk->queue->limits.discard_alignment = 0; > blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX); > blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue); > + clear_bit(NBD_HAS_STARTED, &nbd->flags); > > mutex_unlock(&nbd->config_lock); > nbd_put(nbd); > @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b > ret = nbd_start_device(nbd); > if (ret) > return ret; > + else > + set_bit(NBD_HAS_STARTED, &nbd->flags); The else is superfluous here. Thanks, Josef ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process 2020-01-17 14:18 ` Josef Bacik @ 2020-01-19 6:27 ` sunke (E) 0 siblings, 0 replies; 7+ messages in thread From: sunke (E) @ 2020-01-19 6:27 UTC (permalink / raw) To: Josef Bacik, axboe; +Cc: linux-block, nbd, linux-kernel 在 2020/1/17 22:18, Josef Bacik 写道: > On 1/17/20 6:50 AM, Sun Ke wrote: >> Connect and disconnect a nbd device repeatedly, will cause >> NULL pointer fault. >> >> It will appear by the steps: >> 1. Connect the nbd device and disconnect it, but now nbd device >> is not disconnected totally. >> 2. Connect the same nbd device again immediately, it will fail >> in nbd_start_device with a EBUSY return value. >> 3. Wait a second to make sure the last config_refs is reduced >> and run nbd_config_put to disconnect the nbd device totally. >> 4. Start another process to open the nbd_device, config_refs >> will increase and at the same time disconnect it. >> >> To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl >> and nbd_genl_connect if nbd device is started successfully. >> Clear it in nbd_config_put. Test it in nbd_genl_disconnect and >> nbd_genl_reconfigure. > > I don't doubt what you are seeing, but what exactly are we NULL pointer > dereferencing? I can't quite figure it out from the steps. The root case is when do disconnect, pointers in structure nbd_device will not be free immeditily, it should wait for the last config_refs to be decreased. I got this kernel NULL pointer dereference report: [ 256.454582] Dev nbd0: unable to read RDB block 0 [ 256.455611] Dev nbd0: unable to read RDB block 0 [ 256.457528] Dev nbd0: unable to read RDB block 0 [ 256.458742] Dev nbd0: unable to read RDB block 0 [ 256.516375] Dev nbd0: unable to read RDB block 0 [ 257.468970] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 257.469645] #PF: supervisor write access in kernel mode [ 257.470445] #PF: error_code(0x0002) - not-present page [ 257.470888] PGD 12ecb7067 P4D 12ecb7067 PUD 12f3f2067 PMD 0 [ 257.471384] Oops: 0002 [#1] SMP [ 257.471671] CPU: 1 PID: 1651 Comm: nbd-client Not tainted 5.5.0-rc5-00039-gae6088216ce4 #22 [ 257.472501] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 [ 257.473776] RIP: 0010:mutex_lock+0x29/0x60 [ 257.474593] Code: 00 0f 1f 44 00 00 53 48 89 fb 48 83 05 cf f8 01 02 01 e8 ea bd ff ff 48 83 05 ca f8 01 02 01 31 c0 65 48 8b 14 25 00 7d 01 00 <f0> 48 0f b1 13 74 f [ 257.476221] RSP: 0018:ffffc900004cfa10 EFLAGS: 00010246 [ 257.476670] RAX: 0000000000000000 RBX: 0000000000000020 RCX: 0000000000000000 [ 257.477289] RDX: ffff88812f524a00 RSI: ffffffff82e44212 RDI: 0000000000000020 [ 257.477999] RBP: ffffc900004cfab0 R08: ffff88813bc6c110 R09: 0000000000000000 [ 257.478617] R10: 8080808080808080 R11: 0000000000000018 R12: ffff88813584b000 [ 257.479228] R13: ffffffff838b1f00 R14: ffffc900004cfbb8 R15: ffffc900004cfa40 [ 257.479871] FS: 00007f0c30d75b40(0000) GS:ffff88813bc40000(0000) knlGS:0000000000000000 [ 257.480569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 257.481336] CR2: 0000000000000020 CR3: 000000012f5ea000 CR4: 00000000000006e0 [ 257.481980] Call Trace: [ 257.482262] flush_workqueue+0x91/0x690 [ 257.482627] ? nbd_size_update+0x180/0x180 [nbd] [ 257.483063] nbd_disconnect_and_put+0x80/0xd0 [nbd] [ 257.483497] nbd_genl_disconnect+0x153/0x2d0 [nbd] [ 257.483969] genl_rcv_msg+0x2ab/0x620 [ 257.484302] ? netlink_unicast+0x3b8/0x5e0 [ 257.484663] ? __nlmsg_put+0x78/0x90 [ 257.485009] ? genl_family_rcv_msg_attrs_parse+0x1a0/0x1a0 [ 257.485488] netlink_rcv_skb+0x5a/0x1a0 [ 257.485849] genl_rcv+0x34/0x60 [ 257.486129] netlink_unicast+0x2a4/0x5e0 [ 257.486468] netlink_sendmsg+0x369/0x6b0 [ 257.486854] ? rw_copy_check_uvector+0x50/0x1d0 [ 257.487257] ____sys_sendmsg+0x1f7/0x370 [ 257.487604] ? copy_msghdr_from_user+0xff/0x1e0 [ 257.488016] ___sys_sendmsg+0x8c/0xe0 [ 257.488335] ? copy_msghdr_from_user+0xff/0x1e0 [ 257.488730] ? ___sys_recvmsg+0xa1/0xe0 [ 257.489091] ? handle_mm_fault+0x199/0x390 [ 257.489454] __sys_sendmsg+0x6b/0xe0 [ 257.489766] __x64_sys_sendmsg+0x23/0x30 [ 257.490149] do_syscall_64+0xab/0x410 [ 257.490474] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 257.490942] RIP: 0033:0x7f0c3047cb87 [ 257.491288] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 8 [ 257.492940] RSP: 002b:00007ffefb59db28 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 257.493627] RAX: ffffffffffffffda RBX: 00000000023b0120 RCX: 00007f0c3047cb87 [ 257.494497] RDX: 0000000000000000 RSI: 00007ffefb59db60 RDI: 0000000000000003 [ 257.495116] RBP: 00000000023b01f0 R08: 0000000000000014 R09: 0000000000000002 [ 257.495731] R10: 0000000000000006 R11: 0000000000000246 R12: 00000000023b0030 [ 257.496356] R13: 00007ffefb59db60 R14: 0000000000000001 R15: 00000000ffffffff [ 257.496989] Modules linked in: nbd [ 257.497580] CR2: 0000000000000020 Thanks, Ke > >> >> Signed-off-by: Sun Ke <sunke32@huawei.com> >> --- >> drivers/block/nbd.c | 21 +++++++++++++++++++++ >> 1 file changed, 21 insertions(+) >> >> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c >> index b4607dd96185..ddd364e208ab 100644 >> --- a/drivers/block/nbd.c >> +++ b/drivers/block/nbd.c >> @@ -83,6 +83,7 @@ struct link_dead_args { >> #define NBD_DESTROY_ON_DISCONNECT 0 >> #define NBD_DISCONNECT_REQUESTED 1 >> +#define NBD_HAS_STARTED 2 >> struct nbd_config { >> u32 flags; >> @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd) >> nbd->disk->queue->limits.discard_alignment = 0; >> blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX); >> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue); >> + clear_bit(NBD_HAS_STARTED, &nbd->flags); >> mutex_unlock(&nbd->config_lock); >> nbd_put(nbd); >> @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct >> nbd_device *nbd, struct block_device *b >> ret = nbd_start_device(nbd); >> if (ret) >> return ret; >> + else >> + set_bit(NBD_HAS_STARTED, &nbd->flags); > > The else is superfluous here. Thanks, > > Josef > > . ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process 2020-01-17 11:50 [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process Sun Ke 2020-01-17 14:18 ` Josef Bacik @ 2020-01-17 17:32 ` Mike Christie 2020-01-19 7:10 ` sunke (E) 1 sibling, 1 reply; 7+ messages in thread From: Mike Christie @ 2020-01-17 17:32 UTC (permalink / raw) To: Sun Ke, josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li On 01/17/2020 05:50 AM, Sun Ke wrote: > Connect and disconnect a nbd device repeatedly, will cause > NULL pointer fault. > > It will appear by the steps: > 1. Connect the nbd device and disconnect it, but now nbd device > is not disconnected totally. > 2. Connect the same nbd device again immediately, it will fail > in nbd_start_device with a EBUSY return value. > 3. Wait a second to make sure the last config_refs is reduced > and run nbd_config_put to disconnect the nbd device totally. > 4. Start another process to open the nbd_device, config_refs > will increase and at the same time disconnect it. Just to make sure I understood this, for step 4 the process is doing: open(/dev/nbdX); ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX) ? There is no successful NBD_DO_IT / nbd_genl_connect between the open and disconnect calls at step #4, because it would normally be done at #2 and that failed. nbd_disconnect_and_put could then reference a null recv_workq. If we are also racing with a close() then that could free the device/config from under nbd_disconnect_and_put. > > To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl I'm not sure if we need the new bit. We could just add a check for a non null task_recv in nbd_genl_disconnect like how nbd_start_device and nbd_genl_disconnect do. The new bit might be more clear which is nice. If we got this route, should the new bit be a runtime_flag like other device state bits? > and nbd_genl_connect if nbd device is started successfully. > Clear it in nbd_config_put. Test it in nbd_genl_disconnect and > nbd_genl_reconfigure. > > Signed-off-by: Sun Ke <sunke32@huawei.com> > --- > drivers/block/nbd.c | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c > index b4607dd96185..ddd364e208ab 100644 > --- a/drivers/block/nbd.c > +++ b/drivers/block/nbd.c > @@ -83,6 +83,7 @@ struct link_dead_args { > > #define NBD_DESTROY_ON_DISCONNECT 0 > #define NBD_DISCONNECT_REQUESTED 1 > +#define NBD_HAS_STARTED 2 > > struct nbd_config { > u32 flags; > @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd) > nbd->disk->queue->limits.discard_alignment = 0; > blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX); > blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue); > + clear_bit(NBD_HAS_STARTED, &nbd->flags); > > mutex_unlock(&nbd->config_lock); > nbd_put(nbd); > @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b > ret = nbd_start_device(nbd); > if (ret) > return ret; > + else > + set_bit(NBD_HAS_STARTED, &nbd->flags); > > if (max_part) > bdev->bd_invalidated = 1; > @@ -1961,6 +1965,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info) > mutex_unlock(&nbd->config_lock); > if (!ret) { > set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags); > + set_bit(NBD_HAS_STARTED, &nbd->flags); > refcount_inc(&nbd->config_refs); > nbd_connect_reply(info, nbd->index); > } > @@ -2008,6 +2013,14 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info) > index); > return -EINVAL; > } > + > + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) { > + mutex_unlock(&nbd_index_mutex); > + printk(KERN_ERR "nbd: device at index %d failed to start\n", > + index); > + return -EBUSY; > + } > + > if (!refcount_inc_not_zero(&nbd->refs)) { > mutex_unlock(&nbd_index_mutex); > printk(KERN_ERR "nbd: device at index %d is going down\n", > @@ -2049,6 +2062,14 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info) > index); > return -EINVAL; > } > + > + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) { > + mutex_unlock(&nbd_index_mutex); > + printk(KERN_ERR "nbd: device at index %d failed to start\n", > + index); > + return -EBUSY; > + } > + > if (!refcount_inc_not_zero(&nbd->refs)) { > mutex_unlock(&nbd_index_mutex); > printk(KERN_ERR "nbd: device at index %d is going down\n", > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process 2020-01-17 17:32 ` Mike Christie @ 2020-01-19 7:10 ` sunke (E) 2020-02-10 3:16 ` Mike Christie 0 siblings, 1 reply; 7+ messages in thread From: sunke (E) @ 2020-01-19 7:10 UTC (permalink / raw) To: Mike Christie, josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li Thanks for your detailed suggestions. 在 2020/1/18 1:32, Mike Christie 写道: > On 01/17/2020 05:50 AM, Sun Ke wrote: >> Connect and disconnect a nbd device repeatedly, will cause >> NULL pointer fault. >> >> It will appear by the steps: >> 1. Connect the nbd device and disconnect it, but now nbd device >> is not disconnected totally. >> 2. Connect the same nbd device again immediately, it will fail >> in nbd_start_device with a EBUSY return value. >> 3. Wait a second to make sure the last config_refs is reduced >> and run nbd_config_put to disconnect the nbd device totally. >> 4. Start another process to open the nbd_device, config_refs >> will increase and at the same time disconnect it. > > Just to make sure I understood this, for step 4 the process is doing: > > open(/dev/nbdX); > ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX) > > ? > do nbd_genl_disconnect(for /dev/nbdX); I tested it. Connect /dev/nbdX through ioctl interface by nbd-client -L -N export localhost /dev/nbdX and through netlink interface by nbd-client localhost XXXX /dev/nbdX, disconnect /dev/nbdX by nbd-client -d /dev/nbdX. Both call nbd_genl_disconnect(for /dev/nbdX) and both contain the same null pointer dereference. > There is no successful NBD_DO_IT / nbd_genl_connect between the open and > disconnect calls at step #4, because it would normally be done at #2 and > that failed. nbd_disconnect_and_put could then reference a null > recv_workq. If we are also racing with a close() then that could free > the device/config from under nbd_disconnect_and_put. > Yes, nbd_disconnect_and_put could then reference a null recv_workq. >> >> To fix it, add a NBD_HAS_STARTED flag. Set it in nbd_start_device_ioctl > > I'm not sure if we need the new bit. We could just add a check for a non > null task_recv in nbd_genl_disconnect like how nbd_start_device and > nbd_genl_disconnect do. > I am also not very sure which is better. because in nbd_config_put, not only recv_workq is null, nbd->task_recv and nbd->config the same. so I doubt that if step 4 do something else will also reference a null pointer. > The new bit might be more clear which is nice. If we got this route, > should the new bit be a runtime_flag like other device state bits? > > Yes, I realize it. Just add a check for a non null task_recv in nbd_genl_disconnect is better, right? >> and nbd_genl_connect if nbd device is started successfully. >> Clear it in nbd_config_put. Test it in nbd_genl_disconnect and >> nbd_genl_reconfigure. >> >> Signed-off-by: Sun Ke <sunke32@huawei.com> >> --- >> drivers/block/nbd.c | 21 +++++++++++++++++++++ >> 1 file changed, 21 insertions(+) >> >> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c >> index b4607dd96185..ddd364e208ab 100644 >> --- a/drivers/block/nbd.c >> +++ b/drivers/block/nbd.c >> @@ -83,6 +83,7 @@ struct link_dead_args { >> >> #define NBD_DESTROY_ON_DISCONNECT 0 >> #define NBD_DISCONNECT_REQUESTED 1 >> +#define NBD_HAS_STARTED 2 >> >> struct nbd_config { >> u32 flags; >> @@ -1215,6 +1216,7 @@ static void nbd_config_put(struct nbd_device *nbd) >> nbd->disk->queue->limits.discard_alignment = 0; >> blk_queue_max_discard_sectors(nbd->disk->queue, UINT_MAX); >> blk_queue_flag_clear(QUEUE_FLAG_DISCARD, nbd->disk->queue); >> + clear_bit(NBD_HAS_STARTED, &nbd->flags); >> >> mutex_unlock(&nbd->config_lock); >> nbd_put(nbd); >> @@ -1290,6 +1292,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b >> ret = nbd_start_device(nbd); >> if (ret) >> return ret; >> + else >> + set_bit(NBD_HAS_STARTED, &nbd->flags); >> >> if (max_part) >> bdev->bd_invalidated = 1; >> @@ -1961,6 +1965,7 @@ static int nbd_genl_connect(struct sk_buff *skb, struct genl_info *info) >> mutex_unlock(&nbd->config_lock); >> if (!ret) { >> set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags); >> + set_bit(NBD_HAS_STARTED, &nbd->flags); >> refcount_inc(&nbd->config_refs); >> nbd_connect_reply(info, nbd->index); >> } >> @@ -2008,6 +2013,14 @@ static int nbd_genl_disconnect(struct sk_buff *skb, struct genl_info *info) >> index); >> return -EINVAL; >> } >> + >> + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) { >> + mutex_unlock(&nbd_index_mutex); >> + printk(KERN_ERR "nbd: device at index %d failed to start\n", >> + index); >> + return -EBUSY; >> + } >> + >> if (!refcount_inc_not_zero(&nbd->refs)) { >> mutex_unlock(&nbd_index_mutex); >> printk(KERN_ERR "nbd: device at index %d is going down\n", >> @@ -2049,6 +2062,14 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info) >> index); >> return -EINVAL; >> } >> + >> + if (!test_bit(NBD_HAS_STARTED, &nbd->flags)) { >> + mutex_unlock(&nbd_index_mutex); >> + printk(KERN_ERR "nbd: device at index %d failed to start\n", >> + index); >> + return -EBUSY; >> + } >> + >> if (!refcount_inc_not_zero(&nbd->refs)) { >> mutex_unlock(&nbd_index_mutex); >> printk(KERN_ERR "nbd: device at index %d is going down\n", >> I thought the changes in nbd_genl_reconfigure is necessary althought my test do not call it. but now I think it is superfluous, nbd_genl_reconfigure checks for a non null task_recv. Thanks, Ke > > > . > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process 2020-01-19 7:10 ` sunke (E) @ 2020-02-10 3:16 ` Mike Christie 2020-02-10 9:15 ` sunke (E) 0 siblings, 1 reply; 7+ messages in thread From: Mike Christie @ 2020-02-10 3:16 UTC (permalink / raw) To: sunke (E), josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li [-- Attachment #1: Type: text/plain, Size: 2110 bytes --] On 01/19/2020 01:10 AM, sunke (E) wrote: > > Thanks for your detailed suggestions. > > 在 2020/1/18 1:32, Mike Christie 写道: >> On 01/17/2020 05:50 AM, Sun Ke wrote: >>> Connect and disconnect a nbd device repeatedly, will cause >>> NULL pointer fault. >>> >>> It will appear by the steps: >>> 1. Connect the nbd device and disconnect it, but now nbd device >>> is not disconnected totally. >>> 2. Connect the same nbd device again immediately, it will fail >>> in nbd_start_device with a EBUSY return value. >>> 3. Wait a second to make sure the last config_refs is reduced >>> and run nbd_config_put to disconnect the nbd device totally. >>> 4. Start another process to open the nbd_device, config_refs >>> will increase and at the same time disconnect it. >> >> Just to make sure I understood this, for step 4 the process is doing: >> >> open(/dev/nbdX); >> ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX) >> >> ? >> > do nbd_genl_disconnect(for /dev/nbdX); > I tested it. Connect /dev/nbdX > through ioctl interface by nbd-client -L -N export localhost /dev/nbdX and > through netlink interface by nbd-client localhost XXXX /dev/nbdX, > disconnect /dev/nbdX by nbd-client -d /dev/nbdX. > Both call nbd_genl_disconnect(for /dev/nbdX) and both contain the same > null pointer dereference. >> There is no successful NBD_DO_IT / nbd_genl_connect between the open and >> disconnect calls at step #4, because it would normally be done at #2 and >> that failed. nbd_disconnect_and_put could then reference a null >> recv_workq. If we are also racing with a close() then that could free >> the device/config from under nbd_disconnect_and_put. >> > Yes, nbd_disconnect_and_put could then reference a null recv_workq. Hey Sunke How about the attached patch. I am still testing it. The basic idea is that we need to do a flush whenever we have done a sock_shutdown and are in the disconnect/connect/clear sock path, so it just adds the flush in that function. We then do not need to keep adding these flushes everywhere. [-- Attachment #2: 0001-nbd-fix-crash-during-nbd_genl_disconnect.patch --] [-- Type: text/x-patch, Size: 5016 bytes --] From c007f70b73d31a11ea90ae53c748266eec88c0ab Mon Sep 17 00:00:00 2001 From: Mike Christie <mchristi@redhat.com> Date: Sun, 9 Feb 2020 21:06:00 -0600 Subject: [PATCH] nbd: fix crash during nbd_genl_disconnect If we open a nbd device, but have not done a nbd_genl_connect we will crash in nbd_genl_disconnect when we try to flush the workqueue since it was never allocated. This patch moves all the flush calls to sock_shutdown and adds a check for a valid workqueue. Note that we flush_workqueue will do the right thing if there are no running works so we did not need the if (i) check in nbd_start_device. This also fixes a possible bug where you could do nbd_genl_connect, then do a NBD_DISCONNECT and NBD_CLEAR_SOCK and the workqueue would not get flushed. --- drivers/block/nbd.c | 44 +++++++++++++++++--------------------------- 1 file changed, 17 insertions(+), 27 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 78181908f0df..1afc70ed1f0a 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -341,9 +341,12 @@ static void nbd_complete_rq(struct request *req) } /* - * Forcibly shutdown the socket causing all listeners to error + * Forcibly shutdown the socket causing all listeners to error. If called + * from a disconnect/clearing context we must make sure the recv workqueues + * have exited so they don't release the last nbd_config reference and try to + * destroy the workqueue from inside itself. */ -static void sock_shutdown(struct nbd_device *nbd) +static void sock_shutdown(struct nbd_device *nbd, bool flush_work) { struct nbd_config *config = nbd->config; int i; @@ -351,7 +354,7 @@ static void sock_shutdown(struct nbd_device *nbd) if (config->num_connections == 0) return; if (test_and_set_bit(NBD_RT_DISCONNECTED, &config->runtime_flags)) - return; + goto try_flush; for (i = 0; i < config->num_connections; i++) { struct nbd_sock *nsock = config->socks[i]; @@ -360,6 +363,10 @@ static void sock_shutdown(struct nbd_device *nbd) mutex_unlock(&nsock->tx_lock); } dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n"); + +try_flush: + if (flush_work && nbd->recv_workq) + flush_workqueue(nbd->recv_workq); } static u32 req_to_nbd_cmd_type(struct request *req) @@ -446,7 +453,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, set_bit(NBD_RT_TIMEDOUT, &config->runtime_flags); cmd->status = BLK_STS_IOERR; mutex_unlock(&cmd->lock); - sock_shutdown(nbd); + sock_shutdown(nbd, false); nbd_config_put(nbd); done: blk_mq_complete_request(req); @@ -910,7 +917,7 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index) * for the reconnect timer don't trigger the timer again * and instead just error out. */ - sock_shutdown(nbd); + sock_shutdown(nbd, false); nbd_config_put(nbd); blk_mq_start_request(req); return -EIO; @@ -1178,7 +1185,7 @@ static int nbd_disconnect(struct nbd_device *nbd) static void nbd_clear_sock(struct nbd_device *nbd) { - sock_shutdown(nbd); + sock_shutdown(nbd, true); nbd_clear_que(nbd); nbd->task_setup = NULL; } @@ -1264,17 +1271,7 @@ static int nbd_start_device(struct nbd_device *nbd) args = kzalloc(sizeof(*args), GFP_KERNEL); if (!args) { - sock_shutdown(nbd); - /* - * If num_connections is m (2 < m), - * and NO.1 ~ NO.n(1 < n < m) kzallocs are successful. - * But NO.(n + 1) failed. We still have n recv threads. - * So, add flush_workqueue here to prevent recv threads - * dropping the last config_refs and trying to destroy - * the workqueue from inside the workqueue. - */ - if (i) - flush_workqueue(nbd->recv_workq); + sock_shutdown(nbd, true); return -ENOMEM; } sk_set_memalloc(config->socks[i]->sock->sk); @@ -1306,9 +1303,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b mutex_unlock(&nbd->config_lock); ret = wait_event_interruptible(config->recv_wq, atomic_read(&config->recv_threads) == 0); - if (ret) - sock_shutdown(nbd); - flush_workqueue(nbd->recv_workq); + sock_shutdown(nbd, true); mutex_lock(&nbd->config_lock); nbd_bdev_reset(bdev); @@ -1323,7 +1318,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b static void nbd_clear_sock_ioctl(struct nbd_device *nbd, struct block_device *bdev) { - sock_shutdown(nbd); + sock_shutdown(nbd, true); __invalidate_device(bdev, true); nbd_bdev_reset(bdev); if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF, @@ -1986,12 +1981,7 @@ static void nbd_disconnect_and_put(struct nbd_device *nbd) nbd_disconnect(nbd); nbd_clear_sock(nbd); mutex_unlock(&nbd->config_lock); - /* - * Make sure recv thread has finished, so it does not drop the last - * config ref and try to destroy the workqueue from inside the work - * queue. - */ - flush_workqueue(nbd->recv_workq); + if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF, &nbd->config->runtime_flags)) nbd_config_put(nbd); -- 2.20.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process 2020-02-10 3:16 ` Mike Christie @ 2020-02-10 9:15 ` sunke (E) 0 siblings, 0 replies; 7+ messages in thread From: sunke (E) @ 2020-02-10 9:15 UTC (permalink / raw) To: Mike Christie, josef, axboe; +Cc: linux-block, nbd, linux-kernel, Xiubo Li Hi Mike Your idea looks good. Thanks, Sun Ke 在 2020/2/10 11:16, Mike Christie 写道: > On 01/19/2020 01:10 AM, sunke (E) wrote: >> >> Thanks for your detailed suggestions. >> >> 在 2020/1/18 1:32, Mike Christie 写道: >>> On 01/17/2020 05:50 AM, Sun Ke wrote: >>>> Connect and disconnect a nbd device repeatedly, will cause >>>> NULL pointer fault. >>>> >>>> It will appear by the steps: >>>> 1. Connect the nbd device and disconnect it, but now nbd device >>>> is not disconnected totally. >>>> 2. Connect the same nbd device again immediately, it will fail >>>> in nbd_start_device with a EBUSY return value. >>>> 3. Wait a second to make sure the last config_refs is reduced >>>> and run nbd_config_put to disconnect the nbd device totally. >>>> 4. Start another process to open the nbd_device, config_refs >>>> will increase and at the same time disconnect it. >>> >>> Just to make sure I understood this, for step 4 the process is doing: >>> >>> open(/dev/nbdX); >>> ioctl(NBD_DISCONNECT, /dev/nbdX) or nbd_genl_disconnect(for /dev/nbdX) >>> >>> ? >>> >> do nbd_genl_disconnect(for /dev/nbdX); >> I tested it. Connect /dev/nbdX >> through ioctl interface by nbd-client -L -N export localhost /dev/nbdX and >> through netlink interface by nbd-client localhost XXXX /dev/nbdX, >> disconnect /dev/nbdX by nbd-client -d /dev/nbdX. >> Both call nbd_genl_disconnect(for /dev/nbdX) and both contain the same >> null pointer dereference. >>> There is no successful NBD_DO_IT / nbd_genl_connect between the open and >>> disconnect calls at step #4, because it would normally be done at #2 and >>> that failed. nbd_disconnect_and_put could then reference a null >>> recv_workq. If we are also racing with a close() then that could free >>> the device/config from under nbd_disconnect_and_put. >>> >> Yes, nbd_disconnect_and_put could then reference a null recv_workq. > > Hey Sunke > > How about the attached patch. I am still testing it. The basic idea is > that we need to do a flush whenever we have done a sock_shutdown and are > in the disconnect/connect/clear sock path, so it just adds the flush in > that function. We then do not need to keep adding these flushes everywhere. > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-02-10 9:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-17 11:50 [PATCH] nbd: fix potential NULL pointer fault in connect and disconnect process Sun Ke 2020-01-17 14:18 ` Josef Bacik 2020-01-19 6:27 ` sunke (E) 2020-01-17 17:32 ` Mike Christie 2020-01-19 7:10 ` sunke (E) 2020-02-10 3:16 ` Mike Christie 2020-02-10 9:15 ` sunke (E)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).