From: Hou Tao <houtao1@huawei.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>, Jens Axboe <axboe@kernel.dk>,
<linux-block@vger.kernel.org>, <nbd@other.debian.org>
Subject: Re: [PATCH v2 3/3] nbd: fix race between nbd_alloc_config() and module removal
Date: Wed, 8 Sep 2021 21:03:28 +0800 [thread overview]
Message-ID: <319b5ef6-3d73-8795-e252-3c35fbe1b5bc@huawei.com> (raw)
In-Reply-To: <730dae5e-5af8-3554-18bf-e22ff576e2b1@huawei.com>
Hi Christoph,
Any comments for this patch ?
On 9/7/2021 11:04 AM, Hou Tao wrote:
> Hi,
>
> On 9/6/2021 6:25 PM, Christoph Hellwig wrote:
>> On Mon, Sep 06, 2021 at 06:08:54PM +0800, Hou Tao wrote:
>>>>> + if (!try_module_get(THIS_MODULE))
>>>>> + return ERR_PTR(-ENODEV);
>>>> try_module_get(THIS_MODULE) is an indicator for an unsafe pattern. If
>>>> we don't already have a reference it could never close the race.
>>>>
>>>> Looking at the callers:
>>>>
>>>> - nbd_open like all block device operations must have a reference
>>>> already.
>>> Yes. nbd_open() has already taken a reference in dentry_open().
>>>> - for nbd_genl_connect I'm not an expert, but given that struct
>>>> nbd_genl_family has a module member I suspect the networkinh
>>>> code already takes a reference.
>>> That was my original though, but the fact is netlink code doesn't take a module reference
>>>
>>> in genl_family_rcv_msg_doit() and netlink uses genl_lock_all() to serialize between module removal
>>>
>>> and nbd_connect_genl_ops calling, so I think use try_module_get() is OK here.
>> How it this going to work? If there was a race you just shortened it,
>> but it can still happen before you call try_module_get. So I think we
>> need to look into how the netlink calling conventions are supposed to
>> look and understand the issues there first.
>> .
> Let me explain first. The reason it works is due to genl_lock_all() in netlink code.
>
> If the module removal happens before calling try_module_get(), nbd_cleanup() will
>
> call genl_unregister_family() first, and then genl_lock_all(). genl_lock_all() will
>
> prevent ops in nbd_connect_genl_ops() from being called, because the calling
>
> of nbd ops happens in genl_rcv() which needs to acquire cb_lock first.
>
>
> process A process B
>
> module removal
>
> genl_unregister_family()
>
> genl_lock_all()
>
> down_write(&cb_lock)
>
> receive a new netlink message
>
> genl_rcv()
>
> // will wait for the removal of nbd ops
>
> down_read(&cb_lock)
>
> If nbd_alloc_config() happens before the module removal, genl_rcv() must
>
> have been acquired cb_lock & genl_mutex, so nbd_cleanup() will block in
>
> genl_unregister_family(). When nbd_alloc_config() calls try_module_get(),
>
> it will find out the module is dying, so fail nbd_genl_connect().
>
>
> process A process B
>
> a new netlink message
>
> genl_rcv()
>
> down_read(&cb_lock)
>
> mutex_lock(&genl_mutex)
>
> nbd_genl_connect()
>
> nbd_alloc_config()
>
> module removal
>
> genl_unregister_family
>
> // module is dying, so fail
>
> try_module_get()
>
> genl_lock_all()
>
> // wait for the completion of nbd ops
>
> down_write(&cb_lock)
>
> I have checked multiple genl_ops, it seems that the reason why these genl_ops
>
> don't need try_module_get() is that these ops don't create new object through
>
> genl_ops and just control it. However genl_family_rcv_msg_dumpit() will try to
>
> call try_module_get(), but according to the history (6dc878a8ca39 "netlink: add reference of module in netlink_dump_start"),
>
> it is because inet_diag_handler_cmd() will call __netlink_dump_start().
>
> Regards,
>
> Tao
>
>
> .
next prev parent reply other threads:[~2021-09-08 13:03 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-04 12:25 [PATCH v2 0/3] fix races between nbd setup and module removal Hou Tao
2021-09-04 12:25 ` [PATCH v2 1/3] nbd: use pr_err to output error message Hou Tao
2021-09-06 9:27 ` Christoph Hellwig
2021-09-04 12:25 ` [PATCH v2 2/3] nbd: call genl_unregister_family() first in nbd_cleanup() Hou Tao
2021-09-06 9:27 ` Christoph Hellwig
2021-09-04 12:25 ` [PATCH v2 3/3] nbd: fix race between nbd_alloc_config() and module removal Hou Tao
2021-09-06 9:30 ` Christoph Hellwig
2021-09-06 10:08 ` Hou Tao
2021-09-06 10:25 ` Christoph Hellwig
2021-09-07 3:04 ` Hou Tao
2021-09-08 13:03 ` Hou Tao [this message]
2021-09-09 6:40 ` Christoph Hellwig
2021-09-13 4:32 ` Hou Tao
2021-09-13 15:25 ` Christoph Hellwig
2021-09-14 11:42 ` Wouter Verhelst
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=319b5ef6-3d73-8795-e252-3c35fbe1b5bc@huawei.com \
--to=houtao1@huawei.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=josef@toxicpanda.com \
--cc=linux-block@vger.kernel.org \
--cc=nbd@other.debian.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).