Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage()
@ 2020-07-26 13:52 Coly Li
  2020-07-26 13:52 ` [PATCH 2/2] drbd: code cleanup by using " Coly Li
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Coly Li @ 2020-07-26 13:52 UTC (permalink / raw)
  To: sagi, philipp.reisner, linux-nvme, linux-block, linux-bcache, hch
  Cc: Coly Li, Chaitanya Kulkarni, Hannes Reinecke, Jan Kara,
	Jens Axboe, Mikhail Skorzhinskii, Vlastimil Babka, stable

Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
send slab pages. But for pages allocated by __get_free_pages() without
__GFP_COMP, which also have refcount as 0, they are still sent by
kernel_sendpage() to remote end, this is problematic.

When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache
device, writing meta data e.g. cache_set->disk_buckets to remote SSD may
trigger a kernel panic due to the above problem. Bcause the meta data
pages for cache_set->disk_buckets are allocated by __get_free_pages()
without __GFP_COMP.

This problem should be fixed both in upper layer driver (bcache) and
nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking
whether the page refcount is 0, if yes then don't use kernel_sendpage()
and call sock_no_sendpage() to send the page into network stack.

Such check is done by macro sendpage_ok() in this patch, which is defined
in include/linux/net.h as,
	(!PageSlab(page) && page_count(page) >= 1)
If sendpage_ok() returns false, sock_no_sendpage() will handle the page
other than kernel_sendpage().

The code comments in this patch is copied and modified from drbd where
the similar problem already gets solved by Philipp Reisner. This is the
best code comment including my own version.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Vlastimil Babka <vbabka@suse.com>
Cc: stable@vger.kernel.org
---
Changelog:
v3: introduce a more common name sendpage_ok() for the open coded check
v2: fix typo in patch subject.
v1: the initial version.

 drivers/nvme/host/tcp.c | 13 +++++++++++--
 include/linux/net.h     |  2 ++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 79ef2b8e2b3c..f9952f6d94b9 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
 		else
 			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
 
-		/* can't zcopy slab pages */
-		if (unlikely(PageSlab(page))) {
+		/*
+		 * e.g. XFS meta- & log-data is in slab pages, or bcache meta
+		 * data pages, or other high order pages allocated by
+		 * __get_free_pages() without __GFP_COMP, which have a page_count
+		 * of 0 and/or have PageSlab() set. We cannot use send_page for
+		 * those, as that does get_page(); put_page(); and would cause
+		 * either a VM_BUG directly, or __page_cache_release a page that
+		 * would actually still be referenced by someone, leading to some
+		 * obscure delayed Oops somewhere else.
+		 */
+		if (unlikely(!sendpage_ok(page))) {
 			ret = sock_no_sendpage(queue->sock, page, offset, len,
 					flags);
 		} else {
diff --git a/include/linux/net.h b/include/linux/net.h
index 016a9c5faa34..41e5d2898e97 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -290,6 +290,8 @@ do {									\
 #define net_get_random_once_wait(buf, nbytes)			\
 	get_random_once_wait((buf), (nbytes))
 
+#define sendpage_ok(page)	(!PageSlab(page) && page_count(page) >= 1)
+
 int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec,
 		   size_t num, size_t len);
 int kernel_sendmsg_locked(struct sock *sk, struct msghdr *msg,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 2/2] drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage()
  2020-07-26 13:52 [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage() Coly Li
@ 2020-07-26 13:52 ` Coly Li
  2020-07-26 15:07 ` [PATCH 1/2] nvme-tcp: use " Christoph Hellwig
  2020-07-27 17:25 ` Sagi Grimberg
  2 siblings, 0 replies; 5+ messages in thread
From: Coly Li @ 2020-07-26 13:52 UTC (permalink / raw)
  To: sagi, philipp.reisner, linux-nvme, linux-block, linux-bcache, hch; +Cc: Coly Li

In _drbd_send_page() a page is checked by following code before sending
it by kernel_sendpage(),
	(page_count(page) < 1) || PageSlab(page)
If the check is true, this page won't be send by kernel_sendpage() and
handled by sock_no_sendpage().

This kind of check is exactly what macro sendpage_ok() does, which is
introduced into include/linux/net.h to solve a similar send page issue
in nvme-tcp code.

This patch uses macro sendpage_ok() to replace the open coded checks to
page type and refcount in _drbd_send_page(), as a code cleanup.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/block/drbd/drbd_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 45fbd526c453..567d7e1d9f76 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1552,7 +1552,7 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
 	 * put_page(); and would cause either a VM_BUG directly, or
 	 * __page_cache_release a page that would actually still be referenced
 	 * by someone, leading to some obscure delayed Oops somewhere else. */
-	if (drbd_disable_sendpage || (page_count(page) < 1) || PageSlab(page))
+	if (drbd_disable_sendpage || !sendpage_ok(page))
 		return _drbd_no_send_page(peer_device, page, offset, size, msg_flags);
 
 	msg_flags |= MSG_NOSIGNAL;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage()
  2020-07-26 13:52 [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage() Coly Li
  2020-07-26 13:52 ` [PATCH 2/2] drbd: code cleanup by using " Coly Li
@ 2020-07-26 15:07 ` Christoph Hellwig
  2020-07-27 17:25 ` Sagi Grimberg
  2 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2020-07-26 15:07 UTC (permalink / raw)
  To: Coly Li
  Cc: sagi, philipp.reisner, linux-nvme, linux-block, linux-bcache,
	hch, Chaitanya Kulkarni, Hannes Reinecke, Jan Kara, Jens Axboe,
	Mikhail Skorzhinskii, Vlastimil Babka, stable

The sendpage_ok helper should go into a separate patch and probably
be an inline fuction.  Also please add the netdev list to the Cc list.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage()
  2020-07-26 13:52 [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage() Coly Li
  2020-07-26 13:52 ` [PATCH 2/2] drbd: code cleanup by using " Coly Li
  2020-07-26 15:07 ` [PATCH 1/2] nvme-tcp: use " Christoph Hellwig
@ 2020-07-27 17:25 ` Sagi Grimberg
  2020-07-28 12:42   ` Coly Li
  2 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2020-07-27 17:25 UTC (permalink / raw)
  To: Coly Li, philipp.reisner, linux-nvme, linux-block, linux-bcache, hch
  Cc: Chaitanya Kulkarni, Hannes Reinecke, Jan Kara, Jens Axboe,
	Mikhail Skorzhinskii, Vlastimil Babka, stable



On 7/26/20 6:52 AM, Coly Li wrote:
> Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
> send slab pages. But for pages allocated by __get_free_pages() without
> __GFP_COMP, which also have refcount as 0, they are still sent by
> kernel_sendpage() to remote end, this is problematic.
> 
> When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache
> device, writing meta data e.g. cache_set->disk_buckets to remote SSD may
> trigger a kernel panic due to the above problem. Bcause the meta data
> pages for cache_set->disk_buckets are allocated by __get_free_pages()
> without __GFP_COMP.
> 
> This problem should be fixed both in upper layer driver (bcache) and
> nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking
> whether the page refcount is 0, if yes then don't use kernel_sendpage()
> and call sock_no_sendpage() to send the page into network stack.
> 
> Such check is done by macro sendpage_ok() in this patch, which is defined
> in include/linux/net.h as,
> 	(!PageSlab(page) && page_count(page) >= 1)
> If sendpage_ok() returns false, sock_no_sendpage() will handle the page
> other than kernel_sendpage().
> 
> The code comments in this patch is copied and modified from drbd where
> the similar problem already gets solved by Philipp Reisner. This is the
> best code comment including my own version.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Jan Kara <jack@suse.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
> Cc: Philipp Reisner <philipp.reisner@linbit.com>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Vlastimil Babka <vbabka@suse.com>
> Cc: stable@vger.kernel.org
> ---
> Changelog:
> v3: introduce a more common name sendpage_ok() for the open coded check
> v2: fix typo in patch subject.
> v1: the initial version.
> 
>   drivers/nvme/host/tcp.c | 13 +++++++++++--
>   include/linux/net.h     |  2 ++
>   2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 79ef2b8e2b3c..f9952f6d94b9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
>   		else
>   			flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
>   
> -		/* can't zcopy slab pages */
> -		if (unlikely(PageSlab(page))) {
> +		/*
> +		 * e.g. XFS meta- & log-data is in slab pages, or bcache meta
> +		 * data pages, or other high order pages allocated by
> +		 * __get_free_pages() without __GFP_COMP, which have a page_count
> +		 * of 0 and/or have PageSlab() set. We cannot use send_page for
> +		 * those, as that does get_page(); put_page(); and would cause
> +		 * either a VM_BUG directly, or __page_cache_release a page that
> +		 * would actually still be referenced by someone, leading to some
> +		 * obscure delayed Oops somewhere else.
> +		 */

I was hoping that this comment would move to the helper as well.

Agree with Christoph comment as well.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage()
  2020-07-27 17:25 ` Sagi Grimberg
@ 2020-07-28 12:42   ` Coly Li
  0 siblings, 0 replies; 5+ messages in thread
From: Coly Li @ 2020-07-28 12:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: philipp.reisner, linux-nvme, linux-block, linux-bcache, hch,
	Chaitanya Kulkarni, Hannes Reinecke, Jan Kara, Jens Axboe,
	Mikhail Skorzhinskii, Vlastimil Babka, stable

On 2020/7/28 01:25, Sagi Grimberg wrote:
> 
> 
> On 7/26/20 6:52 AM, Coly Li wrote:
>> Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
>> send slab pages. But for pages allocated by __get_free_pages() without
>> __GFP_COMP, which also have refcount as 0, they are still sent by
>> kernel_sendpage() to remote end, this is problematic.
>>
>> When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache
>> device, writing meta data e.g. cache_set->disk_buckets to remote SSD may
>> trigger a kernel panic due to the above problem. Bcause the meta data
>> pages for cache_set->disk_buckets are allocated by __get_free_pages()
>> without __GFP_COMP.
>>
>> This problem should be fixed both in upper layer driver (bcache) and
>> nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking
>> whether the page refcount is 0, if yes then don't use kernel_sendpage()
>> and call sock_no_sendpage() to send the page into network stack.
>>
>> Such check is done by macro sendpage_ok() in this patch, which is defined
>> in include/linux/net.h as,
>>     (!PageSlab(page) && page_count(page) >= 1)
>> If sendpage_ok() returns false, sock_no_sendpage() will handle the page
>> other than kernel_sendpage().
>>
>> The code comments in this patch is copied and modified from drbd where
>> the similar problem already gets solved by Philipp Reisner. This is the
>> best code comment including my own version.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Hannes Reinecke <hare@suse.de>
>> Cc: Jan Kara <jack@suse.com>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
>> Cc: Philipp Reisner <philipp.reisner@linbit.com>
>> Cc: Sagi Grimberg <sagi@grimberg.me>
>> Cc: Vlastimil Babka <vbabka@suse.com>
>> Cc: stable@vger.kernel.org
>> ---
>> Changelog:
>> v3: introduce a more common name sendpage_ok() for the open coded check
>> v2: fix typo in patch subject.
>> v1: the initial version.
>>
>>   drivers/nvme/host/tcp.c | 13 +++++++++++--
>>   include/linux/net.h     |  2 ++
>>   2 files changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 79ef2b8e2b3c..f9952f6d94b9 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct
>> nvme_tcp_request *req)
>>           else
>>               flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
>>   -        /* can't zcopy slab pages */
>> -        if (unlikely(PageSlab(page))) {
>> +        /*
>> +         * e.g. XFS meta- & log-data is in slab pages, or bcache meta
>> +         * data pages, or other high order pages allocated by
>> +         * __get_free_pages() without __GFP_COMP, which have a
>> page_count
>> +         * of 0 and/or have PageSlab() set. We cannot use send_page for
>> +         * those, as that does get_page(); put_page(); and would cause
>> +         * either a VM_BUG directly, or __page_cache_release a page that
>> +         * would actually still be referenced by someone, leading to
>> some
>> +         * obscure delayed Oops somewhere else.
>> +         */
> 
> I was hoping that this comment would move to the helper as well.
> 

Sure, I will do that.


> Agree with Christoph comment as well.

I will move the inline sendpage_ok() to a separated patch.

Coly Li

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-26 13:52 [PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage() Coly Li
2020-07-26 13:52 ` [PATCH 2/2] drbd: code cleanup by using " Coly Li
2020-07-26 15:07 ` [PATCH 1/2] nvme-tcp: use " Christoph Hellwig
2020-07-27 17:25 ` Sagi Grimberg
2020-07-28 12:42   ` Coly Li

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git