From: Coly Li <colyli@suse.de> To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, hch@lst.de Cc: Coly Li <colyli@suse.de>, Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>, Hannes Reinecke <hare@suse.de>, Jan Kara <jack@suse.com>, Jens Axboe <axboe@kernel.dk>, Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>, Philipp Reisner <philipp.reisner@linbit.com>, Sagi Grimberg <sagi@grimberg.me>, Vlastimil Babka <vbabka@suse.com>, stable@vger.kernel.org Subject: [PATCH v2] nvme-tcp: don't use sendpage for pages not taking reference counter Date: Mon, 13 Jul 2020 20:44:44 +0800 [thread overview] Message-ID: <20200713124444.19640-1-colyli@suse.de> (raw) Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic. When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache device, writing meta data e.g. cache_set->disk_buckets to remote SSD may trigger a kernel panic due to the above problem. Bcause the meta data pages for cache_set->disk_buckets are allocated by __get_free_pages() without __GFP_COMP. This problem should be fixed both in upper layer driver (bcache) and nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking whether the page refcount is 0, if yes then don't use kernel_sendpage() and call sock_no_sendpage() to send the page into network stack. The code comments in this patch is copied and modified from drbd where the similar problem already gets solved by Philipp Reisner. This is the best code comment including my own version. Signed-off-by: Coly Li <colyli@suse.de> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Vlastimil Babka <vbabka@suse.com> Cc: stable@vger.kernel.org --- Changelog: v2: fix typo in patch subject. v1: the initial version. drivers/nvme/host/tcp.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 79ef2b8e2b3c..faa71db7522a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) else flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST; - /* can't zcopy slab pages */ - if (unlikely(PageSlab(page))) { + /* + * e.g. XFS meta- & log-data is in slab pages, or bcache meta + * data pages, or other high order pages allocated by + * __get_free_pages() without __GFP_COMP, which have a page_count + * of 0 and/or have PageSlab() set. We cannot use send_page for + * those, as that does get_page(); put_page(); and would cause + * either a VM_BUG directly, or __page_cache_release a page that + * would actually still be referenced by someone, leading to some + * obscure delayed Oops somewhere else. + */ + if (unlikely(PageSlab(page) || page_count(page) < 1)) { ret = sock_no_sendpage(queue->sock, page, offset, len, flags); } else { -- 2.26.2
WARNING: multiple messages have this Message-ID (diff)
From: Coly Li <colyli@suse.de> To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, linux-bcache@vger.kernel.org, hch@lst.de Cc: Jens Axboe <axboe@kernel.dk>, Vlastimil Babka <vbabka@suse.com>, Sagi Grimberg <sagi@grimberg.me>, Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>, Philipp Reisner <philipp.reisner@linbit.com>, stable@vger.kernel.org, Coly Li <colyli@suse.de>, Hannes Reinecke <hare@suse.de>, Jan Kara <jack@suse.com>, Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Subject: [PATCH v2] nvme-tcp: don't use sendpage for pages not taking reference counter Date: Mon, 13 Jul 2020 20:44:44 +0800 [thread overview] Message-ID: <20200713124444.19640-1-colyli@suse.de> (raw) Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic. When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache device, writing meta data e.g. cache_set->disk_buckets to remote SSD may trigger a kernel panic due to the above problem. Bcause the meta data pages for cache_set->disk_buckets are allocated by __get_free_pages() without __GFP_COMP. This problem should be fixed both in upper layer driver (bcache) and nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking whether the page refcount is 0, if yes then don't use kernel_sendpage() and call sock_no_sendpage() to send the page into network stack. The code comments in this patch is copied and modified from drbd where the similar problem already gets solved by Philipp Reisner. This is the best code comment including my own version. Signed-off-by: Coly Li <colyli@suse.de> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Vlastimil Babka <vbabka@suse.com> Cc: stable@vger.kernel.org --- Changelog: v2: fix typo in patch subject. v1: the initial version. drivers/nvme/host/tcp.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 79ef2b8e2b3c..faa71db7522a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) else flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST; - /* can't zcopy slab pages */ - if (unlikely(PageSlab(page))) { + /* + * e.g. XFS meta- & log-data is in slab pages, or bcache meta + * data pages, or other high order pages allocated by + * __get_free_pages() without __GFP_COMP, which have a page_count + * of 0 and/or have PageSlab() set. We cannot use send_page for + * those, as that does get_page(); put_page(); and would cause + * either a VM_BUG directly, or __page_cache_release a page that + * would actually still be referenced by someone, leading to some + * obscure delayed Oops somewhere else. + */ + if (unlikely(PageSlab(page) || page_count(page) < 1)) { ret = sock_no_sendpage(queue->sock, page, offset, len, flags); } else { -- 2.26.2 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
next reply other threads:[~2020-07-13 12:45 UTC|newest] Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-07-13 12:44 Coly Li [this message] 2020-07-13 12:44 ` [PATCH v2] nvme-tcp: don't use sendpage for pages not taking reference counter Coly Li 2020-07-17 17:08 ` Sasha Levin 2020-07-17 17:08 ` Sasha Levin 2020-07-20 19:46 ` Sagi Grimberg 2020-07-20 19:46 ` Sagi Grimberg
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200713124444.19640-1-colyli@suse.de \ --to=colyli@suse.de \ --cc=axboe@kernel.dk \ --cc=chaitanya.kulkarni@wdc.com \ --cc=hare@suse.de \ --cc=hch@lst.de \ --cc=jack@suse.com \ --cc=linux-bcache@vger.kernel.org \ --cc=linux-block@vger.kernel.org \ --cc=linux-nvme@lists.infradead.org \ --cc=mskorzhinskiy@solarflare.com \ --cc=philipp.reisner@linbit.com \ --cc=sagi@grimberg.me \ --cc=stable@vger.kernel.org \ --cc=vbabka@suse.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.