All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 0/3] Fix zone revalidation memory allocation failures
@ 2019-06-27  2:49 Damien Le Moal
  2019-06-27  2:49 ` [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers Damien Le Moal
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  2:49 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

This series addresses a reccuring problem with zone revalidation
failures observed during extensive testing with memory constrained
system and device hot-plugging.

The problem source is failure to allocate large memory areas with
alloc_pages() or kmalloc() in blk_revalidate_disk_zones() to store the
disk array of zones (struct blk_zone) or in sd_zbc_report_zones() for
the report zones command reply buffer.

The solution proposed here is to:
1) limit the number of zones to be reported with a single report zones
command execution, and
2) Use vmalloc to allocate large-ish arrays and buffers in place of
alloc_pages() or kmalloc().

With these changes, tests do not show any zone revalidation failures
while not impacting the time taken for a disk zone inspection during
device scan and revalidation.

Changes from v3:
* Reworked use of flush_kernel_vmap_range() and
  invalidate_kernel_vmap_range() to contain the calls within bio.c,
  transparently to the user of bio_map_kern().
* Add similar support to bio_copy_kern().

Changes from v2:
* Move invalidate_kernel_vmap_range() of vmalloc-ed buffer to sd_zbc.c
  in patch 2, after completion of scsi_execute_req().
* In patch 2, add flush_kernel_vmap_range() before scsi_execute_req().

Changes from V1:
* Added call to invalidate_kernel_vmap_range() for vmalloc-ed buffers
  in patch 1.
* Fixed patch 2 compilation error with Sparc64 (kbuild robot)

Damien Le Moal (3):
  block: Allow mapping of vmalloc-ed buffers
  sd_zbc: Fix report zones buffer allocation
  block: Limit zone array allocation size

 block/bio.c            | 43 +++++++++++++++++++++-
 block/blk-zoned.c      | 29 +++++++--------
 drivers/scsi/sd_zbc.c  | 83 +++++++++++++++++++++++++++++++-----------
 include/linux/blkdev.h |  5 +++
 4 files changed, 122 insertions(+), 38 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  2:49 [PATCH V4 0/3] Fix zone revalidation memory allocation failures Damien Le Moal
@ 2019-06-27  2:49 ` Damien Le Moal
  2019-06-27  7:28   ` Christoph Hellwig
  2019-06-27  7:47   ` Ming Lei
  2019-06-27  2:49 ` [PATCH V4 2/3] sd_zbc: Fix report zones buffer allocation Damien Le Moal
  2019-06-27  2:49 ` [PATCH V4 3/3] block: Limit zone array allocation size Damien Le Moal
  2 siblings, 2 replies; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  2:49 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

To allow the SCSI subsystem scsi_execute_req() function to issue
requests using large buffers that are better allocated with vmalloc()
rather than kmalloc(), modify bio_map_kern() and bio_copy_kern() to
allow passing a buffer allocated with vmalloc(). To do so, detect
vmalloc-ed buffers using is_vmalloc_addr(). For vmalloc-ed buffers,
flush the buffer using flush_kernel_vmap_range(), use vmalloc_to_page()
instead of virt_to_page() to obtain the pages of the buffer, and
invalidate the buffer addresses with invalidate_kernel_vmap_range() on
completion of read BIOs. This last point is executed using the function
bio_invalidate_vmalloc_pages() which is defined only if the
architecture defines ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the
architecture actually needs the invalidation done.

Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
Fixes: e76239a3748c ("block: add a report_zones method")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/bio.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index ce797d73bb43..1c21d1e7f1b8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -16,6 +16,7 @@
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
 #include <linux/blk-cgroup.h>
+#include <linux/highmem.h>
 
 #include <trace/events/block.h>
 #include "blk.h"
@@ -1479,8 +1480,26 @@ void bio_unmap_user(struct bio *bio)
 	bio_put(bio);
 }
 
+#ifdef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
+static void bio_invalidate_vmalloc_pages(struct bio *bio)
+{
+	if (bio->bi_private) {
+		struct bvec_iter_all iter_all;
+		struct bio_vec *bvec;
+		unsigned long len = 0;
+
+		bio_for_each_segment_all(bvec, bio, iter_all)
+			len += bvec->bv_len;
+		invalidate_kernel_vmap_range(bio->bi_private, len);
+	}
+}
+#else
+static void bio_invalidate_vmalloc_pages(struct bio *bio) {}
+#endif
+
 static void bio_map_kern_endio(struct bio *bio)
 {
+	bio_invalidate_vmalloc_pages(bio);
 	bio_put(bio);
 }
 
@@ -1501,6 +1520,8 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
 	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	unsigned long start = kaddr >> PAGE_SHIFT;
 	const int nr_pages = end - start;
+	bool is_vmalloc = is_vmalloc_addr(data);
+	struct page *page;
 	int offset, i;
 	struct bio *bio;
 
@@ -1508,6 +1529,12 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
 	if (!bio)
 		return ERR_PTR(-ENOMEM);
 
+	if (is_vmalloc) {
+		flush_kernel_vmap_range(data, len);
+		if ((!op_is_write(bio_op(bio))))
+			bio->bi_private = data;
+	}
+
 	offset = offset_in_page(kaddr);
 	for (i = 0; i < nr_pages; i++) {
 		unsigned int bytes = PAGE_SIZE - offset;
@@ -1518,7 +1545,11 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
 		if (bytes > len)
 			bytes = len;
 
-		if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
+		if (!is_vmalloc)
+			page = virt_to_page(data);
+		else
+			page = vmalloc_to_page(data);
+		if (bio_add_pc_page(q, bio, page, bytes,
 				    offset) < bytes) {
 			/* we don't support partial mappings */
 			bio_put(bio);
@@ -1531,6 +1562,7 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
 	}
 
 	bio->bi_end_io = bio_map_kern_endio;
+
 	return bio;
 }
 EXPORT_SYMBOL(bio_map_kern);
@@ -1543,6 +1575,7 @@ static void bio_copy_kern_endio(struct bio *bio)
 
 static void bio_copy_kern_endio_read(struct bio *bio)
 {
+	unsigned long len = 0;
 	char *p = bio->bi_private;
 	struct bio_vec *bvec;
 	struct bvec_iter_all iter_all;
@@ -1550,8 +1583,12 @@ static void bio_copy_kern_endio_read(struct bio *bio)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
 		p += bvec->bv_len;
+		len += bvec->bv_len;
 	}
 
+	if (is_vmalloc_addr(bio->bi_private))
+		invalidate_kernel_vmap_range(bio->bi_private, len);
+
 	bio_copy_kern_endio(bio);
 }
 
@@ -1572,6 +1609,7 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
 	unsigned long kaddr = (unsigned long)data;
 	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	unsigned long start = kaddr >> PAGE_SHIFT;
+	bool is_vmalloc = is_vmalloc_addr(data);
 	struct bio *bio;
 	void *p = data;
 	int nr_pages = 0;
@@ -1587,6 +1625,9 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
 	if (!bio)
 		return ERR_PTR(-ENOMEM);
 
+	if (is_vmalloc)
+		flush_kernel_vmap_range(data, len);
+
 	while (len) {
 		struct page *page;
 		unsigned int bytes = PAGE_SIZE;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V4 2/3] sd_zbc: Fix report zones buffer allocation
  2019-06-27  2:49 [PATCH V4 0/3] Fix zone revalidation memory allocation failures Damien Le Moal
  2019-06-27  2:49 ` [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers Damien Le Moal
@ 2019-06-27  2:49 ` Damien Le Moal
  2019-06-27  2:49 ` [PATCH V4 3/3] block: Limit zone array allocation size Damien Le Moal
  2 siblings, 0 replies; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  2:49 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

During disk scan and revalidation done with sd_revalidate(), the zones
of a zoned disk are checked using the helper function
blk_revalidate_disk_zones() if a configuration change is detected
(change in the number of zones or zone size). The function
blk_revalidate_disk_zones() issues report_zones calls that are very
large, that is, to obtain zone information for all zones of the disk
with a single command. The size of the report zones command buffer
necessary for such large request generally is lower than the disk
max_hw_sectors and KMALLOC_MAX_SIZE (4MB) and succeeds on boot (no
memory fragmentation), but often fail at run time (e.g. hot-plug
event). This causes the disk revalidation to fail and the disk
capacity to be changed to 0.

This problem can be avoided by using vmalloc() instead of kmalloc() for
the buffer allocation. To limit the amount of memory to be allocated,
this patch also introduces the arbitrary SD_ZBC_REPORT_MAX_ZONES
maximum number of zones to report with a single report zones command.
This limit may be lowered further to satisfy the disk max_hw_sectors
limit. Finally, to ensure that the vmalloc-ed buffer can always be
mapped in a request, the buffer size is further limited to at most
queue_max_segments() pages, allowing successful mapping of the buffer
even in the worst case scenario where none of the buffer pages are
contiguous.

Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
Fixes: e76239a3748c ("block: add a report_zones method")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd_zbc.c | 83 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 62 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 7334024b64f1..ecd967fb39c1 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/vmalloc.h>
 
 #include <asm/unaligned.h>
 
@@ -50,7 +51,7 @@ static void sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 /**
  * sd_zbc_do_report_zones - Issue a REPORT ZONES scsi command.
  * @sdkp: The target disk
- * @buf: Buffer to use for the reply
+ * @buf: vmalloc-ed buffer to use for the reply
  * @buflen: the buffer size
  * @lba: Start LBA of the report
  * @partial: Do partial report
@@ -79,6 +80,7 @@ static int sd_zbc_do_report_zones(struct scsi_disk *sdkp, unsigned char *buf,
 	put_unaligned_be32(buflen, &cmd[10]);
 	if (partial)
 		cmd[14] = ZBC_REPORT_ZONE_PARTIAL;
+
 	memset(buf, 0, buflen);
 
 	result = scsi_execute_req(sdp, cmd, DMA_FROM_DEVICE,
@@ -103,6 +105,48 @@ static int sd_zbc_do_report_zones(struct scsi_disk *sdkp, unsigned char *buf,
 	return 0;
 }
 
+/*
+ * Maximum number of zones to get with one report zones command.
+ */
+#define SD_ZBC_REPORT_MAX_ZONES		8192U
+
+/**
+ * Allocate a buffer for report zones reply.
+ * @disk: The target disk
+ * @nr_zones: Maximum number of zones to report
+ * @buflen: Size of the buffer allocated
+ * @gfp_mask: Memory allocation mask
+ *
+ */
+static void *sd_zbc_alloc_report_buffer(struct request_queue *q,
+					unsigned int nr_zones, size_t *buflen,
+					gfp_t gfp_mask)
+{
+	size_t bufsize;
+	void *buf;
+
+	/*
+	 * Report zone buffer size should be at most 64B times the number of
+	 * zones requested plus the 64B reply header, but should be at least
+	 * SECTOR_SIZE for ATA devices.
+	 * Make sure that this size does not exceed the hardware capabilities.
+	 * Furthermore, since the report zone command cannot be split, make
+	 * sure that the allocated buffer can always be mapped by limiting the
+	 * number of pages allocated to the HBA max segments limit.
+	 */
+	nr_zones = min(nr_zones, SD_ZBC_REPORT_MAX_ZONES);
+	bufsize = roundup((nr_zones + 1) * 64, 512);
+	bufsize = min_t(size_t, bufsize,
+			queue_max_hw_sectors(q) << SECTOR_SHIFT);
+	bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT);
+
+	buf = __vmalloc(bufsize, gfp_mask, PAGE_KERNEL);
+	if (buf)
+		*buflen = bufsize;
+
+	return buf;
+}
+
 /**
  * sd_zbc_report_zones - Disk report zones operation.
  * @disk: The target disk
@@ -118,9 +162,9 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 			gfp_t gfp_mask)
 {
 	struct scsi_disk *sdkp = scsi_disk(disk);
-	unsigned int i, buflen, nrz = *nr_zones;
+	unsigned int i, nrz = *nr_zones;
 	unsigned char *buf;
-	size_t offset = 0;
+	size_t buflen = 0, offset = 0;
 	int ret = 0;
 
 	if (!sd_is_zoned(sdkp))
@@ -132,16 +176,14 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 	 * without exceeding the device maximum command size. For ATA disks,
 	 * buffers must be aligned to 512B.
 	 */
-	buflen = min(queue_max_hw_sectors(disk->queue) << 9,
-		     roundup((nrz + 1) * 64, 512));
-	buf = kmalloc(buflen, gfp_mask);
+	buf = sd_zbc_alloc_report_buffer(disk->queue, nrz, &buflen, gfp_mask);
 	if (!buf)
 		return -ENOMEM;
 
 	ret = sd_zbc_do_report_zones(sdkp, buf, buflen,
 			sectors_to_logical(sdkp->device, sector), true);
 	if (ret)
-		goto out_free_buf;
+		goto out;
 
 	nrz = min(nrz, get_unaligned_be32(&buf[0]) / 64);
 	for (i = 0; i < nrz; i++) {
@@ -152,8 +194,8 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
 
 	*nr_zones = nrz;
 
-out_free_buf:
-	kfree(buf);
+out:
+	kvfree(buf);
 
 	return ret;
 }
@@ -287,8 +329,6 @@ static int sd_zbc_check_zoned_characteristics(struct scsi_disk *sdkp,
 	return 0;
 }
 
-#define SD_ZBC_BUF_SIZE 131072U
-
 /**
  * sd_zbc_check_zones - Check the device capacity and zone sizes
  * @sdkp: Target disk
@@ -304,22 +344,23 @@ static int sd_zbc_check_zoned_characteristics(struct scsi_disk *sdkp,
  */
 static int sd_zbc_check_zones(struct scsi_disk *sdkp, u32 *zblocks)
 {
+	size_t bufsize, buflen;
 	u64 zone_blocks = 0;
 	sector_t max_lba, block = 0;
 	unsigned char *buf;
 	unsigned char *rec;
-	unsigned int buf_len;
-	unsigned int list_length;
 	int ret;
 	u8 same;
 
 	/* Get a buffer */
-	buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL);
+	buf = sd_zbc_alloc_report_buffer(sdkp->disk->queue,
+					 SD_ZBC_REPORT_MAX_ZONES,
+					 &bufsize, GFP_NOIO);
 	if (!buf)
 		return -ENOMEM;
 
 	/* Do a report zone to get max_lba and the same field */
-	ret = sd_zbc_do_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, 0, false);
+	ret = sd_zbc_do_report_zones(sdkp, buf, bufsize, 0, false);
 	if (ret)
 		goto out_free;
 
@@ -355,12 +396,12 @@ static int sd_zbc_check_zones(struct scsi_disk *sdkp, u32 *zblocks)
 	do {
 
 		/* Parse REPORT ZONES header */
-		list_length = get_unaligned_be32(&buf[0]) + 64;
+		buflen = min_t(size_t, get_unaligned_be32(&buf[0]) + 64,
+			       bufsize);
 		rec = buf + 64;
-		buf_len = min(list_length, SD_ZBC_BUF_SIZE);
 
 		/* Parse zone descriptors */
-		while (rec < buf + buf_len) {
+		while (rec < buf + buflen) {
 			u64 this_zone_blocks = get_unaligned_be64(&rec[8]);
 
 			if (zone_blocks == 0) {
@@ -376,8 +417,8 @@ static int sd_zbc_check_zones(struct scsi_disk *sdkp, u32 *zblocks)
 		}
 
 		if (block < sdkp->capacity) {
-			ret = sd_zbc_do_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE,
-						     block, true);
+			ret = sd_zbc_do_report_zones(sdkp, buf, bufsize, block,
+						     true);
 			if (ret)
 				goto out_free;
 		}
@@ -408,7 +449,7 @@ static int sd_zbc_check_zones(struct scsi_disk *sdkp, u32 *zblocks)
 	}
 
 out_free:
-	kfree(buf);
+	kvfree(buf);
 
 	return ret;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V4 3/3] block: Limit zone array allocation size
  2019-06-27  2:49 [PATCH V4 0/3] Fix zone revalidation memory allocation failures Damien Le Moal
  2019-06-27  2:49 ` [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers Damien Le Moal
  2019-06-27  2:49 ` [PATCH V4 2/3] sd_zbc: Fix report zones buffer allocation Damien Le Moal
@ 2019-06-27  2:49 ` Damien Le Moal
  2 siblings, 0 replies; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  2:49 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

Limit the size of the struct blk_zone array used in
blk_revalidate_disk_zones() to avoid memory allocation failures leading
to disk revalidation failure. Further reduce the likelyhood of these
failures by using kvmalloc() instead of directly allocating contiguous
pages.

Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
Fixes: e76239a3748c ("block: add a report_zones method")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c      | 29 +++++++++++++----------------
 include/linux/blkdev.h |  5 +++++
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ae7e91bd0618..26f878b9b5f5 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -373,22 +373,20 @@ static inline unsigned long *blk_alloc_zone_bitmap(int node,
  * Allocate an array of struct blk_zone to get nr_zones zone information.
  * The allocated array may be smaller than nr_zones.
  */
-static struct blk_zone *blk_alloc_zones(int node, unsigned int *nr_zones)
+static struct blk_zone *blk_alloc_zones(unsigned int *nr_zones)
 {
-	size_t size = *nr_zones * sizeof(struct blk_zone);
-	struct page *page;
-	int order;
-
-	for (order = get_order(size); order >= 0; order--) {
-		page = alloc_pages_node(node, GFP_NOIO | __GFP_ZERO, order);
-		if (page) {
-			*nr_zones = min_t(unsigned int, *nr_zones,
-				(PAGE_SIZE << order) / sizeof(struct blk_zone));
-			return page_address(page);
-		}
+	struct blk_zone *zones;
+	size_t nrz = min(*nr_zones, BLK_ZONED_REPORT_MAX_ZONES);
+
+	zones = kvcalloc(nrz, sizeof(struct blk_zone), GFP_NOIO);
+	if (!zones) {
+		*nr_zones = 0;
+		return NULL;
 	}
 
-	return NULL;
+	*nr_zones = nrz;
+
+	return zones;
 }
 
 void blk_queue_free_zone_bitmaps(struct request_queue *q)
@@ -443,7 +441,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 
 	/* Get zone information and initialize seq_zones_bitmap */
 	rep_nr_zones = nr_zones;
-	zones = blk_alloc_zones(q->node, &rep_nr_zones);
+	zones = blk_alloc_zones(&rep_nr_zones);
 	if (!zones)
 		goto out;
 
@@ -480,8 +478,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 	blk_mq_unfreeze_queue(q);
 
 out:
-	free_pages((unsigned long)zones,
-		   get_order(rep_nr_zones * sizeof(struct blk_zone)));
+	kvfree(zones);
 	kfree(seq_zones_wlock);
 	kfree(seq_zones_bitmap);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 592669bcc536..f7faac856017 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -344,6 +344,11 @@ struct queue_limits {
 
 #ifdef CONFIG_BLK_DEV_ZONED
 
+/*
+ * Maximum number of zones to report with a single report zones command.
+ */
+#define BLK_ZONED_REPORT_MAX_ZONES	8192U
+
 extern unsigned int blkdev_nr_zones(struct block_device *bdev);
 extern int blkdev_report_zones(struct block_device *bdev,
 			       sector_t sector, struct blk_zone *zones,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  2:49 ` [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers Damien Le Moal
@ 2019-06-27  7:28   ` Christoph Hellwig
  2019-06-27  8:14     ` Damien Le Moal
  2019-06-27  7:47   ` Ming Lei
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2019-06-27  7:28 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

> +#ifdef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE

That seems like an odd constructu, as you don't call
flush_kernel_dcache_page.  From looking whoe defines it it seems
to be about the right set of architectures, but that might be
by a mix of chance and similar requirements for cache flushing.

> +static void bio_invalidate_vmalloc_pages(struct bio *bio)
> +{
> +	if (bio->bi_private) {
> +		struct bvec_iter_all iter_all;
> +		struct bio_vec *bvec;
> +		unsigned long len = 0;
> +
> +		bio_for_each_segment_all(bvec, bio, iter_all)
> +			len += bvec->bv_len;
> +             invalidate_kernel_vmap_range(bio->bi_private, len);

We control the bio here, so we can directly iterate over the
segments instead of doing the fairly expensive bio_for_each_segment_all
call that goes to each page and builds a bvec for it.

> +	struct page *page;
>  	int offset, i;
>  	struct bio *bio;
>  
> @@ -1508,6 +1529,12 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
>  	if (!bio)
>  		return ERR_PTR(-ENOMEM);
>  
> +	if (is_vmalloc) {
> +		flush_kernel_vmap_range(data, len);
> +		if ((!op_is_write(bio_op(bio))))
> +			bio->bi_private = data;
> +	}

We've just allocate the bio, so bio->bi_opf is not actually set at
this point unfortunately.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  2:49 ` [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers Damien Le Moal
  2019-06-27  7:28   ` Christoph Hellwig
@ 2019-06-27  7:47   ` Ming Lei
  2019-06-27  8:17     ` Damien Le Moal
  1 sibling, 1 reply; 12+ messages in thread
From: Ming Lei @ 2019-06-27  7:47 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

On Thu, Jun 27, 2019 at 11:49:08AM +0900, Damien Le Moal wrote:
> To allow the SCSI subsystem scsi_execute_req() function to issue
> requests using large buffers that are better allocated with vmalloc()
> rather than kmalloc(), modify bio_map_kern() and bio_copy_kern() to
> allow passing a buffer allocated with vmalloc(). To do so, detect
> vmalloc-ed buffers using is_vmalloc_addr(). For vmalloc-ed buffers,
> flush the buffer using flush_kernel_vmap_range(), use vmalloc_to_page()
> instead of virt_to_page() to obtain the pages of the buffer, and
> invalidate the buffer addresses with invalidate_kernel_vmap_range() on
> completion of read BIOs. This last point is executed using the function
> bio_invalidate_vmalloc_pages() which is defined only if the
> architecture defines ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the
> architecture actually needs the invalidation done.
> 
> Fixes: 515ce6061312 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
> Fixes: e76239a3748c ("block: add a report_zones method")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>  block/bio.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index ce797d73bb43..1c21d1e7f1b8 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -16,6 +16,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/cgroup.h>
>  #include <linux/blk-cgroup.h>
> +#include <linux/highmem.h>
>  
>  #include <trace/events/block.h>
>  #include "blk.h"
> @@ -1479,8 +1480,26 @@ void bio_unmap_user(struct bio *bio)
>  	bio_put(bio);
>  }
>  
> +#ifdef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
> +static void bio_invalidate_vmalloc_pages(struct bio *bio)
> +{
> +	if (bio->bi_private) {
> +		struct bvec_iter_all iter_all;
> +		struct bio_vec *bvec;
> +		unsigned long len = 0;
> +
> +		bio_for_each_segment_all(bvec, bio, iter_all)
> +			len += bvec->bv_len;
> +		invalidate_kernel_vmap_range(bio->bi_private, len);
> +	}
> +}
> +#else
> +static void bio_invalidate_vmalloc_pages(struct bio *bio) {}
> +#endif
> +
>  static void bio_map_kern_endio(struct bio *bio)
>  {
> +	bio_invalidate_vmalloc_pages(bio);
>  	bio_put(bio);
>  }
>  
> @@ -1501,6 +1520,8 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
>  	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  	unsigned long start = kaddr >> PAGE_SHIFT;
>  	const int nr_pages = end - start;
> +	bool is_vmalloc = is_vmalloc_addr(data);
> +	struct page *page;
>  	int offset, i;
>  	struct bio *bio;
>  
> @@ -1508,6 +1529,12 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
>  	if (!bio)
>  		return ERR_PTR(-ENOMEM);
>  
> +	if (is_vmalloc) {
> +		flush_kernel_vmap_range(data, len);
> +		if ((!op_is_write(bio_op(bio))))
> +			bio->bi_private = data;
> +	}
> +
>  	offset = offset_in_page(kaddr);
>  	for (i = 0; i < nr_pages; i++) {
>  		unsigned int bytes = PAGE_SIZE - offset;
> @@ -1518,7 +1545,11 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
>  		if (bytes > len)
>  			bytes = len;
>  
> -		if (bio_add_pc_page(q, bio, virt_to_page(data), bytes,
> +		if (!is_vmalloc)
> +			page = virt_to_page(data);
> +		else
> +			page = vmalloc_to_page(data);
> +		if (bio_add_pc_page(q, bio, page, bytes,
>  				    offset) < bytes) {
>  			/* we don't support partial mappings */
>  			bio_put(bio);
> @@ -1531,6 +1562,7 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
>  	}
>  
>  	bio->bi_end_io = bio_map_kern_endio;
> +
>  	return bio;
>  }
>  EXPORT_SYMBOL(bio_map_kern);
> @@ -1543,6 +1575,7 @@ static void bio_copy_kern_endio(struct bio *bio)
>  
>  static void bio_copy_kern_endio_read(struct bio *bio)
>  {
> +	unsigned long len = 0;
>  	char *p = bio->bi_private;
>  	struct bio_vec *bvec;
>  	struct bvec_iter_all iter_all;
> @@ -1550,8 +1583,12 @@ static void bio_copy_kern_endio_read(struct bio *bio)
>  	bio_for_each_segment_all(bvec, bio, iter_all) {
>  		memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
>  		p += bvec->bv_len;
> +		len += bvec->bv_len;
>  	}
>  
> +	if (is_vmalloc_addr(bio->bi_private))
> +		invalidate_kernel_vmap_range(bio->bi_private, len);
> +
>  	bio_copy_kern_endio(bio);
>  }
>  
> @@ -1572,6 +1609,7 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
>  	unsigned long kaddr = (unsigned long)data;
>  	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  	unsigned long start = kaddr >> PAGE_SHIFT;
> +	bool is_vmalloc = is_vmalloc_addr(data);
>  	struct bio *bio;
>  	void *p = data;
>  	int nr_pages = 0;
> @@ -1587,6 +1625,9 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
>  	if (!bio)
>  		return ERR_PTR(-ENOMEM);
>  
> +	if (is_vmalloc)
> +		flush_kernel_vmap_range(data, len);
> +

Are your sure that invalidate[|flush]_kernel_vmap_range is needed for
bio_copy_kernel? The vmalloc buffer isn't involved in IO, and only
accessed by CPU.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  7:28   ` Christoph Hellwig
@ 2019-06-27  8:14     ` Damien Le Moal
  2019-06-27  8:25       ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  8:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Bart Van Assche

Christoph,

On 2019/06/27 16:28, Christoph Hellwig wrote:
>> +#ifdef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
> 
> That seems like an odd constructu, as you don't call
> flush_kernel_dcache_page.  From looking whoe defines it it seems
> to be about the right set of architectures, but that might be
> by a mix of chance and similar requirements for cache flushing.

This comes from include/linux/highmem.h:

#ifndef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
static inline void flush_kernel_dcache_page(struct page *page)
{
}
static inline void flush_kernel_vmap_range(void *vaddr, int size)
{
}
static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
{
}
#endif

which I guessed is for the architectures that do not need the flush/invalidate
vmap functions. I copied. Is there a better way ? The point was to avoid doing
the loop on the bvec for the range length on architectures that have an empty
definition of invalidate_kernel_vmap_range().

> 
>> +static void bio_invalidate_vmalloc_pages(struct bio *bio)
>> +{
>> +	if (bio->bi_private) {
>> +		struct bvec_iter_all iter_all;
>> +		struct bio_vec *bvec;
>> +		unsigned long len = 0;
>> +
>> +		bio_for_each_segment_all(bvec, bio, iter_all)
>> +			len += bvec->bv_len;
>> +             invalidate_kernel_vmap_range(bio->bi_private, len);
> 
> We control the bio here, so we can directly iterate over the
> segments instead of doing the fairly expensive bio_for_each_segment_all
> call that goes to each page and builds a bvec for it.

OK. Got it. Will update it.

> 
>> +	struct page *page;
>>  	int offset, i;
>>  	struct bio *bio;
>>  
>> @@ -1508,6 +1529,12 @@ struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
>>  	if (!bio)
>>  		return ERR_PTR(-ENOMEM);
>>  
>> +	if (is_vmalloc) {
>> +		flush_kernel_vmap_range(data, len);
>> +		if ((!op_is_write(bio_op(bio))))
>> +			bio->bi_private = data;
>> +	}
> 
> We've just allocate the bio, so bio->bi_opf is not actually set at
> this point unfortunately.

OK. I will move the check to the completion path then.

Thanks !

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  7:47   ` Ming Lei
@ 2019-06-27  8:17     ` Damien Le Moal
  2019-06-27  8:25       ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  8:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Hi Ming,

On 2019/06/27 16:47, Ming Lei wrote:
[...]
>>  static void bio_copy_kern_endio_read(struct bio *bio)
>>  {
>> +	unsigned long len = 0;
>>  	char *p = bio->bi_private;
>>  	struct bio_vec *bvec;
>>  	struct bvec_iter_all iter_all;
>> @@ -1550,8 +1583,12 @@ static void bio_copy_kern_endio_read(struct bio *bio)
>>  	bio_for_each_segment_all(bvec, bio, iter_all) {
>>  		memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
>>  		p += bvec->bv_len;
>> +		len += bvec->bv_len;
>>  	}
>>  
>> +	if (is_vmalloc_addr(bio->bi_private))
>> +		invalidate_kernel_vmap_range(bio->bi_private, len);
>> +
>>  	bio_copy_kern_endio(bio);
>>  }
>>  
>> @@ -1572,6 +1609,7 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
>>  	unsigned long kaddr = (unsigned long)data;
>>  	unsigned long end = (kaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>>  	unsigned long start = kaddr >> PAGE_SHIFT;
>> +	bool is_vmalloc = is_vmalloc_addr(data);
>>  	struct bio *bio;
>>  	void *p = data;
>>  	int nr_pages = 0;
>> @@ -1587,6 +1625,9 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len,
>>  	if (!bio)
>>  		return ERR_PTR(-ENOMEM);
>>  
>> +	if (is_vmalloc)
>> +		flush_kernel_vmap_range(data, len);
>> +
> 
> Are your sure that invalidate[|flush]_kernel_vmap_range is needed for
> bio_copy_kernel? The vmalloc buffer isn't involved in IO, and only
> accessed by CPU.

Not sure at all. I am a little out of my league here.
Christoph mentioned that this is necessary.

Christoph, can you confirm ?

Thanks !

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  8:17     ` Damien Le Moal
@ 2019-06-27  8:25       ` Christoph Hellwig
  2019-06-27  8:36         ` Damien Le Moal
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2019-06-27  8:25 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Ming Lei, linux-scsi, Martin K . Petersen, linux-block,
	Jens Axboe, Christoph Hellwig, Bart Van Assche

On Thu, Jun 27, 2019 at 08:17:08AM +0000, Damien Le Moal wrote:
> > Are your sure that invalidate[|flush]_kernel_vmap_range is needed for
> > bio_copy_kernel? The vmalloc buffer isn't involved in IO, and only
> > accessed by CPU.
> 
> Not sure at all. I am a little out of my league here.
> Christoph mentioned that this is necessary.
> 
> Christoph, can you confirm ?

Ming is right.  Sorry for misleading you.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  8:14     ` Damien Le Moal
@ 2019-06-27  8:25       ` Christoph Hellwig
  2019-06-27  8:37         ` Damien Le Moal
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2019-06-27  8:25 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, Martin K . Petersen, linux-block,
	Jens Axboe, Bart Van Assche

On Thu, Jun 27, 2019 at 08:14:56AM +0000, Damien Le Moal wrote:
> which I guessed is for the architectures that do not need the flush/invalidate
> vmap functions. I copied. Is there a better way ? The point was to avoid doing
> the loop on the bvec for the range length on architectures that have an empty
> definition of invalidate_kernel_vmap_range().

No, looks like what you did is right.  I blame my lack of attention
on the heat wave here and the resulting lack of sleep..

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  8:25       ` Christoph Hellwig
@ 2019-06-27  8:36         ` Damien Le Moal
  0 siblings, 0 replies; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  8:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, linux-scsi, Martin K . Petersen, linux-block,
	Jens Axboe, Bart Van Assche

On 2019/06/27 17:25, Christoph Hellwig wrote:
> On Thu, Jun 27, 2019 at 08:17:08AM +0000, Damien Le Moal wrote:
>>> Are your sure that invalidate[|flush]_kernel_vmap_range is needed for
>>> bio_copy_kernel? The vmalloc buffer isn't involved in IO, and only
>>> accessed by CPU.
>>
>> Not sure at all. I am a little out of my league here.
>> Christoph mentioned that this is necessary.
>>
>> Christoph, can you confirm ?
> 
> Ming is right.  Sorry for misleading you.

No problem. Sending a V5 with cleanups.

Thanks !


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers
  2019-06-27  8:25       ` Christoph Hellwig
@ 2019-06-27  8:37         ` Damien Le Moal
  0 siblings, 0 replies; 12+ messages in thread
From: Damien Le Moal @ 2019-06-27  8:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Bart Van Assche

On 2019/06/27 17:26, Christoph Hellwig wrote:
> On Thu, Jun 27, 2019 at 08:14:56AM +0000, Damien Le Moal wrote:
>> which I guessed is for the architectures that do not need the flush/invalidate
>> vmap functions. I copied. Is there a better way ? The point was to avoid doing
>> the loop on the bvec for the range length on architectures that have an empty
>> definition of invalidate_kernel_vmap_range().
> 
> No, looks like what you did is right.  I blame my lack of attention
> on the heat wave here and the resulting lack of sleep..

No problem ! At least we checked twice now !


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-06-27  8:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-27  2:49 [PATCH V4 0/3] Fix zone revalidation memory allocation failures Damien Le Moal
2019-06-27  2:49 ` [PATCH V4 1/3] block: Allow mapping of vmalloc-ed buffers Damien Le Moal
2019-06-27  7:28   ` Christoph Hellwig
2019-06-27  8:14     ` Damien Le Moal
2019-06-27  8:25       ` Christoph Hellwig
2019-06-27  8:37         ` Damien Le Moal
2019-06-27  7:47   ` Ming Lei
2019-06-27  8:17     ` Damien Le Moal
2019-06-27  8:25       ` Christoph Hellwig
2019-06-27  8:36         ` Damien Le Moal
2019-06-27  2:49 ` [PATCH V4 2/3] sd_zbc: Fix report zones buffer allocation Damien Le Moal
2019-06-27  2:49 ` [PATCH V4 3/3] block: Limit zone array allocation size Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.