All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] block: a virtual block device driver for testing
@ 2017-08-05 15:51 Shaohua Li
  2017-08-05 15:51 ` [PATCH 1/5] testb: add interface Shaohua Li
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Shaohua Li @ 2017-08-05 15:51 UTC (permalink / raw)
  To: linux-block, linux-raid; +Cc: kernel-team, Kyungchan Koh, Shaohua Li

From: Shaohua Li <shli@fb.com>

In testing software RAID, I usually found it's hard to cover specific cases.
RAID is supposed to work even disk is in semi good state, for example, some
sectors are broken. Since we can't control the behavior of hardware, it's
difficult to create test suites to do destructive tests. But we can control the
behavior of software, software based disk is an obvious choice for such tests.
While we already have several software based disks for testing (eg, null_blk,
scsi_debug), none is for destructive testing, this is the reason we create a
new test block device.

Currently the driver can create disk with following features:
- Bandwidth control. A raid array consists of several disks. The disks could
  run in different speed, for example, one disk is SSD and the other is HD.
  Actually raid1 has a feature called write behind just for this. To test such
  raid1 feature, we'd like the disks speed could be controlled.
- Emulate disk cache. Software must flush disk cache to guarantee data is
  safely stored in media after a power failure. To verify if software works
  well, we can't simply use physical disk, because even software doesn't flush
  cache, the hardware probably will flush the cache. With a software
  implementation of disk cache, we can fully control how we flush disk cache in a
  power failure.
- Badblock. If only part of a disk is broken, software raid continues working.
  To test if software raid works well, disks must include some broken parts or
  bad blocks. Bad blocks can be easily implemented in software.

While this is inspired by software raid testing, the driver is very flexible
for extension. We can easily add new features into the driver. The interface is
configfs, which can be configured with a shell script. There is a 'features'
attribute exposing all supported features. By checking this, we don't need to
worry about compability issues. For configuration details, please check the
first patch.

This is William's intern project. I made some changes, all errors are mine. You
are more than welcomed to test and add new features!

Thanks,
Shaohua

Kyungchan Koh (4):
  testb: add interface
  testb: implement block device operations
  testb: implement bandwidth control
  testb: emulate disk cache

Shaohua Li (1):
  testb: badblock support

 drivers/block/Kconfig    |    8 +
 drivers/block/Makefile   |    2 +
 drivers/block/test_blk.c | 1294 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1304 insertions(+)
 create mode 100644 drivers/block/test_blk.c

-- 
2.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] testb: add interface
  2017-08-05 15:51 [PATCH 0/5] block: a virtual block device driver for testing Shaohua Li
@ 2017-08-05 15:51 ` Shaohua Li
  2017-08-05 15:51 ` [PATCH 2/5] testb: implement block device operations Shaohua Li
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2017-08-05 15:51 UTC (permalink / raw)
  To: linux-block, linux-raid; +Cc: kernel-team, Kyungchan Koh, Kyungchan Koh

From: Kyungchan Koh <kkc6196@fb.com>

The testb block device driver is intended for testing, so configuration
should be easy. We are using configfs here, which can be configured with
a shell script. Basically the the testb will be configured as:

mount the configfs fs as usual:
mount -t configfs none /mnt

Checking which features the driver supports:
cat /mnt/testb/features

The 'features' attribute is for future extension. We probably will add
new features into the driver, userspace can check this attribute to find
the supported features.

Create a device:
mkdir /mnt/testb/a

Then configure the device by setting attributes under /mnt/testb/a
size: disk size in bytes
blocksize: sector size, mush be multiples of 512, and maximum is 4k
discard: if the disk supports discard
nr_queues: how many queues supported in the disk
q_depth: queue depth of the disk

Then power on the device:
echo 1 > /mnt/testb/a/power
this will create a disk, which should be /dev/testb_a
We don't allow change attributes after the device poweron once so far

We can remove the disk by writing 0 to the 'power' attribute. 'rmdir
/mnt/testb/a' will delete the device, which also remove the disk if the
disk isn't removed yet.

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/block/Kconfig    |   8 ++
 drivers/block/Makefile   |   2 +
 drivers/block/test_blk.c | 261 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 271 insertions(+)
 create mode 100644 drivers/block/test_blk.c

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 8ddc982..2da5d02 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -15,6 +15,14 @@ menuconfig BLK_DEV
 
 if BLK_DEV
 
+config BLK_DEV_TEST_BLK
+	tristate "A test block driver"
+	depends on CONFIGFS_FS
+	help
+	  A memory-based block device driver for testing purposes. Configurable
+	  through configFS. Useful features such as bandwidth throttling,
+	  writeback cache, and power loss emulation.
+
 config BLK_DEV_NULL_BLK
 	tristate "Null test block driver"
 
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index ec8c368..f0f0d21 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,6 +29,8 @@ obj-$(CONFIG_VIRTIO_BLK)	+= virtio_blk.o
 
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 
+obj-$(CONFIG_BLK_DEV_TEST_BLK)	+= test_blk.o
+
 obj-$(CONFIG_XEN_BLKDEV_FRONTEND)	+= xen-blkfront.o
 obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= xen-blkback/
 obj-$(CONFIG_BLK_DEV_DRBD)     += drbd/
diff --git a/drivers/block/test_blk.c b/drivers/block/test_blk.c
new file mode 100644
index 0000000..93e8ec2
--- /dev/null
+++ b/drivers/block/test_blk.c
@@ -0,0 +1,261 @@
+/*
+ * test_blk.c - A memory-based test block device driver.
+ *
+ * Copyright (c) 2017 Facebook, Inc.
+ *
+ * Parts derived from drivers/block/null_blk.c and drivers/block/brd.c,
+ * copyright to respective owners.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/blkdev.h>
+#include <linux/configfs.h>
+#include <linux/radix-tree.h>
+
+/*
+ * Status flags for testb_device.
+ *
+ * CONFIGURED:	Device has been configured and turned on. Cannot reconfigure.
+ * UP:		Device is currently on and visible in userspace.
+ */
+enum testb_device_flags {
+	TESTB_DEV_FL_CONFIGURED	= 0,
+	TESTB_DEV_FL_UP		= 1,
+};
+
+/*
+ * testb_device represents the characteristics of a virtual device.
+ *
+ * @item:	The struct used by configfs to represent items in fs.
+ * @lock:	Protect data of the device
+ * @pages:	The storage of the device.
+ * @flags:	TEST_DEV_FL_ flags to indicate various status.
+ *
+ * @power:	1 means on; 0 means off.
+ * @size:	The size of the disk (in bytes).
+ * @blocksize:	The block size for the request queue.
+ * @nr_queues:	The number of queues.
+ * @q_depth:	The depth of each queue.
+ * @discard:	If enable discard
+ */
+struct testb_device {
+	struct config_item item;
+	spinlock_t lock;
+	struct radix_tree_root pages;
+	unsigned long flags;
+
+	uint power;
+	u64 size;
+	uint blocksize;
+	uint nr_queues;
+	uint q_depth;
+	uint discard;
+};
+
+static inline struct testb_device *to_testb_device(struct config_item *item)
+{
+	return item ? container_of(item, struct testb_device, item) : NULL;
+}
+
+static inline ssize_t testb_device_uint_attr_show(uint val, char *page)
+{
+	return snprintf(page, PAGE_SIZE, "%u\n", val);
+}
+
+static ssize_t
+testb_device_uint_attr_store(uint *val, const char *page, size_t count)
+{
+	uint tmp;
+	int result;
+
+	result = kstrtouint(page, 0, &tmp);
+	if (result)
+		return result;
+
+	*val = tmp;
+	return count;
+}
+
+static inline ssize_t testb_device_u64_attr_show(u64 val, char *page)
+{
+	return snprintf(page, PAGE_SIZE, "%llu\n", val);
+}
+
+static ssize_t
+testb_device_u64_attr_store(u64 *val, const char *page, size_t count)
+{
+	int result;
+	u64 tmp;
+
+	result = kstrtoull(page, 0, &tmp);
+	if (result)
+		return result;
+
+	*val = tmp;
+	return count;
+}
+
+/* The following macro should only be used with TYPE = {uint, u64}. */
+#define TESTB_DEVICE_ATTR(NAME, TYPE)						\
+static ssize_t									\
+testb_device_##NAME##_show(struct config_item *item, char *page)		\
+{										\
+	return testb_device_##TYPE##_attr_show(					\
+				to_testb_device(item)->NAME, page);		\
+}										\
+static ssize_t									\
+testb_device_##NAME##_store(struct config_item *item, const char *page,		\
+			    size_t count)					\
+{										\
+	if (test_bit(TESTB_DEV_FL_CONFIGURED, &to_testb_device(item)->flags))	\
+		return -EBUSY;							\
+	return testb_device_##TYPE##_attr_store(				\
+			&to_testb_device(item)->NAME, page, count);		\
+}										\
+CONFIGFS_ATTR(testb_device_, NAME);
+
+TESTB_DEVICE_ATTR(size, u64);
+TESTB_DEVICE_ATTR(blocksize, uint);
+TESTB_DEVICE_ATTR(nr_queues, uint);
+TESTB_DEVICE_ATTR(q_depth, uint);
+TESTB_DEVICE_ATTR(discard, uint);
+
+static ssize_t testb_device_power_show(struct config_item *item, char *page)
+{
+	return testb_device_uint_attr_show(to_testb_device(item)->power, page);
+}
+
+static ssize_t testb_device_power_store(struct config_item *item,
+				     const char *page, size_t count)
+{
+	struct testb_device *t_dev = to_testb_device(item);
+	uint newp = 0;
+	ssize_t ret;
+
+	ret = testb_device_uint_attr_store(&newp, page, count);
+	if (ret < 0)
+		return ret;
+
+	if (!t_dev->power && newp) {
+		if (test_and_set_bit(TESTB_DEV_FL_UP, &t_dev->flags))
+			return count;
+
+		set_bit(TESTB_DEV_FL_CONFIGURED, &t_dev->flags);
+		t_dev->power = newp;
+	} else if (to_testb_device(item)->power && !newp) {
+		t_dev->power = newp;
+		clear_bit(TESTB_DEV_FL_UP, &t_dev->flags);
+	}
+
+	return count;
+}
+
+CONFIGFS_ATTR(testb_device_, power);
+
+static struct configfs_attribute *testb_device_attrs[] = {
+	&testb_device_attr_power,
+	&testb_device_attr_size,
+	&testb_device_attr_blocksize,
+	&testb_device_attr_nr_queues,
+	&testb_device_attr_q_depth,
+	&testb_device_attr_discard,
+	NULL,
+};
+
+static void testb_device_release(struct config_item *item)
+{
+	kfree(to_testb_device(item));
+}
+
+static struct configfs_item_operations testb_device_ops = {
+	.release	= testb_device_release,
+};
+
+static struct config_item_type testb_device_type = {
+	.ct_item_ops	= &testb_device_ops,
+	.ct_attrs	= testb_device_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct
+config_item *testb_group_make_item(struct config_group *group, const char *name)
+{
+	struct testb_device *t_dev;
+
+	t_dev = kzalloc(sizeof(struct testb_device), GFP_KERNEL);
+	if (!t_dev)
+		return ERR_PTR(-ENOMEM);
+
+	config_item_init_type_name(&t_dev->item, name, &testb_device_type);
+
+	/* Initialize attributes with default values. */
+	t_dev->size = 1024 * 1024 * 1024ULL;
+	t_dev->blocksize = 512;
+	t_dev->nr_queues = 2;
+	t_dev->q_depth = 64;
+	t_dev->discard = 1;
+
+	return &t_dev->item;
+}
+
+static void
+testb_group_drop_item(struct config_group *group, struct config_item *item)
+{
+	config_item_put(item);
+}
+
+static ssize_t memb_group_features_show(struct config_item *item, char *page)
+{
+	return snprintf(page, PAGE_SIZE, "\n");
+}
+
+CONFIGFS_ATTR_RO(memb_group_, features);
+
+static struct configfs_attribute *testb_group_attrs[] = {
+	&memb_group_attr_features,
+	NULL,
+};
+
+static struct configfs_group_operations testb_group_ops = {
+	.make_item	= testb_group_make_item,
+	.drop_item	= testb_group_drop_item,
+};
+
+static struct config_item_type testb_group_type = {
+	.ct_group_ops	= &testb_group_ops,
+	.ct_attrs	= testb_group_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct configfs_subsystem testb_subsys = {
+	.su_group = {
+		.cg_item = {
+			.ci_namebuf = "testb",
+			.ci_type = &testb_group_type,
+		},
+	},
+};
+
+static int __init testb_init(void)
+{
+	int ret = 0;
+	struct configfs_subsystem *subsys = &testb_subsys;
+
+	config_group_init(&subsys->su_group);
+	mutex_init(&subsys->su_mutex);
+
+	ret = configfs_register_subsystem(subsys);
+	return ret;
+}
+
+static void __exit testb_exit(void)
+{
+	configfs_unregister_subsystem(&testb_subsys);
+}
+
+module_init(testb_init);
+module_exit(testb_exit);
+
+MODULE_AUTHOR("Will Koh <kkc6196@fb.com>, Shaohua Li <shli@fb.com>");
+MODULE_LICENSE("GPL");
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] testb: implement block device operations
  2017-08-05 15:51 [PATCH 0/5] block: a virtual block device driver for testing Shaohua Li
  2017-08-05 15:51 ` [PATCH 1/5] testb: add interface Shaohua Li
@ 2017-08-05 15:51 ` Shaohua Li
  2017-08-08 20:34   ` Jens Axboe
  2017-08-05 15:51 ` [PATCH 3/5] testb: implement bandwidth control Shaohua Li
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2017-08-05 15:51 UTC (permalink / raw)
  To: linux-block, linux-raid; +Cc: kernel-team, Kyungchan Koh, Kyungchan Koh

From: Kyungchan Koh <kkc6196@fb.com>

This create/remove disk when user writes 1/0 to 'power' attribute.

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/block/test_blk.c | 539 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 538 insertions(+), 1 deletion(-)

diff --git a/drivers/block/test_blk.c b/drivers/block/test_blk.c
index 93e8ec2..1b06ce7 100644
--- a/drivers/block/test_blk.c
+++ b/drivers/block/test_blk.c
@@ -9,9 +9,44 @@
 
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/blk-mq.h>
 #include <linux/blkdev.h>
 #include <linux/configfs.h>
 #include <linux/radix-tree.h>
+#include <linux/idr.h>
+
+#define SECTOR_SHIFT		9
+#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
+#define SECTOR_SIZE		(1 << SECTOR_SHIFT)
+#define SECTOR_MASK		(PAGE_SECTORS - 1)
+
+#define FREE_BATCH		16
+
+struct testb {
+	unsigned int index;
+	struct request_queue *q;
+	struct gendisk *disk;
+
+	struct testb_device *t_dev;
+
+	struct blk_mq_tag_set tag_set;
+
+	char disk_name[DISK_NAME_LEN];
+};
+
+/*
+ * testb_page is a page in memory for testb devices.
+ *
+ * @page:	The page holding the data.
+ * @bitmap:	The bitmap represents which sector in the page has data.
+ *		Each bit represents one block size. For example, sector 8
+ *		will use the 7th bit
+ */
+struct testb_page {
+	struct page *page;
+	unsigned long bitmap;
+};
 
 /*
  * Status flags for testb_device.
@@ -29,6 +64,7 @@ enum testb_device_flags {
  *
  * @item:	The struct used by configfs to represent items in fs.
  * @lock:	Protect data of the device
+ * @testb:	The device that these attributes belong to.
  * @pages:	The storage of the device.
  * @flags:	TEST_DEV_FL_ flags to indicate various status.
  *
@@ -42,6 +78,7 @@ enum testb_device_flags {
 struct testb_device {
 	struct config_item item;
 	spinlock_t lock;
+	struct testb *testb;
 	struct radix_tree_root pages;
 	unsigned long flags;
 
@@ -53,6 +90,10 @@ struct testb_device {
 	uint discard;
 };
 
+static int testb_poweron_device(struct testb_device *dev);
+static void testb_poweroff_device(struct testb_device *dev);
+static void testb_free_device_storage(struct testb_device *t_dev);
+
 static inline struct testb_device *to_testb_device(struct config_item *item)
 {
 	return item ? container_of(item, struct testb_device, item) : NULL;
@@ -140,11 +181,17 @@ static ssize_t testb_device_power_store(struct config_item *item,
 	if (!t_dev->power && newp) {
 		if (test_and_set_bit(TESTB_DEV_FL_UP, &t_dev->flags))
 			return count;
+		ret = testb_poweron_device(t_dev);
+		if (ret) {
+			clear_bit(TESTB_DEV_FL_UP, &t_dev->flags);
+			return -ENOMEM;
+		}
 
 		set_bit(TESTB_DEV_FL_CONFIGURED, &t_dev->flags);
 		t_dev->power = newp;
 	} else if (to_testb_device(item)->power && !newp) {
 		t_dev->power = newp;
+		testb_poweroff_device(t_dev);
 		clear_bit(TESTB_DEV_FL_UP, &t_dev->flags);
 	}
 
@@ -165,7 +212,10 @@ static struct configfs_attribute *testb_device_attrs[] = {
 
 static void testb_device_release(struct config_item *item)
 {
-	kfree(to_testb_device(item));
+	struct testb_device *t_dev = to_testb_device(item);
+
+	testb_free_device_storage(t_dev);
+	kfree(t_dev);
 }
 
 static struct configfs_item_operations testb_device_ops = {
@@ -186,6 +236,8 @@ config_item *testb_group_make_item(struct config_group *group, const char *name)
 	t_dev = kzalloc(sizeof(struct testb_device), GFP_KERNEL);
 	if (!t_dev)
 		return ERR_PTR(-ENOMEM);
+	spin_lock_init(&t_dev->lock);
+	INIT_RADIX_TREE(&t_dev->pages, GFP_ATOMIC);
 
 	config_item_init_type_name(&t_dev->item, name, &testb_device_type);
 
@@ -202,6 +254,12 @@ config_item *testb_group_make_item(struct config_group *group, const char *name)
 static void
 testb_group_drop_item(struct config_group *group, struct config_item *item)
 {
+	struct testb_device *t_dev = to_testb_device(item);
+
+	if (test_and_clear_bit(TESTB_DEV_FL_UP, &t_dev->flags)) {
+		testb_poweroff_device(t_dev);
+		t_dev->power = 0;
+	}
 	config_item_put(item);
 }
 
@@ -237,6 +295,473 @@ static struct configfs_subsystem testb_subsys = {
 	},
 };
 
+static DEFINE_IDA(testb_indices);
+static DEFINE_MUTEX(testb_lock);
+static int testb_major;
+
+static struct testb_page *testb_alloc_page(gfp_t gfp_flags)
+{
+	struct testb_page *t_page;
+
+	t_page = kmalloc(sizeof(struct testb_page), gfp_flags);
+	if (!t_page)
+		goto out;
+
+	t_page->page = alloc_pages(gfp_flags, 0);
+	if (!t_page->page)
+		goto out_freepage;
+
+	t_page->bitmap = 0;
+	return t_page;
+out_freepage:
+	kfree(t_page);
+out:
+	return NULL;
+}
+
+static void testb_free_page(struct testb_page *t_page)
+{
+	WARN_ON(!t_page);
+
+	__free_page(t_page->page);
+	kfree(t_page);
+}
+
+static void testb_free_sector(struct testb *testb, sector_t sector)
+{
+	unsigned int sector_bit;
+	u64 idx;
+	struct testb_page *t_page, *ret;
+	struct radix_tree_root *root;
+
+	assert_spin_locked(&testb->t_dev->lock);
+
+	root = &testb->t_dev->pages;
+	idx = sector >> PAGE_SECTORS_SHIFT;
+	sector_bit = (sector & SECTOR_MASK);
+
+	t_page = radix_tree_lookup(root, idx);
+	if (t_page) {
+		__clear_bit(sector_bit, &t_page->bitmap);
+
+		if (!t_page->bitmap) {
+			ret = radix_tree_delete_item(root, idx, t_page);
+			WARN_ON(ret != t_page);
+			testb_free_page(ret);
+		}
+	}
+}
+
+static struct testb_page *testb_radix_tree_insert(struct testb *testb, u64 idx,
+	struct testb_page *t_page)
+{
+	struct radix_tree_root *root;
+
+	assert_spin_locked(&testb->t_dev->lock);
+
+	root = &testb->t_dev->pages;
+
+	if (radix_tree_insert(root, idx, t_page)) {
+		testb_free_page(t_page);
+		t_page = radix_tree_lookup(root, idx);
+		WARN_ON(!t_page || t_page->page->index != idx);
+	}
+
+	return t_page;
+}
+
+static void testb_free_device_storage(struct testb_device *t_dev)
+{
+	unsigned long pos = 0;
+	int nr_pages;
+	struct testb_page *ret, *t_pages[FREE_BATCH];
+	struct radix_tree_root *root;
+
+	root = &t_dev->pages;
+
+	do {
+		int i;
+
+		nr_pages = radix_tree_gang_lookup(root,
+				(void **)t_pages, pos, FREE_BATCH);
+
+		for (i = 0; i < nr_pages; i++) {
+			pos = t_pages[i]->page->index;
+			ret = radix_tree_delete_item(root, pos, t_pages[i]);
+			WARN_ON(ret != t_pages[i]);
+			testb_free_page(ret);
+		}
+
+		pos++;
+	} while (nr_pages == FREE_BATCH);
+}
+
+static struct testb_page *testb_lookup_page(struct testb *testb,
+	sector_t sector, bool for_write)
+{
+	unsigned int sector_bit;
+	u64 idx;
+	struct testb_page *t_page;
+
+	assert_spin_locked(&testb->t_dev->lock);
+
+	idx = sector >> PAGE_SECTORS_SHIFT;
+	sector_bit = (sector & SECTOR_MASK);
+
+	t_page = radix_tree_lookup(&testb->t_dev->pages, idx);
+	WARN_ON(t_page && t_page->page->index != idx);
+
+	if (t_page && (for_write || test_bit(sector_bit, &t_page->bitmap)))
+		return t_page;
+
+	return NULL;
+}
+
+static struct testb_page *testb_insert_page(struct testb *testb,
+	sector_t sector, unsigned long *lock_flag)
+{
+	u64 idx;
+	struct testb_page *t_page;
+
+	assert_spin_locked(&testb->t_dev->lock);
+
+	t_page = testb_lookup_page(testb, sector, true);
+	if (t_page)
+		return t_page;
+
+	spin_unlock_irqrestore(&testb->t_dev->lock, *lock_flag);
+
+	t_page = testb_alloc_page(GFP_NOIO);
+	if (!t_page)
+		goto out_lock;
+
+	if (radix_tree_preload(GFP_NOIO))
+		goto out_freepage;
+
+	spin_lock_irqsave(&testb->t_dev->lock, *lock_flag);
+	idx = sector >> PAGE_SECTORS_SHIFT;
+	t_page->page->index = idx;
+	t_page = testb_radix_tree_insert(testb, idx, t_page);
+	radix_tree_preload_end();
+
+	return t_page;
+out_freepage:
+	testb_free_page(t_page);
+out_lock:
+	spin_lock_irqsave(&testb->t_dev->lock, *lock_flag);
+	return testb_lookup_page(testb, sector, true);
+}
+
+static int copy_to_testb(struct testb *testb, struct page *source,
+	unsigned int off, sector_t sector, size_t n, unsigned long *lock_flag)
+{
+	size_t temp, count = 0;
+	unsigned int offset;
+	struct testb_page *t_page;
+	void *dst, *src;
+
+	while (count < n) {
+		temp = min_t(size_t, testb->t_dev->blocksize, n - count);
+
+		offset = (sector & SECTOR_MASK) << SECTOR_SHIFT;
+		t_page = testb_insert_page(testb, sector, lock_flag);
+		if (!t_page)
+			return -ENOSPC;
+
+		src = kmap_atomic(source);
+		dst = kmap_atomic(t_page->page);
+		memcpy(dst + offset, src + off + count, temp);
+		kunmap_atomic(dst);
+		kunmap_atomic(src);
+
+		__set_bit(sector & SECTOR_MASK, &t_page->bitmap);
+
+		count += temp;
+		sector += temp >> SECTOR_SHIFT;
+	}
+	return 0;
+}
+
+static int copy_from_testb(struct testb *testb, struct page *dest,
+	unsigned int off, sector_t sector, size_t n, unsigned long *lock_flag)
+{
+	size_t temp, count = 0;
+	unsigned int offset;
+	struct testb_page *t_page;
+	void *dst, *src;
+
+	while (count < n) {
+		temp = min_t(size_t, testb->t_dev->blocksize, n - count);
+
+		offset = (sector & SECTOR_MASK) << SECTOR_SHIFT;
+		t_page = testb_lookup_page(testb, sector, false);
+
+		dst = kmap_atomic(dest);
+		if (!t_page) {
+			memset(dst + off + count, 0, temp);
+			goto next;
+		}
+		src = kmap_atomic(t_page->page);
+		memcpy(dst + off + count, src + offset, temp);
+		kunmap_atomic(src);
+next:
+		kunmap_atomic(dst);
+
+		count += temp;
+		sector += temp >> SECTOR_SHIFT;
+	}
+	return 0;
+}
+
+static void testb_handle_discard(struct testb *testb, sector_t sector, size_t n)
+{
+	size_t temp;
+	unsigned long lock_flag;
+
+	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
+	while (n > 0) {
+		temp = min_t(size_t, n, testb->t_dev->blocksize);
+		testb_free_sector(testb, sector);
+		sector += temp >> SECTOR_SHIFT;
+		n -= temp;
+	}
+	spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+}
+
+static int testb_handle_flush(struct testb *testb)
+{
+	return 0;
+}
+
+static int testb_transfer(struct testb *testb, struct page *page,
+	unsigned int len, unsigned int off, bool is_write, sector_t sector,
+	unsigned long *lock_flags)
+{
+	int err = 0;
+
+	if (!is_write) {
+		err = copy_from_testb(testb, page, off, sector, len,
+						lock_flags);
+		flush_dcache_page(page);
+	} else {
+		flush_dcache_page(page);
+		err = copy_to_testb(testb, page, off, sector, len,
+						lock_flags);
+	}
+
+	return err;
+}
+
+static int testb_handle_rq(struct request *rq)
+{
+	struct testb *testb = rq->q->queuedata;
+	int err;
+	unsigned int len;
+	sector_t sector;
+	struct req_iterator iter;
+	struct bio_vec bvec;
+	unsigned long lock_flag;
+
+	sector = blk_rq_pos(rq);
+
+	if (req_op(rq) == REQ_OP_DISCARD) {
+		testb_handle_discard(testb, sector, blk_rq_bytes(rq));
+		return 0;
+	} else if (req_op(rq) == REQ_OP_FLUSH)
+		return testb_handle_flush(testb);
+
+	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
+	rq_for_each_segment(bvec, rq, iter) {
+		len = bvec.bv_len;
+		err = testb_transfer(testb, bvec.bv_page, len, bvec.bv_offset,
+				     op_is_write(req_op(rq)), sector,
+				     &lock_flag);
+		if (err) {
+			spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+			return err;
+		}
+		sector += len >> SECTOR_SHIFT;
+	}
+	spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+
+	return 0;
+}
+
+static blk_status_t
+testb_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd)
+{
+	int err;
+
+	blk_mq_start_request(bd->rq);
+
+	err = testb_handle_rq(bd->rq);
+	if (err)
+		return errno_to_blk_status(err);
+
+	blk_mq_complete_request(bd->rq);
+	return BLK_STS_OK;
+}
+
+static void testb_softirq_done_fn(struct request *rq)
+{
+	blk_mq_end_request(rq, BLK_STS_OK);
+}
+
+static const struct blk_mq_ops testb_mq_ops = {
+	.queue_rq	= testb_queue_rq,
+	.complete	= testb_softirq_done_fn,
+};
+
+static const struct block_device_operations testb_fops = {
+	.owner		= THIS_MODULE,
+};
+
+static void testb_free_bdev(struct testb *testb)
+{
+	mutex_lock(&testb_lock);
+	ida_simple_remove(&testb_indices, testb->index);
+	mutex_unlock(&testb_lock);
+
+	blk_cleanup_queue(testb->q);
+	blk_mq_free_tag_set(&testb->tag_set);
+
+	kfree(testb);
+}
+
+static void testb_gendisk_unregister(struct testb *testb)
+{
+	del_gendisk(testb->disk);
+
+	put_disk(testb->disk);
+}
+
+static void testb_poweroff_device(struct testb_device *dev)
+{
+	testb_gendisk_unregister(dev->testb);
+	testb_free_bdev(dev->testb);
+}
+
+static void testb_config_discard(struct testb *testb)
+{
+	if (testb->t_dev->discard == 0)
+		return;
+	testb->q->limits.discard_granularity = testb->t_dev->blocksize;
+	testb->q->limits.discard_alignment = testb->t_dev->blocksize;
+	blk_queue_max_discard_sectors(testb->q, UINT_MAX >> 9);
+	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, testb->q);
+}
+
+static void testb_config_flush(struct testb *testb)
+{
+	blk_queue_write_cache(testb->q, true, true);
+	blk_queue_flush_queueable(testb->q, true);
+}
+
+static int testb_gendisk_register(struct testb *testb)
+{
+	sector_t size;
+	struct gendisk *disk;
+
+	disk = testb->disk = alloc_disk(DISK_MAX_PARTS);
+	if (!disk)
+		return -ENOMEM;
+
+	size = testb->t_dev->size;
+	set_capacity(disk, size >> 9);
+
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	disk->major		= testb_major;
+	disk->first_minor	= testb->index * DISK_MAX_PARTS;
+	disk->fops		= &testb_fops;
+	disk->private_data	= testb;
+	disk->queue		= testb->q;
+	snprintf(disk->disk_name, DISK_NAME_LEN, "%s", testb->disk_name);
+
+	add_disk(testb->disk);
+	return 0;
+}
+
+static int testb_alloc_bdev(struct testb_device *t_dev)
+{
+	int ret;
+	struct testb *testb;
+
+	testb = kzalloc(sizeof(struct testb), GFP_KERNEL);
+	if (!testb) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	t_dev->blocksize = (t_dev->blocksize >> SECTOR_SHIFT) << SECTOR_SHIFT;
+	t_dev->blocksize = clamp_t(uint, t_dev->blocksize, 512, 4096);
+
+	if (t_dev->nr_queues > nr_cpu_ids)
+		t_dev->nr_queues = nr_cpu_ids;
+	else if (!t_dev->nr_queues)
+		t_dev->nr_queues = 1;
+
+	testb->t_dev = t_dev;
+	t_dev->testb = testb;
+
+	testb->tag_set.ops = &testb_mq_ops;
+	testb->tag_set.nr_hw_queues = t_dev->nr_queues;
+	testb->tag_set.queue_depth = t_dev->q_depth;
+	testb->tag_set.numa_node = NUMA_NO_NODE;
+	testb->tag_set.cmd_size = 0;
+	testb->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE |
+			      BLK_MQ_F_BLOCKING;
+	testb->tag_set.driver_data = testb;
+
+	ret = blk_mq_alloc_tag_set(&testb->tag_set);
+	if (ret)
+		goto out_cleanup_queues;
+
+	testb->q = blk_mq_init_queue(&testb->tag_set);
+	if (IS_ERR(testb->q)) {
+		ret = -ENOMEM;
+		goto out_cleanup_tags;
+	}
+
+	testb->q->queuedata = testb;
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, testb->q);
+	queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, testb->q);
+
+	testb_config_discard(testb);
+	testb_config_flush(testb);
+
+	blk_queue_logical_block_size(testb->q, t_dev->blocksize);
+	blk_queue_physical_block_size(testb->q, t_dev->blocksize);
+
+	snprintf(testb->disk_name, CONFIGFS_ITEM_NAME_LEN, "testb_%s",
+		 t_dev->item.ci_name);
+
+	mutex_lock(&testb_lock);
+	testb->index = ida_simple_get(&testb_indices, 0, 0, GFP_KERNEL);
+	mutex_unlock(&testb_lock);
+
+	return 0;
+out_cleanup_tags:
+	blk_mq_free_tag_set(&testb->tag_set);
+out_cleanup_queues:
+	kfree(testb);
+out:
+	return ret;
+}
+
+static int testb_poweron_device(struct testb_device *dev)
+{
+	int ret;
+
+	ret = testb_alloc_bdev(dev);
+	if (ret)
+		return ret;
+	if (testb_gendisk_register(dev->testb)) {
+		testb_free_bdev(dev->testb);
+		return -EINVAL;
+	}
+	return 0;
+}
+
 static int __init testb_init(void)
 {
 	int ret = 0;
@@ -245,12 +770,24 @@ static int __init testb_init(void)
 	config_group_init(&subsys->su_group);
 	mutex_init(&subsys->su_mutex);
 
+	testb_major = register_blkdev(0, "testb");
+	if (testb_major < 0)
+		return testb_major;
+
 	ret = configfs_register_subsystem(subsys);
+	if (ret)
+		goto out_unregister;
+
+	return 0;
+out_unregister:
+	unregister_blkdev(testb_major, "testb");
 	return ret;
 }
 
 static void __exit testb_exit(void)
 {
+	unregister_blkdev(testb_major, "testb");
+
 	configfs_unregister_subsystem(&testb_subsys);
 }
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] testb: implement bandwidth control
  2017-08-05 15:51 [PATCH 0/5] block: a virtual block device driver for testing Shaohua Li
  2017-08-05 15:51 ` [PATCH 1/5] testb: add interface Shaohua Li
  2017-08-05 15:51 ` [PATCH 2/5] testb: implement block device operations Shaohua Li
@ 2017-08-05 15:51 ` Shaohua Li
  2017-08-05 15:51 ` [PATCH 4/5] testb: emulate disk cache Shaohua Li
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2017-08-05 15:51 UTC (permalink / raw)
  To: linux-block, linux-raid; +Cc: kernel-team, Kyungchan Koh, Kyungchan Koh

From: Kyungchan Koh <kkc6196@fb.com>

In test, we usually expect controllable disk speed. For example, in a
raid array, we'd like some disks are fast and some are slow. MD RAID
actually has a feature for this. To test the feature, we'd like to make
the disk run in specific speed.

block throttling probably can be used for this purpose, but it requires
cgroup setup. Here we just implement a simple throttling mechanism in
the driver. There is slight fluctuation in the mechanism, but it's good
enough for test.

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/block/test_blk.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/drivers/block/test_blk.c b/drivers/block/test_blk.c
index 1b06ce7..c67c73d 100644
--- a/drivers/block/test_blk.c
+++ b/drivers/block/test_blk.c
@@ -9,9 +9,11 @@
 
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/atomic.h>
 #include <linux/blk-mq.h>
 #include <linux/blkdev.h>
 #include <linux/configfs.h>
+#include <linux/hrtimer.h>
 #include <linux/radix-tree.h>
 #include <linux/idr.h>
 
@@ -23,6 +25,14 @@
 
 #define FREE_BATCH		16
 
+#define TICKS_PER_SEC		50ULL
+#define TIMER_INTERVAL		(NSEC_PER_SEC / TICKS_PER_SEC)
+
+static inline u64 mb_per_tick(int mbps)
+{
+	return (1 << 20) / TICKS_PER_SEC * ((u64) mbps);
+}
+
 struct testb {
 	unsigned int index;
 	struct request_queue *q;
@@ -33,6 +43,9 @@ struct testb {
 	struct blk_mq_tag_set tag_set;
 
 	char disk_name[DISK_NAME_LEN];
+
+	atomic_long_t cur_bytes;
+	struct hrtimer timer;
 };
 
 /*
@@ -53,10 +66,12 @@ struct testb_page {
  *
  * CONFIGURED:	Device has been configured and turned on. Cannot reconfigure.
  * UP:		Device is currently on and visible in userspace.
+ * THROTTLED:	Device is being throttled.
  */
 enum testb_device_flags {
 	TESTB_DEV_FL_CONFIGURED	= 0,
 	TESTB_DEV_FL_UP		= 1,
+	TESTB_DEV_FL_THROTTLED	= 2,
 };
 
 /*
@@ -74,6 +89,7 @@ enum testb_device_flags {
  * @nr_queues:	The number of queues.
  * @q_depth:	The depth of each queue.
  * @discard:	If enable discard
+ * @mbps:	Bandwidth throttle cap (in mb/s).
  */
 struct testb_device {
 	struct config_item item;
@@ -88,6 +104,7 @@ struct testb_device {
 	uint nr_queues;
 	uint q_depth;
 	uint discard;
+	uint mbps;
 };
 
 static int testb_poweron_device(struct testb_device *dev);
@@ -161,6 +178,7 @@ TESTB_DEVICE_ATTR(blocksize, uint);
 TESTB_DEVICE_ATTR(nr_queues, uint);
 TESTB_DEVICE_ATTR(q_depth, uint);
 TESTB_DEVICE_ATTR(discard, uint);
+TESTB_DEVICE_ATTR(mbps, uint);
 
 static ssize_t testb_device_power_show(struct config_item *item, char *page)
 {
@@ -207,6 +225,7 @@ static struct configfs_attribute *testb_device_attrs[] = {
 	&testb_device_attr_nr_queues,
 	&testb_device_attr_q_depth,
 	&testb_device_attr_discard,
+	&testb_device_attr_mbps,
 	NULL,
 };
 
@@ -247,6 +266,7 @@ config_item *testb_group_make_item(struct config_group *group, const char *name)
 	t_dev->nr_queues = 2;
 	t_dev->q_depth = 64;
 	t_dev->discard = 1;
+	t_dev->mbps = -1;
 
 	return &t_dev->item;
 }
@@ -265,7 +285,7 @@ testb_group_drop_item(struct config_group *group, struct config_item *item)
 
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
-	return snprintf(page, PAGE_SIZE, "\n");
+	return snprintf(page, PAGE_SIZE, "bandwidth\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -299,6 +319,11 @@ static DEFINE_IDA(testb_indices);
 static DEFINE_MUTEX(testb_lock);
 static int testb_major;
 
+static inline int testb_throttled(struct testb *testb)
+{
+	return test_bit(TESTB_DEV_FL_THROTTLED, &testb->t_dev->flags);
+}
+
 static struct testb_page *testb_alloc_page(gfp_t gfp_flags)
 {
 	struct testb_page *t_page;
@@ -570,6 +595,44 @@ static int testb_handle_rq(struct request *rq)
 	} else if (req_op(rq) == REQ_OP_FLUSH)
 		return testb_handle_flush(testb);
 
+	len = blk_rq_bytes(rq);
+	if (testb_throttled(testb)) {
+		if (!hrtimer_active(&testb->timer))
+			hrtimer_restart(&testb->timer);
+
+		/*
+		 * The lock is taken here to ensure no race happens
+		 * between the timer that a) resets the budget and
+		 * b) starts the queue and the cmd handler that
+		 * c) spends the budget and d) determines to stop
+		 * the queue.
+		 *
+		 * The race happens when the processes interweave.
+		 * For example, c, a, b, d. For a whole interval,
+		 * there may be no I/O occurring due to the race.
+		 *
+		 * Solution: Create an isolated region using the
+		 * device spinlock that ensure a and b occur in
+		 * isolation to c and d such that no interweaving
+		 * occurs.
+		 *
+		 * This also prevents multiple processes or threads from
+		 * calling blk_mq_stop_hw_queues.
+		 */
+		spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
+
+		if (atomic_long_add_negative((int)(-1 * len),
+						&testb->cur_bytes)) {
+			if (!blk_mq_queue_stopped(testb->q))
+				blk_mq_stop_hw_queues(testb->q);
+
+			spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+			/* requeue the request */
+			return -ENOMEM;
+		}
+		spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+	}
+
 	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
 	rq_for_each_segment(bvec, rq, iter) {
 		len = bvec.bv_len;
@@ -622,6 +685,8 @@ static void testb_free_bdev(struct testb *testb)
 	ida_simple_remove(&testb_indices, testb->index);
 	mutex_unlock(&testb_lock);
 
+	if (testb_throttled(testb))
+		hrtimer_cancel(&testb->timer);
 	blk_cleanup_queue(testb->q);
 	blk_mq_free_tag_set(&testb->tag_set);
 
@@ -657,6 +722,37 @@ static void testb_config_flush(struct testb *testb)
 	blk_queue_flush_queueable(testb->q, true);
 }
 
+static enum hrtimer_restart testb_timer_fn(struct hrtimer *timer)
+{
+	struct testb *testb = container_of(timer, struct testb, timer);
+	ktime_t testb_timer_interval = ktime_set(0, TIMER_INTERVAL);
+	int testb_mbps = testb->t_dev->mbps;
+	unsigned long lock_flag;
+
+	if (atomic_long_read(&testb->cur_bytes) == mb_per_tick(testb_mbps))
+		return HRTIMER_NORESTART;
+
+	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
+	atomic_long_set(&testb->cur_bytes, mb_per_tick(testb_mbps));
+	if (blk_mq_queue_stopped(testb->q))
+		blk_mq_start_stopped_hw_queues(testb->q, true);
+	spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+
+	hrtimer_forward_now(&testb->timer, testb_timer_interval);
+
+	return HRTIMER_RESTART;
+}
+
+static void testb_setup_timer(struct testb *testb)
+{
+	ktime_t testb_timer_interval = ktime_set(0, TIMER_INTERVAL);
+
+	hrtimer_init(&testb->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	testb->timer.function = &testb_timer_fn;
+	atomic_long_set(&testb->cur_bytes, mb_per_tick(testb->t_dev->mbps));
+	hrtimer_start(&testb->timer, testb_timer_interval, HRTIMER_MODE_ABS);
+}
+
 static int testb_gendisk_register(struct testb *testb)
 {
 	sector_t size;
@@ -722,6 +818,12 @@ static int testb_alloc_bdev(struct testb_device *t_dev)
 		goto out_cleanup_tags;
 	}
 
+	if (t_dev->mbps && t_dev->mbps != -1) {
+		t_dev->mbps = (t_dev->mbps >> 10) > 0 ? 1024 : t_dev->mbps;
+		set_bit(TESTB_DEV_FL_THROTTLED, &t_dev->flags);
+		testb_setup_timer(testb);
+	}
+
 	testb->q->queuedata = testb;
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, testb->q);
 	queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, testb->q);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] testb: emulate disk cache
  2017-08-05 15:51 [PATCH 0/5] block: a virtual block device driver for testing Shaohua Li
                   ` (2 preceding siblings ...)
  2017-08-05 15:51 ` [PATCH 3/5] testb: implement bandwidth control Shaohua Li
@ 2017-08-05 15:51 ` Shaohua Li
  2017-08-05 15:51 ` [PATCH 5/5] testb: badblock support Shaohua Li
  2017-08-07  8:29 ` [PATCH 0/5] block: a virtual block device driver for testing Hannes Reinecke
  5 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2017-08-05 15:51 UTC (permalink / raw)
  To: linux-block, linux-raid; +Cc: kernel-team, Kyungchan Koh, Kyungchan Koh

From: Kyungchan Koh <kkc6196@fb.com>

Software must flush disk cache to guarantee data safety. To check if
software correctly does disk cache flush, we must know the behavior of
disk. But physical disk behavior is uncontrollable. Even software
doesn't do the flush, the disk probably does the flush. This patch tries
to emulate a cache in the test disk.

All write will go to a cache first, when the cache is full, we then
flush some data to disk storage. A flush request will flush all data of
the cache to disk storage. If there is a power failure (by writing to
power attribute, 'echo 0 > disk_name/power'), we discard all data in the
cache, but preserve the data in disk storage. Later we can power on the
disk again as usual (write 1 to 'power' attribute), then we can check if
data integrity and very if software does everything correctly.

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/block/test_blk.c | 248 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 224 insertions(+), 24 deletions(-)

diff --git a/drivers/block/test_blk.c b/drivers/block/test_blk.c
index c67c73d..631dae4 100644
--- a/drivers/block/test_blk.c
+++ b/drivers/block/test_blk.c
@@ -46,6 +46,7 @@ struct testb {
 
 	atomic_long_t cur_bytes;
 	struct hrtimer timer;
+	unsigned long cache_flush_pos;
 };
 
 /*
@@ -62,16 +63,27 @@ struct testb_page {
 };
 
 /*
+ * The highest 2 bits of bitmap are for special purpose. LOCK means the cache
+ * page is being flushing to storage. FREE means the cache page is freed and
+ * should be skipped from flushing to storage. Please see
+ * testb_make_cache_space
+ */
+#define TESTB_PAGE_LOCK (sizeof(unsigned long) * 8 - 1)
+#define TESTB_PAGE_FREE (sizeof(unsigned long) * 8 - 2)
+
+/*
  * Status flags for testb_device.
  *
  * CONFIGURED:	Device has been configured and turned on. Cannot reconfigure.
  * UP:		Device is currently on and visible in userspace.
  * THROTTLED:	Device is being throttled.
+ * CACHE:	Device is using a write-back cache.
  */
 enum testb_device_flags {
 	TESTB_DEV_FL_CONFIGURED	= 0,
 	TESTB_DEV_FL_UP		= 1,
 	TESTB_DEV_FL_THROTTLED	= 2,
+	TESTB_DEV_FL_CACHE	= 3,
 };
 
 /*
@@ -81,6 +93,8 @@ enum testb_device_flags {
  * @lock:	Protect data of the device
  * @testb:	The device that these attributes belong to.
  * @pages:	The storage of the device.
+ * @cache:	The cache of the device.
+ * @curr_cache:	The current cache size.
  * @flags:	TEST_DEV_FL_ flags to indicate various status.
  *
  * @power:	1 means on; 0 means off.
@@ -90,13 +104,16 @@ enum testb_device_flags {
  * @q_depth:	The depth of each queue.
  * @discard:	If enable discard
  * @mbps:	Bandwidth throttle cap (in mb/s).
+ * @cache_size:	The max capacity of the cache.
  */
 struct testb_device {
 	struct config_item item;
 	spinlock_t lock;
 	struct testb *testb;
 	struct radix_tree_root pages;
+	struct radix_tree_root cache;
 	unsigned long flags;
+	unsigned int curr_cache;
 
 	uint power;
 	u64 size;
@@ -105,11 +122,13 @@ struct testb_device {
 	uint q_depth;
 	uint discard;
 	uint mbps;
+	u64 cache_size;
 };
 
 static int testb_poweron_device(struct testb_device *dev);
 static void testb_poweroff_device(struct testb_device *dev);
-static void testb_free_device_storage(struct testb_device *t_dev);
+static void testb_free_device_storage(struct testb_device *t_dev,
+	bool is_cache);
 
 static inline struct testb_device *to_testb_device(struct config_item *item)
 {
@@ -179,6 +198,7 @@ TESTB_DEVICE_ATTR(nr_queues, uint);
 TESTB_DEVICE_ATTR(q_depth, uint);
 TESTB_DEVICE_ATTR(discard, uint);
 TESTB_DEVICE_ATTR(mbps, uint);
+TESTB_DEVICE_ATTR(cache_size, u64);
 
 static ssize_t testb_device_power_show(struct config_item *item, char *page)
 {
@@ -226,6 +246,7 @@ static struct configfs_attribute *testb_device_attrs[] = {
 	&testb_device_attr_q_depth,
 	&testb_device_attr_discard,
 	&testb_device_attr_mbps,
+	&testb_device_attr_cache_size,
 	NULL,
 };
 
@@ -233,7 +254,7 @@ static void testb_device_release(struct config_item *item)
 {
 	struct testb_device *t_dev = to_testb_device(item);
 
-	testb_free_device_storage(t_dev);
+	testb_free_device_storage(t_dev, false);
 	kfree(t_dev);
 }
 
@@ -257,6 +278,7 @@ config_item *testb_group_make_item(struct config_group *group, const char *name)
 		return ERR_PTR(-ENOMEM);
 	spin_lock_init(&t_dev->lock);
 	INIT_RADIX_TREE(&t_dev->pages, GFP_ATOMIC);
+	INIT_RADIX_TREE(&t_dev->cache, GFP_ATOMIC);
 
 	config_item_init_type_name(&t_dev->item, name, &testb_device_type);
 
@@ -267,6 +289,7 @@ config_item *testb_group_make_item(struct config_group *group, const char *name)
 	t_dev->q_depth = 64;
 	t_dev->discard = 1;
 	t_dev->mbps = -1;
+	t_dev->cache_size = 100 * 1024 * 1024ULL;
 
 	return &t_dev->item;
 }
@@ -285,7 +308,7 @@ testb_group_drop_item(struct config_group *group, struct config_item *item)
 
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
-	return snprintf(page, PAGE_SIZE, "bandwidth\n");
+	return snprintf(page, PAGE_SIZE, "bandwidth,cache\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -324,6 +347,11 @@ static inline int testb_throttled(struct testb *testb)
 	return test_bit(TESTB_DEV_FL_THROTTLED, &testb->t_dev->flags);
 }
 
+static inline int testb_cache_active(struct testb *testb)
+{
+	return test_bit(TESTB_DEV_FL_CACHE, &testb->t_dev->flags);
+}
+
 static struct testb_page *testb_alloc_page(gfp_t gfp_flags)
 {
 	struct testb_page *t_page;
@@ -348,11 +376,15 @@ static void testb_free_page(struct testb_page *t_page)
 {
 	WARN_ON(!t_page);
 
+	__set_bit(TESTB_PAGE_FREE, &t_page->bitmap);
+	if (test_bit(TESTB_PAGE_LOCK, &t_page->bitmap))
+		return;
 	__free_page(t_page->page);
 	kfree(t_page);
 }
 
-static void testb_free_sector(struct testb *testb, sector_t sector)
+static void testb_free_sector(struct testb *testb, sector_t sector,
+	bool is_cache)
 {
 	unsigned int sector_bit;
 	u64 idx;
@@ -361,7 +393,7 @@ static void testb_free_sector(struct testb *testb, sector_t sector)
 
 	assert_spin_locked(&testb->t_dev->lock);
 
-	root = &testb->t_dev->pages;
+	root = is_cache ? &testb->t_dev->cache : &testb->t_dev->pages;
 	idx = sector >> PAGE_SECTORS_SHIFT;
 	sector_bit = (sector & SECTOR_MASK);
 
@@ -373,36 +405,40 @@ static void testb_free_sector(struct testb *testb, sector_t sector)
 			ret = radix_tree_delete_item(root, idx, t_page);
 			WARN_ON(ret != t_page);
 			testb_free_page(ret);
+			if (is_cache)
+				testb->t_dev->curr_cache -= PAGE_SIZE;
 		}
 	}
 }
 
 static struct testb_page *testb_radix_tree_insert(struct testb *testb, u64 idx,
-	struct testb_page *t_page)
+	struct testb_page *t_page, bool is_cache)
 {
 	struct radix_tree_root *root;
 
 	assert_spin_locked(&testb->t_dev->lock);
 
-	root = &testb->t_dev->pages;
+	root = is_cache ? &testb->t_dev->cache : &testb->t_dev->pages;
 
 	if (radix_tree_insert(root, idx, t_page)) {
 		testb_free_page(t_page);
 		t_page = radix_tree_lookup(root, idx);
 		WARN_ON(!t_page || t_page->page->index != idx);
-	}
+	} else if (is_cache)
+		testb->t_dev->curr_cache += PAGE_SIZE;
 
 	return t_page;
 }
 
-static void testb_free_device_storage(struct testb_device *t_dev)
+static void testb_free_device_storage(struct testb_device *t_dev,
+	bool is_cache)
 {
 	unsigned long pos = 0;
 	int nr_pages;
 	struct testb_page *ret, *t_pages[FREE_BATCH];
 	struct radix_tree_root *root;
 
-	root = &t_dev->pages;
+	root = is_cache ? &t_dev->cache : &t_dev->pages;
 
 	do {
 		int i;
@@ -419,21 +455,27 @@ static void testb_free_device_storage(struct testb_device *t_dev)
 
 		pos++;
 	} while (nr_pages == FREE_BATCH);
+
+	if (is_cache)
+		t_dev->curr_cache = 0;
 }
 
-static struct testb_page *testb_lookup_page(struct testb *testb,
-	sector_t sector, bool for_write)
+static struct testb_page *__testb_lookup_page(struct testb *testb,
+	sector_t sector, bool for_write, bool is_cache)
 {
 	unsigned int sector_bit;
 	u64 idx;
 	struct testb_page *t_page;
+	struct radix_tree_root *root;
 
 	assert_spin_locked(&testb->t_dev->lock);
 
 	idx = sector >> PAGE_SECTORS_SHIFT;
 	sector_bit = (sector & SECTOR_MASK);
 
-	t_page = radix_tree_lookup(&testb->t_dev->pages, idx);
+	root = is_cache ? &testb->t_dev->cache : &testb->t_dev->pages;
+
+	t_page = radix_tree_lookup(root, idx);
 	WARN_ON(t_page && t_page->page->index != idx);
 
 	if (t_page && (for_write || test_bit(sector_bit, &t_page->bitmap)))
@@ -442,15 +484,27 @@ static struct testb_page *testb_lookup_page(struct testb *testb,
 	return NULL;
 }
 
+static struct testb_page *testb_lookup_page(struct testb *testb,
+	sector_t sector, bool for_write, bool ignore_cache)
+{
+	struct testb_page *page = NULL;
+
+	if (!ignore_cache)
+		page = __testb_lookup_page(testb, sector, for_write, true);
+	if (page)
+		return page;
+	return __testb_lookup_page(testb, sector, for_write, false);
+}
+
 static struct testb_page *testb_insert_page(struct testb *testb,
-	sector_t sector, unsigned long *lock_flag)
+	sector_t sector, unsigned long *lock_flag, bool ignore_cache)
 {
 	u64 idx;
 	struct testb_page *t_page;
 
 	assert_spin_locked(&testb->t_dev->lock);
 
-	t_page = testb_lookup_page(testb, sector, true);
+	t_page = testb_lookup_page(testb, sector, true, ignore_cache);
 	if (t_page)
 		return t_page;
 
@@ -466,7 +520,7 @@ static struct testb_page *testb_insert_page(struct testb *testb,
 	spin_lock_irqsave(&testb->t_dev->lock, *lock_flag);
 	idx = sector >> PAGE_SECTORS_SHIFT;
 	t_page->page->index = idx;
-	t_page = testb_radix_tree_insert(testb, idx, t_page);
+	t_page = testb_radix_tree_insert(testb, idx, t_page, !ignore_cache);
 	radix_tree_preload_end();
 
 	return t_page;
@@ -474,11 +528,122 @@ static struct testb_page *testb_insert_page(struct testb *testb,
 	testb_free_page(t_page);
 out_lock:
 	spin_lock_irqsave(&testb->t_dev->lock, *lock_flag);
-	return testb_lookup_page(testb, sector, true);
+	return testb_lookup_page(testb, sector, true, ignore_cache);
+}
+
+static int
+testb_flush_cache_page(struct testb *testb, struct testb_page *c_page,
+	unsigned long *lock_flag)
+{
+	int i;
+	unsigned int offset;
+	u64 idx;
+	struct testb_page *t_page, *ret;
+	void *dst, *src;
+
+	assert_spin_locked(&testb->t_dev->lock);
+
+	idx = c_page->page->index;
+
+	t_page = testb_insert_page(testb, idx << PAGE_SECTORS_SHIFT,
+		lock_flag, true);
+
+	__clear_bit(TESTB_PAGE_LOCK, &c_page->bitmap);
+	if (test_bit(TESTB_PAGE_FREE, &c_page->bitmap)) {
+		testb_free_page(c_page);
+		if (t_page && t_page->bitmap == 0) {
+			ret = radix_tree_delete_item(&testb->t_dev->pages,
+				idx, t_page);
+			testb_free_page(t_page);
+		}
+		return 0;
+	}
+
+	if (!t_page)
+		return -ENOMEM;
+
+	src = kmap_atomic(c_page->page);
+	dst = kmap_atomic(t_page->page);
+
+	for (i = 0; i < PAGE_SECTORS;
+			i += (testb->t_dev->blocksize >> SECTOR_SHIFT)) {
+		if (test_bit(i, &c_page->bitmap)) {
+			offset = (i << SECTOR_SHIFT);
+			memcpy(dst + offset, src + offset,
+				testb->t_dev->blocksize);
+			__set_bit(i, &t_page->bitmap);
+		}
+	}
+
+	kunmap_atomic(dst);
+	kunmap_atomic(src);
+
+	ret = radix_tree_delete_item(&testb->t_dev->cache, idx, c_page);
+	testb_free_page(ret);
+	testb->t_dev->curr_cache -= PAGE_SIZE;
+
+	return 0;
+}
+
+static int testb_make_cache_space(struct testb *testb,
+	unsigned long *lock_flag, size_t n)
+{
+	int i, err, nr_pages;
+	struct testb_page *c_pages[FREE_BATCH];
+	size_t flushed = 0, one_round;
+
+	assert_spin_locked(&testb->t_dev->lock);
+
+again:
+	if (testb->t_dev->cache_size > testb->t_dev->curr_cache + n ||
+			testb->t_dev->curr_cache == 0)
+		return 0;
+
+	nr_pages = radix_tree_gang_lookup(&testb->t_dev->cache,
+			(void **)c_pages, testb->cache_flush_pos, FREE_BATCH);
+	/*
+	 * testb_flush_cache_page could unlock before using the c_pages. To
+	 * avoid race, we don't allow page free
+	 */
+	for (i = 0; i < nr_pages; i++) {
+		testb->cache_flush_pos = c_pages[i]->page->index;
+		/*
+		 * We found the page which is being flushed to disk by other
+		 * threads
+		 */
+		if (test_bit(TESTB_PAGE_LOCK, &c_pages[i]->bitmap))
+			c_pages[i] = NULL;
+		else
+			__set_bit(TESTB_PAGE_LOCK, &c_pages[i]->bitmap);
+	}
+
+	one_round = 0;
+	for (i = 0; i < nr_pages; i++) {
+		if (c_pages[i] == NULL)
+			continue;
+		err = testb_flush_cache_page(testb, c_pages[i], lock_flag);
+		if (err)
+			return err;
+		one_round++;
+	}
+	flushed += one_round << PAGE_SHIFT;
+
+	if (n > flushed) {
+		if (nr_pages == 0)
+			testb->cache_flush_pos = 0;
+		if (one_round == 0) {
+			/* give other threads a chance */
+			spin_unlock_irqrestore(&testb->t_dev->lock, *lock_flag);
+			spin_lock_irqsave(&testb->t_dev->lock, *lock_flag);
+		}
+		goto again;
+	}
+	return 0;
 }
 
 static int copy_to_testb(struct testb *testb, struct page *source,
-	unsigned int off, sector_t sector, size_t n, unsigned long *lock_flag)
+	unsigned int off, sector_t sector, size_t n, unsigned long *lock_flag,
+	bool is_fua)
 {
 	size_t temp, count = 0;
 	unsigned int offset;
@@ -488,8 +653,12 @@ static int copy_to_testb(struct testb *testb, struct page *source,
 	while (count < n) {
 		temp = min_t(size_t, testb->t_dev->blocksize, n - count);
 
+		if (testb_cache_active(testb) && !is_fua)
+			testb_make_cache_space(testb, lock_flag, PAGE_SIZE);
+
 		offset = (sector & SECTOR_MASK) << SECTOR_SHIFT;
-		t_page = testb_insert_page(testb, sector, lock_flag);
+		t_page = testb_insert_page(testb, sector, lock_flag,
+			!testb_cache_active(testb) || is_fua);
 		if (!t_page)
 			return -ENOSPC;
 
@@ -501,6 +670,9 @@ static int copy_to_testb(struct testb *testb, struct page *source,
 
 		__set_bit(sector & SECTOR_MASK, &t_page->bitmap);
 
+		if (is_fua)
+			testb_free_sector(testb, sector, true);
+
 		count += temp;
 		sector += temp >> SECTOR_SHIFT;
 	}
@@ -519,7 +691,8 @@ static int copy_from_testb(struct testb *testb, struct page *dest,
 		temp = min_t(size_t, testb->t_dev->blocksize, n - count);
 
 		offset = (sector & SECTOR_MASK) << SECTOR_SHIFT;
-		t_page = testb_lookup_page(testb, sector, false);
+		t_page = testb_lookup_page(testb, sector, false,
+			!testb_cache_active(testb));
 
 		dst = kmap_atomic(dest);
 		if (!t_page) {
@@ -546,7 +719,9 @@ static void testb_handle_discard(struct testb *testb, sector_t sector, size_t n)
 	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
 	while (n > 0) {
 		temp = min_t(size_t, n, testb->t_dev->blocksize);
-		testb_free_sector(testb, sector);
+		testb_free_sector(testb, sector, false);
+		if (testb_cache_active(testb))
+			testb_free_sector(testb, sector, true);
 		sector += temp >> SECTOR_SHIFT;
 		n -= temp;
 	}
@@ -555,12 +730,28 @@ static void testb_handle_discard(struct testb *testb, sector_t sector, size_t n)
 
 static int testb_handle_flush(struct testb *testb)
 {
+	unsigned long lock_flag;
+	int err;
+
+	if (!testb_cache_active(testb))
+		return 0;
+
+	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
+	while (true) {
+		err = testb_make_cache_space(testb, &lock_flag,
+			testb->t_dev->cache_size);
+		if (err || testb->t_dev->curr_cache == 0)
+			break;
+	}
+
+	WARN_ON(!radix_tree_empty(&testb->t_dev->cache));
+	spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
 	return 0;
 }
 
 static int testb_transfer(struct testb *testb, struct page *page,
 	unsigned int len, unsigned int off, bool is_write, sector_t sector,
-	unsigned long *lock_flags)
+	unsigned long *lock_flags, bool is_fua)
 {
 	int err = 0;
 
@@ -571,7 +762,7 @@ static int testb_transfer(struct testb *testb, struct page *page,
 	} else {
 		flush_dcache_page(page);
 		err = copy_to_testb(testb, page, off, sector, len,
-						lock_flags);
+						lock_flags, is_fua);
 	}
 
 	return err;
@@ -638,7 +829,7 @@ static int testb_handle_rq(struct request *rq)
 		len = bvec.bv_len;
 		err = testb_transfer(testb, bvec.bv_page, len, bvec.bv_offset,
 				     op_is_write(req_op(rq)), sector,
-				     &lock_flag);
+				     &lock_flag, req_op(rq) & REQ_FUA);
 		if (err) {
 			spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
 			return err;
@@ -690,6 +881,8 @@ static void testb_free_bdev(struct testb *testb)
 	blk_cleanup_queue(testb->q);
 	blk_mq_free_tag_set(&testb->tag_set);
 
+	if (testb_cache_active(testb))
+		testb_free_device_storage(testb->t_dev, true);
 	kfree(testb);
 }
 
@@ -799,6 +992,9 @@ static int testb_alloc_bdev(struct testb_device *t_dev)
 	testb->t_dev = t_dev;
 	t_dev->testb = testb;
 
+	if (t_dev->cache_size > 0)
+		set_bit(TESTB_DEV_FL_CACHE, &testb->t_dev->flags);
+
 	testb->tag_set.ops = &testb_mq_ops;
 	testb->tag_set.nr_hw_queues = t_dev->nr_queues;
 	testb->tag_set.queue_depth = t_dev->q_depth;
@@ -869,6 +1065,10 @@ static int __init testb_init(void)
 	int ret = 0;
 	struct configfs_subsystem *subsys = &testb_subsys;
 
+	/* check for testb_page.bitmap */
+	if (sizeof(unsigned long) * 8 - 2 < (PAGE_SIZE >> SECTOR_SHIFT))
+		return -EINVAL;
+
 	config_group_init(&subsys->su_group);
 	mutex_init(&subsys->su_mutex);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/5] testb: badblock support
  2017-08-05 15:51 [PATCH 0/5] block: a virtual block device driver for testing Shaohua Li
                   ` (3 preceding siblings ...)
  2017-08-05 15:51 ` [PATCH 4/5] testb: emulate disk cache Shaohua Li
@ 2017-08-05 15:51 ` Shaohua Li
  2017-08-06  1:56   ` Dan Williams
  2017-08-07  8:29 ` [PATCH 0/5] block: a virtual block device driver for testing Hannes Reinecke
  5 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2017-08-05 15:51 UTC (permalink / raw)
  To: linux-block, linux-raid; +Cc: kernel-team, Kyungchan Koh, Shaohua Li

From: Shaohua Li <shli@fb.com>

Sometime disk could have tracks broken and data there is inaccessable,
but data in other parts can be accessed in normal way. MD RAID supports
such disks. But we don't have a good way to test it, because we can't
control which part of a physical disk is bad. For a virtual disk, this
can be easily controlled.

This patch adds a new 'badblock' attribute. Configure it in this way:
echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
blocks.
echo "-20-30" > xxx/badblock, this will make sector [20-30] good

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/block/test_blk.c | 206 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 200 insertions(+), 6 deletions(-)

diff --git a/drivers/block/test_blk.c b/drivers/block/test_blk.c
index 631dae4..54647e9 100644
--- a/drivers/block/test_blk.c
+++ b/drivers/block/test_blk.c
@@ -16,6 +16,7 @@
 #include <linux/hrtimer.h>
 #include <linux/radix-tree.h>
 #include <linux/idr.h>
+#include <linux/interval_tree_generic.h>
 
 #define SECTOR_SHIFT		9
 #define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
@@ -112,6 +113,7 @@ struct testb_device {
 	struct testb *testb;
 	struct radix_tree_root pages;
 	struct radix_tree_root cache;
+	struct rb_root badblock_tree;
 	unsigned long flags;
 	unsigned int curr_cache;
 
@@ -238,6 +240,185 @@ static ssize_t testb_device_power_store(struct config_item *item,
 
 CONFIGFS_ATTR(testb_device_, power);
 
+struct testb_bb_range {
+	struct rb_node rb;
+	sector_t start;
+	sector_t end;
+	sector_t __subtree_last;
+};
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->end)
+
+INTERVAL_TREE_DEFINE(struct testb_bb_range, rb, sector_t, __subtree_last,
+		START, LAST, static, testb_bb);
+
+static void testb_bb_adjust_range(struct rb_root *root,
+	struct testb_bb_range *range, sector_t start, sector_t end)
+{
+	testb_bb_remove(range, root);
+	range->start = start;
+	range->end = end;
+	testb_bb_insert(range, root);
+}
+
+static int testb_bb_insert_range(struct testb_device *dev, sector_t start,
+	sector_t end)
+{
+	struct rb_root *root = &dev->badblock_tree;
+	struct testb_bb_range *first, *next;
+	sector_t nstart, nend;
+
+	first = testb_bb_iter_first(root, start, end);
+	if (!first) {
+		first = kmalloc(sizeof(*first), GFP_ATOMIC);
+		if (!first)
+			return  -ENOMEM;
+		first->start = start;
+		first->end = end;
+		testb_bb_insert(first, root);
+		return 0;
+	}
+
+	nstart = min(start, first->start);
+	nend = max(first->end, end);
+	while (true) {
+		next = testb_bb_iter_next(first, start, end);
+		if (!next)
+			break;
+		nend = max(nend, next->end);
+		testb_bb_remove(next, root);
+		kfree(next);
+	}
+	testb_bb_adjust_range(root, first, nstart, nend);
+	return 0;
+}
+
+static int testb_bb_remove_range(struct testb_device *dev, sector_t start,
+	sector_t end)
+{
+	struct testb_bb_range *first;
+	struct rb_root *root = &dev->badblock_tree;
+
+	first = testb_bb_iter_first(root, start, end);
+	while (first) {
+		if (first->start < start && first->end > end) {
+			sector_t tmp = first->end;
+
+			testb_bb_adjust_range(root, first, first->start,
+				start - 1);
+			return testb_bb_insert_range(dev, end + 1, tmp);
+		}
+
+		if (first->start >= start && first->end <= end) {
+			testb_bb_remove(first, root);
+			kfree(first);
+			first = testb_bb_iter_first(root, start, end);
+			continue;
+		}
+
+		if (first->start < start) {
+			testb_bb_adjust_range(root, first, first->start,
+				start - 1);
+			first = testb_bb_iter_first(root, start, end);
+			continue;
+		}
+
+		WARN_ON(first->end <= end);
+		testb_bb_adjust_range(root, first, end + 1, first->end);
+		return 0;
+	}
+	return 0;
+}
+
+static bool testb_bb_in_range(struct testb_device *dev, sector_t start,
+	sector_t end)
+{
+	assert_spin_locked(&dev->lock);
+	return testb_bb_iter_first(&dev->badblock_tree, start, end) != NULL;
+}
+
+static void testb_bb_clear_all(struct testb_device *dev)
+{
+	struct testb_bb_range *iter;
+
+	while ((iter = testb_bb_iter_first(&dev->badblock_tree, 0,
+						~(sector_t)0))) {
+		testb_bb_remove(iter, &dev->badblock_tree);
+		kfree(iter);
+	}
+}
+
+static ssize_t testb_device_badblock_show(struct config_item *item, char *page)
+{
+	struct testb_device *t_dev = to_testb_device(item);
+	ssize_t count = 0, left = PAGE_SIZE;
+	struct testb_bb_range *node;
+	bool first = true;
+
+	spin_lock_irq(&t_dev->lock);
+	for (node = testb_bb_iter_first(&t_dev->badblock_tree, 0, ~(sector_t)0);
+		node; node = testb_bb_iter_next(node, 0, ~(sector_t)0)) {
+		ssize_t ret;
+
+		ret = snprintf(page + count, left, "%s%llu-%llu",
+			first ? "":",", (u64)node->start, (u64)node->end);
+		if (ret < 0)
+			break;
+		count += ret;
+		left -= ret;
+		first = false;
+		/* don't output truncated range */
+		if (left < 50)
+			break;
+	}
+	spin_unlock_irq(&t_dev->lock);
+	return count + sprintf(page + count, "\n");
+}
+
+static ssize_t testb_device_badblock_store(struct config_item *item,
+				     const char *page, size_t count)
+{
+	struct testb_device *t_dev = to_testb_device(item);
+	char *orig, *buf, *tmp;
+	u64 start, end;
+	int ret;
+
+	orig = kstrndup(page, count, GFP_KERNEL);
+	if (!orig)
+		return -ENOMEM;
+
+	buf = strstrip(orig);
+
+	ret = -EINVAL;
+	if (buf[0] != '+' && buf[0] != '-')
+		goto out;
+	tmp = strchr(&buf[1], '-');
+	if (!tmp)
+		goto out;
+	*tmp = '\0';
+	ret = kstrtoull(buf + 1, 0, &start);
+	if (ret)
+		goto out;
+	ret = kstrtoull(tmp + 1, 0, &end);
+	if (ret)
+		goto out;
+	ret = -EINVAL;
+	if (start > end)
+		goto out;
+	spin_lock_irq(&t_dev->lock);
+	if (buf[0] == '+')
+		ret = testb_bb_insert_range(t_dev, start, end);
+	else
+		ret = testb_bb_remove_range(t_dev, start, end);
+	spin_unlock_irq(&t_dev->lock);
+	if (ret == 0)
+		ret = count;
+out:
+	kfree(orig);
+	return ret;
+}
+CONFIGFS_ATTR(testb_device_, badblock);
+
 static struct configfs_attribute *testb_device_attrs[] = {
 	&testb_device_attr_power,
 	&testb_device_attr_size,
@@ -247,6 +428,7 @@ static struct configfs_attribute *testb_device_attrs[] = {
 	&testb_device_attr_discard,
 	&testb_device_attr_mbps,
 	&testb_device_attr_cache_size,
+	&testb_device_attr_badblock,
 	NULL,
 };
 
@@ -254,6 +436,7 @@ static void testb_device_release(struct config_item *item)
 {
 	struct testb_device *t_dev = to_testb_device(item);
 
+	testb_bb_clear_all(t_dev);
 	testb_free_device_storage(t_dev, false);
 	kfree(t_dev);
 }
@@ -308,7 +491,7 @@ testb_group_drop_item(struct config_group *group, struct config_item *item)
 
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
-	return snprintf(page, PAGE_SIZE, "bandwidth,cache\n");
+	return snprintf(page, PAGE_SIZE, "bandwidth,cache,badblock\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -711,12 +894,18 @@ static int copy_from_testb(struct testb *testb, struct page *dest,
 	return 0;
 }
 
-static void testb_handle_discard(struct testb *testb, sector_t sector, size_t n)
+static int testb_handle_discard(struct testb *testb, sector_t sector, size_t n)
 {
 	size_t temp;
 	unsigned long lock_flag;
 
 	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
+	if (testb_bb_in_range(testb->t_dev, sector,
+				sector + (n >> SECTOR_SHIFT) - 1)) {
+		spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+		return -EIO;
+	}
+
 	while (n > 0) {
 		temp = min_t(size_t, n, testb->t_dev->blocksize);
 		testb_free_sector(testb, sector, false);
@@ -726,6 +915,7 @@ static void testb_handle_discard(struct testb *testb, sector_t sector, size_t n)
 		n -= temp;
 	}
 	spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+	return 0;
 }
 
 static int testb_handle_flush(struct testb *testb)
@@ -780,10 +970,9 @@ static int testb_handle_rq(struct request *rq)
 
 	sector = blk_rq_pos(rq);
 
-	if (req_op(rq) == REQ_OP_DISCARD) {
-		testb_handle_discard(testb, sector, blk_rq_bytes(rq));
-		return 0;
-	} else if (req_op(rq) == REQ_OP_FLUSH)
+	if (req_op(rq) == REQ_OP_DISCARD)
+		return testb_handle_discard(testb, sector, blk_rq_bytes(rq));
+	else if (req_op(rq) == REQ_OP_FLUSH)
 		return testb_handle_flush(testb);
 
 	len = blk_rq_bytes(rq);
@@ -825,6 +1014,11 @@ static int testb_handle_rq(struct request *rq)
 	}
 
 	spin_lock_irqsave(&testb->t_dev->lock, lock_flag);
+	if (testb_bb_in_range(testb->t_dev, sector,
+				sector + blk_rq_sectors(rq) - 1)) {
+		spin_unlock_irqrestore(&testb->t_dev->lock, lock_flag);
+		return -EIO;
+	}
 	rq_for_each_segment(bvec, rq, iter) {
 		len = bvec.bv_len;
 		err = testb_transfer(testb, bvec.bv_page, len, bvec.bv_offset,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] testb: badblock support
  2017-08-05 15:51 ` [PATCH 5/5] testb: badblock support Shaohua Li
@ 2017-08-06  1:56   ` Dan Williams
  2017-08-07  4:39     ` Shaohua Li
  0 siblings, 1 reply; 18+ messages in thread
From: Dan Williams @ 2017-08-06  1:56 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-block, linux-raid, Kernel Team, Kyungchan Koh, Shaohua Li

On Sat, Aug 5, 2017 at 8:51 AM, Shaohua Li <shli@kernel.org> wrote:
> From: Shaohua Li <shli@fb.com>
>
> Sometime disk could have tracks broken and data there is inaccessable,
> but data in other parts can be accessed in normal way. MD RAID supports
> such disks. But we don't have a good way to test it, because we can't
> control which part of a physical disk is bad. For a virtual disk, this
> can be easily controlled.
>
> This patch adds a new 'badblock' attribute. Configure it in this way:
> echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
> blocks.
> echo "-20-30" > xxx/badblock, this will make sector [20-30] good

Did you happen to overlook block/badblocks.c, or did you find it unsuitable?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] testb: badblock support
  2017-08-06  1:56   ` Dan Williams
@ 2017-08-07  4:39     ` Shaohua Li
  0 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2017-08-07  4:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-block, linux-raid, Kernel Team, Kyungchan Koh, Shaohua Li

On Sat, Aug 05, 2017 at 06:56:31PM -0700, Dan Williams wrote:
> On Sat, Aug 5, 2017 at 8:51 AM, Shaohua Li <shli@kernel.org> wrote:
> > From: Shaohua Li <shli@fb.com>
> >
> > Sometime disk could have tracks broken and data there is inaccessable,
> > but data in other parts can be accessed in normal way. MD RAID supports
> > such disks. But we don't have a good way to test it, because we can't
> > control which part of a physical disk is bad. For a virtual disk, this
> > can be easily controlled.
> >
> > This patch adds a new 'badblock' attribute. Configure it in this way:
> > echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
> > blocks.
> > echo "-20-30" > xxx/badblock, this will make sector [20-30] good
> 
> Did you happen to overlook block/badblocks.c, or did you find it unsuitable?

Hmm, I overlooked at it indeed, thought I don't need those features of
badblocks without double checking the code. Will use badblocks interface in
next post.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-05 15:51 [PATCH 0/5] block: a virtual block device driver for testing Shaohua Li
                   ` (4 preceding siblings ...)
  2017-08-05 15:51 ` [PATCH 5/5] testb: badblock support Shaohua Li
@ 2017-08-07  8:29 ` Hannes Reinecke
  2017-08-07 16:36   ` Shaohua Li
  5 siblings, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2017-08-07  8:29 UTC (permalink / raw)
  To: Shaohua Li, linux-block, linux-raid
  Cc: kernel-team, Kyungchan Koh, Shaohua Li

On 08/05/2017 05:51 PM, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> In testing software RAID, I usually found it's hard to cover specific cases.
> RAID is supposed to work even disk is in semi good state, for example, some
> sectors are broken. Since we can't control the behavior of hardware, it's
> difficult to create test suites to do destructive tests. But we can control the
> behavior of software, software based disk is an obvious choice for such tests.
> While we already have several software based disks for testing (eg, null_blk,
> scsi_debug), none is for destructive testing, this is the reason we create a
> new test block device.
> 
> Currently the driver can create disk with following features:
> - Bandwidth control. A raid array consists of several disks. The disks could
>   run in different speed, for example, one disk is SSD and the other is HD.
>   Actually raid1 has a feature called write behind just for this. To test such
>   raid1 feature, we'd like the disks speed could be controlled.
> - Emulate disk cache. Software must flush disk cache to guarantee data is
>   safely stored in media after a power failure. To verify if software works
>   well, we can't simply use physical disk, because even software doesn't flush
>   cache, the hardware probably will flush the cache. With a software
>   implementation of disk cache, we can fully control how we flush disk cache in a
>   power failure.
> - Badblock. If only part of a disk is broken, software raid continues working.
>   To test if software raid works well, disks must include some broken parts or
>   bad blocks. Bad blocks can be easily implemented in software.
> 
> While this is inspired by software raid testing, the driver is very flexible
> for extension. We can easily add new features into the driver. The interface is
> configfs, which can be configured with a shell script. There is a 'features'
> attribute exposing all supported features. By checking this, we don't need to
> worry about compability issues. For configuration details, please check the
> first patch.
> 
Any particular reason why you can't fold these changes into brd or null_blk?
After all, without those testing features it is 'just' another ramdisk
driver...

> This is William's intern project. I made some changes, all errors are mine. You
> are more than welcomed to test and add new features!
> 
Ah. But then why isn't he mentioned in the From: or Signed-off: lines?
Shouldn't he be getting some more credits than just 'William'?
(Unless William is in fact an alias for Kyungchan Koh, in which case a
name mapping would be nice ...)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-07  8:29 ` [PATCH 0/5] block: a virtual block device driver for testing Hannes Reinecke
@ 2017-08-07 16:36   ` Shaohua Li
  2017-08-08 20:31     ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2017-08-07 16:36 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-block, linux-raid, kernel-team, Kyungchan Koh, Shaohua Li

On Mon, Aug 07, 2017 at 10:29:05AM +0200, Hannes Reinecke wrote:
> On 08/05/2017 05:51 PM, Shaohua Li wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > In testing software RAID, I usually found it's hard to cover specific cases.
> > RAID is supposed to work even disk is in semi good state, for example, some
> > sectors are broken. Since we can't control the behavior of hardware, it's
> > difficult to create test suites to do destructive tests. But we can control the
> > behavior of software, software based disk is an obvious choice for such tests.
> > While we already have several software based disks for testing (eg, null_blk,
> > scsi_debug), none is for destructive testing, this is the reason we create a
> > new test block device.
> > 
> > Currently the driver can create disk with following features:
> > - Bandwidth control. A raid array consists of several disks. The disks could
> >   run in different speed, for example, one disk is SSD and the other is HD.
> >   Actually raid1 has a feature called write behind just for this. To test such
> >   raid1 feature, we'd like the disks speed could be controlled.
> > - Emulate disk cache. Software must flush disk cache to guarantee data is
> >   safely stored in media after a power failure. To verify if software works
> >   well, we can't simply use physical disk, because even software doesn't flush
> >   cache, the hardware probably will flush the cache. With a software
> >   implementation of disk cache, we can fully control how we flush disk cache in a
> >   power failure.
> > - Badblock. If only part of a disk is broken, software raid continues working.
> >   To test if software raid works well, disks must include some broken parts or
> >   bad blocks. Bad blocks can be easily implemented in software.
> > 
> > While this is inspired by software raid testing, the driver is very flexible
> > for extension. We can easily add new features into the driver. The interface is
> > configfs, which can be configured with a shell script. There is a 'features'
> > attribute exposing all supported features. By checking this, we don't need to
> > worry about compability issues. For configuration details, please check the
> > first patch.
> > 
> Any particular reason why you can't fold these changes into brd or null_blk?
> After all, without those testing features it is 'just' another ramdisk
> driver...

null_blk isn't a good fit. ramdisk might be, but I try to not. We are adding
new interfaces, locks and so on. Adding the features into ramdisk driver will
mess it up. Binding it to ramdisk driver will make adding new features harder
too, because the test driver doesn't really care about performance while
ramdisk does.

> > This is William's intern project. I made some changes, all errors are mine. You
> > are more than welcomed to test and add new features!
> > 
> Ah. But then why isn't he mentioned in the From: or Signed-off: lines?
> Shouldn't he be getting some more credits than just 'William'?
> (Unless William is in fact an alias for Kyungchan Koh, in which case a
> name mapping would be nice ...)

William is infact an alias for Kyungchan Koh. I'll make the name consistent in
next post.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-07 16:36   ` Shaohua Li
@ 2017-08-08 20:31     ` Jens Axboe
  2017-08-08 21:05       ` Shaohua Li
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2017-08-08 20:31 UTC (permalink / raw)
  To: Shaohua Li, Hannes Reinecke
  Cc: linux-block, linux-raid, kernel-team, Kyungchan Koh, Shaohua Li

On 08/07/2017 10:36 AM, Shaohua Li wrote:
> On Mon, Aug 07, 2017 at 10:29:05AM +0200, Hannes Reinecke wrote:
>> On 08/05/2017 05:51 PM, Shaohua Li wrote:
>>> From: Shaohua Li <shli@fb.com>
>>>
>>> In testing software RAID, I usually found it's hard to cover specific cases.
>>> RAID is supposed to work even disk is in semi good state, for example, some
>>> sectors are broken. Since we can't control the behavior of hardware, it's
>>> difficult to create test suites to do destructive tests. But we can control the
>>> behavior of software, software based disk is an obvious choice for such tests.
>>> While we already have several software based disks for testing (eg, null_blk,
>>> scsi_debug), none is for destructive testing, this is the reason we create a
>>> new test block device.
>>>
>>> Currently the driver can create disk with following features:
>>> - Bandwidth control. A raid array consists of several disks. The disks could
>>>   run in different speed, for example, one disk is SSD and the other is HD.
>>>   Actually raid1 has a feature called write behind just for this. To test such
>>>   raid1 feature, we'd like the disks speed could be controlled.
>>> - Emulate disk cache. Software must flush disk cache to guarantee data is
>>>   safely stored in media after a power failure. To verify if software works
>>>   well, we can't simply use physical disk, because even software doesn't flush
>>>   cache, the hardware probably will flush the cache. With a software
>>>   implementation of disk cache, we can fully control how we flush disk cache in a
>>>   power failure.
>>> - Badblock. If only part of a disk is broken, software raid continues working.
>>>   To test if software raid works well, disks must include some broken parts or
>>>   bad blocks. Bad blocks can be easily implemented in software.
>>>
>>> While this is inspired by software raid testing, the driver is very flexible
>>> for extension. We can easily add new features into the driver. The interface is
>>> configfs, which can be configured with a shell script. There is a 'features'
>>> attribute exposing all supported features. By checking this, we don't need to
>>> worry about compability issues. For configuration details, please check the
>>> first patch.
>>>
>> Any particular reason why you can't fold these changes into brd or null_blk?
>> After all, without those testing features it is 'just' another ramdisk
>> driver...
> 
> null_blk isn't a good fit. ramdisk might be, but I try to not. We are adding
> new interfaces, locks and so on. Adding the features into ramdisk driver will
> mess it up. Binding it to ramdisk driver will make adding new features harder
> too, because the test driver doesn't really care about performance while
> ramdisk does.

I'm curious why null_blk isn't a good fit? You'd just need to add RAM
storage to it. That would just be a separate option that should be set,
ram_backing=1 or something like that. That would make it less critical
than using the RAM disk driver as well, since only people that want a "real"
data backing would enable it.

It's not that I'm extremely opposed to adding a(nother) test block driver,
but we at least need some sort of reasoning behind why, which isn't just
"not a good fit".

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] testb: implement block device operations
  2017-08-05 15:51 ` [PATCH 2/5] testb: implement block device operations Shaohua Li
@ 2017-08-08 20:34   ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2017-08-08 20:34 UTC (permalink / raw)
  To: Shaohua Li, linux-block, linux-raid
  Cc: kernel-team, Kyungchan Koh, Kyungchan Koh

On 08/05/2017 09:51 AM, Shaohua Li wrote:
> +static struct testb_page *testb_insert_page(struct testb *testb,
> +	sector_t sector, unsigned long *lock_flag)
> +{
> +	u64 idx;
> +	struct testb_page *t_page;
> +
> +	assert_spin_locked(&testb->t_dev->lock);
> +
> +	t_page = testb_lookup_page(testb, sector, true);
> +	if (t_page)
> +		return t_page;
> +
> +	spin_unlock_irqrestore(&testb->t_dev->lock, *lock_flag);

Passing in lock flags through several functions is kind of iffy. It also
used to break on some architectures, though I don't think that's the
case anymore. The problem is that it usually ends up being a code
locking, instead of a data structure locking. The latter is much
cleaner.

> +static int copy_from_testb(struct testb *testb, struct page *dest,
> +	unsigned int off, sector_t sector, size_t n, unsigned long *lock_flag)

This also gets the lock flags passed in, but doesn't use it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-08 20:31     ` Jens Axboe
@ 2017-08-08 21:05       ` Shaohua Li
  2017-08-08 21:13         ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2017-08-08 21:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Hannes Reinecke, linux-block, linux-raid, kernel-team,
	Kyungchan Koh, Shaohua Li

On Tue, Aug 08, 2017 at 02:31:54PM -0600, Jens Axboe wrote:
> On 08/07/2017 10:36 AM, Shaohua Li wrote:
> > On Mon, Aug 07, 2017 at 10:29:05AM +0200, Hannes Reinecke wrote:
> >> On 08/05/2017 05:51 PM, Shaohua Li wrote:
> >>> From: Shaohua Li <shli@fb.com>
> >>>
> >>> In testing software RAID, I usually found it's hard to cover specific cases.
> >>> RAID is supposed to work even disk is in semi good state, for example, some
> >>> sectors are broken. Since we can't control the behavior of hardware, it's
> >>> difficult to create test suites to do destructive tests. But we can control the
> >>> behavior of software, software based disk is an obvious choice for such tests.
> >>> While we already have several software based disks for testing (eg, null_blk,
> >>> scsi_debug), none is for destructive testing, this is the reason we create a
> >>> new test block device.
> >>>
> >>> Currently the driver can create disk with following features:
> >>> - Bandwidth control. A raid array consists of several disks. The disks could
> >>>   run in different speed, for example, one disk is SSD and the other is HD.
> >>>   Actually raid1 has a feature called write behind just for this. To test such
> >>>   raid1 feature, we'd like the disks speed could be controlled.
> >>> - Emulate disk cache. Software must flush disk cache to guarantee data is
> >>>   safely stored in media after a power failure. To verify if software works
> >>>   well, we can't simply use physical disk, because even software doesn't flush
> >>>   cache, the hardware probably will flush the cache. With a software
> >>>   implementation of disk cache, we can fully control how we flush disk cache in a
> >>>   power failure.
> >>> - Badblock. If only part of a disk is broken, software raid continues working.
> >>>   To test if software raid works well, disks must include some broken parts or
> >>>   bad blocks. Bad blocks can be easily implemented in software.
> >>>
> >>> While this is inspired by software raid testing, the driver is very flexible
> >>> for extension. We can easily add new features into the driver. The interface is
> >>> configfs, which can be configured with a shell script. There is a 'features'
> >>> attribute exposing all supported features. By checking this, we don't need to
> >>> worry about compability issues. For configuration details, please check the
> >>> first patch.
> >>>
> >> Any particular reason why you can't fold these changes into brd or null_blk?
> >> After all, without those testing features it is 'just' another ramdisk
> >> driver...
> > 
> > null_blk isn't a good fit. ramdisk might be, but I try to not. We are adding
> > new interfaces, locks and so on. Adding the features into ramdisk driver will
> > mess it up. Binding it to ramdisk driver will make adding new features harder
> > too, because the test driver doesn't really care about performance while
> > ramdisk does.
> 
> I'm curious why null_blk isn't a good fit? You'd just need to add RAM
> storage to it. That would just be a separate option that should be set,
> ram_backing=1 or something like that. That would make it less critical
> than using the RAM disk driver as well, since only people that want a "real"
> data backing would enable it.
> 
> It's not that I'm extremely opposed to adding a(nother) test block driver,
> but we at least need some sort of reasoning behind why, which isn't just
> "not a good fit".

Ah, I thought the 'null' of null_blk means we do nothing for the disks. Of
course we can rename it, which means this point less meaningful. I think the
main reason is the interface. We will configure the disks with different
parameters and do power on/off for each disks (which is the key we can emulate
disk cache and power loss). The module paramter interface of null_blk doesn't
work for the usage. Of course, these issues can be fixed, for example, we can
make null_blk use the configfs interface. If you really prefer a single driver
for all test purpose, I can move the test_blk functionalities to null_blk.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-08 21:05       ` Shaohua Li
@ 2017-08-08 21:13         ` Jens Axboe
  2017-08-08 22:00             ` Bart Van Assche
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2017-08-08 21:13 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe
  Cc: Hannes Reinecke, linux-block, linux-raid, kernel-team,
	Kyungchan Koh, Shaohua Li

On 08/08/2017 03:05 PM, Shaohua Li wrote:
>> I'm curious why null_blk isn't a good fit? You'd just need to add RAM
>> storage to it. That would just be a separate option that should be set,
>> ram_backing=1 or something like that. That would make it less critical
>> than using the RAM disk driver as well, since only people that want a "real"
>> data backing would enable it.
>>
>> It's not that I'm extremely opposed to adding a(nother) test block driver,
>> but we at least need some sort of reasoning behind why, which isn't just
>> "not a good fit".
> 
> Ah, I thought the 'null' of null_blk means we do nothing for the
> disks. Of course we can rename it, which means this point less
> meaningful. I think the main reason is the interface. We will
> configure the disks with different parameters and do power on/off for
> each disks (which is the key we can emulate disk cache and power
> loss). The module paramter interface of null_blk doesn't work for the
> usage. Of course, these issues can be fixed, for example, we can make
> null_blk use the configfs interface. If you really prefer a single
> driver for all test purpose, I can move the test_blk functionalities
> to null_blk.

The idea with null_blk is just that it's a test vehicle. As such, it
would actually be useful to have a mode where it does store the data in
RAM, since that enables you to do other kinds of testing as well. I'd be
fine with augmenting it with configfs for certain things.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-08 21:13         ` Jens Axboe
@ 2017-08-08 22:00             ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2017-08-08 22:00 UTC (permalink / raw)
  To: shli, axboe
  Cc: linux-raid, kernel-team, hare, linux-block, kyungchan.koh, shli

On Tue, 2017-08-08 at 15:13 -0600, Jens Axboe wrote:
> On 08/08/2017 03:05 PM, Shaohua Li wrote:
> > > I'm curious why null_blk isn't a good fit? You'd just need to add RAM
> > > storage to it. That would just be a separate option that should be
> > > set,
> > > ram_backing=1 or something like that. That would make it less critical
> > > than using the RAM disk driver as well, since only people that want a
> > > "real"
> > > data backing would enable it.
> > > 
> > > It's not that I'm extremely opposed to adding a(nother) test block
> > > driver,
> > > but we at least need some sort of reasoning behind why, which isn't
> > > just
> > > "not a good fit".
> > 
> > Ah, I thought the 'null' of null_blk means we do nothing for the
> > disks. Of course we can rename it, which means this point less
> > meaningful. I think the main reason is the interface. We will
> > configure the disks with different parameters and do power on/off for
> > each disks (which is the key we can emulate disk cache and power
> > loss). The module paramter interface of null_blk doesn't work for the
> > usage. Of course, these issues can be fixed, for example, we can make
> > null_blk use the configfs interface. If you really prefer a single
> > driver for all test purpose, I can move the test_blk functionalities
> > to null_blk.
> 
> The idea with null_blk is just that it's a test vehicle. As such, it
> would actually be useful to have a mode where it does store the data in
> RAM, since that enables you to do other kinds of testing as well. I'd be
> fine with augmenting it with configfs for certain things.

Hello Jens,

Would you consider it acceptable to make the mode in which null_blk stores
data the default? I know several people who got confused by null_blk by
default not retaining data ...

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
@ 2017-08-08 22:00             ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2017-08-08 22:00 UTC (permalink / raw)
  To: shli, axboe
  Cc: linux-raid, kernel-team, hare, linux-block, kyungchan.koh, shli

T24gVHVlLCAyMDE3LTA4LTA4IGF0IDE1OjEzIC0wNjAwLCBKZW5zIEF4Ym9lIHdyb3RlOg0KPiBP
biAwOC8wOC8yMDE3IDAzOjA1IFBNLCBTaGFvaHVhIExpIHdyb3RlOg0KPiA+ID4gSSdtIGN1cmlv
dXMgd2h5IG51bGxfYmxrIGlzbid0IGEgZ29vZCBmaXQ/IFlvdSdkIGp1c3QgbmVlZCB0byBhZGQg
UkFNDQo+ID4gPiBzdG9yYWdlIHRvIGl0LiBUaGF0IHdvdWxkIGp1c3QgYmUgYSBzZXBhcmF0ZSBv
cHRpb24gdGhhdCBzaG91bGQgYmUNCj4gPiA+IHNldCwNCj4gPiA+IHJhbV9iYWNraW5nPTEgb3Ig
c29tZXRoaW5nIGxpa2UgdGhhdC4gVGhhdCB3b3VsZCBtYWtlIGl0IGxlc3MgY3JpdGljYWwNCj4g
PiA+IHRoYW4gdXNpbmcgdGhlIFJBTSBkaXNrIGRyaXZlciBhcyB3ZWxsLCBzaW5jZSBvbmx5IHBl
b3BsZSB0aGF0IHdhbnQgYQ0KPiA+ID4gInJlYWwiDQo+ID4gPiBkYXRhIGJhY2tpbmcgd291bGQg
ZW5hYmxlIGl0Lg0KPiA+ID4gDQo+ID4gPiBJdCdzIG5vdCB0aGF0IEknbSBleHRyZW1lbHkgb3Bw
b3NlZCB0byBhZGRpbmcgYShub3RoZXIpIHRlc3QgYmxvY2sNCj4gPiA+IGRyaXZlciwNCj4gPiA+
IGJ1dCB3ZSBhdCBsZWFzdCBuZWVkIHNvbWUgc29ydCBvZiByZWFzb25pbmcgYmVoaW5kIHdoeSwg
d2hpY2ggaXNuJ3QNCj4gPiA+IGp1c3QNCj4gPiA+ICJub3QgYSBnb29kIGZpdCIuDQo+ID4gDQo+
ID4gQWgsIEkgdGhvdWdodCB0aGUgJ251bGwnIG9mIG51bGxfYmxrIG1lYW5zIHdlIGRvIG5vdGhp
bmcgZm9yIHRoZQ0KPiA+IGRpc2tzLiBPZiBjb3Vyc2Ugd2UgY2FuIHJlbmFtZSBpdCwgd2hpY2gg
bWVhbnMgdGhpcyBwb2ludCBsZXNzDQo+ID4gbWVhbmluZ2Z1bC4gSSB0aGluayB0aGUgbWFpbiBy
ZWFzb24gaXMgdGhlIGludGVyZmFjZS4gV2Ugd2lsbA0KPiA+IGNvbmZpZ3VyZSB0aGUgZGlza3Mg
d2l0aCBkaWZmZXJlbnQgcGFyYW1ldGVycyBhbmQgZG8gcG93ZXIgb24vb2ZmIGZvcg0KPiA+IGVh
Y2ggZGlza3MgKHdoaWNoIGlzIHRoZSBrZXkgd2UgY2FuIGVtdWxhdGUgZGlzayBjYWNoZSBhbmQg
cG93ZXINCj4gPiBsb3NzKS4gVGhlIG1vZHVsZSBwYXJhbXRlciBpbnRlcmZhY2Ugb2YgbnVsbF9i
bGsgZG9lc24ndCB3b3JrIGZvciB0aGUNCj4gPiB1c2FnZS4gT2YgY291cnNlLCB0aGVzZSBpc3N1
ZXMgY2FuIGJlIGZpeGVkLCBmb3IgZXhhbXBsZSwgd2UgY2FuIG1ha2UNCj4gPiBudWxsX2JsayB1
c2UgdGhlIGNvbmZpZ2ZzIGludGVyZmFjZS4gSWYgeW91IHJlYWxseSBwcmVmZXIgYSBzaW5nbGUN
Cj4gPiBkcml2ZXIgZm9yIGFsbCB0ZXN0IHB1cnBvc2UsIEkgY2FuIG1vdmUgdGhlIHRlc3RfYmxr
IGZ1bmN0aW9uYWxpdGllcw0KPiA+IHRvIG51bGxfYmxrLg0KPiANCj4gVGhlIGlkZWEgd2l0aCBu
dWxsX2JsayBpcyBqdXN0IHRoYXQgaXQncyBhIHRlc3QgdmVoaWNsZS4gQXMgc3VjaCwgaXQNCj4g
d291bGQgYWN0dWFsbHkgYmUgdXNlZnVsIHRvIGhhdmUgYSBtb2RlIHdoZXJlIGl0IGRvZXMgc3Rv
cmUgdGhlIGRhdGEgaW4NCj4gUkFNLCBzaW5jZSB0aGF0IGVuYWJsZXMgeW91IHRvIGRvIG90aGVy
IGtpbmRzIG9mIHRlc3RpbmcgYXMgd2VsbC4gSSdkIGJlDQo+IGZpbmUgd2l0aCBhdWdtZW50aW5n
IGl0IHdpdGggY29uZmlnZnMgZm9yIGNlcnRhaW4gdGhpbmdzLg0KDQpIZWxsbyBKZW5zLA0KDQpX
b3VsZCB5b3UgY29uc2lkZXIgaXQgYWNjZXB0YWJsZSB0byBtYWtlIHRoZSBtb2RlIGluIHdoaWNo
IG51bGxfYmxrIHN0b3Jlcw0KZGF0YSB0aGUgZGVmYXVsdD8gSSBrbm93IHNldmVyYWwgcGVvcGxl
IHdobyBnb3QgY29uZnVzZWQgYnkgbnVsbF9ibGsgYnkNCmRlZmF1bHQgbm90IHJldGFpbmluZyBk
YXRhIC4uLg0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-08 22:00             ` Bart Van Assche
  (?)
@ 2017-08-08 22:07             ` Jens Axboe
  -1 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2017-08-08 22:07 UTC (permalink / raw)
  To: Bart Van Assche, shli
  Cc: linux-raid, kernel-team, hare, linux-block, kyungchan.koh, shli

On 08/08/2017 04:00 PM, Bart Van Assche wrote:
> On Tue, 2017-08-08 at 15:13 -0600, Jens Axboe wrote:
>> On 08/08/2017 03:05 PM, Shaohua Li wrote:
>>>> I'm curious why null_blk isn't a good fit? You'd just need to add RAM
>>>> storage to it. That would just be a separate option that should be
>>>> set,
>>>> ram_backing=1 or something like that. That would make it less critical
>>>> than using the RAM disk driver as well, since only people that want a
>>>> "real"
>>>> data backing would enable it.
>>>>
>>>> It's not that I'm extremely opposed to adding a(nother) test block
>>>> driver,
>>>> but we at least need some sort of reasoning behind why, which isn't
>>>> just
>>>> "not a good fit".
>>>
>>> Ah, I thought the 'null' of null_blk means we do nothing for the
>>> disks. Of course we can rename it, which means this point less
>>> meaningful. I think the main reason is the interface. We will
>>> configure the disks with different parameters and do power on/off for
>>> each disks (which is the key we can emulate disk cache and power
>>> loss). The module paramter interface of null_blk doesn't work for the
>>> usage. Of course, these issues can be fixed, for example, we can make
>>> null_blk use the configfs interface. If you really prefer a single
>>> driver for all test purpose, I can move the test_blk functionalities
>>> to null_blk.
>>
>> The idea with null_blk is just that it's a test vehicle. As such, it
>> would actually be useful to have a mode where it does store the data in
>> RAM, since that enables you to do other kinds of testing as well. I'd be
>> fine with augmenting it with configfs for certain things.
> 
> Hello Jens,
> 
> Would you consider it acceptable to make the mode in which null_blk stores
> data the default? I know several people who got confused by null_blk by
> default not retaining data ...

I don't think we should change the default, since that'll then upset
people that currently use it and all of a sudden see a different
performance profile. It's called null_blk, and the device node is
/dev/nullb0. I think either one of those should reasonably set
expectations for the user that it doesn't really store your data at all.

We could add a module info blurb on it not storing data, I see we don't
have that. The initial commit said:

commit f2298c0403b0dfcaef637eba0c02c4a06d7a25ab
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Oct 25 11:52:25 2013 +0100

    null_blk: multi queue aware block test driver
    
    A driver that simply completes IO it receives, it does no
    transfers. Written to fascilitate testing of the blk-mq code.
    It supports various module options to use either bio queueing,
    rq queueing, or mq mode.

We should just add a MODULE_INFO with a short, similarly worded text.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] block: a virtual block device driver for testing
  2017-08-08 22:00             ` Bart Van Assche
  (?)
  (?)
@ 2017-08-08 22:08             ` Omar Sandoval
  -1 siblings, 0 replies; 18+ messages in thread
From: Omar Sandoval @ 2017-08-08 22:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: shli, axboe, linux-raid, kernel-team, hare, linux-block,
	kyungchan.koh, shli

On Tue, Aug 08, 2017 at 10:00:21PM +0000, Bart Van Assche wrote:
> On Tue, 2017-08-08 at 15:13 -0600, Jens Axboe wrote:
> > On 08/08/2017 03:05 PM, Shaohua Li wrote:
> > > > I'm curious why null_blk isn't a good fit? You'd just need to add RAM
> > > > storage to it. That would just be a separate option that should be
> > > > set,
> > > > ram_backing=1 or something like that. That would make it less critical
> > > > than using the RAM disk driver as well, since only people that want a
> > > > "real"
> > > > data backing would enable it.
> > > > 
> > > > It's not that I'm extremely opposed to adding a(nother) test block
> > > > driver,
> > > > but we at least need some sort of reasoning behind why, which isn't
> > > > just
> > > > "not a good fit".
> > > 
> > > Ah, I thought the 'null' of null_blk means we do nothing for the
> > > disks. Of course we can rename it, which means this point less
> > > meaningful. I think the main reason is the interface. We will
> > > configure the disks with different parameters and do power on/off for
> > > each disks (which is the key we can emulate disk cache and power
> > > loss). The module paramter interface of null_blk doesn't work for the
> > > usage. Of course, these issues can be fixed, for example, we can make
> > > null_blk use the configfs interface. If you really prefer a single
> > > driver for all test purpose, I can move the test_blk functionalities
> > > to null_blk.
> > 
> > The idea with null_blk is just that it's a test vehicle. As such, it
> > would actually be useful to have a mode where it does store the data in
> > RAM, since that enables you to do other kinds of testing as well. I'd be
> > fine with augmenting it with configfs for certain things.
> 
> Hello Jens,
> 
> Would you consider it acceptable to make the mode in which null_blk stores
> data the default? I know several people who got confused by null_blk by
> default not retaining data ...

My minor issue with that is that I'd have to change all of my
performance testing scripts to override the default :) I'd prefer that
it stay as-is.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-08-08 22:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-05 15:51 [PATCH 0/5] block: a virtual block device driver for testing Shaohua Li
2017-08-05 15:51 ` [PATCH 1/5] testb: add interface Shaohua Li
2017-08-05 15:51 ` [PATCH 2/5] testb: implement block device operations Shaohua Li
2017-08-08 20:34   ` Jens Axboe
2017-08-05 15:51 ` [PATCH 3/5] testb: implement bandwidth control Shaohua Li
2017-08-05 15:51 ` [PATCH 4/5] testb: emulate disk cache Shaohua Li
2017-08-05 15:51 ` [PATCH 5/5] testb: badblock support Shaohua Li
2017-08-06  1:56   ` Dan Williams
2017-08-07  4:39     ` Shaohua Li
2017-08-07  8:29 ` [PATCH 0/5] block: a virtual block device driver for testing Hannes Reinecke
2017-08-07 16:36   ` Shaohua Li
2017-08-08 20:31     ` Jens Axboe
2017-08-08 21:05       ` Shaohua Li
2017-08-08 21:13         ` Jens Axboe
2017-08-08 22:00           ` Bart Van Assche
2017-08-08 22:00             ` Bart Van Assche
2017-08-08 22:07             ` Jens Axboe
2017-08-08 22:08             ` Omar Sandoval

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.