All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3]block: An IOPS based ioscheduler
@ 2012-01-04  6:53 Shaohua Li
  2012-01-04  6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Shaohua Li @ 2012-01-04  6:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, vgoyal, jmoyer

An IOPS based I/O scheduler

Flash based storage has some different characteristics against rotate disk.
1. no I/O seek.
2. read and write I/O cost usually is much different.
3. Time which a request takes depends on request size.
4. High throughput and IOPS, low latency.

CFQ iosched does well for rotate disk, for example fair dispatching, idle
for sequential read. It also has optimization for flash based storage (for
item 1 above), but overall it's not designed for flash based storage. It's
a slice based algorithm. Since flash based storage request cost is very
low, and drive has big queue_depth is quite popular now which makes
dispatching cost even lower, CFQ's slice accounting (jiffy based)
doesn't work well. CFQ doesn't consider above item 2 & 3.

FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
only targets for drive without I/O seek. It's quite similar like CFQ, but
the dispatch decision is made according to IOPS instead of slice.

The algorithm is simple. Drive has a service tree, and each task lives in
the tree. The key into the tree is called vios (virtual I/O). Every request
has vios, which is calculated according to its ioprio, request size and so
on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
always selects task with minimum vios in the service tree and let the task
dispatch request. The dispatched request's vios is then added to the task's
vios and the task is repositioned in the sevice tree.

The series are orgnized as:
Patch 1: separate CFQ's io context management code. FIOPS will use it too.
Patch 2: The core FIOPS.
Patch 3: request read/write vios scale. This demontrates how the vios scale.

To make the code simple for easy view, some scale code isn't included here,
some not implementated yet.

TODO:
1. ioprio support (have patch already)
2. request size vios scale
3. cgroup support
4. tracing support
5. automatically select default iosched according to QUEUE_FLAG_NONROT.

Comments and suggestions are welcome!

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC 1/3]block: seperate CFQ io context management code
  2012-01-04  6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li
@ 2012-01-04  6:53 ` Shaohua Li
  2012-01-04  8:19   ` Namhyung Kim
  2012-01-04  6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-04  6:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, vgoyal, jmoyer, Shaohua Li

[-- Attachment #1: separate-cfq-ioc.patch --]
[-- Type: text/plain, Size: 48325 bytes --]

CFQ's io context management creates a per-device io context for each task.
It's quite generic. Separate it from CFQ, and use it for fiops I/O scheduler.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/blk-ioc.c           |  474 ++++++++++++++++++++++++++++++++++-
 block/blk.h               |   55 ++++
 block/cfq-iosched.c       |  614 ++++++++++------------------------------------
 include/linux/iocontext.h |   30 +-
 4 files changed, 683 insertions(+), 490 deletions(-)

Index: linux/block/blk-ioc.c
===================================================================
--- linux.orig/block/blk-ioc.c	2011-12-27 16:13:02.000000000 +0800
+++ linux/block/blk-ioc.c	2011-12-28 09:42:18.000000000 +0800
@@ -8,6 +8,7 @@
 #include <linux/blkdev.h>
 #include <linux/bootmem.h>	/* for max_pfn/max_low_pfn */
 #include <linux/slab.h>
+#include <linux/idr.h>
 
 #include "blk.h"
 
@@ -16,12 +17,12 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
-static void cfq_dtor(struct io_context *ioc)
+static void queue_data_dtor(struct io_context *ioc)
 {
 	if (!hlist_empty(&ioc->cic_list)) {
-		struct cfq_io_context *cic;
+		struct dev_io_context *cic;
 
-		cic = hlist_entry(ioc->cic_list.first, struct cfq_io_context,
+		cic = hlist_entry(ioc->cic_list.first, struct dev_io_context,
 								cic_list);
 		cic->dtor(ioc);
 	}
@@ -40,7 +41,7 @@ int put_io_context(struct io_context *io
 
 	if (atomic_long_dec_and_test(&ioc->refcount)) {
 		rcu_read_lock();
-		cfq_dtor(ioc);
+		queue_data_dtor(ioc);
 		rcu_read_unlock();
 
 		kmem_cache_free(iocontext_cachep, ioc);
@@ -50,14 +51,14 @@ int put_io_context(struct io_context *io
 }
 EXPORT_SYMBOL(put_io_context);
 
-static void cfq_exit(struct io_context *ioc)
+static void queue_data_exit(struct io_context *ioc)
 {
 	rcu_read_lock();
 
 	if (!hlist_empty(&ioc->cic_list)) {
-		struct cfq_io_context *cic;
+		struct dev_io_context *cic;
 
-		cic = hlist_entry(ioc->cic_list.first, struct cfq_io_context,
+		cic = hlist_entry(ioc->cic_list.first, struct dev_io_context,
 								cic_list);
 		cic->exit(ioc);
 	}
@@ -75,7 +76,7 @@ void exit_io_context(struct task_struct
 	task_unlock(task);
 
 	if (atomic_dec_and_test(&ioc->nr_tasks))
-		cfq_exit(ioc);
+		queue_data_exit(ioc);
 
 	put_io_context(ioc);
 }
@@ -162,3 +163,460 @@ static int __init blk_ioc_init(void)
 	return 0;
 }
 subsys_initcall(blk_ioc_init);
+
+#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
+#define CIC_DEAD_INDEX_SHIFT	1
+
+static inline void *queue_data_dead_key(struct queue_data *qdata)
+{
+	return (void *)(qdata->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
+}
+
+int ioc_builder_init(struct ioc_builder *builder)
+{
+	if (!builder->alloc_ioc || !builder->free_ioc)
+		return -ENOMEM;
+
+	builder->ioc_count = alloc_percpu(unsigned long);
+	if (!builder->ioc_count)
+		return -ENOMEM;
+
+	builder->ioc_gone = NULL;
+	spin_lock_init(&builder->ioc_gone_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(ioc_builder_init);
+
+void io_context_builder_exit(struct ioc_builder *builder)
+{
+	DECLARE_COMPLETION_ONSTACK(all_gone);
+
+	builder->ioc_gone = &all_gone;
+	/* ioc_gone's update must be visible before reading ioc_count */
+	smp_wmb();
+
+	/*
+	 * this also protects us from entering cfq_slab_kill() with
+	 * pending RCU callbacks
+	 */
+	if (elv_ioc_count_read(*builder->ioc_count))
+		wait_for_completion(&all_gone);
+
+	free_percpu(builder->ioc_count);
+}
+EXPORT_SYMBOL(io_context_builder_exit);
+
+static DEFINE_SPINLOCK(cic_index_lock);
+static DEFINE_IDA(cic_index_ida);
+static int builder_alloc_cic_index(struct ioc_builder *builder)
+{
+	int index, error;
+	unsigned long flags;
+
+	do {
+		if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
+			return -ENOMEM;
+
+		spin_lock_irqsave(&cic_index_lock, flags);
+		error = ida_get_new(&cic_index_ida, &index);
+		spin_unlock_irqrestore(&cic_index_lock, flags);
+		if (error && error != -EAGAIN)
+			return error;
+	} while (error);
+
+	return index;
+}
+
+static void builder_free_cic_index(struct ioc_builder *builder, int index)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cic_index_lock, flags);
+	ida_remove(&cic_index_ida, index);
+	spin_unlock_irqrestore(&cic_index_lock, flags);
+}
+
+int ioc_builder_init_queue(struct ioc_builder *builder,
+	struct queue_data *qdata, struct request_queue *q)
+{
+	/*
+	 * Don't need take queue_lock in the routine, since we are
+	 * initializing the ioscheduler, and nobody is using qdata
+	 */
+	qdata->cic_index = builder_alloc_cic_index(builder);
+	if (qdata->cic_index < 0)
+		return -ENOMEM;
+
+	qdata->queue = q;
+	INIT_LIST_HEAD(&qdata->cic_list);
+
+	return 0;
+}
+EXPORT_SYMBOL(ioc_builder_init_queue);
+
+/*
+ * Call func for each cic attached to this ioc.
+ */
+static void
+call_for_each_cic(struct io_context *ioc,
+		  void (*func)(struct io_context *, struct dev_io_context *))
+{
+	struct dev_io_context *cic;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+
+	hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list)
+		func(ioc, cic);
+
+	rcu_read_unlock();
+}
+
+static void queue_data_cic_free_rcu(struct rcu_head *head)
+{
+	struct dev_io_context *cic;
+	struct ioc_builder *builder;
+
+	cic = container_of(head, struct dev_io_context, rcu_head);
+	builder = cic->builder;
+
+	builder->free_ioc(builder, cic);
+	elv_ioc_count_dec(*builder->ioc_count);
+
+	if (builder->ioc_gone) {
+		/*
+		 * CFQ scheduler is exiting, grab exit lock and check
+		 * the pending io context count. If it hits zero,
+		 * complete ioc_gone and set it back to NULL
+		 */
+		spin_lock(&builder->ioc_gone_lock);
+		if (builder->ioc_gone &&
+		    !elv_ioc_count_read(*builder->ioc_count)) {
+			complete(builder->ioc_gone);
+			builder->ioc_gone = NULL;
+		}
+		spin_unlock(&builder->ioc_gone_lock);
+	}
+}
+
+static void queue_data_cic_free(struct dev_io_context *cic)
+{
+	call_rcu(&cic->rcu_head, queue_data_cic_free_rcu);
+}
+
+static void cic_free_func(struct io_context *ioc, struct dev_io_context *cic)
+{
+	unsigned long flags;
+	unsigned long dead_key = (unsigned long) cic->key;
+
+	BUG_ON(!(dead_key & CIC_DEAD_KEY));
+
+	spin_lock_irqsave(&ioc->lock, flags);
+	radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT);
+	hlist_del_rcu(&cic->cic_list);
+	spin_unlock_irqrestore(&ioc->lock, flags);
+
+	queue_data_cic_free(cic);
+}
+
+/*
+ * Must be called with rcu_read_lock() held or preemption otherwise disabled.
+ * Only two callers of this - ->dtor() which is called with the rcu_read_lock(),
+ * and ->trim() which is called with the task lock held
+ */
+void queue_data_free_io_context(struct io_context *ioc)
+{
+	/*
+	 * ioc->refcount is zero here, or we are called from elv_unregister(),
+	 * so no more cic's are allowed to be linked into this ioc.  So it
+	 * should be ok to iterate over the known list, we will see all cic's
+	 * since no new ones are added.
+	 */
+	call_for_each_cic(ioc, cic_free_func);
+}
+EXPORT_SYMBOL(queue_data_free_io_context);
+
+static void __queue_data_exit_single_io_context(struct queue_data *qdata,
+					 struct dev_io_context *cic)
+{
+	struct io_context *ioc = cic->ioc;
+	struct ioc_builder *builder = cic->builder;
+
+	list_del_init(&cic->queue_list);
+
+	/*
+	 * Make sure dead mark is seen for dead queues
+	 */
+	smp_wmb();
+	cic->key = queue_data_dead_key(qdata);
+
+	rcu_read_lock();
+	if (rcu_dereference(ioc->ioc_data) == cic) {
+		rcu_read_unlock();
+		spin_lock(&ioc->lock);
+		rcu_assign_pointer(ioc->ioc_data, NULL);
+		spin_unlock(&ioc->lock);
+	} else
+		rcu_read_unlock();
+
+	if (builder->cic_exit)
+		builder->cic_exit(qdata, cic);
+}
+
+/* with request_queue lock hold */
+void ioc_builder_exit_queue(struct ioc_builder *builder,
+	struct queue_data *qdata)
+{
+	while (!list_empty(&qdata->cic_list)) {
+		struct dev_io_context *cic = list_entry(qdata->cic_list.next,
+							struct dev_io_context,
+							queue_list);
+
+		__queue_data_exit_single_io_context(qdata, cic);
+	}
+
+	builder_free_cic_index(builder, qdata->cic_index);
+}
+EXPORT_SYMBOL(ioc_builder_exit_queue);
+
+static void queue_data_exit_single_io_context(struct io_context *ioc,
+				       struct dev_io_context *cic)
+{
+	struct queue_data *qdata = cic_to_queue_data(cic);
+
+	if (qdata) {
+		struct request_queue *q = qdata->queue;
+		unsigned long flags;
+
+		spin_lock_irqsave(q->queue_lock, flags);
+
+		/*
+		 * Ensure we get a fresh copy of the ->key to prevent
+		 * race between exiting task and queue
+		 */
+		smp_read_barrier_depends();
+		if (cic->key == qdata)
+			__queue_data_exit_single_io_context(qdata, cic);
+
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
+}
+
+/*
+ * The process that ioc belongs to has exited, we need to clean up
+ * and put the internal structures we have that belongs to that process.
+ */
+static void queue_data_exit_io_context(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, queue_data_exit_single_io_context);
+}
+
+static struct dev_io_context *
+queue_data_alloc_io_context(struct ioc_builder *builder,
+	struct queue_data *qdata, gfp_t gfp_mask)
+{
+	struct dev_io_context *cic;
+
+	cic = builder->alloc_ioc(builder, qdata, gfp_mask | __GFP_ZERO);
+
+	if (cic) {
+		cic->builder = builder;
+		if (builder->cic_init)
+			builder->cic_init(qdata, cic);
+		INIT_LIST_HEAD(&cic->queue_list);
+		INIT_HLIST_NODE(&cic->cic_list);
+		cic->dtor = queue_data_free_io_context;
+		cic->exit = queue_data_exit_io_context;
+		elv_ioc_count_inc(*builder->ioc_count);
+	}
+
+	return cic;
+}
+
+/*
+ * We drop dev io contexts lazily, so we may find a dead one.
+ */
+static void
+queue_data_drop_dead_cic(struct queue_data *queue_data, struct io_context *ioc,
+		  struct dev_io_context *cic)
+{
+	unsigned long flags;
+
+	WARN_ON(!list_empty(&cic->queue_list));
+	BUG_ON(cic->key != queue_data_dead_key(queue_data));
+
+	spin_lock_irqsave(&ioc->lock, flags);
+
+	BUG_ON(rcu_dereference_check(ioc->ioc_data,
+		lockdep_is_held(&ioc->lock)) == cic);
+
+	radix_tree_delete(&ioc->radix_root, queue_data->cic_index);
+	hlist_del_rcu(&cic->cic_list);
+	spin_unlock_irqrestore(&ioc->lock, flags);
+
+	queue_data_cic_free(cic);
+}
+
+struct dev_io_context *
+queue_data_cic_lookup(struct queue_data *qdata, struct io_context *ioc)
+{
+	struct dev_io_context *cic;
+	unsigned long flags;
+
+	if (unlikely(!ioc))
+		return NULL;
+
+	rcu_read_lock();
+
+	/*
+	 * we maintain a last-hit cache, to avoid browsing over the tree
+	 */
+	cic = rcu_dereference(ioc->ioc_data);
+	if (cic && cic->key == qdata) {
+		rcu_read_unlock();
+		return cic;
+	}
+
+	do {
+		cic = radix_tree_lookup(&ioc->radix_root, qdata->cic_index);
+		rcu_read_unlock();
+		if (!cic)
+			break;
+		if (unlikely(cic->key != qdata)) {
+			queue_data_drop_dead_cic(qdata, ioc, cic);
+			rcu_read_lock();
+			continue;
+		}
+
+		spin_lock_irqsave(&ioc->lock, flags);
+		rcu_assign_pointer(ioc->ioc_data, cic);
+		spin_unlock_irqrestore(&ioc->lock, flags);
+		break;
+	} while (1);
+
+	return cic;
+}
+EXPORT_SYMBOL(queue_data_cic_lookup);
+
+/*
+ * Add cic into ioc, using qdata as the search key. This enables us to lookup
+ * the process specific dev io context when entered from the block layer.
+ * Also adds the cic to a per-qdata list, used when this queue is removed.
+ */
+static int queue_data_cic_link(struct queue_data *qdata,
+	struct io_context *ioc, struct dev_io_context *cic, gfp_t gfp_mask)
+{
+	unsigned long flags;
+	int ret;
+
+	ret = radix_tree_preload(gfp_mask);
+	if (!ret) {
+		cic->ioc = ioc;
+		cic->key = qdata;
+
+		spin_lock_irqsave(&ioc->lock, flags);
+		ret = radix_tree_insert(&ioc->radix_root,
+					qdata->cic_index, cic);
+		if (!ret)
+			hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
+		spin_unlock_irqrestore(&ioc->lock, flags);
+
+		radix_tree_preload_end();
+
+		if (!ret) {
+			spin_lock_irqsave(qdata->queue->queue_lock, flags);
+			list_add(&cic->queue_list, &qdata->cic_list);
+			spin_unlock_irqrestore(qdata->queue->queue_lock, flags);
+		}
+	}
+
+	if (ret && ret != -EEXIST)
+		printk(KERN_ERR "block: cic link failed!\n");
+
+	return ret;
+}
+
+static void changed_ioprio(struct io_context *ioc,
+	struct dev_io_context *gen_cic)
+{
+	struct ioc_builder *builder = gen_cic->builder;
+	if (builder->changed_ioprio)
+		builder->changed_ioprio(ioc, gen_cic);
+}
+
+static void queue_data_ioc_set_ioprio(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_ioprio);
+	ioc->ioprio_changed = 0;
+}
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+static void changed_cgroup(struct io_context *ioc,
+	struct dev_io_context *gen_cic)
+{
+	struct ioc_builder *builder = gen_cic->builder;
+	if (builder->changed_cgroup)
+		builder->changed_cgroup(ioc, gen_cic);
+}
+
+static void queue_data_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_CFQ_GROUP_IOSCHED */
+
+/*
+ * Setup general io context and dev io context. There can be several
+ * dev io contexts per general io context, if this process is doing io to more
+ * than one device managed by elevator.
+ */
+struct dev_io_context *queue_data_get_io_context(struct ioc_builder *builder,
+	struct queue_data *qdata, gfp_t gfp_mask)
+{
+	struct io_context *ioc = NULL;
+	struct dev_io_context *cic;
+	int ret;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	ioc = get_io_context(gfp_mask, qdata->queue->node);
+	if (!ioc)
+		return NULL;
+
+retry:
+	cic = queue_data_cic_lookup(qdata, ioc);
+	if (cic)
+		goto out;
+
+	cic = queue_data_alloc_io_context(builder, qdata, gfp_mask);
+	if (cic == NULL)
+		goto err;
+
+	ret = queue_data_cic_link(qdata, ioc, cic, gfp_mask);
+	if (ret == -EEXIST) {
+		/* someone has linked cic to ioc already */
+		queue_data_cic_free(cic);
+		goto retry;
+	} else if (ret)
+		goto err_free;
+
+out:
+	smp_read_barrier_depends();
+	if (unlikely(ioc->ioprio_changed))
+		queue_data_ioc_set_ioprio(ioc);
+
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+	if (unlikely(ioc->cgroup_changed))
+		queue_data_ioc_set_cgroup(ioc);
+#endif
+	return cic;
+err_free:
+	queue_data_cic_free(cic);
+err:
+	put_io_context(ioc);
+	return NULL;
+}
+EXPORT_SYMBOL(queue_data_get_io_context);
+#endif
Index: linux/block/blk.h
===================================================================
--- linux.orig/block/blk.h	2011-12-27 16:13:02.000000000 +0800
+++ linux/block/blk.h	2011-12-28 09:42:18.000000000 +0800
@@ -206,4 +206,59 @@ static inline void blk_throtl_exit(struc
 static inline void blk_throtl_release(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
+#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
+struct queue_data;
+struct ioc_builder {
+	struct dev_io_context *(*alloc_ioc)(struct ioc_builder *builder,
+		struct queue_data *qdata, gfp_t gfp_mask);
+	void (*free_ioc)(struct ioc_builder *builder,
+		struct dev_io_context *dev_ioc);
+
+	void (*changed_ioprio)(struct io_context *ioc,
+		struct dev_io_context *cic);
+	void (*changed_cgroup)(struct io_context *ioc,
+		struct dev_io_context *cic);
+	void (*cic_exit)(struct queue_data *qdata,
+		struct dev_io_context *gen_cic);
+	void (*cic_init)(struct queue_data *qdata,
+		struct dev_io_context *gen_cic);
+
+	unsigned long __percpu *ioc_count;
+	struct completion *ioc_gone;
+	spinlock_t ioc_gone_lock;
+};
+
+struct queue_data {
+	struct request_queue *queue;
+
+	unsigned int cic_index;
+	struct list_head cic_list;
+};
+
+#define CIC_DEAD_KEY	1ul
+static inline struct queue_data *cic_to_queue_data(struct dev_io_context *cic)
+{
+	struct queue_data *qdata = cic->key;
+
+	if (unlikely((unsigned long) qdata & CIC_DEAD_KEY))
+		return NULL;
+
+	return qdata;
+}
+
+int ioc_builder_init(struct ioc_builder *builder);
+void io_context_builder_exit(struct ioc_builder *builder);
+
+int ioc_builder_init_queue(struct ioc_builder *builder,
+	struct queue_data *qdata, struct request_queue *q);
+void ioc_builder_exit_queue(struct ioc_builder *builder,
+	struct queue_data *qdata);
+
+struct dev_io_context *queue_data_get_io_context(struct ioc_builder *builder,
+	struct queue_data *qdata, gfp_t gfp_mask);
+struct dev_io_context *queue_data_cic_lookup(struct queue_data *qdata,
+	struct io_context *ioc);
+void queue_data_free_io_context(struct io_context *ioc);
+#endif
+
 #endif /* BLK_INTERNAL_H */
Index: linux/block/cfq-iosched.c
===================================================================
--- linux.orig/block/cfq-iosched.c	2011-12-27 16:13:02.000000000 +0800
+++ linux/block/cfq-iosched.c	2011-12-28 09:12:06.000000000 +0800
@@ -14,6 +14,7 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "blk.h"
 #include "cfq.h"
 
 /*
@@ -60,13 +61,7 @@ static const int cfq_hist_divisor = 4;
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
-
-static DEFINE_PER_CPU(unsigned long, cfq_ioc_count);
-static struct completion *ioc_gone;
-static DEFINE_SPINLOCK(ioc_gone_lock);
-
-static DEFINE_SPINLOCK(cic_index_lock);
-static DEFINE_IDA(cic_index_ida);
+static struct ioc_builder ioc_builder;
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
 #define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
@@ -220,7 +215,8 @@ struct cfq_group {
  * Per block device queue structure
  */
 struct cfq_data {
-	struct request_queue *queue;
+	struct queue_data qdata;
+
 	/* Root service tree for cfq_groups */
 	struct cfq_rb_root grp_service_tree;
 	struct cfq_group root_group;
@@ -290,9 +286,6 @@ struct cfq_data {
 	unsigned int cfq_group_idle;
 	unsigned int cfq_latency;
 
-	unsigned int cic_index;
-	struct list_head cic_list;
-
 	/*
 	 * Fallback dummy cfqq for extreme OOM conditions
 	 */
@@ -306,6 +299,10 @@ struct cfq_data {
 	/* Number of groups which are on blkcg->blkg_list */
 	unsigned int nr_blkcg_linked_grps;
 };
+#define queue_data_to_cfqd(ptr) \
+	container_of(ptr, struct cfq_data, qdata)
+#define dev_ioc_to_cfq_ioc(ptr) \
+	container_of(ptr, struct cfq_io_context, dev_ioc)
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 
@@ -369,21 +366,21 @@ CFQ_CFQQ_FNS(wait_busy);
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
+	blk_add_trace_msg((cfqd)->qdata.queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
 			cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
 			blkg_path(&(cfqq)->cfqg->blkg), ##args)
 
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...)				\
-	blk_add_trace_msg((cfqd)->queue, "%s " fmt,			\
+	blk_add_trace_msg((cfqd)->qdata.queue, "%s " fmt,		\
 				blkg_path(&(cfqg)->blkg), ##args)       \
 
 #else
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+	blk_add_trace_msg((cfqd)->qdata.queue, "cfq%d " fmt, (cfqq)->pid, ##args)
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...)		do {} while (0)
 #endif
 #define cfq_log(cfqd, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
+	blk_add_trace_msg((cfqd)->qdata.queue, "cfq " fmt, ##args)
 
 /* Traverses through cfq group service trees */
 #define for_each_cfqg_st(cfqg, i, j, st) \
@@ -464,8 +461,6 @@ static inline int cfqg_busy_async_queues
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
 				       struct io_context *, gfp_t);
-static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
-						struct io_context *);
 
 static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 					    bool is_sync)
@@ -479,23 +474,6 @@ static inline void cic_set_cfqq(struct c
 	cic->cfqq[is_sync] = cfqq;
 }
 
-#define CIC_DEAD_KEY	1ul
-#define CIC_DEAD_INDEX_SHIFT	1
-
-static inline void *cfqd_dead_key(struct cfq_data *cfqd)
-{
-	return (void *)(cfqd->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY);
-}
-
-static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic)
-{
-	struct cfq_data *cfqd = cic->key;
-
-	if (unlikely((unsigned long) cfqd & CIC_DEAD_KEY))
-		return NULL;
-
-	return cfqd;
-}
 
 /*
  * We regard a request as SYNC, if it's either a read or has the SYNC bit
@@ -514,7 +492,7 @@ static inline void cfq_schedule_dispatch
 {
 	if (cfqd->busy_queues) {
 		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
+		kblockd_schedule_work(cfqd->qdata.queue, &cfqd->unplug_work);
 	}
 }
 
@@ -1030,7 +1008,7 @@ static void cfq_update_blkio_group_weigh
 static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
 			struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
 {
-	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+	struct backing_dev_info *bdi = &cfqd->qdata.queue->backing_dev_info;
 	unsigned int major, minor;
 
 	/*
@@ -1065,7 +1043,7 @@ static struct cfq_group * cfq_alloc_cfqg
 	int i, j, ret;
 	struct cfq_rb_root *st;
 
-	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->qdata.queue->node);
 	if (!cfqg)
 		return NULL;
 
@@ -1097,7 +1075,7 @@ cfq_find_cfqg(struct cfq_data *cfqd, str
 {
 	struct cfq_group *cfqg = NULL;
 	void *key = cfqd;
-	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+	struct backing_dev_info *bdi = &cfqd->qdata.queue->backing_dev_info;
 	unsigned int major, minor;
 
 	/*
@@ -1125,7 +1103,7 @@ static struct cfq_group *cfq_get_cfqg(st
 {
 	struct blkio_cgroup *blkcg;
 	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
-	struct request_queue *q = cfqd->queue;
+	struct request_queue *q = cfqd->qdata.queue;
 
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
@@ -1259,9 +1237,9 @@ static void cfq_unlink_blkio_group(void
 	unsigned long  flags;
 	struct cfq_data *cfqd = key;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(cfqd->qdata.queue->queue_lock, flags);
 	cfq_destroy_cfqg(cfqd, cfqg_of_blkg(blkg));
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(cfqd->qdata.queue->queue_lock, flags);
 }
 
 #else /* GROUP_IOSCHED */
@@ -1561,12 +1539,14 @@ static struct request *
 cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 {
 	struct task_struct *tsk = current;
+	struct dev_io_context *gen_cic;
 	struct cfq_io_context *cic;
 	struct cfq_queue *cfqq;
 
-	cic = cfq_cic_lookup(cfqd, tsk->io_context);
-	if (!cic)
+	gen_cic = queue_data_cic_lookup(&cfqd->qdata, tsk->io_context);
+	if (!gen_cic)
 		return NULL;
+	cic = dev_ioc_to_cfq_ioc(gen_cic);
 
 	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
 	if (cfqq) {
@@ -1675,6 +1655,7 @@ static int cfq_allow_merge(struct reques
 			   struct bio *bio)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct dev_io_context *gen_cic;
 	struct cfq_io_context *cic;
 	struct cfq_queue *cfqq;
 
@@ -1688,9 +1669,10 @@ static int cfq_allow_merge(struct reques
 	 * Lookup the cfqq that this bio will be queued with. Allow
 	 * merge only if rq is queued there.
 	 */
-	cic = cfq_cic_lookup(cfqd, current->io_context);
-	if (!cic)
+	gen_cic = queue_data_cic_lookup(&cfqd->qdata, current->io_context);
+	if (!gen_cic)
 		return false;
+	cic = dev_ioc_to_cfq_ioc(gen_cic);
 
 	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
 	return cfqq == RQ_CFQQ(rq);
@@ -1774,7 +1756,7 @@ __cfq_slice_expired(struct cfq_data *cfq
 		cfqd->active_queue = NULL;
 
 	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
+		put_io_context(cfqd->active_cic->dev_ioc.ioc);
 		cfqd->active_cic = NULL;
 	}
 }
@@ -1976,7 +1958,7 @@ static bool cfq_should_idle(struct cfq_d
 
 	/* We do for queues that were marked with idle window flag. */
 	if (cfq_cfqq_idle_window(cfqq) &&
-	   !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
+	   !(blk_queue_nonrot(cfqd->qdata.queue) && cfqd->hw_tag))
 		return true;
 
 	/*
@@ -2002,7 +1984,7 @@ static void cfq_arm_slice_timer(struct c
 	 * for devices that support queuing, otherwise we still have a problem
 	 * with sync vs async workloads.
 	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
+	if (blk_queue_nonrot(cfqd->qdata.queue) && cfqd->hw_tag)
 		return;
 
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
@@ -2029,7 +2011,7 @@ static void cfq_arm_slice_timer(struct c
 	 * task has exited, don't wait
 	 */
 	cic = cfqd->active_cic;
-	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
+	if (!cic || !atomic_read(&cic->dev_ioc.ioc->nr_tasks))
 		return;
 
 	/*
@@ -2423,7 +2405,7 @@ static int __cfq_forced_dispatch_cfqq(st
 	int dispatched = 0;
 
 	while (cfqq->next_rq) {
-		cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq);
+		cfq_dispatch_insert(cfqq->cfqd->qdata.queue, cfqq->next_rq);
 		dispatched++;
 	}
 
@@ -2577,12 +2559,12 @@ static bool cfq_dispatch_request(struct
 	/*
 	 * insert request into driver dispatch list
 	 */
-	cfq_dispatch_insert(cfqd->queue, rq);
+	cfq_dispatch_insert(cfqd->qdata.queue, rq);
 
 	if (!cfqd->active_cic) {
 		struct cfq_io_context *cic = RQ_CIC(rq);
 
-		atomic_long_inc(&cic->ioc->refcount);
+		atomic_long_inc(&cic->dev_ioc.ioc->refcount);
 		cfqd->active_cic = cic;
 	}
 
@@ -2665,84 +2647,6 @@ static void cfq_put_queue(struct cfq_que
 	cfq_put_cfqg(cfqg);
 }
 
-/*
- * Call func for each cic attached to this ioc.
- */
-static void
-call_for_each_cic(struct io_context *ioc,
-		  void (*func)(struct io_context *, struct cfq_io_context *))
-{
-	struct cfq_io_context *cic;
-	struct hlist_node *n;
-
-	rcu_read_lock();
-
-	hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list)
-		func(ioc, cic);
-
-	rcu_read_unlock();
-}
-
-static void cfq_cic_free_rcu(struct rcu_head *head)
-{
-	struct cfq_io_context *cic;
-
-	cic = container_of(head, struct cfq_io_context, rcu_head);
-
-	kmem_cache_free(cfq_ioc_pool, cic);
-	elv_ioc_count_dec(cfq_ioc_count);
-
-	if (ioc_gone) {
-		/*
-		 * CFQ scheduler is exiting, grab exit lock and check
-		 * the pending io context count. If it hits zero,
-		 * complete ioc_gone and set it back to NULL
-		 */
-		spin_lock(&ioc_gone_lock);
-		if (ioc_gone && !elv_ioc_count_read(cfq_ioc_count)) {
-			complete(ioc_gone);
-			ioc_gone = NULL;
-		}
-		spin_unlock(&ioc_gone_lock);
-	}
-}
-
-static void cfq_cic_free(struct cfq_io_context *cic)
-{
-	call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
-}
-
-static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
-{
-	unsigned long flags;
-	unsigned long dead_key = (unsigned long) cic->key;
-
-	BUG_ON(!(dead_key & CIC_DEAD_KEY));
-
-	spin_lock_irqsave(&ioc->lock, flags);
-	radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT);
-	hlist_del_rcu(&cic->cic_list);
-	spin_unlock_irqrestore(&ioc->lock, flags);
-
-	cfq_cic_free(cic);
-}
-
-/*
- * Must be called with rcu_read_lock() held or preemption otherwise disabled.
- * Only two callers of this - ->dtor() which is called with the rcu_read_lock(),
- * and ->trim() which is called with the task lock held
- */
-static void cfq_free_io_context(struct io_context *ioc)
-{
-	/*
-	 * ioc->refcount is zero here, or we are called from elv_unregister(),
-	 * so no more cic's are allowed to be linked into this ioc.  So it
-	 * should be ok to iterate over the known list, we will see all cic's
-	 * since no new ones are added.
-	 */
-	call_for_each_cic(ioc, cic_free_func);
-}
-
 static void cfq_put_cooperator(struct cfq_queue *cfqq)
 {
 	struct cfq_queue *__cfqq, *next;
@@ -2776,90 +2680,6 @@ static void cfq_exit_cfqq(struct cfq_dat
 	cfq_put_queue(cfqq);
 }
 
-static void __cfq_exit_single_io_context(struct cfq_data *cfqd,
-					 struct cfq_io_context *cic)
-{
-	struct io_context *ioc = cic->ioc;
-
-	list_del_init(&cic->queue_list);
-
-	/*
-	 * Make sure dead mark is seen for dead queues
-	 */
-	smp_wmb();
-	cic->key = cfqd_dead_key(cfqd);
-
-	rcu_read_lock();
-	if (rcu_dereference(ioc->ioc_data) == cic) {
-		rcu_read_unlock();
-		spin_lock(&ioc->lock);
-		rcu_assign_pointer(ioc->ioc_data, NULL);
-		spin_unlock(&ioc->lock);
-	} else
-		rcu_read_unlock();
-
-	if (cic->cfqq[BLK_RW_ASYNC]) {
-		cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_ASYNC]);
-		cic->cfqq[BLK_RW_ASYNC] = NULL;
-	}
-
-	if (cic->cfqq[BLK_RW_SYNC]) {
-		cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_SYNC]);
-		cic->cfqq[BLK_RW_SYNC] = NULL;
-	}
-}
-
-static void cfq_exit_single_io_context(struct io_context *ioc,
-				       struct cfq_io_context *cic)
-{
-	struct cfq_data *cfqd = cic_to_cfqd(cic);
-
-	if (cfqd) {
-		struct request_queue *q = cfqd->queue;
-		unsigned long flags;
-
-		spin_lock_irqsave(q->queue_lock, flags);
-
-		/*
-		 * Ensure we get a fresh copy of the ->key to prevent
-		 * race between exiting task and queue
-		 */
-		smp_read_barrier_depends();
-		if (cic->key == cfqd)
-			__cfq_exit_single_io_context(cfqd, cic);
-
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	}
-}
-
-/*
- * The process that ioc belongs to has exited, we need to clean up
- * and put the internal structures we have that belongs to that process.
- */
-static void cfq_exit_io_context(struct io_context *ioc)
-{
-	call_for_each_cic(ioc, cfq_exit_single_io_context);
-}
-
-static struct cfq_io_context *
-cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
-{
-	struct cfq_io_context *cic;
-
-	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
-	if (cic) {
-		cic->ttime.last_end_request = jiffies;
-		INIT_LIST_HEAD(&cic->queue_list);
-		INIT_HLIST_NODE(&cic->cic_list);
-		cic->dtor = cfq_free_io_context;
-		cic->exit = cfq_exit_io_context;
-		elv_ioc_count_inc(cfq_ioc_count);
-	}
-
-	return cic;
-}
-
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
@@ -2902,21 +2722,24 @@ static void cfq_init_prio_data(struct cf
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
-static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
+static void changed_ioprio(struct io_context *ioc,
+	struct dev_io_context *gen_cic)
 {
-	struct cfq_data *cfqd = cic_to_cfqd(cic);
+	struct queue_data *qdata = cic_to_queue_data(gen_cic);
+	struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic);
+	struct cfq_data *cfqd = queue_data_to_cfqd(qdata);
 	struct cfq_queue *cfqq;
 	unsigned long flags;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(cfqd->qdata.queue->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->dev_ioc.ioc,
 						GFP_ATOMIC);
 		if (new_cfqq) {
 			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
@@ -2928,13 +2751,7 @@ static void changed_ioprio(struct io_con
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_ioc_set_ioprio(struct io_context *ioc)
-{
-	call_for_each_cic(ioc, changed_ioprio);
-	ioc->ioprio_changed = 0;
+	spin_unlock_irqrestore(cfqd->qdata.queue->queue_lock, flags);
 }
 
 static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2958,17 +2775,20 @@ static void cfq_init_cfqq(struct cfq_dat
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+static void changed_cgroup(struct io_context *ioc,
+	struct dev_io_context *gen_cic)
 {
+	struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic);
 	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
-	struct cfq_data *cfqd = cic_to_cfqd(cic);
+	struct queue_data *qdata = cic_to_queue_data(gen_cic);
+	struct cfq_data *cfqd = queue_data_to_cfqd(qdata);
 	unsigned long flags;
 	struct request_queue *q;
 
 	if (unlikely(!cfqd))
 		return;
 
-	q = cfqd->queue;
+	q = cfqd->qdata.queue;
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
@@ -2984,12 +2804,6 @@ static void changed_cgroup(struct io_con
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
-
-static void cfq_ioc_set_cgroup(struct io_context *ioc)
-{
-	call_for_each_cic(ioc, changed_cgroup);
-	ioc->cgroup_changed = 0;
-}
 #endif  /* CONFIG_CFQ_GROUP_IOSCHED */
 
 static struct cfq_queue *
@@ -2997,12 +2811,14 @@ cfq_find_alloc_queue(struct cfq_data *cf
 		     struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
+	struct dev_io_context *gen_cic;
 	struct cfq_io_context *cic;
 	struct cfq_group *cfqg;
 
 retry:
 	cfqg = cfq_get_cfqg(cfqd);
-	cic = cfq_cic_lookup(cfqd, ioc);
+	gen_cic = queue_data_cic_lookup(&cfqd->qdata, ioc);
+	cic = dev_ioc_to_cfq_ioc(gen_cic);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
@@ -3016,17 +2832,17 @@ retry:
 			cfqq = new_cfqq;
 			new_cfqq = NULL;
 		} else if (gfp_mask & __GFP_WAIT) {
-			spin_unlock_irq(cfqd->queue->queue_lock);
+			spin_unlock_irq(cfqd->qdata.queue->queue_lock);
 			new_cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_ZERO,
-					cfqd->queue->node);
-			spin_lock_irq(cfqd->queue->queue_lock);
+					cfqd->qdata.queue->node);
+			spin_lock_irq(cfqd->qdata.queue->queue_lock);
 			if (new_cfqq)
 				goto retry;
 		} else {
 			cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_ZERO,
-					cfqd->queue->node);
+					cfqd->qdata.queue->node);
 		}
 
 		if (cfqq) {
@@ -3088,159 +2904,6 @@ cfq_get_queue(struct cfq_data *cfqd, boo
 	return cfqq;
 }
 
-/*
- * We drop cfq io contexts lazily, so we may find a dead one.
- */
-static void
-cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
-		  struct cfq_io_context *cic)
-{
-	unsigned long flags;
-
-	WARN_ON(!list_empty(&cic->queue_list));
-	BUG_ON(cic->key != cfqd_dead_key(cfqd));
-
-	spin_lock_irqsave(&ioc->lock, flags);
-
-	BUG_ON(rcu_dereference_check(ioc->ioc_data,
-		lockdep_is_held(&ioc->lock)) == cic);
-
-	radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
-	hlist_del_rcu(&cic->cic_list);
-	spin_unlock_irqrestore(&ioc->lock, flags);
-
-	cfq_cic_free(cic);
-}
-
-static struct cfq_io_context *
-cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc)
-{
-	struct cfq_io_context *cic;
-	unsigned long flags;
-
-	if (unlikely(!ioc))
-		return NULL;
-
-	rcu_read_lock();
-
-	/*
-	 * we maintain a last-hit cache, to avoid browsing over the tree
-	 */
-	cic = rcu_dereference(ioc->ioc_data);
-	if (cic && cic->key == cfqd) {
-		rcu_read_unlock();
-		return cic;
-	}
-
-	do {
-		cic = radix_tree_lookup(&ioc->radix_root, cfqd->cic_index);
-		rcu_read_unlock();
-		if (!cic)
-			break;
-		if (unlikely(cic->key != cfqd)) {
-			cfq_drop_dead_cic(cfqd, ioc, cic);
-			rcu_read_lock();
-			continue;
-		}
-
-		spin_lock_irqsave(&ioc->lock, flags);
-		rcu_assign_pointer(ioc->ioc_data, cic);
-		spin_unlock_irqrestore(&ioc->lock, flags);
-		break;
-	} while (1);
-
-	return cic;
-}
-
-/*
- * Add cic into ioc, using cfqd as the search key. This enables us to lookup
- * the process specific cfq io context when entered from the block layer.
- * Also adds the cic to a per-cfqd list, used when this queue is removed.
- */
-static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
-			struct cfq_io_context *cic, gfp_t gfp_mask)
-{
-	unsigned long flags;
-	int ret;
-
-	ret = radix_tree_preload(gfp_mask);
-	if (!ret) {
-		cic->ioc = ioc;
-		cic->key = cfqd;
-
-		spin_lock_irqsave(&ioc->lock, flags);
-		ret = radix_tree_insert(&ioc->radix_root,
-						cfqd->cic_index, cic);
-		if (!ret)
-			hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list);
-		spin_unlock_irqrestore(&ioc->lock, flags);
-
-		radix_tree_preload_end();
-
-		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-		}
-	}
-
-	if (ret && ret != -EEXIST)
-		printk(KERN_ERR "cfq: cic link failed!\n");
-
-	return ret;
-}
-
-/*
- * Setup general io context and cfq io context. There can be several cfq
- * io contexts per general io context, if this process is doing io to more
- * than one device managed by cfq.
- */
-static struct cfq_io_context *
-cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
-{
-	struct io_context *ioc = NULL;
-	struct cfq_io_context *cic;
-	int ret;
-
-	might_sleep_if(gfp_mask & __GFP_WAIT);
-
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
-	if (!ioc)
-		return NULL;
-
-retry:
-	cic = cfq_cic_lookup(cfqd, ioc);
-	if (cic)
-		goto out;
-
-	cic = cfq_alloc_io_context(cfqd, gfp_mask);
-	if (cic == NULL)
-		goto err;
-
-	ret = cfq_cic_link(cfqd, ioc, cic, gfp_mask);
-	if (ret == -EEXIST) {
-		/* someone has linked cic to ioc already */
-		cfq_cic_free(cic);
-		goto retry;
-	} else if (ret)
-		goto err_free;
-
-out:
-	smp_read_barrier_depends();
-	if (unlikely(ioc->ioprio_changed))
-		cfq_ioc_set_ioprio(ioc);
-
-#ifdef CONFIG_CFQ_GROUP_IOSCHED
-	if (unlikely(ioc->cgroup_changed))
-		cfq_ioc_set_cgroup(ioc);
-#endif
-	return cic;
-err_free:
-	cfq_cic_free(cic);
-err:
-	put_io_context(ioc);
-	return NULL;
-}
 
 static void
 __cfq_update_io_thinktime(struct cfq_ttime *ttime, unsigned long slice_idle)
@@ -3281,7 +2944,7 @@ cfq_update_io_seektime(struct cfq_data *
 	}
 
 	cfqq->seek_history <<= 1;
-	if (blk_queue_nonrot(cfqd->queue))
+	if (blk_queue_nonrot(cfqd->qdata.queue))
 		cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
 	else
 		cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);
@@ -3310,7 +2973,8 @@ cfq_update_idle_window(struct cfq_data *
 
 	if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE))
 		enable_idle = 0;
-	else if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
+	else if (!atomic_read(&cic->dev_ioc.ioc->nr_tasks) ||
+	    !cfqd->cfq_slice_idle ||
 	    (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime.ttime_samples)) {
@@ -3471,7 +3135,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, s
 			    cfqd->busy_queues > 1) {
 				cfq_del_timer(cfqd, cfqq);
 				cfq_clear_cfqq_wait_request(cfqq);
-				__blk_run_queue(cfqd->queue);
+				__blk_run_queue(cfqd->qdata.queue);
 			} else {
 				cfq_blkiocg_update_idle_time_stats(
 						&cfqq->cfqg->blkg);
@@ -3486,7 +3150,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, s
 		 * this new queue is RT and the current one is BE
 		 */
 		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
+		__blk_run_queue(cfqd->qdata.queue);
 	}
 }
 
@@ -3496,7 +3160,7 @@ static void cfq_insert_request(struct re
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 
 	cfq_log_cfqq(cfqd, cfqq, "insert_request");
-	cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc);
+	cfq_init_prio_data(cfqq, RQ_CIC(rq)->dev_ioc.ioc);
 
 	rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
 	list_add_tail(&rq->queuelist, &cfqq->fifo);
@@ -3683,6 +3347,7 @@ static int cfq_may_queue(struct request_
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct task_struct *tsk = current;
+	struct dev_io_context *gen_cic;
 	struct cfq_io_context *cic;
 	struct cfq_queue *cfqq;
 
@@ -3692,13 +3357,14 @@ static int cfq_may_queue(struct request_
 	 * so just lookup a possibly existing queue, or return 'may queue'
 	 * if that fails
 	 */
-	cic = cfq_cic_lookup(cfqd, tsk->io_context);
-	if (!cic)
+	gen_cic = queue_data_cic_lookup(&cfqd->qdata, tsk->io_context);
+	if (!gen_cic)
 		return ELV_MQUEUE_MAY;
+	cic = dev_ioc_to_cfq_ioc(gen_cic);
 
 	cfqq = cic_to_cfqq(cic, rw_is_sync(rw));
 	if (cfqq) {
-		cfq_init_prio_data(cfqq, cic->ioc);
+		cfq_init_prio_data(cfqq, cic->dev_ioc.ioc);
 
 		return __cfq_may_queue(cfqq);
 	}
@@ -3719,7 +3385,7 @@ static void cfq_put_request(struct reque
 		BUG_ON(!cfqq->allocated[rw]);
 		cfqq->allocated[rw]--;
 
-		put_io_context(RQ_CIC(rq)->ioc);
+		put_io_context(RQ_CIC(rq)->dev_ioc.ioc);
 
 		rq->elevator_private[0] = NULL;
 		rq->elevator_private[1] = NULL;
@@ -3772,6 +3438,7 @@ cfq_set_request(struct request_queue *q,
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
+	struct dev_io_context *dev_ioc;
 	const int rw = rq_data_dir(rq);
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
@@ -3779,7 +3446,12 @@ cfq_set_request(struct request_queue *q,
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	cic = cfq_get_io_context(cfqd, gfp_mask);
+	dev_ioc = queue_data_get_io_context(&ioc_builder, &cfqd->qdata,
+						gfp_mask);
+	if (dev_ioc)
+		cic = dev_ioc_to_cfq_ioc(dev_ioc);
+	else
+		cic = NULL;
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
@@ -3789,7 +3461,7 @@ cfq_set_request(struct request_queue *q,
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, is_sync, cic->dev_ioc.ioc, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	} else {
 		/*
@@ -3832,10 +3504,10 @@ static void cfq_kick_queue(struct work_s
 {
 	struct cfq_data *cfqd =
 		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
+	struct request_queue *q = cfqd->qdata.queue;
 
 	spin_lock_irq(q->queue_lock);
-	__blk_run_queue(cfqd->queue);
+	__blk_run_queue(q);
 	spin_unlock_irq(q->queue_lock);
 }
 
@@ -3851,7 +3523,7 @@ static void cfq_idle_slice_timer(unsigne
 
 	cfq_log(cfqd, "idle timer fired");
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(cfqd->qdata.queue->queue_lock, flags);
 
 	cfqq = cfqd->active_queue;
 	if (cfqq) {
@@ -3892,7 +3564,7 @@ expire:
 out_kick:
 	cfq_schedule_dispatch(cfqd);
 out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(cfqd->qdata.queue->queue_lock, flags);
 }
 
 static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
@@ -3916,10 +3588,35 @@ static void cfq_put_async_queues(struct
 		cfq_put_queue(cfqd->async_idle_cfqq);
 }
 
+static void cfq_init_cic(struct queue_data *qdata,
+	struct dev_io_context *gen_cic)
+{
+	struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic);
+
+	cic->ttime.last_end_request = jiffies;
+}
+
+static void cfq_exit_cic(struct queue_data *qdata,
+	struct dev_io_context *gen_cic)
+{
+	struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic);
+	struct cfq_data *cfqd = queue_data_to_cfqd(qdata);
+
+	if (cic->cfqq[BLK_RW_ASYNC]) {
+		cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_ASYNC]);
+		cic->cfqq[BLK_RW_ASYNC] = NULL;
+	}
+
+	if (cic->cfqq[BLK_RW_SYNC]) {
+		cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_SYNC]);
+		cic->cfqq[BLK_RW_SYNC] = NULL;
+	}
+}
+
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
-	struct request_queue *q = cfqd->queue;
+	struct request_queue *q = cfqd->qdata.queue;
 	bool wait = false;
 
 	cfq_shutdown_timer_wq(cfqd);
@@ -3929,13 +3626,7 @@ static void cfq_exit_queue(struct elevat
 	if (cfqd->active_queue)
 		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
 
-	while (!list_empty(&cfqd->cic_list)) {
-		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
-							struct cfq_io_context,
-							queue_list);
-
-		__cfq_exit_single_io_context(cfqd, cic);
-	}
+	ioc_builder_exit_queue(&ioc_builder, &cfqd->qdata);
 
 	cfq_put_async_queues(cfqd);
 	cfq_release_cfq_groups(cfqd);
@@ -3951,10 +3642,6 @@ static void cfq_exit_queue(struct elevat
 
 	cfq_shutdown_timer_wq(cfqd);
 
-	spin_lock(&cic_index_lock);
-	ida_remove(&cic_index_ida, cfqd->cic_index);
-	spin_unlock(&cic_index_lock);
-
 	/*
 	 * Wait for cfqg->blkg->key accessors to exit their grace periods.
 	 * Do this wait only if there are other unlinked groups out
@@ -3976,24 +3663,6 @@ static void cfq_exit_queue(struct elevat
 	kfree(cfqd);
 }
 
-static int cfq_alloc_cic_index(void)
-{
-	int index, error;
-
-	do {
-		if (!ida_pre_get(&cic_index_ida, GFP_KERNEL))
-			return -ENOMEM;
-
-		spin_lock(&cic_index_lock);
-		error = ida_get_new(&cic_index_ida, &index);
-		spin_unlock(&cic_index_lock);
-		if (error && error != -EAGAIN)
-			return error;
-	} while (error);
-
-	return index;
-}
-
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
@@ -4001,24 +3670,15 @@ static void *cfq_init_queue(struct reque
 	struct cfq_group *cfqg;
 	struct cfq_rb_root *st;
 
-	i = cfq_alloc_cic_index();
-	if (i < 0)
+	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (!cfqd)
 		return NULL;
 
-	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (!cfqd) {
-		spin_lock(&cic_index_lock);
-		ida_remove(&cic_index_ida, i);
-		spin_unlock(&cic_index_lock);
+	if (ioc_builder_init_queue(&ioc_builder, &cfqd->qdata, q)) {
+		kfree(cfqd);
 		return NULL;
 	}
 
-	/*
-	 * Don't need take queue_lock in the routine, since we are
-	 * initializing the ioscheduler, and nobody is using cfqd
-	 */
-	cfqd->cic_index = i;
-
 	/* Init root service tree */
 	cfqd->grp_service_tree = CFQ_RB_ROOT;
 
@@ -4044,9 +3704,7 @@ static void *cfq_init_queue(struct reque
 	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
 		kfree(cfqg);
 
-		spin_lock(&cic_index_lock);
-		ida_remove(&cic_index_ida, cfqd->cic_index);
-		spin_unlock(&cic_index_lock);
+		ioc_builder_exit_queue(&ioc_builder, &cfqd->qdata);
 
 		kfree(cfqd);
 		return NULL;
@@ -4079,9 +3737,6 @@ static void *cfq_init_queue(struct reque
 	cfqd->oom_cfqq.ref++;
 	cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group);
 
-	INIT_LIST_HEAD(&cfqd->cic_list);
-
-	cfqd->queue = q;
 
 	init_timer(&cfqd->idle_slice_timer);
 	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
@@ -4137,6 +3792,34 @@ fail:
 	return -ENOMEM;
 }
 
+static struct dev_io_context *cfq_alloc_ioc(struct ioc_builder *builder,
+	struct queue_data *qdata, gfp_t gfp_mask)
+{
+	struct cfq_io_context *ioc = kmem_cache_alloc_node(cfq_ioc_pool,
+					gfp_mask, qdata->queue->node);
+	if (ioc)
+		return &ioc->dev_ioc;
+	return NULL;
+}
+
+static void cfq_free_ioc(struct ioc_builder *builder,
+	struct dev_io_context *dev_ioc)
+{
+	struct cfq_io_context *ioc = dev_ioc_to_cfq_ioc(dev_ioc);
+	kmem_cache_free(cfq_ioc_pool, ioc);
+}
+
+static struct ioc_builder ioc_builder = {
+	.alloc_ioc = cfq_alloc_ioc,
+	.free_ioc = cfq_free_ioc,
+	.changed_ioprio = changed_ioprio,
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+	.changed_cgroup = changed_cgroup,
+#endif
+	.cic_init = cfq_init_cic,
+	.cic_exit = cfq_exit_cic,
+};
+
 /*
  * sysfs parts below -->
  */
@@ -4247,7 +3930,7 @@ static struct elevator_type iosched_cfq
 		.elevator_may_queue_fn =	cfq_may_queue,
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
-		.trim =				cfq_free_io_context,
+		.trim =				queue_data_free_io_context,
 	},
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
@@ -4284,6 +3967,10 @@ static int __init cfq_init(void)
 #endif
 	if (cfq_slab_setup())
 		return -ENOMEM;
+	if (ioc_builder_init(&ioc_builder)) {
+		cfq_slab_kill();
+		return -ENOMEM;
+	}
 
 	elv_register(&iosched_cfq);
 	blkio_policy_register(&blkio_policy_cfq);
@@ -4293,20 +3980,9 @@ static int __init cfq_init(void)
 
 static void __exit cfq_exit(void)
 {
-	DECLARE_COMPLETION_ONSTACK(all_gone);
 	blkio_policy_unregister(&blkio_policy_cfq);
 	elv_unregister(&iosched_cfq);
-	ioc_gone = &all_gone;
-	/* ioc_gone's update must be visible before reading ioc_count */
-	smp_wmb();
-
-	/*
-	 * this also protects us from entering cfq_slab_kill() with
-	 * pending RCU callbacks
-	 */
-	if (elv_ioc_count_read(cfq_ioc_count))
-		wait_for_completion(&all_gone);
-	ida_destroy(&cic_index_ida);
+	io_context_builder_exit(&ioc_builder);
 	cfq_slab_kill();
 }
 
Index: linux/include/linux/iocontext.h
===================================================================
--- linux.orig/include/linux/iocontext.h	2011-12-27 16:13:02.000000000 +0800
+++ linux/include/linux/iocontext.h	2011-12-27 16:16:38.000000000 +0800
@@ -4,6 +4,22 @@
 #include <linux/radix-tree.h>
 #include <linux/rcupdate.h>
 
+struct ioc_builder;
+struct dev_io_context {
+	void *key;
+	struct io_context *ioc;
+
+	struct list_head queue_list;
+	struct hlist_node cic_list;
+
+	void (*dtor)(struct io_context *); /* destructor */
+	void (*exit)(struct io_context *); /* called on task exit */
+
+	struct rcu_head rcu_head;
+
+	struct ioc_builder *builder;
+};
+
 struct cfq_queue;
 struct cfq_ttime {
 	unsigned long last_end_request;
@@ -14,21 +30,9 @@ struct cfq_ttime {
 };
 
 struct cfq_io_context {
-	void *key;
-
+	struct dev_io_context dev_ioc;
 	struct cfq_queue *cfqq[2];
-
-	struct io_context *ioc;
-
 	struct cfq_ttime ttime;
-
-	struct list_head queue_list;
-	struct hlist_node cic_list;
-
-	void (*dtor)(struct io_context *); /* destructor */
-	void (*exit)(struct io_context *); /* called on task exit */
-
-	struct rcu_head rcu_head;
 };
 
 /*


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC 2/3]block: FIOPS ioscheduler core
  2012-01-04  6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li
  2012-01-04  6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li
@ 2012-01-04  6:53 ` Shaohua Li
  2012-01-06  6:05   ` Namjae Jeon
  2012-01-07  1:06   ` Zhu Yanhai
  2012-01-04  6:53 ` [RFC 3/3]block: fiops read/write request scale Shaohua Li
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Shaohua Li @ 2012-01-04  6:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, vgoyal, jmoyer, Shaohua Li

[-- Attachment #1: fiops-iosch.patch --]
[-- Type: text/plain, Size: 19808 bytes --]

FIOPS (Fair IOPS) ioscheduler is IOPS based ioscheduler, so only targets
for drive without I/O seek. It's quite similar like CFQ, but the dispatch
decision is made according to IOPS instead of slice.

The algorithm is simple. Drive has a service tree, and each task lives in
the tree. The key into the tree is called vios (virtual I/O). Every request
has vios, which is calculated according to its ioprio, request size and so
on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
always selects task with minimum vios in the service tree and let the task
dispatch request. The dispatched request's vios is then added to the task's
vios and the task is repositioned in the sevice tree.

Unlike CFQ, FIOPS doesn't have separate sync/async queues, because with I/O
less writeback, usually a task can only dispatch either sync or async requests.
Bias read or write request can still be done with read/write scale.

One issue is if workload iodepth is lower than drive queue_depth, IOPS
share of a task might not be strictly according to its priority, request
size and so on. In this case, the drive is in idle actually. Solving the
problem need make drive idle, so impact performance. I believe CFQ isn't
completely fair between tasks in such case too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/Kconfig.iosched |    8 
 block/Makefile        |    1 
 block/blk-ioc.c       |    2 
 block/blk.h           |    2 
 block/fiops-iosched.c |  659 ++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 670 insertions(+), 2 deletions(-)

Index: linux/block/Makefile
===================================================================
--- linux.orig/block/Makefile	2011-12-28 09:42:18.000000000 +0800
+++ linux/block/Makefile	2011-12-28 09:42:35.000000000 +0800
@@ -14,6 +14,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_IOSCHED_FIOPS)	+= fiops-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
Index: linux/block/fiops-iosched.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/block/fiops-iosched.c	2012-01-04 14:01:26.000000000 +0800
@@ -0,0 +1,659 @@
+/*
+ * IOPS based IO scheduler. Based on CFQ.
+ *  Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *  Shaohua Li <shli@kernel.org>
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
+#include <linux/jiffies.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include "blk.h"
+
+#define VIOS_SCALE_SHIFT 10
+#define VIOS_SCALE (1 << VIOS_SCALE_SHIFT)
+
+struct fiops_rb_root {
+	struct rb_root rb;
+	struct rb_node *left;
+	unsigned count;
+
+	u64 min_vios;
+};
+#define FIOPS_RB_ROOT	(struct fiops_rb_root) { .rb = RB_ROOT}
+
+struct fiops_data {
+	struct queue_data qdata;
+
+	struct fiops_rb_root service_tree;
+
+	unsigned int busy_queues;
+
+	struct work_struct unplug_work;
+};
+
+struct fiops_ioc {
+	struct dev_io_context dev_ioc;
+
+	unsigned int flags;
+	struct fiops_data *fiopsd;
+	struct rb_node rb_node;
+	u64 vios; /* key in service_tree */
+	struct fiops_rb_root *service_tree;
+
+	struct rb_root sort_list;
+	struct list_head fifo;
+
+	pid_t pid;
+};
+
+static struct kmem_cache *fiops_ioc_pool;
+static struct ioc_builder ioc_builder;
+#define queue_data_to_fiopsd(ptr) container_of(ptr, struct fiops_data, qdata)
+#define dev_ioc_to_fiops_ioc(ptr) container_of(ptr, struct fiops_ioc, dev_ioc)
+#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree))
+
+#define RQ_CIC(rq)		\
+	((struct fiops_ioc *) (rq)->elevator_private[0])
+enum ioc_state_flags {
+	FIOPS_IOC_FLAG_on_rr = 0,	/* on round-robin busy list */
+};
+
+#define FIOPS_IOC_FNS(name)						\
+static inline void fiops_mark_ioc_##name(struct fiops_ioc *ioc)	\
+{									\
+	ioc->flags |= (1 << FIOPS_IOC_FLAG_##name);			\
+}									\
+static inline void fiops_clear_ioc_##name(struct fiops_ioc *ioc)	\
+{									\
+	ioc->flags &= ~(1 << FIOPS_IOC_FLAG_##name);			\
+}									\
+static inline int fiops_ioc_##name(const struct fiops_ioc *ioc)	\
+{									\
+	return ((ioc)->flags & (1 << FIOPS_IOC_FLAG_##name)) != 0;	\
+}
+
+FIOPS_IOC_FNS(on_rr);
+#undef FIOPS_IOC_FNS
+
+/*
+ * The below is leftmost cache rbtree addon
+ */
+static struct fiops_ioc *fiops_rb_first(struct fiops_rb_root *root)
+{
+	/* Service tree is empty */
+	if (!root->count)
+		return NULL;
+
+	if (!root->left)
+		root->left = rb_first(&root->rb);
+
+	if (root->left)
+		return rb_entry(root->left, struct fiops_ioc, rb_node);
+
+	return NULL;
+}
+
+static void rb_erase_init(struct rb_node *n, struct rb_root *root)
+{
+	rb_erase(n, root);
+	RB_CLEAR_NODE(n);
+}
+
+static void fiops_rb_erase(struct rb_node *n, struct fiops_rb_root *root)
+{
+	if (root->left == n)
+		root->left = NULL;
+	rb_erase_init(n, &root->rb);
+	--root->count;
+}
+
+static inline u64 max_vios(u64 min_vios, u64 vios)
+{
+	s64 delta = (s64)(vios - min_vios);
+	if (delta > 0)
+		min_vios = vios;
+
+	return min_vios;
+}
+
+static void fiops_update_min_vios(struct fiops_rb_root *service_tree)
+{
+	struct fiops_ioc *ioc;
+
+	ioc = fiops_rb_first(service_tree);
+	if (!ioc)
+		return;
+	service_tree->min_vios = max_vios(service_tree->min_vios, ioc->vios);
+}
+
+/*
+ * The fiopsd->service_trees holds all pending fiops_ioc's that have
+ * requests waiting to be processed. It is sorted in the order that
+ * we will service the queues.
+ */
+static void fiops_service_tree_add(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc)
+{
+	struct rb_node **p, *parent;
+	struct fiops_ioc *__ioc;
+	struct fiops_rb_root *service_tree = ioc_service_tree(ioc);
+	u64 vios;
+	int left;
+
+	/* New added IOC */
+	if (RB_EMPTY_NODE(&ioc->rb_node))
+		vios = max_vios(service_tree->min_vios, ioc->vios);
+	else {
+		vios = ioc->vios;
+		/* ioc->service_tree might not equal to service_tree */
+		fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
+		ioc->service_tree = NULL;
+	}
+
+	left = 1;
+	parent = NULL;
+	ioc->service_tree = service_tree;
+	p = &service_tree->rb.rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		__ioc = rb_entry(parent, struct fiops_ioc, rb_node);
+
+		/*
+		 * sort by key, that represents service time.
+		 */
+		if (vios <  __ioc->vios)
+			n = &(*p)->rb_left;
+		else {
+			n = &(*p)->rb_right;
+			left = 0;
+		}
+
+		p = n;
+	}
+
+	if (left)
+		service_tree->left = &ioc->rb_node;
+
+	ioc->vios = vios;
+	rb_link_node(&ioc->rb_node, parent, p);
+	rb_insert_color(&ioc->rb_node, &service_tree->rb);
+	service_tree->count++;
+
+	fiops_update_min_vios(service_tree);
+}
+
+/*
+ * Update ioc's position in the service tree.
+ */
+static void fiops_resort_rr_list(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc)
+{
+	/*
+	 * Resorting requires the ioc to be on the RR list already.
+	 */
+	if (fiops_ioc_on_rr(ioc))
+		fiops_service_tree_add(fiopsd, ioc);
+}
+
+/*
+ * add to busy list of queues for service, trying to be fair in ordering
+ * the pending list according to last request service
+ */
+static void fiops_add_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
+{
+	BUG_ON(fiops_ioc_on_rr(ioc));
+	fiops_mark_ioc_on_rr(ioc);
+
+	fiopsd->busy_queues++;
+
+	fiops_resort_rr_list(fiopsd, ioc);
+}
+
+/*
+ * Called when the ioc no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void fiops_del_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
+{
+	BUG_ON(!fiops_ioc_on_rr(ioc));
+	fiops_clear_ioc_on_rr(ioc);
+
+	if (!RB_EMPTY_NODE(&ioc->rb_node)) {
+		fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
+		ioc->service_tree = NULL;
+	}
+
+	BUG_ON(!fiopsd->busy_queues);
+	fiopsd->busy_queues--;
+}
+
+/*
+ * rb tree support functions
+ */
+static void fiops_del_rq_rb(struct request *rq)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+
+	elv_rb_del(&ioc->sort_list, rq);
+}
+
+static void fiops_add_rq_rb(struct request *rq)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+	struct fiops_data *fiopsd = ioc->fiopsd;
+
+	elv_rb_add(&ioc->sort_list, rq);
+
+	if (!fiops_ioc_on_rr(ioc))
+		fiops_add_ioc_rr(fiopsd, ioc);
+}
+
+static void fiops_reposition_rq_rb(struct fiops_ioc *ioc, struct request *rq)
+{
+	elv_rb_del(&ioc->sort_list, rq);
+	fiops_add_rq_rb(rq);
+}
+
+static void fiops_remove_request(struct request *rq)
+{
+	list_del_init(&rq->queuelist);
+	fiops_del_rq_rb(rq);
+}
+
+static u64 fiops_scaled_vios(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc, struct request *rq)
+{
+	return VIOS_SCALE;
+}
+
+/* return vios dispatched */
+static u64 fiops_dispatch_request(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc)
+{
+	struct request *rq;
+	struct request_queue *q = fiopsd->qdata.queue;
+
+	rq = rq_entry_fifo(ioc->fifo.next);
+
+	fiops_remove_request(rq);
+	elv_dispatch_sort(q, rq);
+
+	return fiops_scaled_vios(fiopsd, ioc, rq);
+}
+
+static int fiops_forced_dispatch(struct fiops_data *fiopsd)
+{
+	struct fiops_ioc *ioc;
+	int dispatched = 0;
+
+	while ((ioc = fiops_rb_first(&fiopsd->service_tree)) != NULL) {
+		while (!list_empty(&ioc->fifo)) {
+			fiops_dispatch_request(fiopsd, ioc);
+			dispatched++;
+		}
+		if (fiops_ioc_on_rr(ioc))
+			fiops_del_ioc_rr(fiopsd, ioc);
+	}
+	return dispatched;
+}
+
+static struct fiops_ioc *fiops_select_ioc(struct fiops_data *fiopsd)
+{
+	struct fiops_ioc *ioc;
+
+	if (RB_EMPTY_ROOT(&fiopsd->service_tree.rb))
+		return NULL;
+	ioc = fiops_rb_first(&fiopsd->service_tree);
+	return ioc;
+}
+
+static void fiops_charge_vios(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc, u64 vios)
+{
+	struct fiops_rb_root *service_tree = ioc->service_tree;
+	ioc->vios += vios;
+
+	if (RB_EMPTY_ROOT(&ioc->sort_list))
+		fiops_del_ioc_rr(fiopsd, ioc);
+	else
+		fiops_resort_rr_list(fiopsd, ioc);
+
+	fiops_update_min_vios(service_tree);
+}
+
+static int fiops_dispatch_requests(struct request_queue *q, int force)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct fiops_ioc *ioc;
+	u64 vios;
+
+	if (unlikely(force))
+		return fiops_forced_dispatch(fiopsd);
+
+	ioc = fiops_select_ioc(fiopsd);
+	if (!ioc)
+		return 0;
+
+	vios = fiops_dispatch_request(fiopsd, ioc);
+
+	fiops_charge_vios(fiopsd, ioc, vios);
+	return 1;
+}
+
+static void fiops_insert_request(struct request_queue *q, struct request *rq)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+
+	rq_set_fifo_time(rq, jiffies);
+
+	list_add_tail(&rq->queuelist, &ioc->fifo);
+
+	fiops_add_rq_rb(rq);
+}
+
+/*
+ * scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing
+ */
+static inline void fiops_schedule_dispatch(struct fiops_data *fiopsd)
+{
+	if (fiopsd->busy_queues)
+		kblockd_schedule_work(fiopsd->qdata.queue,
+				      &fiopsd->unplug_work);
+}
+
+static int
+fiops_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct dev_io_context *dev_ioc;
+	struct fiops_ioc *cic;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	dev_ioc = queue_data_get_io_context(&ioc_builder, &fiopsd->qdata,
+		gfp_mask);
+	if (!dev_ioc)
+		goto queue_fail;
+
+	cic = dev_ioc_to_fiops_ioc(dev_ioc);
+
+	/*
+	 * we hold a reference of dev_ioc and nobody else set this request,
+	 * doesn't need locking
+	 */
+	rq->elevator_private[0] = cic;
+
+	return 0;
+
+queue_fail:
+	fiops_schedule_dispatch(fiopsd);
+	return 1;
+}
+
+static void fiops_put_request(struct request *rq)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+
+	if (ioc) {
+		rq->elevator_private[0] = NULL;
+		put_io_context(ioc->dev_ioc.ioc);
+	}
+}
+
+static struct request *
+fiops_find_rq_fmerge(struct fiops_data *fiopsd, struct bio *bio)
+{
+	struct task_struct *tsk = current;
+	struct dev_io_context *gen_cic;
+	struct fiops_ioc *cic;
+
+	gen_cic = queue_data_cic_lookup(&fiopsd->qdata, tsk->io_context);
+	if (!gen_cic)
+		return NULL;
+	cic = dev_ioc_to_fiops_ioc(gen_cic);
+
+	if (cic) {
+		sector_t sector = bio->bi_sector + bio_sectors(bio);
+
+		return elv_rb_find(&cic->sort_list, sector);
+	}
+
+	return NULL;
+}
+
+static int fiops_merge(struct request_queue *q, struct request **req,
+		     struct bio *bio)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct request *__rq;
+
+	__rq = fiops_find_rq_fmerge(fiopsd, bio);
+	if (__rq && elv_rq_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_FRONT_MERGE;
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void fiops_merged_request(struct request_queue *q, struct request *req,
+			       int type)
+{
+	if (type == ELEVATOR_FRONT_MERGE) {
+		struct fiops_ioc *ioc = RQ_CIC(req);
+
+		fiops_reposition_rq_rb(ioc, req);
+	}
+}
+
+static void
+fiops_merged_requests(struct request_queue *q, struct request *rq,
+		    struct request *next)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	/*
+	 * reposition in fifo if next is older than rq
+	 */
+	if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
+	    time_before(rq_fifo_time(next), rq_fifo_time(rq))) {
+		list_move(&rq->queuelist, &next->queuelist);
+		rq_set_fifo_time(rq, rq_fifo_time(next));
+	}
+
+	fiops_remove_request(next);
+
+	ioc = RQ_CIC(next);
+	/*
+	 * all requests of this task are merged to other tasks, delete it
+	 * from the service tree.
+	 */
+	if (fiops_ioc_on_rr(ioc) && RB_EMPTY_ROOT(&ioc->sort_list))
+		fiops_del_ioc_rr(fiopsd, ioc);
+}
+
+static int fiops_allow_merge(struct request_queue *q, struct request *rq,
+			   struct bio *bio)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct dev_io_context *gen_cic;
+	struct fiops_ioc *cic;
+
+	/*
+	 * Lookup the ioc that this bio will be queued with. Allow
+	 * merge only if rq is queued there.
+	 */
+	gen_cic = queue_data_cic_lookup(&fiopsd->qdata, current->io_context);
+	if (!gen_cic)
+		return false;
+	cic = dev_ioc_to_fiops_ioc(gen_cic);
+
+	return cic == RQ_CIC(rq);
+}
+
+static void fiops_exit_queue(struct elevator_queue *e)
+{
+	struct fiops_data *fiopsd = e->elevator_data;
+	struct request_queue *q = fiopsd->qdata.queue;
+
+	cancel_work_sync(&fiopsd->unplug_work);
+
+	spin_lock_irq(q->queue_lock);
+
+	ioc_builder_exit_queue(&ioc_builder, &fiopsd->qdata);
+
+	spin_unlock_irq(q->queue_lock);
+	kfree(fiopsd);
+}
+
+static void fiops_kick_queue(struct work_struct *work)
+{
+	struct fiops_data *fiopsd =
+		container_of(work, struct fiops_data, unplug_work);
+	struct request_queue *q = fiopsd->qdata.queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void *fiops_init_queue(struct request_queue *q)
+{
+	struct fiops_data *fiopsd;
+
+	fiopsd = kzalloc_node(sizeof(*fiopsd), GFP_KERNEL, q->node);
+	if (!fiopsd)
+		return NULL;
+
+	if (ioc_builder_init_queue(&ioc_builder, &fiopsd->qdata, q)) {
+		kfree(fiopsd);
+		return NULL;
+	}
+
+	fiopsd->service_tree = FIOPS_RB_ROOT;
+
+	INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue);
+
+	return fiopsd;
+}
+
+static void fiops_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (fiops_ioc_pool)
+		kmem_cache_destroy(fiops_ioc_pool);
+}
+
+static int __init fiops_slab_setup(void)
+{
+	fiops_ioc_pool = KMEM_CACHE(fiops_ioc, 0);
+	if (!fiops_ioc_pool)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static struct dev_io_context *
+fiops_alloc_ioc(struct ioc_builder *builder, struct queue_data *qdata,
+	gfp_t gfp_mask)
+{
+	struct fiops_ioc *ioc = kmem_cache_alloc_node(fiops_ioc_pool,
+		gfp_mask, qdata->queue->node);
+	if (ioc)
+		return &ioc->dev_ioc;
+	return NULL;
+}
+
+static void fiops_free_ioc(struct ioc_builder *builder,
+	struct dev_io_context *dev_ioc)
+{
+	struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(dev_ioc);
+	kmem_cache_free(fiops_ioc_pool, ioc);
+}
+
+static void fiops_init_cic(struct queue_data *qdata,
+	struct dev_io_context *gen_cic)
+{
+	struct fiops_data *fiopsd = queue_data_to_fiopsd(qdata);
+	struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic);
+
+	RB_CLEAR_NODE(&ioc->rb_node);
+	INIT_LIST_HEAD(&ioc->fifo);
+	ioc->sort_list = RB_ROOT;
+
+	ioc->fiopsd = fiopsd;
+
+	ioc->pid = current->pid;
+}
+
+static void fiops_exit_cic(struct queue_data *qdata,
+	struct dev_io_context *gen_cic)
+{
+	struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic);
+
+	WARN_ON(fiops_ioc_on_rr(ioc));
+}
+
+static struct ioc_builder ioc_builder = {
+	.alloc_ioc = fiops_alloc_ioc,
+	.free_ioc = fiops_free_ioc,
+	.cic_init = fiops_init_cic,
+	.cic_exit = fiops_exit_cic,
+};
+
+static struct elevator_type iosched_fiops = {
+	.ops = {
+		.elevator_merge_fn =		fiops_merge,
+		.elevator_merged_fn =		fiops_merged_request,
+		.elevator_merge_req_fn =	fiops_merged_requests,
+		.elevator_allow_merge_fn =	fiops_allow_merge,
+		.elevator_dispatch_fn =		fiops_dispatch_requests,
+		.elevator_add_req_fn =		fiops_insert_request,
+		.elevator_former_req_fn =	elv_rb_former_request,
+		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_set_req_fn =		fiops_set_request,
+		.elevator_put_req_fn =		fiops_put_request,
+		.elevator_init_fn =		fiops_init_queue,
+		.elevator_exit_fn =		fiops_exit_queue,
+		.trim =				queue_data_free_io_context,
+	},
+	.elevator_name =	"fiops",
+	.elevator_owner =	THIS_MODULE,
+};
+
+static int __init fiops_init(void)
+{
+	if (fiops_slab_setup())
+		return -ENOMEM;
+	if (ioc_builder_init(&ioc_builder)) {
+		fiops_slab_kill();
+		return -ENOMEM;
+	}
+
+	elv_register(&iosched_fiops);
+
+	return 0;
+}
+
+static void __exit fiops_exit(void)
+{
+	elv_unregister(&iosched_fiops);
+	io_context_builder_exit(&ioc_builder);
+	fiops_slab_kill();
+}
+
+module_init(fiops_init);
+module_exit(fiops_exit);
+
+MODULE_AUTHOR("Jens Axboe, Shaohua Li <shli@kernel.org>");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("IOPS based IO scheduler");
Index: linux/block/Kconfig.iosched
===================================================================
--- linux.orig/block/Kconfig.iosched	2011-12-28 09:42:18.000000000 +0800
+++ linux/block/Kconfig.iosched	2012-01-04 13:58:35.000000000 +0800
@@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_FIOPS
+	tristate "IOPS based I/O scheduler"
+	default y
+	---help---
+	  This is an IOPS based I/O scheduler. It will try to distribute
+          IOPS equally among all processes in the system. It's mainly for
+          Flash based storage.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
Index: linux/block/blk.h
===================================================================
--- linux.orig/block/blk.h	2011-12-28 09:42:18.000000000 +0800
+++ linux/block/blk.h	2011-12-28 09:42:35.000000000 +0800
@@ -206,7 +206,7 @@ static inline void blk_throtl_exit(struc
 static inline void blk_throtl_release(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
-#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
+#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS)
 struct queue_data;
 struct ioc_builder {
 	struct dev_io_context *(*alloc_ioc)(struct ioc_builder *builder,
Index: linux/block/blk-ioc.c
===================================================================
--- linux.orig/block/blk-ioc.c	2011-12-28 09:42:18.000000000 +0800
+++ linux/block/blk-ioc.c	2011-12-28 09:42:35.000000000 +0800
@@ -164,7 +164,7 @@ static int __init blk_ioc_init(void)
 }
 subsys_initcall(blk_ioc_init);
 
-#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
+#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS)
 #define CIC_DEAD_INDEX_SHIFT	1
 
 static inline void *queue_data_dead_key(struct queue_data *qdata)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC 3/3]block: fiops read/write request scale
  2012-01-04  6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li
  2012-01-04  6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li
  2012-01-04  6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li
@ 2012-01-04  6:53 ` Shaohua Li
  2012-01-04  7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Shaohua Li @ 2012-01-04  6:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, vgoyal, jmoyer, Shaohua Li

[-- Attachment #1: fiops-rw-scale.patch --]
[-- Type: text/plain, Size: 3614 bytes --]

read/write speed of Flash based storage usually is different. Add a scale
to differenate read and write. Also add a tunable, so user can assign
different scale for read and write. This can also be used to bias read
or write in the system

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/fiops-iosched.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 68 insertions(+), 1 deletion(-)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-04 14:01:26.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-04 14:05:48.000000000 +0800
@@ -15,6 +15,9 @@
 #define VIOS_SCALE_SHIFT 10
 #define VIOS_SCALE (1 << VIOS_SCALE_SHIFT)
 
+#define VIOS_READ_SCALE (1)
+#define VIOS_WRITE_SCALE (1)
+
 struct fiops_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
@@ -32,6 +35,9 @@ struct fiops_data {
 	unsigned int busy_queues;
 
 	struct work_struct unplug_work;
+
+	unsigned int read_scale;
+	unsigned int write_scale;
 };
 
 struct fiops_ioc {
@@ -268,7 +274,10 @@ static void fiops_remove_request(struct
 static u64 fiops_scaled_vios(struct fiops_data *fiopsd,
 	struct fiops_ioc *ioc, struct request *rq)
 {
-	return VIOS_SCALE;
+	if (rq_data_dir(rq) == READ)
+		return VIOS_SCALE;
+	else
+		return VIOS_SCALE * fiopsd->write_scale / fiopsd->read_scale;
 }
 
 /* return vios dispatched */
@@ -540,6 +549,9 @@ static void *fiops_init_queue(struct req
 
 	INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue);
 
+	fiopsd->read_scale = VIOS_READ_SCALE;
+	fiopsd->write_scale = VIOS_WRITE_SCALE;
+
 	return fiopsd;
 }
 
@@ -610,6 +622,60 @@ static struct ioc_builder ioc_builder =
 	.cic_exit = fiops_exit_cic,
 };
 
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+fiops_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+fiops_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR)					\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct fiops_data *fiopsd = e->elevator_data;			\
+	return fiops_var_show(__VAR, (page));				\
+}
+SHOW_FUNCTION(fiops_read_scale_show, fiopsd->read_scale);
+SHOW_FUNCTION(fiops_write_scale_show, fiopsd->write_scale);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)				\
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct fiops_data *fiopsd = e->elevator_data;			\
+	unsigned int __data;						\
+	int ret = fiops_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	*(__PTR) = __data;						\
+	return ret;							\
+}
+STORE_FUNCTION(fiops_read_scale_store, &fiopsd->read_scale, 1, 100);
+STORE_FUNCTION(fiops_write_scale_store, &fiopsd->write_scale, 1, 100);
+#undef STORE_FUNCTION
+
+#define FIOPS_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, fiops_##name##_show, fiops_##name##_store)
+
+static struct elv_fs_entry fiops_attrs[] = {
+	FIOPS_ATTR(read_scale),
+	FIOPS_ATTR(write_scale),
+	__ATTR_NULL
+};
+
 static struct elevator_type iosched_fiops = {
 	.ops = {
 		.elevator_merge_fn =		fiops_merge,
@@ -626,6 +692,7 @@ static struct elevator_type iosched_fiop
 		.elevator_exit_fn =		fiops_exit_queue,
 		.trim =				queue_data_free_io_context,
 	},
+	.elevator_attrs =	fiops_attrs,
 	.elevator_name =	"fiops",
 	.elevator_owner =	THIS_MODULE,
 };


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-04  6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li
                   ` (2 preceding siblings ...)
  2012-01-04  6:53 ` [RFC 3/3]block: fiops read/write request scale Shaohua Li
@ 2012-01-04  7:19 ` Dave Chinner
  2012-01-05  6:50   ` Shaohua Li
  2012-01-06  9:41 ` Zhu Yanhai
  2012-01-15 22:24 ` Vivek Goyal
  5 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2012-01-04  7:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer

On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> An IOPS based I/O scheduler
> 
> Flash based storage has some different characteristics against rotate disk.
> 1. no I/O seek.
> 2. read and write I/O cost usually is much different.
> 3. Time which a request takes depends on request size.
> 4. High throughput and IOPS, low latency.
> 
> CFQ iosched does well for rotate disk, for example fair dispatching, idle
> for sequential read. It also has optimization for flash based storage (for
> item 1 above), but overall it's not designed for flash based storage. It's
> a slice based algorithm. Since flash based storage request cost is very
> low, and drive has big queue_depth is quite popular now which makes
> dispatching cost even lower, CFQ's slice accounting (jiffy based)
> doesn't work well. CFQ doesn't consider above item 2 & 3.
> 
> FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> only targets for drive without I/O seek. It's quite similar like CFQ, but
> the dispatch decision is made according to IOPS instead of slice.
> 
> The algorithm is simple. Drive has a service tree, and each task lives in
> the tree. The key into the tree is called vios (virtual I/O). Every request
> has vios, which is calculated according to its ioprio, request size and so
> on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> always selects task with minimum vios in the service tree and let the task
> dispatch request. The dispatched request's vios is then added to the task's
> vios and the task is repositioned in the sevice tree.
> 
> The series are orgnized as:
> Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> Patch 2: The core FIOPS.
> Patch 3: request read/write vios scale. This demontrates how the vios scale.
> 
> To make the code simple for easy view, some scale code isn't included here,
> some not implementated yet.
> 
> TODO:
> 1. ioprio support (have patch already)
> 2. request size vios scale
> 3. cgroup support
> 4. tracing support
> 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> 
> Comments and suggestions are welcome!

Benchmark results?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 1/3]block: seperate CFQ io context management code
  2012-01-04  6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li
@ 2012-01-04  8:19   ` Namhyung Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Namhyung Kim @ 2012-01-04  8:19 UTC (permalink / raw)
  To: shaohua.li; +Cc: linux-kernel, axboe, vgoyal, jmoyer

Hi,

Two of nitpicks below ..


2012-01-04 3:53 PM, Shaohua Li Wrote:
> CFQ's io context management creates a per-device io context for each task.
> It's quite generic. Separate it from CFQ, and use it for fiops I/O scheduler.
> 
> Signed-off-by: Shaohua Li<shaohua.li@intel.com>
> ---

[snip]

> +int ioc_builder_init(struct ioc_builder *builder)
> +{
> +	if (!builder->alloc_ioc || !builder->free_ioc)
> +		return -ENOMEM;
> +
> +	builder->ioc_count = alloc_percpu(unsigned long);
> +	if (!builder->ioc_count)
> +		return -ENOMEM;
> +
> +	builder->ioc_gone = NULL;
> +	spin_lock_init(&builder->ioc_gone_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ioc_builder_init);
> +
> +void io_context_builder_exit(struct ioc_builder *builder)

It'd be better using 'ioc_builder_exit' as a name for consistency, IMHO.


> +{
> +	DECLARE_COMPLETION_ONSTACK(all_gone);
> +
> +	builder->ioc_gone =&all_gone;
> +	/* ioc_gone's update must be visible before reading ioc_count */
> +	smp_wmb();
> +
> +	/*
> +	 * this also protects us from entering cfq_slab_kill() with
> +	 * pending RCU callbacks
> +	 */
> +	if (elv_ioc_count_read(*builder->ioc_count))
> +		wait_for_completion(&all_gone);
> +
> +	free_percpu(builder->ioc_count);
> +}
> +EXPORT_SYMBOL(io_context_builder_exit);

[snip]

> +static void queue_data_cic_free_rcu(struct rcu_head *head)
> +{
> +	struct dev_io_context *cic;
> +	struct ioc_builder *builder;
> +
> +	cic = container_of(head, struct dev_io_context, rcu_head);
> +	builder = cic->builder;
> +
> +	builder->free_ioc(builder, cic);
> +	elv_ioc_count_dec(*builder->ioc_count);
> +
> +	if (builder->ioc_gone) {
> +		/*
> +		 * CFQ scheduler is exiting, grab exit lock and check

s/CFQ/IO/ ?

Thanks.
Namhyung Kim

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-04  7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner
@ 2012-01-05  6:50   ` Shaohua Li
  2012-01-06  5:12     ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-05  6:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, axboe, vgoyal, jmoyer

On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
> On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> > An IOPS based I/O scheduler
> > 
> > Flash based storage has some different characteristics against rotate disk.
> > 1. no I/O seek.
> > 2. read and write I/O cost usually is much different.
> > 3. Time which a request takes depends on request size.
> > 4. High throughput and IOPS, low latency.
> > 
> > CFQ iosched does well for rotate disk, for example fair dispatching, idle
> > for sequential read. It also has optimization for flash based storage (for
> > item 1 above), but overall it's not designed for flash based storage. It's
> > a slice based algorithm. Since flash based storage request cost is very
> > low, and drive has big queue_depth is quite popular now which makes
> > dispatching cost even lower, CFQ's slice accounting (jiffy based)
> > doesn't work well. CFQ doesn't consider above item 2 & 3.
> > 
> > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> > only targets for drive without I/O seek. It's quite similar like CFQ, but
> > the dispatch decision is made according to IOPS instead of slice.
> > 
> > The algorithm is simple. Drive has a service tree, and each task lives in
> > the tree. The key into the tree is called vios (virtual I/O). Every request
> > has vios, which is calculated according to its ioprio, request size and so
> > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> > always selects task with minimum vios in the service tree and let the task
> > dispatch request. The dispatched request's vios is then added to the task's
> > vios and the task is repositioned in the sevice tree.
> > 
> > The series are orgnized as:
> > Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> > Patch 2: The core FIOPS.
> > Patch 3: request read/write vios scale. This demontrates how the vios scale.
> > 
> > To make the code simple for easy view, some scale code isn't included here,
> > some not implementated yet.
> > 
> > TODO:
> > 1. ioprio support (have patch already)
> > 2. request size vios scale
> > 3. cgroup support
> > 4. tracing support
> > 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> > 
> > Comments and suggestions are welcome!
> 
> Benchmark results?
I didn't have data yet. The patches are still in earlier stage, I want
to focus on the basic idea first.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-05  6:50   ` Shaohua Li
@ 2012-01-06  5:12     ` Shaohua Li
  2012-01-06  9:10       ` Namhyung Kim
                         ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Shaohua Li @ 2012-01-06  5:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, axboe, vgoyal, jmoyer

On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
> On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
> > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> > > An IOPS based I/O scheduler
> > > 
> > > Flash based storage has some different characteristics against rotate disk.
> > > 1. no I/O seek.
> > > 2. read and write I/O cost usually is much different.
> > > 3. Time which a request takes depends on request size.
> > > 4. High throughput and IOPS, low latency.
> > > 
> > > CFQ iosched does well for rotate disk, for example fair dispatching, idle
> > > for sequential read. It also has optimization for flash based storage (for
> > > item 1 above), but overall it's not designed for flash based storage. It's
> > > a slice based algorithm. Since flash based storage request cost is very
> > > low, and drive has big queue_depth is quite popular now which makes
> > > dispatching cost even lower, CFQ's slice accounting (jiffy based)
> > > doesn't work well. CFQ doesn't consider above item 2 & 3.
> > > 
> > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> > > only targets for drive without I/O seek. It's quite similar like CFQ, but
> > > the dispatch decision is made according to IOPS instead of slice.
> > > 
> > > The algorithm is simple. Drive has a service tree, and each task lives in
> > > the tree. The key into the tree is called vios (virtual I/O). Every request
> > > has vios, which is calculated according to its ioprio, request size and so
> > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> > > always selects task with minimum vios in the service tree and let the task
> > > dispatch request. The dispatched request's vios is then added to the task's
> > > vios and the task is repositioned in the sevice tree.
> > > 
> > > The series are orgnized as:
> > > Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> > > Patch 2: The core FIOPS.
> > > Patch 3: request read/write vios scale. This demontrates how the vios scale.
> > > 
> > > To make the code simple for easy view, some scale code isn't included here,
> > > some not implementated yet.
> > > 
> > > TODO:
> > > 1. ioprio support (have patch already)
> > > 2. request size vios scale
> > > 3. cgroup support
> > > 4. tracing support
> > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> > > 
> > > Comments and suggestions are welcome!
> > 
> > Benchmark results?
> I didn't have data yet. The patches are still in earlier stage, I want
> to focus on the basic idea first.
since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
fs is ext4.

workload		percentage change with fiops against cfq
fio_sync_read_4k        -2
fio_mediaplay_64k       0
fio_mediaplay_128k      0
fio_mediaplay_rr_64k    0
fio_sync_read_rr_4k     0
fio_sync_write_128k     0
fio_sync_write_64k      -1
fio_sync_write_4k       -2
fio_sync_write_64k_create       0
fio_sync_write_rr_64k_create    0
fio_sync_write_128k_create      0
fio_aio_randread_4k     -4
fio_aio_randread_64k    0
fio_aio_randwrite_4k    1
fio_aio_randwrite_64k   0
fio_aio_randrw_4k       -1
fio_aio_randrw_64k      0
fio_tpch        9
fio_tpcc        0
fio_mmap_randread_4k    -1
fio_mmap_randread_64k   1
fio_mmap_randread_1k    -8
fio_mmap_randwrite_4k   35
fio_mmap_randwrite_64k  22
fio_mmap_randwrite_1k   28
fio_mmap_randwrite_4k_halfbusy  24
fio_mmap_randrw_4k      23
fio_mmap_randrw_64k     4
fio_mmap_randrw_1k      22
fio_mmap_randrw_4k_halfbusy     35
fio_mmap_sync_read_4k   0
fio_mmap_sync_read_64k  -1
fio_mmap_sync_read_128k         -1
fio_mmap_sync_read_rr_64k       5
fio_mmap_sync_read_rr_4k        3

The fio_mmap_randread_1k has regression against 3.2-rc7, but no
regression against 3.2-rc6 kernel, still checking why. The fiops has
improvement for read/write mixed workload. CFQ is known not good for
read/write mixed workload.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3]block: FIOPS ioscheduler core
  2012-01-04  6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li
@ 2012-01-06  6:05   ` Namjae Jeon
  2012-01-07  1:06   ` Zhu Yanhai
  1 sibling, 0 replies; 29+ messages in thread
From: Namjae Jeon @ 2012-01-06  6:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer

> ===================================================================
> --- linux.orig/block/Kconfig.iosched    2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/Kconfig.iosched 2012-01-04 13:58:35.000000000 +0800
> @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
>        ---help---
>          Enable group IO scheduling in CFQ.
>
> +config IOSCHED_FIOPS
> +       tristate "IOPS based I/O scheduler"
> +       default y
> +       ---help---
> +         This is an IOPS based I/O scheduler. It will try to distribute
> +          IOPS equally among all processes in the system. It's mainly for
> +          Flash based storage.
> +
>  choice
>        prompt "Default I/O scheduler"
>        default DEFAULT_CFQ
Hi.
Is below code needed ?
choice
        prompt "Default I/O scheduler"
        default DEFAULT_CFQ
        help
          Select the I/O scheduler which will be used by default for all
          block devices.

        config DEFAULT_DEADLINE
                bool "Deadline" if IOSCHED_DEADLINE=y

        config DEFAULT_CFQ
                bool "CFQ" if IOSCHED_CFQ=y

+        config DEFAULT_FIOPS
+                bool "FIOPS" if IOSCHED_FIOPS=y

        config DEFAULT_NOOP
                bool "No-op"

config DEFAULT_IOSCHED
        string
        default "deadline" if DEFAULT_DEADLINE
        default "cfq" if DEFAULT_CFQ
        default "noop" if DEFAULT_NOOP
+        default "fiops" if DEFAULT_FIOPS

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-06  5:12     ` Shaohua Li
@ 2012-01-06  9:10       ` Namhyung Kim
  2012-01-06 14:37       ` Jan Kara
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Namhyung Kim @ 2012-01-06  9:10 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, vgoyal, jmoyer

2012-01-06 PM 2:12, Shaohua Li wrote:
> On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
>> On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
>>> On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
>>>> An IOPS based I/O scheduler
>>>>
>>>> Flash based storage has some different characteristics against rotate disk.
>>>> 1. no I/O seek.
>>>> 2. read and write I/O cost usually is much different.
>>>> 3. Time which a request takes depends on request size.
>>>> 4. High throughput and IOPS, low latency.
>>>>
>>>> CFQ iosched does well for rotate disk, for example fair dispatching, idle
>>>> for sequential read. It also has optimization for flash based storage (for
>>>> item 1 above), but overall it's not designed for flash based storage. It's
>>>> a slice based algorithm. Since flash based storage request cost is very
>>>> low, and drive has big queue_depth is quite popular now which makes
>>>> dispatching cost even lower, CFQ's slice accounting (jiffy based)
>>>> doesn't work well. CFQ doesn't consider above item 2&  3.
>>>>
>>>> FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
>>>> only targets for drive without I/O seek. It's quite similar like CFQ, but
>>>> the dispatch decision is made according to IOPS instead of slice.
>>>>
>>>> The algorithm is simple. Drive has a service tree, and each task lives in
>>>> the tree. The key into the tree is called vios (virtual I/O). Every request
>>>> has vios, which is calculated according to its ioprio, request size and so
>>>> on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
>>>> always selects task with minimum vios in the service tree and let the task
>>>> dispatch request. The dispatched request's vios is then added to the task's
>>>> vios and the task is repositioned in the sevice tree.
>>>>
>>>> The series are orgnized as:
>>>> Patch 1: separate CFQ's io context management code. FIOPS will use it too.
>>>> Patch 2: The core FIOPS.
>>>> Patch 3: request read/write vios scale. This demontrates how the vios scale.
>>>>
>>>> To make the code simple for easy view, some scale code isn't included here,
>>>> some not implementated yet.
>>>>
>>>> TODO:
>>>> 1. ioprio support (have patch already)
>>>> 2. request size vios scale
>>>> 3. cgroup support
>>>> 4. tracing support
>>>> 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
>>>>
>>>> Comments and suggestions are welcome!
>>>
>>> Benchmark results?
>> I didn't have data yet. The patches are still in earlier stage, I want
>> to focus on the basic idea first.
> since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
> fs is ext4.
>
> workload		percentage change with fiops against cfq
> fio_sync_read_4k        -2
> fio_mediaplay_64k       0
> fio_mediaplay_128k      0
> fio_mediaplay_rr_64k    0
> fio_sync_read_rr_4k     0
> fio_sync_write_128k     0
> fio_sync_write_64k      -1
> fio_sync_write_4k       -2
> fio_sync_write_64k_create       0
> fio_sync_write_rr_64k_create    0
> fio_sync_write_128k_create      0
> fio_aio_randread_4k     -4
> fio_aio_randread_64k    0
> fio_aio_randwrite_4k    1
> fio_aio_randwrite_64k   0
> fio_aio_randrw_4k       -1
> fio_aio_randrw_64k      0
> fio_tpch        9
> fio_tpcc        0
> fio_mmap_randread_4k    -1
> fio_mmap_randread_64k   1
> fio_mmap_randread_1k    -8
> fio_mmap_randwrite_4k   35
> fio_mmap_randwrite_64k  22
> fio_mmap_randwrite_1k   28
> fio_mmap_randwrite_4k_halfbusy  24
> fio_mmap_randrw_4k      23
> fio_mmap_randrw_64k     4
> fio_mmap_randrw_1k      22
> fio_mmap_randrw_4k_halfbusy     35
> fio_mmap_sync_read_4k   0
> fio_mmap_sync_read_64k  -1
> fio_mmap_sync_read_128k         -1
> fio_mmap_sync_read_rr_64k       5
> fio_mmap_sync_read_rr_4k        3
>
> The fio_mmap_randread_1k has regression against 3.2-rc7, but no
> regression against 3.2-rc6 kernel, still checking why. The fiops has
> improvement for read/write mixed workload. CFQ is known not good for
> read/write mixed workload.
>
> Thanks,
> Shaohua
>

Hi,

Looks promising. :) Anyway what's your configuration for the test? Did 
you use vios scaling based on IO direction and/or ioprio?

Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-04  6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li
                   ` (3 preceding siblings ...)
  2012-01-04  7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner
@ 2012-01-06  9:41 ` Zhu Yanhai
  2012-01-15 22:24 ` Vivek Goyal
  5 siblings, 0 replies; 29+ messages in thread
From: Zhu Yanhai @ 2012-01-06  9:41 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer

Hi,

2012/1/4 Shaohua Li <shaohua.li@intel.com>:
> An IOPS based I/O scheduler
>
> Flash based storage has some different characteristics against rotate disk.
> 1. no I/O seek.
> 2. read and write I/O cost usually is much different.
> 3. Time which a request takes depends on request size.
> 4. High throughput and IOPS, low latency.
>
> CFQ iosched does well for rotate disk, for example fair dispatching, idle
> for sequential read. It also has optimization for flash based storage (for
> item 1 above), but overall it's not designed for flash based storage. It's
> a slice based algorithm. Since flash based storage request cost is very
> low, and drive has big queue_depth is quite popular now which makes
> dispatching cost even lower, CFQ's slice accounting (jiffy based)
> doesn't work well. CFQ doesn't consider above item 2 & 3.
>
> FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> only targets for drive without I/O seek. It's quite similar like CFQ, but
> the dispatch decision is made according to IOPS instead of slice.
>
> The algorithm is simple. Drive has a service tree, and each task lives in
> the tree. The key into the tree is called vios (virtual I/O). Every request
> has vios, which is calculated according to its ioprio, request size and so
> on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> always selects task with minimum vios in the service tree and let the task
> dispatch request. The dispatched request's vios is then added to the task's
> vios and the task is repositioned in the sevice tree.
>
> The series are orgnized as:
> Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> Patch 2: The core FIOPS.
> Patch 3: request read/write vios scale. This demontrates how the vios scale.
>
> To make the code simple for easy view, some scale code isn't included here,
> some not implementated yet.
>
> TODO:
> 1. ioprio support (have patch already)
> 2. request size vios scale
> 3. cgroup support
> 4. tracing support
> 5. automatically select default iosched according to QUEUE_FLAG_NONROT.

Actually most RAID cards we have ever seen report the wrong value for
this flag, including LSI's 1078E, 9260/9265, HBA cards, etc. So it may
needs a human-being to setup it correctly to make this adaptive code
work.
Also there is a much more complicated scenario ahead :) that's the
hybrid storage. E.g, one or two SSD mixed with several SAS disks, or
one expensive FusionIO's ioDrive mixed with several SAS disks, powered
by Facebook's Flashcache. AFAIK there is not any IO bandwidth control
solution for such things. However such setups are really very popular
these days.
I'm going to take some tests with these code against the real workload
this month, hopefully we can get more ideas, and of course the
feedback data.

--
Thanks,
Zhu Yanhai
Taobao Inc.

>
> Comments and suggestions are welcome!
>
> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-06  5:12     ` Shaohua Li
  2012-01-06  9:10       ` Namhyung Kim
@ 2012-01-06 14:37       ` Jan Kara
  2012-01-09  1:26         ` Shaohua Li
  2012-01-08 22:16       ` Dave Chinner
  2012-01-15 22:28       ` Vivek Goyal
  3 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2012-01-06 14:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, vgoyal, jmoyer

On Fri 06-01-12 13:12:29, Shaohua Li wrote:
> On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
> > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
> > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> > > > An IOPS based I/O scheduler
> > > > 
> > > > Flash based storage has some different characteristics against rotate disk.
> > > > 1. no I/O seek.
> > > > 2. read and write I/O cost usually is much different.
> > > > 3. Time which a request takes depends on request size.
> > > > 4. High throughput and IOPS, low latency.
> > > > 
> > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle
> > > > for sequential read. It also has optimization for flash based storage (for
> > > > item 1 above), but overall it's not designed for flash based storage. It's
> > > > a slice based algorithm. Since flash based storage request cost is very
> > > > low, and drive has big queue_depth is quite popular now which makes
> > > > dispatching cost even lower, CFQ's slice accounting (jiffy based)
> > > > doesn't work well. CFQ doesn't consider above item 2 & 3.
> > > > 
> > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> > > > only targets for drive without I/O seek. It's quite similar like CFQ, but
> > > > the dispatch decision is made according to IOPS instead of slice.
> > > > 
> > > > The algorithm is simple. Drive has a service tree, and each task lives in
> > > > the tree. The key into the tree is called vios (virtual I/O). Every request
> > > > has vios, which is calculated according to its ioprio, request size and so
> > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> > > > always selects task with minimum vios in the service tree and let the task
> > > > dispatch request. The dispatched request's vios is then added to the task's
> > > > vios and the task is repositioned in the sevice tree.
> > > > 
> > > > The series are orgnized as:
> > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> > > > Patch 2: The core FIOPS.
> > > > Patch 3: request read/write vios scale. This demontrates how the vios scale.
> > > > 
> > > > To make the code simple for easy view, some scale code isn't included here,
> > > > some not implementated yet.
> > > > 
> > > > TODO:
> > > > 1. ioprio support (have patch already)
> > > > 2. request size vios scale
> > > > 3. cgroup support
> > > > 4. tracing support
> > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> > > > 
> > > > Comments and suggestions are welcome!
> > > 
> > > Benchmark results?
> > I didn't have data yet. The patches are still in earlier stage, I want
> > to focus on the basic idea first.
> since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
> fs is ext4.
> 
> workload		percentage change with fiops against cfq
> fio_sync_read_4k        -2
> fio_mediaplay_64k       0
> fio_mediaplay_128k      0
> fio_mediaplay_rr_64k    0
> fio_sync_read_rr_4k     0
> fio_sync_write_128k     0
> fio_sync_write_64k      -1
> fio_sync_write_4k       -2
> fio_sync_write_64k_create       0
> fio_sync_write_rr_64k_create    0
> fio_sync_write_128k_create      0
> fio_aio_randread_4k     -4
> fio_aio_randread_64k    0
> fio_aio_randwrite_4k    1
> fio_aio_randwrite_64k   0
> fio_aio_randrw_4k       -1
> fio_aio_randrw_64k      0
> fio_tpch        9
> fio_tpcc        0
> fio_mmap_randread_4k    -1
> fio_mmap_randread_64k   1
> fio_mmap_randread_1k    -8
> fio_mmap_randwrite_4k   35
> fio_mmap_randwrite_64k  22
> fio_mmap_randwrite_1k   28
> fio_mmap_randwrite_4k_halfbusy  24
> fio_mmap_randrw_4k      23
> fio_mmap_randrw_64k     4
> fio_mmap_randrw_1k      22
> fio_mmap_randrw_4k_halfbusy     35
> fio_mmap_sync_read_4k   0
> fio_mmap_sync_read_64k  -1
> fio_mmap_sync_read_128k         -1
> fio_mmap_sync_read_rr_64k       5
> fio_mmap_sync_read_rr_4k        3
> 
> The fio_mmap_randread_1k has regression against 3.2-rc7, but no
> regression against 3.2-rc6 kernel, still checking why. The fiops has
> improvement for read/write mixed workload. CFQ is known not good for
> read/write mixed workload.
  Nice, but this is only about throughput, isn't it? Part of the reason why
CFQ takes hit in the throughput is that it prefers sync IO (i.e.  reads and
synchronous writes) against other IO. Does your scheduler do anything like
that? Could you for example compare a latency of reads while running heavy
background writing between CFQ and your scheduler? Loads like this where
original motivation for CFQ I believe.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 2/3]block: FIOPS ioscheduler core
  2012-01-04  6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li
  2012-01-06  6:05   ` Namjae Jeon
@ 2012-01-07  1:06   ` Zhu Yanhai
  1 sibling, 0 replies; 29+ messages in thread
From: Zhu Yanhai @ 2012-01-07  1:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer

2012/1/4 Shaohua Li <shaohua.li@intel.com>:
> FIOPS (Fair IOPS) ioscheduler is IOPS based ioscheduler, so only targets
> for drive without I/O seek. It's quite similar like CFQ, but the dispatch
> decision is made according to IOPS instead of slice.
>
> The algorithm is simple. Drive has a service tree, and each task lives in
> the tree. The key into the tree is called vios (virtual I/O). Every request
> has vios, which is calculated according to its ioprio, request size and so
> on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> always selects task with minimum vios in the service tree and let the task
> dispatch request. The dispatched request's vios is then added to the task's
> vios and the task is repositioned in the sevice tree.
>
> Unlike CFQ, FIOPS doesn't have separate sync/async queues, because with I/O
> less writeback, usually a task can only dispatch either sync or async requests.
> Bias read or write request can still be done with read/write scale.
>
> One issue is if workload iodepth is lower than drive queue_depth, IOPS
> share of a task might not be strictly according to its priority, request
> size and so on. In this case, the drive is in idle actually. Solving the
> problem need make drive idle, so impact performance. I believe CFQ isn't
> completely fair between tasks in such case too.
>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> ---
>  block/Kconfig.iosched |    8
>  block/Makefile        |    1
>  block/blk-ioc.c       |    2
>  block/blk.h           |    2
>  block/fiops-iosched.c |  659 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 670 insertions(+), 2 deletions(-)
>
> Index: linux/block/Makefile
> ===================================================================
> --- linux.orig/block/Makefile   2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/Makefile        2011-12-28 09:42:35.000000000 +0800
> @@ -14,6 +14,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)      += blk-
>  obj-$(CONFIG_IOSCHED_NOOP)     += noop-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)      += cfq-iosched.o
> +obj-$(CONFIG_IOSCHED_FIOPS)    += fiops-iosched.o
>
>  obj-$(CONFIG_BLOCK_COMPAT)     += compat_ioctl.o
>  obj-$(CONFIG_BLK_DEV_INTEGRITY)        += blk-integrity.o
> Index: linux/block/fiops-iosched.c
> ===================================================================
> --- /dev/null   1970-01-01 00:00:00.000000000 +0000
> +++ linux/block/fiops-iosched.c 2012-01-04 14:01:26.000000000 +0800
> @@ -0,0 +1,659 @@
> +/*
> + * IOPS based IO scheduler. Based on CFQ.
> + *  Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
> + *  Shaohua Li <shli@kernel.org>
> + */
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/blkdev.h>
> +#include <linux/elevator.h>
> +#include <linux/jiffies.h>
> +#include <linux/rbtree.h>
> +#include <linux/ioprio.h>
> +#include "blk.h"
> +
> +#define VIOS_SCALE_SHIFT 10
> +#define VIOS_SCALE (1 << VIOS_SCALE_SHIFT)
> +
> +struct fiops_rb_root {
> +       struct rb_root rb;
> +       struct rb_node *left;
> +       unsigned count;
> +
> +       u64 min_vios;
> +};
> +#define FIOPS_RB_ROOT  (struct fiops_rb_root) { .rb = RB_ROOT}
> +
> +struct fiops_data {
> +       struct queue_data qdata;
> +
> +       struct fiops_rb_root service_tree;
> +
> +       unsigned int busy_queues;
> +
> +       struct work_struct unplug_work;
> +};
> +
> +struct fiops_ioc {
> +       struct dev_io_context dev_ioc;
> +
> +       unsigned int flags;
> +       struct fiops_data *fiopsd;
> +       struct rb_node rb_node;
> +       u64 vios; /* key in service_tree */
> +       struct fiops_rb_root *service_tree;
> +
> +       struct rb_root sort_list;
> +       struct list_head fifo;
> +
> +       pid_t pid;
> +};
> +
> +static struct kmem_cache *fiops_ioc_pool;
> +static struct ioc_builder ioc_builder;
> +#define queue_data_to_fiopsd(ptr) container_of(ptr, struct fiops_data, qdata)
> +#define dev_ioc_to_fiops_ioc(ptr) container_of(ptr, struct fiops_ioc, dev_ioc)
> +#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree))
> +
> +#define RQ_CIC(rq)             \
> +       ((struct fiops_ioc *) (rq)->elevator_private[0])
> +enum ioc_state_flags {
> +       FIOPS_IOC_FLAG_on_rr = 0,       /* on round-robin busy list */
> +};
> +
> +#define FIOPS_IOC_FNS(name)                                            \
> +static inline void fiops_mark_ioc_##name(struct fiops_ioc *ioc)        \
> +{                                                                      \
> +       ioc->flags |= (1 << FIOPS_IOC_FLAG_##name);                     \
> +}                                                                      \
> +static inline void fiops_clear_ioc_##name(struct fiops_ioc *ioc)       \
> +{                                                                      \
> +       ioc->flags &= ~(1 << FIOPS_IOC_FLAG_##name);                    \
> +}                                                                      \
> +static inline int fiops_ioc_##name(const struct fiops_ioc *ioc)        \
> +{                                                                      \
> +       return ((ioc)->flags & (1 << FIOPS_IOC_FLAG_##name)) != 0;      \
> +}
> +
> +FIOPS_IOC_FNS(on_rr);
> +#undef FIOPS_IOC_FNS
> +
> +/*
> + * The below is leftmost cache rbtree addon
> + */
> +static struct fiops_ioc *fiops_rb_first(struct fiops_rb_root *root)
> +{
> +       /* Service tree is empty */
> +       if (!root->count)
> +               return NULL;
> +
> +       if (!root->left)
> +               root->left = rb_first(&root->rb);
> +
> +       if (root->left)
> +               return rb_entry(root->left, struct fiops_ioc, rb_node);
> +
> +       return NULL;
> +}
> +
> +static void rb_erase_init(struct rb_node *n, struct rb_root *root)
> +{
> +       rb_erase(n, root);
> +       RB_CLEAR_NODE(n);
> +}
> +
> +static void fiops_rb_erase(struct rb_node *n, struct fiops_rb_root *root)
> +{
> +       if (root->left == n)
> +               root->left = NULL;
> +       rb_erase_init(n, &root->rb);
> +       --root->count;
> +}
> +
> +static inline u64 max_vios(u64 min_vios, u64 vios)
> +{
> +       s64 delta = (s64)(vios - min_vios);
> +       if (delta > 0)
> +               min_vios = vios;
> +
> +       return min_vios;
> +}
> +
> +static void fiops_update_min_vios(struct fiops_rb_root *service_tree)
> +{
> +       struct fiops_ioc *ioc;
> +
> +       ioc = fiops_rb_first(service_tree);
> +       if (!ioc)
> +               return;
> +       service_tree->min_vios = max_vios(service_tree->min_vios, ioc->vios);
This is tracking the max vios here, not the min one. Should be min_vios()?
> +}
> +
> +/*
> + * The fiopsd->service_trees holds all pending fiops_ioc's that have
> + * requests waiting to be processed. It is sorted in the order that
> + * we will service the queues.
> + */
> +static void fiops_service_tree_add(struct fiops_data *fiopsd,
> +       struct fiops_ioc *ioc)
> +{
> +       struct rb_node **p, *parent;
> +       struct fiops_ioc *__ioc;
> +       struct fiops_rb_root *service_tree = ioc_service_tree(ioc);
> +       u64 vios;
> +       int left;
> +
> +       /* New added IOC */
> +       if (RB_EMPTY_NODE(&ioc->rb_node))
> +               vios = max_vios(service_tree->min_vios, ioc->vios);
> +       else {
> +               vios = ioc->vios;
> +               /* ioc->service_tree might not equal to service_tree */
> +               fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
> +               ioc->service_tree = NULL;
> +       }
> +
> +       left = 1;
> +       parent = NULL;
> +       ioc->service_tree = service_tree;
> +       p = &service_tree->rb.rb_node;
> +       while (*p) {
> +               struct rb_node **n;
> +
> +               parent = *p;
> +               __ioc = rb_entry(parent, struct fiops_ioc, rb_node);
> +
> +               /*
> +                * sort by key, that represents service time.
> +                */
> +               if (vios <  __ioc->vios)
> +                       n = &(*p)->rb_left;
> +               else {
> +                       n = &(*p)->rb_right;
> +                       left = 0;
> +               }
> +
> +               p = n;
> +       }
> +
> +       if (left)
> +               service_tree->left = &ioc->rb_node;
> +
> +       ioc->vios = vios;
> +       rb_link_node(&ioc->rb_node, parent, p);
> +       rb_insert_color(&ioc->rb_node, &service_tree->rb);
> +       service_tree->count++;
> +
> +       fiops_update_min_vios(service_tree);
> +}
> +
> +/*
> + * Update ioc's position in the service tree.
> + */
> +static void fiops_resort_rr_list(struct fiops_data *fiopsd,
> +       struct fiops_ioc *ioc)
> +{
> +       /*
> +        * Resorting requires the ioc to be on the RR list already.
> +        */
> +       if (fiops_ioc_on_rr(ioc))
> +               fiops_service_tree_add(fiopsd, ioc);
> +}
> +
> +/*
> + * add to busy list of queues for service, trying to be fair in ordering
> + * the pending list according to last request service
> + */
> +static void fiops_add_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
> +{
> +       BUG_ON(fiops_ioc_on_rr(ioc));
> +       fiops_mark_ioc_on_rr(ioc);
> +
> +       fiopsd->busy_queues++;
> +
> +       fiops_resort_rr_list(fiopsd, ioc);
> +}
> +
> +/*
> + * Called when the ioc no longer has requests pending, remove it from
> + * the service tree.
> + */
> +static void fiops_del_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
> +{
> +       BUG_ON(!fiops_ioc_on_rr(ioc));
> +       fiops_clear_ioc_on_rr(ioc);
> +
> +       if (!RB_EMPTY_NODE(&ioc->rb_node)) {
> +               fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
> +               ioc->service_tree = NULL;
> +       }
> +
> +       BUG_ON(!fiopsd->busy_queues);
> +       fiopsd->busy_queues--;
> +}
> +
> +/*
> + * rb tree support functions
> + */
> +static void fiops_del_rq_rb(struct request *rq)
> +{
> +       struct fiops_ioc *ioc = RQ_CIC(rq);
> +
> +       elv_rb_del(&ioc->sort_list, rq);
> +}
> +
> +static void fiops_add_rq_rb(struct request *rq)
> +{
> +       struct fiops_ioc *ioc = RQ_CIC(rq);
> +       struct fiops_data *fiopsd = ioc->fiopsd;
> +
> +       elv_rb_add(&ioc->sort_list, rq);
> +
> +       if (!fiops_ioc_on_rr(ioc))
> +               fiops_add_ioc_rr(fiopsd, ioc);
> +}
> +
> +static void fiops_reposition_rq_rb(struct fiops_ioc *ioc, struct request *rq)
> +{
> +       elv_rb_del(&ioc->sort_list, rq);
> +       fiops_add_rq_rb(rq);
> +}
> +
> +static void fiops_remove_request(struct request *rq)
> +{
> +       list_del_init(&rq->queuelist);
> +       fiops_del_rq_rb(rq);
> +}
> +
> +static u64 fiops_scaled_vios(struct fiops_data *fiopsd,
> +       struct fiops_ioc *ioc, struct request *rq)
> +{
> +       return VIOS_SCALE;
> +}
> +
> +/* return vios dispatched */
> +static u64 fiops_dispatch_request(struct fiops_data *fiopsd,
> +       struct fiops_ioc *ioc)
> +{
> +       struct request *rq;
> +       struct request_queue *q = fiopsd->qdata.queue;
> +
> +       rq = rq_entry_fifo(ioc->fifo.next);
> +
> +       fiops_remove_request(rq);
> +       elv_dispatch_sort(q, rq);
> +
> +       return fiops_scaled_vios(fiopsd, ioc, rq);
> +}
> +
> +static int fiops_forced_dispatch(struct fiops_data *fiopsd)
> +{
> +       struct fiops_ioc *ioc;
> +       int dispatched = 0;
> +
> +       while ((ioc = fiops_rb_first(&fiopsd->service_tree)) != NULL) {
> +               while (!list_empty(&ioc->fifo)) {
> +                       fiops_dispatch_request(fiopsd, ioc);
> +                       dispatched++;
> +               }
> +               if (fiops_ioc_on_rr(ioc))
> +                       fiops_del_ioc_rr(fiopsd, ioc);
> +       }
> +       return dispatched;
> +}
> +
> +static struct fiops_ioc *fiops_select_ioc(struct fiops_data *fiopsd)
> +{
> +       struct fiops_ioc *ioc;
> +
> +       if (RB_EMPTY_ROOT(&fiopsd->service_tree.rb))
> +               return NULL;
> +       ioc = fiops_rb_first(&fiopsd->service_tree);
> +       return ioc;
> +}
> +
> +static void fiops_charge_vios(struct fiops_data *fiopsd,
> +       struct fiops_ioc *ioc, u64 vios)
> +{
> +       struct fiops_rb_root *service_tree = ioc->service_tree;
> +       ioc->vios += vios;
> +
> +       if (RB_EMPTY_ROOT(&ioc->sort_list))
> +               fiops_del_ioc_rr(fiopsd, ioc);
> +       else
> +               fiops_resort_rr_list(fiopsd, ioc);
> +
> +       fiops_update_min_vios(service_tree);
> +}
> +
> +static int fiops_dispatch_requests(struct request_queue *q, int force)
> +{
> +       struct fiops_data *fiopsd = q->elevator->elevator_data;
> +       struct fiops_ioc *ioc;
> +       u64 vios;
> +
> +       if (unlikely(force))
> +               return fiops_forced_dispatch(fiopsd);
> +
> +       ioc = fiops_select_ioc(fiopsd);
> +       if (!ioc)
> +               return 0;
> +
> +       vios = fiops_dispatch_request(fiopsd, ioc);
> +
> +       fiops_charge_vios(fiopsd, ioc, vios);
> +       return 1;
> +}
> +
> +static void fiops_insert_request(struct request_queue *q, struct request *rq)
> +{
> +       struct fiops_ioc *ioc = RQ_CIC(rq);
> +
> +       rq_set_fifo_time(rq, jiffies);
> +
> +       list_add_tail(&rq->queuelist, &ioc->fifo);
> +
> +       fiops_add_rq_rb(rq);
> +}
> +
> +/*
> + * scheduler run of queue, if there are requests pending and no one in the
> + * driver that will restart queueing
> + */
> +static inline void fiops_schedule_dispatch(struct fiops_data *fiopsd)
> +{
> +       if (fiopsd->busy_queues)
> +               kblockd_schedule_work(fiopsd->qdata.queue,
> +                                     &fiopsd->unplug_work);
> +}
> +
> +static int
> +fiops_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
> +{
> +       struct fiops_data *fiopsd = q->elevator->elevator_data;
> +       struct dev_io_context *dev_ioc;
> +       struct fiops_ioc *cic;
> +
> +       might_sleep_if(gfp_mask & __GFP_WAIT);
> +
> +       dev_ioc = queue_data_get_io_context(&ioc_builder, &fiopsd->qdata,
> +               gfp_mask);
> +       if (!dev_ioc)
> +               goto queue_fail;
> +
> +       cic = dev_ioc_to_fiops_ioc(dev_ioc);
> +
> +       /*
> +        * we hold a reference of dev_ioc and nobody else set this request,
> +        * doesn't need locking
> +        */
> +       rq->elevator_private[0] = cic;
> +
> +       return 0;
> +
> +queue_fail:
> +       fiops_schedule_dispatch(fiopsd);
> +       return 1;
> +}
> +
> +static void fiops_put_request(struct request *rq)
> +{
> +       struct fiops_ioc *ioc = RQ_CIC(rq);
> +
> +       if (ioc) {
> +               rq->elevator_private[0] = NULL;
> +               put_io_context(ioc->dev_ioc.ioc);
> +       }
> +}
> +
> +static struct request *
> +fiops_find_rq_fmerge(struct fiops_data *fiopsd, struct bio *bio)
> +{
> +       struct task_struct *tsk = current;
> +       struct dev_io_context *gen_cic;
> +       struct fiops_ioc *cic;
> +
> +       gen_cic = queue_data_cic_lookup(&fiopsd->qdata, tsk->io_context);
> +       if (!gen_cic)
> +               return NULL;
> +       cic = dev_ioc_to_fiops_ioc(gen_cic);
> +
> +       if (cic) {
> +               sector_t sector = bio->bi_sector + bio_sectors(bio);
> +
> +               return elv_rb_find(&cic->sort_list, sector);
> +       }
> +
> +       return NULL;
> +}
> +
> +static int fiops_merge(struct request_queue *q, struct request **req,
> +                    struct bio *bio)
> +{
> +       struct fiops_data *fiopsd = q->elevator->elevator_data;
> +       struct request *__rq;
> +
> +       __rq = fiops_find_rq_fmerge(fiopsd, bio);
> +       if (__rq && elv_rq_merge_ok(__rq, bio)) {
> +               *req = __rq;
> +               return ELEVATOR_FRONT_MERGE;
> +       }
> +
> +       return ELEVATOR_NO_MERGE;
> +}
> +
> +static void fiops_merged_request(struct request_queue *q, struct request *req,
> +                              int type)
> +{
> +       if (type == ELEVATOR_FRONT_MERGE) {
> +               struct fiops_ioc *ioc = RQ_CIC(req);
> +
> +               fiops_reposition_rq_rb(ioc, req);
> +       }
> +}
> +
> +static void
> +fiops_merged_requests(struct request_queue *q, struct request *rq,
> +                   struct request *next)
> +{
> +       struct fiops_ioc *ioc = RQ_CIC(rq);
> +       struct fiops_data *fiopsd = q->elevator->elevator_data;
> +       /*
> +        * reposition in fifo if next is older than rq
> +        */
> +       if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
> +           time_before(rq_fifo_time(next), rq_fifo_time(rq))) {
> +               list_move(&rq->queuelist, &next->queuelist);
> +               rq_set_fifo_time(rq, rq_fifo_time(next));
> +       }
> +
> +       fiops_remove_request(next);
> +
> +       ioc = RQ_CIC(next);
> +       /*
> +        * all requests of this task are merged to other tasks, delete it
> +        * from the service tree.
> +        */
> +       if (fiops_ioc_on_rr(ioc) && RB_EMPTY_ROOT(&ioc->sort_list))
> +               fiops_del_ioc_rr(fiopsd, ioc);
> +}
> +
> +static int fiops_allow_merge(struct request_queue *q, struct request *rq,
> +                          struct bio *bio)
> +{
> +       struct fiops_data *fiopsd = q->elevator->elevator_data;
> +       struct dev_io_context *gen_cic;
> +       struct fiops_ioc *cic;
> +
> +       /*
> +        * Lookup the ioc that this bio will be queued with. Allow
> +        * merge only if rq is queued there.
> +        */
> +       gen_cic = queue_data_cic_lookup(&fiopsd->qdata, current->io_context);
> +       if (!gen_cic)
> +               return false;
> +       cic = dev_ioc_to_fiops_ioc(gen_cic);
> +
> +       return cic == RQ_CIC(rq);
> +}
> +
> +static void fiops_exit_queue(struct elevator_queue *e)
> +{
> +       struct fiops_data *fiopsd = e->elevator_data;
> +       struct request_queue *q = fiopsd->qdata.queue;
> +
> +       cancel_work_sync(&fiopsd->unplug_work);
> +
> +       spin_lock_irq(q->queue_lock);
> +
> +       ioc_builder_exit_queue(&ioc_builder, &fiopsd->qdata);
> +
> +       spin_unlock_irq(q->queue_lock);
> +       kfree(fiopsd);
> +}
> +
> +static void fiops_kick_queue(struct work_struct *work)
> +{
> +       struct fiops_data *fiopsd =
> +               container_of(work, struct fiops_data, unplug_work);
> +       struct request_queue *q = fiopsd->qdata.queue;
> +
> +       spin_lock_irq(q->queue_lock);
> +       __blk_run_queue(q);
> +       spin_unlock_irq(q->queue_lock);
> +}
> +
> +static void *fiops_init_queue(struct request_queue *q)
> +{
> +       struct fiops_data *fiopsd;
> +
> +       fiopsd = kzalloc_node(sizeof(*fiopsd), GFP_KERNEL, q->node);
> +       if (!fiopsd)
> +               return NULL;
> +
> +       if (ioc_builder_init_queue(&ioc_builder, &fiopsd->qdata, q)) {
> +               kfree(fiopsd);
> +               return NULL;
> +       }
> +
> +       fiopsd->service_tree = FIOPS_RB_ROOT;
> +
> +       INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue);
> +
> +       return fiopsd;
> +}
> +
> +static void fiops_slab_kill(void)
> +{
> +       /*
> +        * Caller already ensured that pending RCU callbacks are completed,
> +        * so we should have no busy allocations at this point.
> +        */
> +       if (fiops_ioc_pool)
> +               kmem_cache_destroy(fiops_ioc_pool);
> +}
> +
> +static int __init fiops_slab_setup(void)
> +{
> +       fiops_ioc_pool = KMEM_CACHE(fiops_ioc, 0);
> +       if (!fiops_ioc_pool)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +static struct dev_io_context *
> +fiops_alloc_ioc(struct ioc_builder *builder, struct queue_data *qdata,
> +       gfp_t gfp_mask)
> +{
> +       struct fiops_ioc *ioc = kmem_cache_alloc_node(fiops_ioc_pool,
> +               gfp_mask, qdata->queue->node);
> +       if (ioc)
> +               return &ioc->dev_ioc;
> +       return NULL;
> +}
> +
> +static void fiops_free_ioc(struct ioc_builder *builder,
> +       struct dev_io_context *dev_ioc)
> +{
> +       struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(dev_ioc);
> +       kmem_cache_free(fiops_ioc_pool, ioc);
> +}
> +
> +static void fiops_init_cic(struct queue_data *qdata,
> +       struct dev_io_context *gen_cic)
> +{
> +       struct fiops_data *fiopsd = queue_data_to_fiopsd(qdata);
> +       struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic);
> +
> +       RB_CLEAR_NODE(&ioc->rb_node);
> +       INIT_LIST_HEAD(&ioc->fifo);
> +       ioc->sort_list = RB_ROOT;
> +
> +       ioc->fiopsd = fiopsd;
> +
> +       ioc->pid = current->pid;
> +}
> +
> +static void fiops_exit_cic(struct queue_data *qdata,
> +       struct dev_io_context *gen_cic)
> +{
> +       struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic);
> +
> +       WARN_ON(fiops_ioc_on_rr(ioc));
> +}
> +
> +static struct ioc_builder ioc_builder = {
> +       .alloc_ioc = fiops_alloc_ioc,
> +       .free_ioc = fiops_free_ioc,
> +       .cic_init = fiops_init_cic,
> +       .cic_exit = fiops_exit_cic,
> +};
> +
> +static struct elevator_type iosched_fiops = {
> +       .ops = {
> +               .elevator_merge_fn =            fiops_merge,
> +               .elevator_merged_fn =           fiops_merged_request,
> +               .elevator_merge_req_fn =        fiops_merged_requests,
> +               .elevator_allow_merge_fn =      fiops_allow_merge,
> +               .elevator_dispatch_fn =         fiops_dispatch_requests,
> +               .elevator_add_req_fn =          fiops_insert_request,
> +               .elevator_former_req_fn =       elv_rb_former_request,
> +               .elevator_latter_req_fn =       elv_rb_latter_request,
> +               .elevator_set_req_fn =          fiops_set_request,
> +               .elevator_put_req_fn =          fiops_put_request,
> +               .elevator_init_fn =             fiops_init_queue,
> +               .elevator_exit_fn =             fiops_exit_queue,
> +               .trim =                         queue_data_free_io_context,
> +       },
> +       .elevator_name =        "fiops",
> +       .elevator_owner =       THIS_MODULE,
> +};
> +
> +static int __init fiops_init(void)
> +{
> +       if (fiops_slab_setup())
> +               return -ENOMEM;
> +       if (ioc_builder_init(&ioc_builder)) {
> +               fiops_slab_kill();
> +               return -ENOMEM;
> +       }
> +
> +       elv_register(&iosched_fiops);
> +
> +       return 0;
> +}
> +
> +static void __exit fiops_exit(void)
> +{
> +       elv_unregister(&iosched_fiops);
> +       io_context_builder_exit(&ioc_builder);
> +       fiops_slab_kill();
> +}
> +
> +module_init(fiops_init);
> +module_exit(fiops_exit);
> +
> +MODULE_AUTHOR("Jens Axboe, Shaohua Li <shli@kernel.org>");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("IOPS based IO scheduler");
> Index: linux/block/Kconfig.iosched
> ===================================================================
> --- linux.orig/block/Kconfig.iosched    2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/Kconfig.iosched 2012-01-04 13:58:35.000000000 +0800
> @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
>        ---help---
>          Enable group IO scheduling in CFQ.
>
> +config IOSCHED_FIOPS
> +       tristate "IOPS based I/O scheduler"
> +       default y
> +       ---help---
> +         This is an IOPS based I/O scheduler. It will try to distribute
> +          IOPS equally among all processes in the system. It's mainly for
> +          Flash based storage.
> +
>  choice
>        prompt "Default I/O scheduler"
>        default DEFAULT_CFQ
> Index: linux/block/blk.h
> ===================================================================
> --- linux.orig/block/blk.h      2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/blk.h   2011-12-28 09:42:35.000000000 +0800
> @@ -206,7 +206,7 @@ static inline void blk_throtl_exit(struc
>  static inline void blk_throtl_release(struct request_queue *q) { }
>  #endif /* CONFIG_BLK_DEV_THROTTLING */
>
> -#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
> +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS)
>  struct queue_data;
>  struct ioc_builder {
>        struct dev_io_context *(*alloc_ioc)(struct ioc_builder *builder,
> Index: linux/block/blk-ioc.c
> ===================================================================
> --- linux.orig/block/blk-ioc.c  2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/blk-ioc.c       2011-12-28 09:42:35.000000000 +0800
> @@ -164,7 +164,7 @@ static int __init blk_ioc_init(void)
>  }
>  subsys_initcall(blk_ioc_init);
>
> -#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
> +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS)
>  #define CIC_DEAD_INDEX_SHIFT   1
>
>  static inline void *queue_data_dead_key(struct queue_data *qdata)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-06  5:12     ` Shaohua Li
  2012-01-06  9:10       ` Namhyung Kim
  2012-01-06 14:37       ` Jan Kara
@ 2012-01-08 22:16       ` Dave Chinner
  2012-01-09  1:09         ` Shaohua Li
  2012-01-15 22:28       ` Vivek Goyal
  3 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2012-01-08 22:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer

On Fri, Jan 06, 2012 at 01:12:29PM +0800, Shaohua Li wrote:
> On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
> > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
> > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> > > > An IOPS based I/O scheduler
> > > > 
> > > > Flash based storage has some different characteristics against rotate disk.
> > > > 1. no I/O seek.
> > > > 2. read and write I/O cost usually is much different.
> > > > 3. Time which a request takes depends on request size.
> > > > 4. High throughput and IOPS, low latency.
> > > > 
> > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle
> > > > for sequential read. It also has optimization for flash based storage (for
> > > > item 1 above), but overall it's not designed for flash based storage. It's
> > > > a slice based algorithm. Since flash based storage request cost is very
> > > > low, and drive has big queue_depth is quite popular now which makes
> > > > dispatching cost even lower, CFQ's slice accounting (jiffy based)
> > > > doesn't work well. CFQ doesn't consider above item 2 & 3.
> > > > 
> > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> > > > only targets for drive without I/O seek. It's quite similar like CFQ, but
> > > > the dispatch decision is made according to IOPS instead of slice.
> > > > 
> > > > The algorithm is simple. Drive has a service tree, and each task lives in
> > > > the tree. The key into the tree is called vios (virtual I/O). Every request
> > > > has vios, which is calculated according to its ioprio, request size and so
> > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> > > > always selects task with minimum vios in the service tree and let the task
> > > > dispatch request. The dispatched request's vios is then added to the task's
> > > > vios and the task is repositioned in the sevice tree.
> > > > 
> > > > The series are orgnized as:
> > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> > > > Patch 2: The core FIOPS.
> > > > Patch 3: request read/write vios scale. This demontrates how the vios scale.
> > > > 
> > > > To make the code simple for easy view, some scale code isn't included here,
> > > > some not implementated yet.
> > > > 
> > > > TODO:
> > > > 1. ioprio support (have patch already)
> > > > 2. request size vios scale
> > > > 3. cgroup support
> > > > 4. tracing support
> > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> > > > 
> > > > Comments and suggestions are welcome!
> > > 
> > > Benchmark results?
> > I didn't have data yet. The patches are still in earlier stage, I want
> > to focus on the basic idea first.
> since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
> fs is ext4.
> 
> workload		percentage change with fiops against cfq
> fio_sync_read_4k        -2
> fio_mediaplay_64k       0
> fio_mediaplay_128k      0
> fio_mediaplay_rr_64k    0
> fio_sync_read_rr_4k     0
> fio_sync_write_128k     0
> fio_sync_write_64k      -1
> fio_sync_write_4k       -2
> fio_sync_write_64k_create       0
> fio_sync_write_rr_64k_create    0
> fio_sync_write_128k_create      0
> fio_aio_randread_4k     -4
> fio_aio_randread_64k    0
> fio_aio_randwrite_4k    1
> fio_aio_randwrite_64k   0
> fio_aio_randrw_4k       -1
> fio_aio_randrw_64k      0
> fio_tpch        9
> fio_tpcc        0
> fio_mmap_randread_4k    -1
> fio_mmap_randread_64k   1
> fio_mmap_randread_1k    -8
> fio_mmap_randwrite_4k   35
> fio_mmap_randwrite_64k  22
> fio_mmap_randwrite_1k   28
> fio_mmap_randwrite_4k_halfbusy  24
> fio_mmap_randrw_4k      23
> fio_mmap_randrw_64k     4
> fio_mmap_randrw_1k      22
> fio_mmap_randrw_4k_halfbusy     35
> fio_mmap_sync_read_4k   0
> fio_mmap_sync_read_64k  -1
> fio_mmap_sync_read_128k         -1
> fio_mmap_sync_read_rr_64k       5
> fio_mmap_sync_read_rr_4k        3
> 
> The fio_mmap_randread_1k has regression against 3.2-rc7, but no
> regression against 3.2-rc6 kernel, still checking why. The fiops has
> improvement for read/write mixed workload. CFQ is known not good for
> read/write mixed workload.

Numbers like this are meaningless without knowing what the hardware
capability is and how the numbers compare to that raw capability.
They tell me only mmap based random write improves in
performance, and only one specific type of random write improves,
not all types.

That raises more questions that it answers: why do AIO based random
writes not go any faster? Is that because even with CFQ, AIO based
random writes saturate the device?  i.e. is AIO based IO that much
faster than mmap based IO that there is no scope for improvement on
your hardware?

You need to present raw numbers and give us some idea of how close
those numbers are to raw hardware capability for us to have any idea
what improvements these numbers actually demonstrate.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-08 22:16       ` Dave Chinner
@ 2012-01-09  1:09         ` Shaohua Li
  2012-01-15 22:45           ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-09  1:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, axboe, vgoyal, jmoyer

On Mon, 2012-01-09 at 09:16 +1100, Dave Chinner wrote:
> On Fri, Jan 06, 2012 at 01:12:29PM +0800, Shaohua Li wrote:
> > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
> > > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
> > > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> > > > > An IOPS based I/O scheduler
> > > > > 
> > > > > Flash based storage has some different characteristics against rotate disk.
> > > > > 1. no I/O seek.
> > > > > 2. read and write I/O cost usually is much different.
> > > > > 3. Time which a request takes depends on request size.
> > > > > 4. High throughput and IOPS, low latency.
> > > > > 
> > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle
> > > > > for sequential read. It also has optimization for flash based storage (for
> > > > > item 1 above), but overall it's not designed for flash based storage. It's
> > > > > a slice based algorithm. Since flash based storage request cost is very
> > > > > low, and drive has big queue_depth is quite popular now which makes
> > > > > dispatching cost even lower, CFQ's slice accounting (jiffy based)
> > > > > doesn't work well. CFQ doesn't consider above item 2 & 3.
> > > > > 
> > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> > > > > only targets for drive without I/O seek. It's quite similar like CFQ, but
> > > > > the dispatch decision is made according to IOPS instead of slice.
> > > > > 
> > > > > The algorithm is simple. Drive has a service tree, and each task lives in
> > > > > the tree. The key into the tree is called vios (virtual I/O). Every request
> > > > > has vios, which is calculated according to its ioprio, request size and so
> > > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> > > > > always selects task with minimum vios in the service tree and let the task
> > > > > dispatch request. The dispatched request's vios is then added to the task's
> > > > > vios and the task is repositioned in the sevice tree.
> > > > > 
> > > > > The series are orgnized as:
> > > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> > > > > Patch 2: The core FIOPS.
> > > > > Patch 3: request read/write vios scale. This demontrates how the vios scale.
> > > > > 
> > > > > To make the code simple for easy view, some scale code isn't included here,
> > > > > some not implementated yet.
> > > > > 
> > > > > TODO:
> > > > > 1. ioprio support (have patch already)
> > > > > 2. request size vios scale
> > > > > 3. cgroup support
> > > > > 4. tracing support
> > > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> > > > > 
> > > > > Comments and suggestions are welcome!
> > > > 
> > > > Benchmark results?
> > > I didn't have data yet. The patches are still in earlier stage, I want
> > > to focus on the basic idea first.
> > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
> > fs is ext4.
> > 
> > workload		percentage change with fiops against cfq
> > fio_sync_read_4k        -2
> > fio_mediaplay_64k       0
> > fio_mediaplay_128k      0
> > fio_mediaplay_rr_64k    0
> > fio_sync_read_rr_4k     0
> > fio_sync_write_128k     0
> > fio_sync_write_64k      -1
> > fio_sync_write_4k       -2
> > fio_sync_write_64k_create       0
> > fio_sync_write_rr_64k_create    0
> > fio_sync_write_128k_create      0
> > fio_aio_randread_4k     -4
> > fio_aio_randread_64k    0
> > fio_aio_randwrite_4k    1
> > fio_aio_randwrite_64k   0
> > fio_aio_randrw_4k       -1
> > fio_aio_randrw_64k      0
> > fio_tpch        9
> > fio_tpcc        0
> > fio_mmap_randread_4k    -1
> > fio_mmap_randread_64k   1
> > fio_mmap_randread_1k    -8
> > fio_mmap_randwrite_4k   35
> > fio_mmap_randwrite_64k  22
> > fio_mmap_randwrite_1k   28
> > fio_mmap_randwrite_4k_halfbusy  24
> > fio_mmap_randrw_4k      23
> > fio_mmap_randrw_64k     4
> > fio_mmap_randrw_1k      22
> > fio_mmap_randrw_4k_halfbusy     35
> > fio_mmap_sync_read_4k   0
> > fio_mmap_sync_read_64k  -1
> > fio_mmap_sync_read_128k         -1
> > fio_mmap_sync_read_rr_64k       5
> > fio_mmap_sync_read_rr_4k        3
> > 
> > The fio_mmap_randread_1k has regression against 3.2-rc7, but no
> > regression against 3.2-rc6 kernel, still checking why. The fiops has
> > improvement for read/write mixed workload. CFQ is known not good for
> > read/write mixed workload.
> 
> Numbers like this are meaningless without knowing what the hardware
> capability is and how the numbers compare to that raw capability.
> They tell me only mmap based random write improves in
> performance, and only one specific type of random write improves,
> not all types.
> 
> That raises more questions that it answers: why do AIO based random
> writes not go any faster? Is that because even with CFQ, AIO based
> random writes saturate the device?  i.e. is AIO based IO that much
> faster than mmap based IO that there is no scope for improvement on
> your hardware?
> 
> You need to present raw numbers and give us some idea of how close
> those numbers are to raw hardware capability for us to have any idea
> what improvements these numbers actually demonstrate.
Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
the jbod capability, for both throughput and IOPS, that's why only
read/write mixed workload impacts. I'll use less SSD in later tests,
which will demonstrate the performance better. I'll report both raw
numbers and fiops/cfq numbers later.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-06 14:37       ` Jan Kara
@ 2012-01-09  1:26         ` Shaohua Li
  2012-01-15 22:32           ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-09  1:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, linux-kernel, axboe, vgoyal, jmoyer

On Fri, 2012-01-06 at 15:37 +0100, Jan Kara wrote:
> On Fri 06-01-12 13:12:29, Shaohua Li wrote:
> > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
> > > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
> > > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> > > > > An IOPS based I/O scheduler
> > > > > 
> > > > > Flash based storage has some different characteristics against rotate disk.
> > > > > 1. no I/O seek.
> > > > > 2. read and write I/O cost usually is much different.
> > > > > 3. Time which a request takes depends on request size.
> > > > > 4. High throughput and IOPS, low latency.
> > > > > 
> > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle
> > > > > for sequential read. It also has optimization for flash based storage (for
> > > > > item 1 above), but overall it's not designed for flash based storage. It's
> > > > > a slice based algorithm. Since flash based storage request cost is very
> > > > > low, and drive has big queue_depth is quite popular now which makes
> > > > > dispatching cost even lower, CFQ's slice accounting (jiffy based)
> > > > > doesn't work well. CFQ doesn't consider above item 2 & 3.
> > > > > 
> > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> > > > > only targets for drive without I/O seek. It's quite similar like CFQ, but
> > > > > the dispatch decision is made according to IOPS instead of slice.
> > > > > 
> > > > > The algorithm is simple. Drive has a service tree, and each task lives in
> > > > > the tree. The key into the tree is called vios (virtual I/O). Every request
> > > > > has vios, which is calculated according to its ioprio, request size and so
> > > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> > > > > always selects task with minimum vios in the service tree and let the task
> > > > > dispatch request. The dispatched request's vios is then added to the task's
> > > > > vios and the task is repositioned in the sevice tree.
> > > > > 
> > > > > The series are orgnized as:
> > > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> > > > > Patch 2: The core FIOPS.
> > > > > Patch 3: request read/write vios scale. This demontrates how the vios scale.
> > > > > 
> > > > > To make the code simple for easy view, some scale code isn't included here,
> > > > > some not implementated yet.
> > > > > 
> > > > > TODO:
> > > > > 1. ioprio support (have patch already)
> > > > > 2. request size vios scale
> > > > > 3. cgroup support
> > > > > 4. tracing support
> > > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> > > > > 
> > > > > Comments and suggestions are welcome!
> > > > 
> > > > Benchmark results?
> > > I didn't have data yet. The patches are still in earlier stage, I want
> > > to focus on the basic idea first.
> > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
> > fs is ext4.
> > 
> > workload		percentage change with fiops against cfq
> > fio_sync_read_4k        -2
> > fio_mediaplay_64k       0
> > fio_mediaplay_128k      0
> > fio_mediaplay_rr_64k    0
> > fio_sync_read_rr_4k     0
> > fio_sync_write_128k     0
> > fio_sync_write_64k      -1
> > fio_sync_write_4k       -2
> > fio_sync_write_64k_create       0
> > fio_sync_write_rr_64k_create    0
> > fio_sync_write_128k_create      0
> > fio_aio_randread_4k     -4
> > fio_aio_randread_64k    0
> > fio_aio_randwrite_4k    1
> > fio_aio_randwrite_64k   0
> > fio_aio_randrw_4k       -1
> > fio_aio_randrw_64k      0
> > fio_tpch        9
> > fio_tpcc        0
> > fio_mmap_randread_4k    -1
> > fio_mmap_randread_64k   1
> > fio_mmap_randread_1k    -8
> > fio_mmap_randwrite_4k   35
> > fio_mmap_randwrite_64k  22
> > fio_mmap_randwrite_1k   28
> > fio_mmap_randwrite_4k_halfbusy  24
> > fio_mmap_randrw_4k      23
> > fio_mmap_randrw_64k     4
> > fio_mmap_randrw_1k      22
> > fio_mmap_randrw_4k_halfbusy     35
> > fio_mmap_sync_read_4k   0
> > fio_mmap_sync_read_64k  -1
> > fio_mmap_sync_read_128k         -1
> > fio_mmap_sync_read_rr_64k       5
> > fio_mmap_sync_read_rr_4k        3
> > 
> > The fio_mmap_randread_1k has regression against 3.2-rc7, but no
> > regression against 3.2-rc6 kernel, still checking why. The fiops has
> > improvement for read/write mixed workload. CFQ is known not good for
> > read/write mixed workload.
>   Nice, but this is only about throughput, isn't it? Part of the reason why
> CFQ takes hit in the throughput is that it prefers sync IO (i.e.  reads and
> synchronous writes) against other IO. Does your scheduler do anything like
> that?
It has. The vios will be scaled based on read/write. By default, I'll
give read 2.5 times more share than write, which matches CFQ. It's quite
easy to change the ratio with sysfs knobs.

>  Could you for example compare a latency of reads while running heavy
> background writing between CFQ and your scheduler? Loads like this where
> original motivation for CFQ I believe.
CFQ supports preemption, FIOPS doesn't, so I suppose read latency of CFQ
is still better in such workload.
In this initial post, I just want to demonstrate the basic idea of the
ioscheduler. I'll post more data for both latency and throughput in next
round.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-04  6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li
                   ` (4 preceding siblings ...)
  2012-01-06  9:41 ` Zhu Yanhai
@ 2012-01-15 22:24 ` Vivek Goyal
  5 siblings, 0 replies; 29+ messages in thread
From: Vivek Goyal @ 2012-01-15 22:24 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, jmoyer

On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> An IOPS based I/O scheduler
> 
> Flash based storage has some different characteristics against rotate disk.
> 1. no I/O seek.
> 2. read and write I/O cost usually is much different.
> 3. Time which a request takes depends on request size.
> 4. High throughput and IOPS, low latency.
> 
> CFQ iosched does well for rotate disk, for example fair dispatching, idle
> for sequential read. It also has optimization for flash based storage (for
> item 1 above), but overall it's not designed for flash based storage. It's
> a slice based algorithm. Since flash based storage request cost is very
> low, and drive has big queue_depth is quite popular now which makes
> dispatching cost even lower, CFQ's slice accounting (jiffy based)
> doesn't work well. CFQ doesn't consider above item 2 & 3.
> 
> FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> only targets for drive without I/O seek. It's quite similar like CFQ, but
> the dispatch decision is made according to IOPS instead of slice.

Hi Shaohua,

What problem are you trying to fix. If you just want to do provide
fairness in terms of IOPS instead of time, then I think existing code
should be easily modifiable instead of writing a new IO scheduler
altogether. It is just like switching the mode either based on tunable
or based on disk type. 

If slice_idle is zero, we already have some code to effectively do 
accounting in terms of IOPS. This is primarily for group level. We should
be able to extend it to queue level.

These are implementation details. I think what I am not able to understand
that what is the problem and what are you going to gain by doing
accounting in terms of IOPS.

Also, for flash based storage, isn't noop or deadline a good choice of
scheduler. Why do we want to come up with something which is CFQ like.
For this fast device any idling will kill the performance and if you
don't idle, then I think practically most of the workload will almost
become round robin kind of serving.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-06  5:12     ` Shaohua Li
                         ` (2 preceding siblings ...)
  2012-01-08 22:16       ` Dave Chinner
@ 2012-01-15 22:28       ` Vivek Goyal
  3 siblings, 0 replies; 29+ messages in thread
From: Vivek Goyal @ 2012-01-15 22:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer

On Fri, Jan 06, 2012 at 01:12:29PM +0800, Shaohua Li wrote:
> On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
> > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
> > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
> > > > An IOPS based I/O scheduler
> > > > 
> > > > Flash based storage has some different characteristics against rotate disk.
> > > > 1. no I/O seek.
> > > > 2. read and write I/O cost usually is much different.
> > > > 3. Time which a request takes depends on request size.
> > > > 4. High throughput and IOPS, low latency.
> > > > 
> > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle
> > > > for sequential read. It also has optimization for flash based storage (for
> > > > item 1 above), but overall it's not designed for flash based storage. It's
> > > > a slice based algorithm. Since flash based storage request cost is very
> > > > low, and drive has big queue_depth is quite popular now which makes
> > > > dispatching cost even lower, CFQ's slice accounting (jiffy based)
> > > > doesn't work well. CFQ doesn't consider above item 2 & 3.
> > > > 
> > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> > > > only targets for drive without I/O seek. It's quite similar like CFQ, but
> > > > the dispatch decision is made according to IOPS instead of slice.
> > > > 
> > > > The algorithm is simple. Drive has a service tree, and each task lives in
> > > > the tree. The key into the tree is called vios (virtual I/O). Every request
> > > > has vios, which is calculated according to its ioprio, request size and so
> > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> > > > always selects task with minimum vios in the service tree and let the task
> > > > dispatch request. The dispatched request's vios is then added to the task's
> > > > vios and the task is repositioned in the sevice tree.
> > > > 
> > > > The series are orgnized as:
> > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too.
> > > > Patch 2: The core FIOPS.
> > > > Patch 3: request read/write vios scale. This demontrates how the vios scale.
> > > > 
> > > > To make the code simple for easy view, some scale code isn't included here,
> > > > some not implementated yet.
> > > > 
> > > > TODO:
> > > > 1. ioprio support (have patch already)
> > > > 2. request size vios scale
> > > > 3. cgroup support
> > > > 4. tracing support
> > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT.
> > > > 
> > > > Comments and suggestions are welcome!
> > > 
> > > Benchmark results?
> > I didn't have data yet. The patches are still in earlier stage, I want
> > to focus on the basic idea first.
> since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
> fs is ext4.
> 
> workload		percentage change with fiops against cfq
> fio_sync_read_4k        -2
> fio_mediaplay_64k       0
> fio_mediaplay_128k      0
> fio_mediaplay_rr_64k    0
> fio_sync_read_rr_4k     0
> fio_sync_write_128k     0
> fio_sync_write_64k      -1
> fio_sync_write_4k       -2
> fio_sync_write_64k_create       0
> fio_sync_write_rr_64k_create    0
> fio_sync_write_128k_create      0
> fio_aio_randread_4k     -4
> fio_aio_randread_64k    0
> fio_aio_randwrite_4k    1
> fio_aio_randwrite_64k   0
> fio_aio_randrw_4k       -1
> fio_aio_randrw_64k      0
> fio_tpch        9
> fio_tpcc        0
> fio_mmap_randread_4k    -1
> fio_mmap_randread_64k   1
> fio_mmap_randread_1k    -8
> fio_mmap_randwrite_4k   35
> fio_mmap_randwrite_64k  22
> fio_mmap_randwrite_1k   28
> fio_mmap_randwrite_4k_halfbusy  24
> fio_mmap_randrw_4k      23
> fio_mmap_randrw_64k     4
> fio_mmap_randrw_1k      22
> fio_mmap_randrw_4k_halfbusy     35
> fio_mmap_sync_read_4k   0
> fio_mmap_sync_read_64k  -1
> fio_mmap_sync_read_128k         -1
> fio_mmap_sync_read_rr_64k       5
> fio_mmap_sync_read_rr_4k        3
> 
> The fio_mmap_randread_1k has regression against 3.2-rc7, but no
> regression against 3.2-rc6 kernel, still checking why. The fiops has
> improvement for read/write mixed workload. CFQ is known not good for
> read/write mixed workload.

Set the slice_idle=0 and possibly increase quantum to 32 or 64 and you
should get better performance. I think practically it should become a
IOPS based scheduler with little imprecise accounting.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-09  1:26         ` Shaohua Li
@ 2012-01-15 22:32           ` Vivek Goyal
  0 siblings, 0 replies; 29+ messages in thread
From: Vivek Goyal @ 2012-01-15 22:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jan Kara, Dave Chinner, linux-kernel, axboe, jmoyer

On Mon, Jan 09, 2012 at 09:26:45AM +0800, Shaohua Li wrote:

[..]
> >  Could you for example compare a latency of reads while running heavy
> > background writing between CFQ and your scheduler? Loads like this where
> > original motivation for CFQ I believe.
> CFQ supports preemption, FIOPS doesn't, so I suppose read latency of CFQ
> is still better in such workload.
> In this initial post, I just want to demonstrate the basic idea of the
> ioscheduler. I'll post more data for both latency and throughput in next
> round.

I think before numbers what will be more helpful to know is that what are
you trying to achieve and why existing CFQ code can not do that with little
modification.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-09  1:09         ` Shaohua Li
@ 2012-01-15 22:45           ` Vivek Goyal
  2012-01-16  4:36             ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2012-01-15 22:45 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer

On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:

[..]
> > You need to present raw numbers and give us some idea of how close
> > those numbers are to raw hardware capability for us to have any idea
> > what improvements these numbers actually demonstrate.
> Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> the jbod capability, for both throughput and IOPS, that's why only
> read/write mixed workload impacts. I'll use less SSD in later tests,
> which will demonstrate the performance better. I'll report both raw
> numbers and fiops/cfq numbers later.

If fiops number are better please explain why those numbers are better.
If you cut down on idling, it is obivious that you will get higher
throughput on these flash devices. CFQ does disable queue idling for
non rotational NCQ devices. If higher throughput is due to driving
deeper queue depths, then CFQ can do that too just by changing quantum
and disabling idling. 

So I really don't understand that what are you doing fundamentally
different in FIOPS ioscheduler. 

The only thing I can think of more accurate accounting per queue in
terms of number of IOs instead of time. Which can just serve to improve
fairness a bit for certain workloads. In practice, I think it might
not matter much.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-15 22:45           ` Vivek Goyal
@ 2012-01-16  4:36             ` Shaohua Li
  2012-01-16  7:11               ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-16  4:36 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer

On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> 
> [..]
> > > You need to present raw numbers and give us some idea of how close
> > > those numbers are to raw hardware capability for us to have any idea
> > > what improvements these numbers actually demonstrate.
> > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > the jbod capability, for both throughput and IOPS, that's why only
> > read/write mixed workload impacts. I'll use less SSD in later tests,
> > which will demonstrate the performance better. I'll report both raw
> > numbers and fiops/cfq numbers later.
> 
> If fiops number are better please explain why those numbers are better.
> If you cut down on idling, it is obivious that you will get higher
> throughput on these flash devices. CFQ does disable queue idling for
> non rotational NCQ devices. If higher throughput is due to driving
> deeper queue depths, then CFQ can do that too just by changing quantum
> and disabling idling. 
it's because of quantum. Surely you can change the quantum, and CFQ
performance will increase, but you will find CFQ is very unfair then.

> So I really don't understand that what are you doing fundamentally
> different in FIOPS ioscheduler. 
> 
> The only thing I can think of more accurate accounting per queue in
> terms of number of IOs instead of time. Which can just serve to improve
> fairness a bit for certain workloads. In practice, I think it might
> not matter much.
If quantum is big, CFQ will have better performance, but it actually
fallbacks to Noop, no any fairness. fairness is important and is why we
introduce CFQ.

In summary, CFQ isn't both fair and good performance. FIOPS is trying to
be fair and have good performance. I didn't think any time based
accounting can make the goal happen for NCQ and SSD (even cfq cgroup
code has iops mode, so suppose you should already know this well).

Surely you can change CFQ to make it IOPS based, but this will mess the
code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
a separate ioscheduler which is IOPS based.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-16  4:36             ` Shaohua Li
@ 2012-01-16  7:11               ` Vivek Goyal
  2012-01-16  7:55                 ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2012-01-16  7:11 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer

On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> > 
> > [..]
> > > > You need to present raw numbers and give us some idea of how close
> > > > those numbers are to raw hardware capability for us to have any idea
> > > > what improvements these numbers actually demonstrate.
> > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > the jbod capability, for both throughput and IOPS, that's why only
> > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > which will demonstrate the performance better. I'll report both raw
> > > numbers and fiops/cfq numbers later.
> > 
> > If fiops number are better please explain why those numbers are better.
> > If you cut down on idling, it is obivious that you will get higher
> > throughput on these flash devices. CFQ does disable queue idling for
> > non rotational NCQ devices. If higher throughput is due to driving
> > deeper queue depths, then CFQ can do that too just by changing quantum
> > and disabling idling. 
> it's because of quantum. Surely you can change the quantum, and CFQ
> performance will increase, but you will find CFQ is very unfair then.

Why increasing quantum leads to CFQ being unfair? In terms of time it
still tries to be fair. That's a different thing that with NCQ, right
time measurement is not possible with requests from multiple queues
being in the driver/disk at the same time. So accouting in terms of
iops per queue might make sense.

> 
> > So I really don't understand that what are you doing fundamentally
> > different in FIOPS ioscheduler. 
> > 
> > The only thing I can think of more accurate accounting per queue in
> > terms of number of IOs instead of time. Which can just serve to improve
> > fairness a bit for certain workloads. In practice, I think it might
> > not matter much.
> If quantum is big, CFQ will have better performance, but it actually
> fallbacks to Noop, no any fairness. fairness is important and is why we
> introduce CFQ.

It is not exactly noop. It still preempts writes and prioritizes reads
and direct writes. 

Also, what's the real life workload where you face issues with using
say deadline with these flash based storage.

> 
> In summary, CFQ isn't both fair and good performance. FIOPS is trying to
> be fair and have good performance. I didn't think any time based
> accounting can make the goal happen for NCQ and SSD (even cfq cgroup
> code has iops mode, so suppose you should already know this well).
> 
> Surely you can change CFQ to make it IOPS based, but this will mess the
> code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
> a separate ioscheduler which is IOPS based.

I think writing a separate IO scheduler just to do accouting in IOPS while
retaining rest of the CFQ code is not a very good idea. Modifying CFQ code
to be able to deal with both time based as well as IOPS accounting might
turn out to be simpler.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-16  7:11               ` Vivek Goyal
@ 2012-01-16  7:55                 ` Shaohua Li
  2012-01-16  8:29                   ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-16  7:55 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer

On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote:
> On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> > > 
> > > [..]
> > > > > You need to present raw numbers and give us some idea of how close
> > > > > those numbers are to raw hardware capability for us to have any idea
> > > > > what improvements these numbers actually demonstrate.
> > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > > the jbod capability, for both throughput and IOPS, that's why only
> > > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > > which will demonstrate the performance better. I'll report both raw
> > > > numbers and fiops/cfq numbers later.
> > > 
> > > If fiops number are better please explain why those numbers are better.
> > > If you cut down on idling, it is obivious that you will get higher
> > > throughput on these flash devices. CFQ does disable queue idling for
> > > non rotational NCQ devices. If higher throughput is due to driving
> > > deeper queue depths, then CFQ can do that too just by changing quantum
> > > and disabling idling. 
> > it's because of quantum. Surely you can change the quantum, and CFQ
> > performance will increase, but you will find CFQ is very unfair then.
> 
> Why increasing quantum leads to CFQ being unfair? In terms of time it
> still tries to be fair. 
we can dispatch a lot of requests to NCQ SSD with very small time
interval. The disk can finish a lot of requests in small time interval
too. The time is much smaller than 1 jiffy. Increasing quantum can lead
a task dispatches request more faster and makes the accounting worse,
because with small quantum the task needs wait to dispatch. you can
easily verify this with a simple fio test.

> That's a different thing that with NCQ, right
> time measurement is not possible with requests from multiple queues
> being in the driver/disk at the same time. So accouting in terms of
> iops per queue might make sense.
yes.

> > > So I really don't understand that what are you doing fundamentally
> > > different in FIOPS ioscheduler. 
> > > 
> > > The only thing I can think of more accurate accounting per queue in
> > > terms of number of IOs instead of time. Which can just serve to improve
> > > fairness a bit for certain workloads. In practice, I think it might
> > > not matter much.
> > If quantum is big, CFQ will have better performance, but it actually
> > fallbacks to Noop, no any fairness. fairness is important and is why we
> > introduce CFQ.
> 
> It is not exactly noop. It still preempts writes and prioritizes reads
> and direct writes. 
sure, I mean fairness mostly here.

> Also, what's the real life workload where you face issues with using
> say deadline with these flash based storage.
deadline doesn't provide fairness. mainly cgroup workload. workload with
different ioprio has issues too, but I don't know which real workload
uses ioprio.

> > 
> > In summary, CFQ isn't both fair and good performance. FIOPS is trying to
> > be fair and have good performance. I didn't think any time based
> > accounting can make the goal happen for NCQ and SSD (even cfq cgroup
> > code has iops mode, so suppose you should already know this well).
> > 
> > Surely you can change CFQ to make it IOPS based, but this will mess the
> > code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
> > a separate ioscheduler which is IOPS based.
> 
> I think writing a separate IO scheduler just to do accouting in IOPS while
> retaining rest of the CFQ code is not a very good idea. Modifying CFQ code
> to be able to deal with both time based as well as IOPS accounting might
> turn out to be simpler.
changing CFQ works, but I really want to avoid having something like
if (iops) 
   xxx
else
   xxx
I plan adding scales for read/write, request size, etc, because
read/write cost is different and request with different size has
different cost in SSD. This can be added to CFQ too with pain. That said
I didn't completely object to make CFQ support IOPS accounting, but my
feeling is a separate ioscheduler is more clean.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-16  7:55                 ` Shaohua Li
@ 2012-01-16  8:29                   ` Vivek Goyal
  2012-01-17  1:06                     ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2012-01-16  8:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer

On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote:
> On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote:
> > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> > > > 
> > > > [..]
> > > > > > You need to present raw numbers and give us some idea of how close
> > > > > > those numbers are to raw hardware capability for us to have any idea
> > > > > > what improvements these numbers actually demonstrate.
> > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > > > the jbod capability, for both throughput and IOPS, that's why only
> > > > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > > > which will demonstrate the performance better. I'll report both raw
> > > > > numbers and fiops/cfq numbers later.
> > > > 
> > > > If fiops number are better please explain why those numbers are better.
> > > > If you cut down on idling, it is obivious that you will get higher
> > > > throughput on these flash devices. CFQ does disable queue idling for
> > > > non rotational NCQ devices. If higher throughput is due to driving
> > > > deeper queue depths, then CFQ can do that too just by changing quantum
> > > > and disabling idling. 
> > > it's because of quantum. Surely you can change the quantum, and CFQ
> > > performance will increase, but you will find CFQ is very unfair then.
> > 
> > Why increasing quantum leads to CFQ being unfair? In terms of time it
> > still tries to be fair. 
> we can dispatch a lot of requests to NCQ SSD with very small time
> interval. The disk can finish a lot of requests in small time interval
> too. The time is much smaller than 1 jiffy. Increasing quantum can lead
> a task dispatches request more faster and makes the accounting worse,
> because with small quantum the task needs wait to dispatch. you can
> easily verify this with a simple fio test.
> 
> > That's a different thing that with NCQ, right
> > time measurement is not possible with requests from multiple queues
> > being in the driver/disk at the same time. So accouting in terms of
> > iops per queue might make sense.
> yes.
> 
> > > > So I really don't understand that what are you doing fundamentally
> > > > different in FIOPS ioscheduler. 
> > > > 
> > > > The only thing I can think of more accurate accounting per queue in
> > > > terms of number of IOs instead of time. Which can just serve to improve
> > > > fairness a bit for certain workloads. In practice, I think it might
> > > > not matter much.
> > > If quantum is big, CFQ will have better performance, but it actually
> > > fallbacks to Noop, no any fairness. fairness is important and is why we
> > > introduce CFQ.
> > 
> > It is not exactly noop. It still preempts writes and prioritizes reads
> > and direct writes. 
> sure, I mean fairness mostly here.
> 
> > Also, what's the real life workload where you face issues with using
> > say deadline with these flash based storage.
> deadline doesn't provide fairness. mainly cgroup workload. workload with
> different ioprio has issues too, but I don't know which real workload
> uses ioprio.

Personally I have not run into any workload which provides deep queue depths
constantly for a very long time. I had to run fio to create such
scnearios.

Not running deep queue depths will lead to expiration of queue (Otherwise
idling will kill performance on these fast devices). And without idling
most of the logic of slice and accounting does not help. A queue
dispatches some requests and expires (irrespective of what time slice
you had allocated it based on ioprio).

That's why I am insisting that it would be nice that any move in this
direction should be driven by some real workload instead of just coming
up with synthetic workloads.

> 
> > > 
> > > In summary, CFQ isn't both fair and good performance. FIOPS is trying to
> > > be fair and have good performance. I didn't think any time based
> > > accounting can make the goal happen for NCQ and SSD (even cfq cgroup
> > > code has iops mode, so suppose you should already know this well).
> > > 
> > > Surely you can change CFQ to make it IOPS based, but this will mess the
> > > code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
> > > a separate ioscheduler which is IOPS based.
> > 
> > I think writing a separate IO scheduler just to do accouting in IOPS while
> > retaining rest of the CFQ code is not a very good idea. Modifying CFQ code
> > to be able to deal with both time based as well as IOPS accounting might
> > turn out to be simpler.
> changing CFQ works, but I really want to avoid having something like
> if (iops) 
>    xxx
> else
>    xxx

I think you can provide a wrapper function and abstract out unit of time
as "charge" or something like that. 

> I plan adding scales for read/write, request size, etc, because
> read/write cost is different and request with different size has
> different cost in SSD. This can be added to CFQ too with pain. That said
> I didn't completely object to make CFQ support IOPS accounting, but my
> feeling is a separate ioscheduler is more clean.

You can provide one function to calculate the "charge" which can be either
time or some kind of IOPS unit adjusted with request size and use that.

Given the fact that new IOPS scheduler will be sharing lots of code with
CFQ (I presume 80-90% of concepts are same) and the only major difference
is accounting, I would tend to think that modifying CFQ is better.

Jens, what do you think?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-16  8:29                   ` Vivek Goyal
@ 2012-01-17  1:06                     ` Shaohua Li
  2012-01-17  9:02                       ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-17  1:06 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai

On Mon, 2012-01-16 at 03:29 -0500, Vivek Goyal wrote:
> On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote:
> > On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote:
> > > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> > > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> > > > > 
> > > > > [..]
> > > > > > > You need to present raw numbers and give us some idea of how close
> > > > > > > those numbers are to raw hardware capability for us to have any idea
> > > > > > > what improvements these numbers actually demonstrate.
> > > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > > > > the jbod capability, for both throughput and IOPS, that's why only
> > > > > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > > > > which will demonstrate the performance better. I'll report both raw
> > > > > > numbers and fiops/cfq numbers later.
> > > > > 
> > > > > If fiops number are better please explain why those numbers are better.
> > > > > If you cut down on idling, it is obivious that you will get higher
> > > > > throughput on these flash devices. CFQ does disable queue idling for
> > > > > non rotational NCQ devices. If higher throughput is due to driving
> > > > > deeper queue depths, then CFQ can do that too just by changing quantum
> > > > > and disabling idling. 
> > > > it's because of quantum. Surely you can change the quantum, and CFQ
> > > > performance will increase, but you will find CFQ is very unfair then.
> > > 
> > > Why increasing quantum leads to CFQ being unfair? In terms of time it
> > > still tries to be fair. 
> > we can dispatch a lot of requests to NCQ SSD with very small time
> > interval. The disk can finish a lot of requests in small time interval
> > too. The time is much smaller than 1 jiffy. Increasing quantum can lead
> > a task dispatches request more faster and makes the accounting worse,
> > because with small quantum the task needs wait to dispatch. you can
> > easily verify this with a simple fio test.
> > 
> > > That's a different thing that with NCQ, right
> > > time measurement is not possible with requests from multiple queues
> > > being in the driver/disk at the same time. So accouting in terms of
> > > iops per queue might make sense.
> > yes.
> > 
> > > > > So I really don't understand that what are you doing fundamentally
> > > > > different in FIOPS ioscheduler. 
> > > > > 
> > > > > The only thing I can think of more accurate accounting per queue in
> > > > > terms of number of IOs instead of time. Which can just serve to improve
> > > > > fairness a bit for certain workloads. In practice, I think it might
> > > > > not matter much.
> > > > If quantum is big, CFQ will have better performance, but it actually
> > > > fallbacks to Noop, no any fairness. fairness is important and is why we
> > > > introduce CFQ.
> > > 
> > > It is not exactly noop. It still preempts writes and prioritizes reads
> > > and direct writes. 
> > sure, I mean fairness mostly here.
> > 
> > > Also, what's the real life workload where you face issues with using
> > > say deadline with these flash based storage.
> > deadline doesn't provide fairness. mainly cgroup workload. workload with
> > different ioprio has issues too, but I don't know which real workload
> > uses ioprio.
> 
> Personally I have not run into any workload which provides deep queue depths
> constantly for a very long time. I had to run fio to create such
> scnearios.
> 
> Not running deep queue depths will lead to expiration of queue (Otherwise
> idling will kill performance on these fast devices). And without idling
> most of the logic of slice and accounting does not help. A queue
> dispatches some requests and expires (irrespective of what time slice
> you had allocated it based on ioprio).
That's true, if workload doesn't drive deep queue depths, any accounting
can't help for NCQ disks as far as I tried. Idling is the only method to
make accounting correct, but it impacts performance too much.

> That's why I am insisting that it would be nice that any move in this
> direction should be driven by some real workload instead of just coming
> up with synthetic workloads.
I thought yanhai from taobao (cc-ed) has real workload and he found cfq
performance suffers a lot.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-17  1:06                     ` Shaohua Li
@ 2012-01-17  9:02                       ` Vivek Goyal
  2012-01-18  1:20                         ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2012-01-17  9:02 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai

On Tue, Jan 17, 2012 at 09:06:28AM +0800, Shaohua Li wrote:
> On Mon, 2012-01-16 at 03:29 -0500, Vivek Goyal wrote:
> > On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote:
> > > On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote:
> > > > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> > > > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > > > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> > > > > > 
> > > > > > [..]
> > > > > > > > You need to present raw numbers and give us some idea of how close
> > > > > > > > those numbers are to raw hardware capability for us to have any idea
> > > > > > > > what improvements these numbers actually demonstrate.
> > > > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > > > > > the jbod capability, for both throughput and IOPS, that's why only
> > > > > > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > > > > > which will demonstrate the performance better. I'll report both raw
> > > > > > > numbers and fiops/cfq numbers later.
> > > > > > 
> > > > > > If fiops number are better please explain why those numbers are better.
> > > > > > If you cut down on idling, it is obivious that you will get higher
> > > > > > throughput on these flash devices. CFQ does disable queue idling for
> > > > > > non rotational NCQ devices. If higher throughput is due to driving
> > > > > > deeper queue depths, then CFQ can do that too just by changing quantum
> > > > > > and disabling idling. 
> > > > > it's because of quantum. Surely you can change the quantum, and CFQ
> > > > > performance will increase, but you will find CFQ is very unfair then.
> > > > 
> > > > Why increasing quantum leads to CFQ being unfair? In terms of time it
> > > > still tries to be fair. 
> > > we can dispatch a lot of requests to NCQ SSD with very small time
> > > interval. The disk can finish a lot of requests in small time interval
> > > too. The time is much smaller than 1 jiffy. Increasing quantum can lead
> > > a task dispatches request more faster and makes the accounting worse,
> > > because with small quantum the task needs wait to dispatch. you can
> > > easily verify this with a simple fio test.
> > > 
> > > > That's a different thing that with NCQ, right
> > > > time measurement is not possible with requests from multiple queues
> > > > being in the driver/disk at the same time. So accouting in terms of
> > > > iops per queue might make sense.
> > > yes.
> > > 
> > > > > > So I really don't understand that what are you doing fundamentally
> > > > > > different in FIOPS ioscheduler. 
> > > > > > 
> > > > > > The only thing I can think of more accurate accounting per queue in
> > > > > > terms of number of IOs instead of time. Which can just serve to improve
> > > > > > fairness a bit for certain workloads. In practice, I think it might
> > > > > > not matter much.
> > > > > If quantum is big, CFQ will have better performance, but it actually
> > > > > fallbacks to Noop, no any fairness. fairness is important and is why we
> > > > > introduce CFQ.
> > > > 
> > > > It is not exactly noop. It still preempts writes and prioritizes reads
> > > > and direct writes. 
> > > sure, I mean fairness mostly here.
> > > 
> > > > Also, what's the real life workload where you face issues with using
> > > > say deadline with these flash based storage.
> > > deadline doesn't provide fairness. mainly cgroup workload. workload with
> > > different ioprio has issues too, but I don't know which real workload
> > > uses ioprio.
> > 
> > Personally I have not run into any workload which provides deep queue depths
> > constantly for a very long time. I had to run fio to create such
> > scnearios.
> > 
> > Not running deep queue depths will lead to expiration of queue (Otherwise
> > idling will kill performance on these fast devices). And without idling
> > most of the logic of slice and accounting does not help. A queue
> > dispatches some requests and expires (irrespective of what time slice
> > you had allocated it based on ioprio).
> That's true, if workload doesn't drive deep queue depths, any accounting
> can't help for NCQ disks as far as I tried. Idling is the only method to
> make accounting correct, but it impacts performance too much.

Idiling will kill performance and faster the device, more prominent are
the effects of idling. So to me using CFQ on these fast devices is not
a very good idea and deadline might just serve well.

> 
> > That's why I am insisting that it would be nice that any move in this
> > direction should be driven by some real workload instead of just coming
> > up with synthetic workloads.
> I thought yanhai from taobao (cc-ed) has real workload and he found cfq
> performance suffers a lot.

Can we run that real workload with "deadline" and see what kind of
concerns do we have. Is anybody getting starved for long time. If not,
then we don't have to do anything.

I think trying to make to make CFQ work (Or trying to come up with CFQ
like IOPS scheduler) on these fast devices might not lead us anywhere.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-17  9:02                       ` Vivek Goyal
@ 2012-01-18  1:20                         ` Shaohua Li
  2012-01-18 13:04                           ` Vivek Goyal
  0 siblings, 1 reply; 29+ messages in thread
From: Shaohua Li @ 2012-01-18  1:20 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai

On Tue, 2012-01-17 at 04:02 -0500, Vivek Goyal wrote:
> On Tue, Jan 17, 2012 at 09:06:28AM +0800, Shaohua Li wrote:
> > On Mon, 2012-01-16 at 03:29 -0500, Vivek Goyal wrote:
> > > On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote:
> > > > On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote:
> > > > > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> > > > > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > > > > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> > > > > > > 
> > > > > > > [..]
> > > > > > > > > You need to present raw numbers and give us some idea of how close
> > > > > > > > > those numbers are to raw hardware capability for us to have any idea
> > > > > > > > > what improvements these numbers actually demonstrate.
> > > > > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > > > > > > the jbod capability, for both throughput and IOPS, that's why only
> > > > > > > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > > > > > > which will demonstrate the performance better. I'll report both raw
> > > > > > > > numbers and fiops/cfq numbers later.
> > > > > > > 
> > > > > > > If fiops number are better please explain why those numbers are better.
> > > > > > > If you cut down on idling, it is obivious that you will get higher
> > > > > > > throughput on these flash devices. CFQ does disable queue idling for
> > > > > > > non rotational NCQ devices. If higher throughput is due to driving
> > > > > > > deeper queue depths, then CFQ can do that too just by changing quantum
> > > > > > > and disabling idling. 
> > > > > > it's because of quantum. Surely you can change the quantum, and CFQ
> > > > > > performance will increase, but you will find CFQ is very unfair then.
> > > > > 
> > > > > Why increasing quantum leads to CFQ being unfair? In terms of time it
> > > > > still tries to be fair. 
> > > > we can dispatch a lot of requests to NCQ SSD with very small time
> > > > interval. The disk can finish a lot of requests in small time interval
> > > > too. The time is much smaller than 1 jiffy. Increasing quantum can lead
> > > > a task dispatches request more faster and makes the accounting worse,
> > > > because with small quantum the task needs wait to dispatch. you can
> > > > easily verify this with a simple fio test.
> > > > 
> > > > > That's a different thing that with NCQ, right
> > > > > time measurement is not possible with requests from multiple queues
> > > > > being in the driver/disk at the same time. So accouting in terms of
> > > > > iops per queue might make sense.
> > > > yes.
> > > > 
> > > > > > > So I really don't understand that what are you doing fundamentally
> > > > > > > different in FIOPS ioscheduler. 
> > > > > > > 
> > > > > > > The only thing I can think of more accurate accounting per queue in
> > > > > > > terms of number of IOs instead of time. Which can just serve to improve
> > > > > > > fairness a bit for certain workloads. In practice, I think it might
> > > > > > > not matter much.
> > > > > > If quantum is big, CFQ will have better performance, but it actually
> > > > > > fallbacks to Noop, no any fairness. fairness is important and is why we
> > > > > > introduce CFQ.
> > > > > 
> > > > > It is not exactly noop. It still preempts writes and prioritizes reads
> > > > > and direct writes. 
> > > > sure, I mean fairness mostly here.
> > > > 
> > > > > Also, what's the real life workload where you face issues with using
> > > > > say deadline with these flash based storage.
> > > > deadline doesn't provide fairness. mainly cgroup workload. workload with
> > > > different ioprio has issues too, but I don't know which real workload
> > > > uses ioprio.
> > > 
> > > Personally I have not run into any workload which provides deep queue depths
> > > constantly for a very long time. I had to run fio to create such
> > > scnearios.
> > > 
> > > Not running deep queue depths will lead to expiration of queue (Otherwise
> > > idling will kill performance on these fast devices). And without idling
> > > most of the logic of slice and accounting does not help. A queue
> > > dispatches some requests and expires (irrespective of what time slice
> > > you had allocated it based on ioprio).
> > That's true, if workload doesn't drive deep queue depths, any accounting
> > can't help for NCQ disks as far as I tried. Idling is the only method to
> > make accounting correct, but it impacts performance too much.
> 
> Idiling will kill performance and faster the device, more prominent are
> the effects of idling. So to me using CFQ on these fast devices is not
> a very good idea and deadline might just serve well.
> 
> > 
> > > That's why I am insisting that it would be nice that any move in this
> > > direction should be driven by some real workload instead of just coming
> > > up with synthetic workloads.
> > I thought yanhai from taobao (cc-ed) has real workload and he found cfq
> > performance suffers a lot.
> 
> Can we run that real workload with "deadline" and see what kind of
> concerns do we have. Is anybody getting starved for long time. If not,
> then we don't have to do anything.
> 
> I think trying to make to make CFQ work (Or trying to come up with CFQ
> like IOPS scheduler) on these fast devices might not lead us anywhere.
If only performance matters, I'd rather use noop for ssd. There is
requirement to have cgroup support (maybe ioprio) to give different
tasks different bandwidth.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-18  1:20                         ` Shaohua Li
@ 2012-01-18 13:04                           ` Vivek Goyal
  2012-01-19  1:21                             ` Shaohua Li
  0 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2012-01-18 13:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai

On Wed, Jan 18, 2012 at 09:20:37AM +0800, Shaohua Li wrote:

[..]
> > I think trying to make to make CFQ work (Or trying to come up with CFQ
> > like IOPS scheduler) on these fast devices might not lead us anywhere.
> If only performance matters, I'd rather use noop for ssd. There is
> requirement to have cgroup support (maybe ioprio) to give different
> tasks different bandwidth.

Sure but the issue is that we need to idle in an effort to prioritize
a task and idling kills performance. So you can implement something but
I have doubts that on a fast hardware it is going to be very useful.

Another issue is that with flash based storage, it can drive really deep
queue depths. If that's the case, then just ioscheduler can't solve the
prioritazaion issues (until and unless ioscheduler does not drive deep
queue depths and kills performance). We need some kind of cooperation
from device (like device understanding the notion of iopriority), so
that device can prioritize the requests and one need not to idle. That
way, we might be able to get service differentiation while getting
reasonable throughput.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC 0/3]block: An IOPS based ioscheduler
  2012-01-18 13:04                           ` Vivek Goyal
@ 2012-01-19  1:21                             ` Shaohua Li
  0 siblings, 0 replies; 29+ messages in thread
From: Shaohua Li @ 2012-01-19  1:21 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai

On Wed, 2012-01-18 at 08:04 -0500, Vivek Goyal wrote:
> On Wed, Jan 18, 2012 at 09:20:37AM +0800, Shaohua Li wrote:
> 
> [..]
> > > I think trying to make to make CFQ work (Or trying to come up with CFQ
> > > like IOPS scheduler) on these fast devices might not lead us anywhere.
> > If only performance matters, I'd rather use noop for ssd. There is
> > requirement to have cgroup support (maybe ioprio) to give different
> > tasks different bandwidth.
> 
> Sure but the issue is that we need to idle in an effort to prioritize
> a task and idling kills performance. So you can implement something but
> I have doubts that on a fast hardware it is going to be very useful.
I didn't do idle in fiops. If workload iodepth is low, this will cause
fairness problem and I just leave it be. There is no way to make low
iodepth workload fair without performance sacrifice. CFQ for SSD has the
same problem too.

> Another issue is that with flash based storage, it can drive really deep
> queue depths. If that's the case, then just ioscheduler can't solve the
> prioritazaion issues (until and unless ioscheduler does not drive deep
> queue depths and kills performance). We need some kind of cooperation
> from device (like device understanding the notion of iopriority), so
> that device can prioritize the requests and one need not to idle. That
> way, we might be able to get service differentiation while getting
> reasonable throughput.
SSD is still like normal disk in terms of queue depth. Don't know the
iodepth of pcie flash card. If the queue depth of such card is very big
(I suppose there is a limitation, because after a critical point
increasing queue depth doesn't increase device performance), we
definitely need reconsider this.

it would be great device let ioscheduler know more info. In my mind, I
hope device can dynamatically adjust its queue depth. For example, in my
ssd, if request size is 4k, I get max throughput with queue depth 32. If
request size is 128k, just 2 queue depth is enough to get peek
throughput. If device can stop fetching request after 2 128k requests
pending, this will solve some fairness issues for low iodepth workload.
Unfortunately device capacity for different request size highly depends
on vendor. The fiops request size scale tries to solve the issue, but I
still didn't find a good scale model for this yet.

I suppose device can not do good prioritization if workload iodepth is
low. If there are just few requests pending, nobody (device or
ioscheduler) can do correct judgment in such case because there isn't
enough info.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-01-19  1:20 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-04  6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li
2012-01-04  6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li
2012-01-04  8:19   ` Namhyung Kim
2012-01-04  6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li
2012-01-06  6:05   ` Namjae Jeon
2012-01-07  1:06   ` Zhu Yanhai
2012-01-04  6:53 ` [RFC 3/3]block: fiops read/write request scale Shaohua Li
2012-01-04  7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner
2012-01-05  6:50   ` Shaohua Li
2012-01-06  5:12     ` Shaohua Li
2012-01-06  9:10       ` Namhyung Kim
2012-01-06 14:37       ` Jan Kara
2012-01-09  1:26         ` Shaohua Li
2012-01-15 22:32           ` Vivek Goyal
2012-01-08 22:16       ` Dave Chinner
2012-01-09  1:09         ` Shaohua Li
2012-01-15 22:45           ` Vivek Goyal
2012-01-16  4:36             ` Shaohua Li
2012-01-16  7:11               ` Vivek Goyal
2012-01-16  7:55                 ` Shaohua Li
2012-01-16  8:29                   ` Vivek Goyal
2012-01-17  1:06                     ` Shaohua Li
2012-01-17  9:02                       ` Vivek Goyal
2012-01-18  1:20                         ` Shaohua Li
2012-01-18 13:04                           ` Vivek Goyal
2012-01-19  1:21                             ` Shaohua Li
2012-01-15 22:28       ` Vivek Goyal
2012-01-06  9:41 ` Zhu Yanhai
2012-01-15 22:24 ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.