linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v2 0/8]block: An IOPS based ioscheduler
@ 2012-01-30  7:02 Shaohua Li
  2012-01-30  7:02 ` [patch v2 1/8]block: fiops ioscheduler core Shaohua Li
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

An IOPS based I/O scheduler

Flash based storage has some different characteristics against rotate disk.
1. no I/O seek.
2. read and write I/O cost usually is much different.
3. Time which a request takes depends on request size.
4. High throughput and IOPS, low latency.

CFQ iosched does well for rotate disk, for example fair dispatching, idle
for sequential read. It also has optimization for flash based storage (for
item 1 above), but overall it's not designed for flash based storage. It's
a slice based algorithm. Since flash based storage request cost is very
low, and drive has big queue_depth is quite popular now which makes
dispatching cost even lower, CFQ's slice accounting (jiffy based)
doesn't work well. CFQ doesn't consider above item 2 & 3.

FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
only targets for drive without I/O seek. It's quite similar like CFQ, but
the dispatch decision is made according to IOPS instead of slice.

To illustrate the design goals, let's compare Noop and CFQ:
Noop: best throughput; No fairness and high latency for sync.
CFQ: lower throughput in some cases; fairness and low latency for sync.
CFQ throughput is slow sometimes because it doesn't drive deep queue depth.
FIOPS adopts some merits of CFQ, for example, fairness and bias sync workload.
And it will be faster than CFQ in general.

Note, if workload iodepth is low, there is no way to maintain fairness without
performance sacrifice. Neither with CFQ. In such case, FIOPS will choose to not
lose performance because flash based storage is usually very fast and expensive,
performance is more important.

The algorithm is simple. Drive has a service tree, and each task lives in
the tree. The key into the tree is called vios (virtual I/O). Every request
has vios, which is calculated according to its ioprio, request size and so
on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
always selects task with minimum vios in the service tree and let the task
dispatch request. The dispatched request's vios is then added to the task's
vios and the task is repositioned in the sevice tree.

Benchmarks results:
SSD I'm using: max throughput read: 250m/s; write: 80m/s. max IOPS for 4k
request read 40k/s; write 20k/s
Latency and fairness tests are done in a desktop with one SSD and kernel
parameter mem=1G. I'll compare noop, cfq and fiops in such workload. The test
script and result is attached. Throughput tests are done in a 4 socket
server and 8 SSD. I'll compare cfq and fiops.

Latency
--------------------------
latency-1read-iodepth32-test
latency-8read-iodepth1-test
latency-8read-iodepth4-test
latency-32read-iodepth1-test
latency-32read-iodepth4-test
In all the tests, sync workloads have less latency with CFQ. FIOPS is worse
than CFQ but much better than noop, because it doesn't do preempt and
strictly follow 2.5:1 ratio of sync/async shares.
If preemption is added (I had a debug patch - the last patch in the series),
FIOPS will get similar result as CFQ.

Fairness
-------------------------
fairness-2read-iodepth8-test
fairness-2read-iodepth32-test
fairness-8read-iodepth4-test
fairness-32read-iodepth2-test

In the tests, thread group 2 should get about 2.33 more IOPS than thread group 1.

The first test doesn't drive big io depth (drive io depth is 31). No ioscheduler
is fair. The thread2/thread1 ratio is: 0.8(CFQ), 1(NOOP, FIOPS).
In the last 3 tests, ratios with CFQ is 2.69, 2.78, 7.54; ratios with FIOPS is
2.33, 2.32, 2.32; NOOP always gives 1.

FIOPS is more fair than CFQ, because CFQ uses jiffies to measure slice, 1
jiffy is too big for SSD and NCQ disk.

Note in all the tests, NOOP and FIOPS can drive the peek IOPS, while CFQ can
only drive peek IOPS for the second test.

Throughput
------------------------
                    workload    cfq     fiops   changes
              fio_sync_read_4k  3186.3  3304.0  3.6%
             fio_mediaplay_64k  3303.7  3372.0  2.0%
            fio_mediaplay_128k  3256.3  3405.7  4.4%
           fio_sync_read_rr_4k  4058.3  4071.3  0.3%
              fio_media_rr_64k  3946.0  4013.3  1.7%
  fio_sync_write_rr_64k_create  700.7   692.7   -1.2%
     fio_sync_write_64k_create  697.0   696.7   -0.0%
    fio_sync_write_128k_create  672.7   675.7   0.4%
             fio_sync_write_4k  667.7   682.3   2.1%
            fio_sync_write_64k  721.3   714.7   -0.9%
           fio_sync_write_128k  704.7   703.0   -0.2%
           fio_aio_randread_4k  534.3   656.7   18.6%
          fio_aio_randread_64k  1877.0  1881.3  0.2%
          fio_aio_randwrite_4k  306.0   366.0   16.4%
         fio_aio_randwrite_64k  481.0   485.3   0.9%
             fio_aio_randrw_4k  92.5    215.7   57.1%
            fio_aio_randrw_64k  352.0   346.3   -1.6%
                      fio_tpcc  328/98  341.6/99.1 3.9%/1.1%
                      fio_tpch  11576.3 11583.3 0.1%
          fio_mmap_randread_1k  6464.0  6472.0  0.1%
          fio_mmap_randread_4k  9321.3  9636.0  3.3%
         fio_mmap_randread_64k  11507.7 11420.0 -0.8%
         fio_mmap_randwrite_1k  68.1    63.4    -7.4%
         fio_mmap_randwrite_4k  261.7   250.3   -4.5%
        fio_mmap_randwrite_64k  414.0   414.7   0.2%
            fio_mmap_randrw_1k  65.8    64.5    -2.1%
            fio_mmap_randrw_4k  260.7   241.3   -8.0%
           fio_mmap_randrw_64k  424.0   429.7   1.3%
         fio_mmap_sync_read_4k  3235.3  3239.7  0.1%
        fio_mmap_sync_read_64k  3265.3  3208.3  -1.8%
       fio_mmap_sync_read_128k  3202.3  3250.3  1.5%
      fio_mmap_sync_read_rr_4k  2328.7  2368.0  1.7%
     fio_mmap_sync_read_rr_64k  2425.0  2416.0  -0.4%

FIOPS is much better for some aio workloads, because it can drive deep
queue depth. For workloads low queue depth already saturates the SSD,
CFQ and FIOPS has no difference.

For some mmap rand read/write workloads, CFQ is better. Again this is
because CFQ has sync preemption. The debug patch, last one in the series,
can fix the gap.

Benchmark Summary
------------------------
FIOPS is more fair and has higher throughput. The throughput gain is because
it can drive deeper queue depth. The fairness gain is because IOPS based
accounting is more accurate.
FIOPS is worse to bias sync workload and has lower throughput in some tests.
This is fixable (like the debug patch mentioned above). But I didn't want
to push the patch in, because it will starve async workload (The same with
CFQ). When we talk about bias sync, I thought we should have a degree how
much the bias should be. Starvation of async sounds not optimal too.

CGROUP
-----------------------
CGROUP isn't implemented yet. FIOPS is more fair, which is very important
for CGROUP. Givin FIOPS uses vios to index service tree, implementing CGROUP
should be relative easy. Hierarchy CGROUP can be easily implemented too,
which CFQ is still lacking.

The series are orgnized as:
Patch 1: The core FIOPS.
Patch 2: request read/write vios scale. This demontrates how the vios scale.
Patch 3: sync/async scale.
Patch 4: ioprio support
Patch 5: a tweak to preserve deep iodepth task share
Patch 6: a tweek to further bias sync task
Patch 7: basic trace mesage support
Patch 8: a debug patch to do sync workload preemption

TODO:
1. request size based vios scale
2. cgroup support
3. automatically select default iosched according to QUEUE_FLAG_NONROT.

Comments and suggestions are welcome!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 1/8]block: fiops ioscheduler core
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30  7:02 ` [patch v2 2/8]block: fiops read/write request scale Shaohua Li
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-iosch.patch --]
[-- Type: text/plain, Size: 16453 bytes --]

FIOPS (Fair IOPS) ioscheduler is IOPS based ioscheduler, so only targets
for drive without I/O seek. It's quite similar like CFQ, but the dispatch
decision is made according to IOPS instead of slice.

The algorithm is simple. Drive has a service tree, and each task lives in
the tree. The key into the tree is called vios (virtual I/O). Every request
has vios, which is calculated according to its ioprio, request size and so
on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
always selects task with minimum vios in the service tree and let the task
dispatch request. The dispatched request's vios is then added to the task's
vios and the task is repositioned in the sevice tree.

Unlike CFQ, FIOPS doesn't have separate sync/async queues, because with I/O
less writeback, usually a task can only dispatch either sync or async requests.
Bias read or write request can still be done with read/write scale.

One issue is if workload iodepth is lower than drive queue_depth, IOPS
share of a task might not be strictly according to its priority, request
size and so on. In this case, the drive is in idle actually. Solving the
problem need make drive idle, so impact performance. I believe CFQ isn't
completely fair between tasks in such case too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/Kconfig.iosched |   12 +
 block/Makefile        |    1 
 block/fiops-iosched.c |  556 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 569 insertions(+)

Index: linux/block/Makefile
===================================================================
--- linux.orig/block/Makefile	2012-01-30 13:48:43.000000000 +0800
+++ linux/block/Makefile	2012-01-30 14:34:27.000000000 +0800
@@ -15,6 +15,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_IOSCHED_FIOPS)	+= fiops-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
Index: linux/block/fiops-iosched.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/block/fiops-iosched.c	2012-01-30 14:34:27.000000000 +0800
@@ -0,0 +1,556 @@
+/*
+ * IOPS based IO scheduler. Based on CFQ.
+ *  Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *  Shaohua Li <shli@kernel.org>
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
+#include <linux/jiffies.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include "blk.h"
+
+#define VIOS_SCALE_SHIFT 10
+#define VIOS_SCALE (1 << VIOS_SCALE_SHIFT)
+
+struct fiops_rb_root {
+	struct rb_root rb;
+	struct rb_node *left;
+	unsigned count;
+
+	u64 min_vios;
+};
+#define FIOPS_RB_ROOT	(struct fiops_rb_root) { .rb = RB_ROOT}
+
+struct fiops_data {
+	struct request_queue *queue;
+
+	struct fiops_rb_root service_tree;
+
+	unsigned int busy_queues;
+	unsigned int in_flight[2];
+
+	struct work_struct unplug_work;
+};
+
+struct fiops_ioc {
+	struct io_cq icq;
+
+	unsigned int flags;
+	struct fiops_data *fiopsd;
+	struct rb_node rb_node;
+	u64 vios; /* key in service_tree */
+	struct fiops_rb_root *service_tree;
+
+	unsigned int in_flight;
+
+	struct rb_root sort_list;
+	struct list_head fifo;
+
+	pid_t pid;
+};
+
+#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree))
+#define RQ_CIC(rq)		icq_to_cic((rq)->elv.icq)
+
+enum ioc_state_flags {
+	FIOPS_IOC_FLAG_on_rr = 0,	/* on round-robin busy list */
+};
+
+#define FIOPS_IOC_FNS(name)						\
+static inline void fiops_mark_ioc_##name(struct fiops_ioc *ioc)	\
+{									\
+	ioc->flags |= (1 << FIOPS_IOC_FLAG_##name);			\
+}									\
+static inline void fiops_clear_ioc_##name(struct fiops_ioc *ioc)	\
+{									\
+	ioc->flags &= ~(1 << FIOPS_IOC_FLAG_##name);			\
+}									\
+static inline int fiops_ioc_##name(const struct fiops_ioc *ioc)	\
+{									\
+	return ((ioc)->flags & (1 << FIOPS_IOC_FLAG_##name)) != 0;	\
+}
+
+FIOPS_IOC_FNS(on_rr);
+#undef FIOPS_IOC_FNS
+
+static inline struct fiops_ioc *icq_to_cic(struct io_cq *icq)
+{
+	/* cic->icq is the first member, %NULL will convert to %NULL */
+	return container_of(icq, struct fiops_ioc, icq);
+}
+
+static inline struct fiops_ioc *fiops_cic_lookup(struct fiops_data *fiopsd,
+					       struct io_context *ioc)
+{
+	if (ioc)
+		return icq_to_cic(ioc_lookup_icq(ioc, fiopsd->queue));
+	return NULL;
+}
+
+/*
+ * The below is leftmost cache rbtree addon
+ */
+static struct fiops_ioc *fiops_rb_first(struct fiops_rb_root *root)
+{
+	/* Service tree is empty */
+	if (!root->count)
+		return NULL;
+
+	if (!root->left)
+		root->left = rb_first(&root->rb);
+
+	if (root->left)
+		return rb_entry(root->left, struct fiops_ioc, rb_node);
+
+	return NULL;
+}
+
+static void rb_erase_init(struct rb_node *n, struct rb_root *root)
+{
+	rb_erase(n, root);
+	RB_CLEAR_NODE(n);
+}
+
+static void fiops_rb_erase(struct rb_node *n, struct fiops_rb_root *root)
+{
+	if (root->left == n)
+		root->left = NULL;
+	rb_erase_init(n, &root->rb);
+	--root->count;
+}
+
+static inline u64 max_vios(u64 min_vios, u64 vios)
+{
+	s64 delta = (s64)(vios - min_vios);
+	if (delta > 0)
+		min_vios = vios;
+
+	return min_vios;
+}
+
+static void fiops_update_min_vios(struct fiops_rb_root *service_tree)
+{
+	struct fiops_ioc *ioc;
+
+	ioc = fiops_rb_first(service_tree);
+	if (!ioc)
+		return;
+	service_tree->min_vios = max_vios(service_tree->min_vios, ioc->vios);
+}
+
+/*
+ * The fiopsd->service_trees holds all pending fiops_ioc's that have
+ * requests waiting to be processed. It is sorted in the order that
+ * we will service the queues.
+ */
+static void fiops_service_tree_add(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc)
+{
+	struct rb_node **p, *parent;
+	struct fiops_ioc *__ioc;
+	struct fiops_rb_root *service_tree = ioc_service_tree(ioc);
+	u64 vios;
+	int left;
+
+	/* New added IOC */
+	if (RB_EMPTY_NODE(&ioc->rb_node))
+		vios = max_vios(service_tree->min_vios, ioc->vios);
+	else {
+		vios = ioc->vios;
+		/* ioc->service_tree might not equal to service_tree */
+		fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
+		ioc->service_tree = NULL;
+	}
+
+	left = 1;
+	parent = NULL;
+	ioc->service_tree = service_tree;
+	p = &service_tree->rb.rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		__ioc = rb_entry(parent, struct fiops_ioc, rb_node);
+
+		/*
+		 * sort by key, that represents service time.
+		 */
+		if (vios <  __ioc->vios)
+			n = &(*p)->rb_left;
+		else {
+			n = &(*p)->rb_right;
+			left = 0;
+		}
+
+		p = n;
+	}
+
+	if (left)
+		service_tree->left = &ioc->rb_node;
+
+	ioc->vios = vios;
+	rb_link_node(&ioc->rb_node, parent, p);
+	rb_insert_color(&ioc->rb_node, &service_tree->rb);
+	service_tree->count++;
+
+	fiops_update_min_vios(service_tree);
+}
+
+/*
+ * Update ioc's position in the service tree.
+ */
+static void fiops_resort_rr_list(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc)
+{
+	/*
+	 * Resorting requires the ioc to be on the RR list already.
+	 */
+	if (fiops_ioc_on_rr(ioc))
+		fiops_service_tree_add(fiopsd, ioc);
+}
+
+/*
+ * add to busy list of queues for service, trying to be fair in ordering
+ * the pending list according to last request service
+ */
+static void fiops_add_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
+{
+	BUG_ON(fiops_ioc_on_rr(ioc));
+	fiops_mark_ioc_on_rr(ioc);
+
+	fiopsd->busy_queues++;
+
+	fiops_resort_rr_list(fiopsd, ioc);
+}
+
+/*
+ * Called when the ioc no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void fiops_del_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
+{
+	BUG_ON(!fiops_ioc_on_rr(ioc));
+	fiops_clear_ioc_on_rr(ioc);
+
+	if (!RB_EMPTY_NODE(&ioc->rb_node)) {
+		fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
+		ioc->service_tree = NULL;
+	}
+
+	BUG_ON(!fiopsd->busy_queues);
+	fiopsd->busy_queues--;
+}
+
+/*
+ * rb tree support functions
+ */
+static void fiops_del_rq_rb(struct request *rq)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+
+	elv_rb_del(&ioc->sort_list, rq);
+}
+
+static void fiops_add_rq_rb(struct request *rq)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+	struct fiops_data *fiopsd = ioc->fiopsd;
+
+	elv_rb_add(&ioc->sort_list, rq);
+
+	if (!fiops_ioc_on_rr(ioc))
+		fiops_add_ioc_rr(fiopsd, ioc);
+}
+
+static void fiops_reposition_rq_rb(struct fiops_ioc *ioc, struct request *rq)
+{
+	elv_rb_del(&ioc->sort_list, rq);
+	fiops_add_rq_rb(rq);
+}
+
+static void fiops_remove_request(struct request *rq)
+{
+	list_del_init(&rq->queuelist);
+	fiops_del_rq_rb(rq);
+}
+
+static u64 fiops_scaled_vios(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc, struct request *rq)
+{
+	return VIOS_SCALE;
+}
+
+/* return vios dispatched */
+static u64 fiops_dispatch_request(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc)
+{
+	struct request *rq;
+	struct request_queue *q = fiopsd->queue;
+
+	rq = rq_entry_fifo(ioc->fifo.next);
+
+	fiops_remove_request(rq);
+	elv_dispatch_add_tail(q, rq);
+
+	fiopsd->in_flight[rq_is_sync(rq)]++;
+	ioc->in_flight++;
+
+	return fiops_scaled_vios(fiopsd, ioc, rq);
+}
+
+static int fiops_forced_dispatch(struct fiops_data *fiopsd)
+{
+	struct fiops_ioc *ioc;
+	int dispatched = 0;
+
+	while ((ioc = fiops_rb_first(&fiopsd->service_tree)) != NULL) {
+		while (!list_empty(&ioc->fifo)) {
+			fiops_dispatch_request(fiopsd, ioc);
+			dispatched++;
+		}
+		if (fiops_ioc_on_rr(ioc))
+			fiops_del_ioc_rr(fiopsd, ioc);
+	}
+	return dispatched;
+}
+
+static struct fiops_ioc *fiops_select_ioc(struct fiops_data *fiopsd)
+{
+	struct fiops_ioc *ioc;
+
+	if (RB_EMPTY_ROOT(&fiopsd->service_tree.rb))
+		return NULL;
+	ioc = fiops_rb_first(&fiopsd->service_tree);
+	return ioc;
+}
+
+static void fiops_charge_vios(struct fiops_data *fiopsd,
+	struct fiops_ioc *ioc, u64 vios)
+{
+	struct fiops_rb_root *service_tree = ioc->service_tree;
+	ioc->vios += vios;
+
+	if (RB_EMPTY_ROOT(&ioc->sort_list))
+		fiops_del_ioc_rr(fiopsd, ioc);
+	else
+		fiops_resort_rr_list(fiopsd, ioc);
+
+	fiops_update_min_vios(service_tree);
+}
+
+static int fiops_dispatch_requests(struct request_queue *q, int force)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct fiops_ioc *ioc;
+	u64 vios;
+
+	if (unlikely(force))
+		return fiops_forced_dispatch(fiopsd);
+
+	ioc = fiops_select_ioc(fiopsd);
+	if (!ioc)
+		return 0;
+
+	vios = fiops_dispatch_request(fiopsd, ioc);
+
+	fiops_charge_vios(fiopsd, ioc, vios);
+	return 1;
+}
+
+static void fiops_insert_request(struct request_queue *q, struct request *rq)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+
+	list_add_tail(&rq->queuelist, &ioc->fifo);
+
+	fiops_add_rq_rb(rq);
+}
+
+/*
+ * scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing
+ */
+static inline void fiops_schedule_dispatch(struct fiops_data *fiopsd)
+{
+	if (fiopsd->busy_queues)
+		kblockd_schedule_work(fiopsd->queue, &fiopsd->unplug_work);
+}
+
+static void fiops_completed_request(struct request_queue *q, struct request *rq)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+
+	fiopsd->in_flight[rq_is_sync(rq)]--;
+	ioc->in_flight--;
+
+	if (fiopsd->in_flight[0] + fiopsd->in_flight[1] == 0)
+		fiops_schedule_dispatch(fiopsd);
+}
+
+static struct request *
+fiops_find_rq_fmerge(struct fiops_data *fiopsd, struct bio *bio)
+{
+	struct task_struct *tsk = current;
+	struct fiops_ioc *cic;
+
+	cic = fiops_cic_lookup(fiopsd, tsk->io_context);
+
+	if (cic) {
+		sector_t sector = bio->bi_sector + bio_sectors(bio);
+
+		return elv_rb_find(&cic->sort_list, sector);
+	}
+
+	return NULL;
+}
+
+static int fiops_merge(struct request_queue *q, struct request **req,
+		     struct bio *bio)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct request *__rq;
+
+	__rq = fiops_find_rq_fmerge(fiopsd, bio);
+	if (__rq && elv_rq_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_FRONT_MERGE;
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void fiops_merged_request(struct request_queue *q, struct request *req,
+			       int type)
+{
+	if (type == ELEVATOR_FRONT_MERGE) {
+		struct fiops_ioc *ioc = RQ_CIC(req);
+
+		fiops_reposition_rq_rb(ioc, req);
+	}
+}
+
+static void
+fiops_merged_requests(struct request_queue *q, struct request *rq,
+		    struct request *next)
+{
+	struct fiops_ioc *ioc = RQ_CIC(rq);
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+
+	fiops_remove_request(next);
+
+	ioc = RQ_CIC(next);
+	/*
+	 * all requests of this task are merged to other tasks, delete it
+	 * from the service tree.
+	 */
+	if (fiops_ioc_on_rr(ioc) && RB_EMPTY_ROOT(&ioc->sort_list))
+		fiops_del_ioc_rr(fiopsd, ioc);
+}
+
+static int fiops_allow_merge(struct request_queue *q, struct request *rq,
+			   struct bio *bio)
+{
+	struct fiops_data *fiopsd = q->elevator->elevator_data;
+	struct fiops_ioc *cic;
+
+	/*
+	 * Lookup the ioc that this bio will be queued with. Allow
+	 * merge only if rq is queued there.
+	 */
+	cic = fiops_cic_lookup(fiopsd, current->io_context);
+
+	return cic == RQ_CIC(rq);
+}
+
+static void fiops_exit_queue(struct elevator_queue *e)
+{
+	struct fiops_data *fiopsd = e->elevator_data;
+
+	cancel_work_sync(&fiopsd->unplug_work);
+
+	kfree(fiopsd);
+}
+
+static void fiops_kick_queue(struct work_struct *work)
+{
+	struct fiops_data *fiopsd =
+		container_of(work, struct fiops_data, unplug_work);
+	struct request_queue *q = fiopsd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void *fiops_init_queue(struct request_queue *q)
+{
+	struct fiops_data *fiopsd;
+
+	fiopsd = kzalloc_node(sizeof(*fiopsd), GFP_KERNEL, q->node);
+	if (!fiopsd)
+		return NULL;
+
+	fiopsd->queue = q;
+
+	fiopsd->service_tree = FIOPS_RB_ROOT;
+
+	INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue);
+
+	return fiopsd;
+}
+
+static void fiops_init_icq(struct io_cq *icq)
+{
+	struct fiops_data *fiopsd = icq->q->elevator->elevator_data;
+	struct fiops_ioc *ioc = icq_to_cic(icq);
+
+	RB_CLEAR_NODE(&ioc->rb_node);
+	INIT_LIST_HEAD(&ioc->fifo);
+	ioc->sort_list = RB_ROOT;
+
+	ioc->fiopsd = fiopsd;
+
+	ioc->pid = current->pid;
+}
+
+static struct elevator_type iosched_fiops = {
+	.ops = {
+		.elevator_merge_fn =		fiops_merge,
+		.elevator_merged_fn =		fiops_merged_request,
+		.elevator_merge_req_fn =	fiops_merged_requests,
+		.elevator_allow_merge_fn =	fiops_allow_merge,
+		.elevator_dispatch_fn =		fiops_dispatch_requests,
+		.elevator_add_req_fn =		fiops_insert_request,
+		.elevator_completed_req_fn =	fiops_completed_request,
+		.elevator_former_req_fn =	elv_rb_former_request,
+		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_init_icq_fn =		fiops_init_icq,
+		.elevator_init_fn =		fiops_init_queue,
+		.elevator_exit_fn =		fiops_exit_queue,
+	},
+	.icq_size	=	sizeof(struct fiops_ioc),
+	.icq_align	=	__alignof__(struct fiops_ioc),
+	.elevator_name =	"fiops",
+	.elevator_owner =	THIS_MODULE,
+};
+
+static int __init fiops_init(void)
+{
+	return elv_register(&iosched_fiops);
+}
+
+static void __exit fiops_exit(void)
+{
+	elv_unregister(&iosched_fiops);
+}
+
+module_init(fiops_init);
+module_exit(fiops_exit);
+
+MODULE_AUTHOR("Jens Axboe, Shaohua Li <shli@kernel.org>");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("IOPS based IO scheduler");
Index: linux/block/Kconfig.iosched
===================================================================
--- linux.orig/block/Kconfig.iosched	2012-01-30 13:48:43.000000000 +0800
+++ linux/block/Kconfig.iosched	2012-01-30 14:34:27.000000000 +0800
@@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_FIOPS
+	tristate "IOPS based I/O scheduler"
+	default y
+	---help---
+	  This is an IOPS based I/O scheduler. It will try to distribute
+          IOPS equally among all processes in the system. It's mainly for
+          Flash based storage.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
@@ -56,6 +64,9 @@ choice
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
+	config DEFAULT_FIOPS
+		bool "FIOPS" if IOSCHED_FIOPS=y
+
 	config DEFAULT_NOOP
 		bool "No-op"
 
@@ -65,6 +76,7 @@ config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
 	default "cfq" if DEFAULT_CFQ
+	default "fiops" if DEFAULT_FIOPS
 	default "noop" if DEFAULT_NOOP
 
 endmenu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 2/8]block: fiops read/write request scale
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
  2012-01-30  7:02 ` [patch v2 1/8]block: fiops ioscheduler core Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30  7:02 ` [patch v2 3/8]block: fiops sync/async scale Shaohua Li
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-rw-scale.patch --]
[-- Type: text/plain, Size: 3721 bytes --]

read/write speed of Flash based storage usually is different. For example,
in my SSD maxium thoughput of read is about 3 times faster than that of
write. Add a scale to differenate read and write. Also add a tunable, so
user can assign different scale for read and write.

By default, the scale is 1:1, which means the scale is a noop.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/fiops-iosched.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-18 14:33:32.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-18 14:33:51.000000000 +0800
@@ -15,6 +15,9 @@
 #define VIOS_SCALE_SHIFT 10
 #define VIOS_SCALE (1 << VIOS_SCALE_SHIFT)
 
+#define VIOS_READ_SCALE (1)
+#define VIOS_WRITE_SCALE (1)
+
 struct fiops_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
@@ -33,6 +36,9 @@ struct fiops_data {
 	unsigned int in_flight[2];
 
 	struct work_struct unplug_work;
+
+	unsigned int read_scale;
+	unsigned int write_scale;
 };
 
 struct fiops_ioc {
@@ -280,7 +286,12 @@ static void fiops_remove_request(struct
 static u64 fiops_scaled_vios(struct fiops_data *fiopsd,
 	struct fiops_ioc *ioc, struct request *rq)
 {
-	return VIOS_SCALE;
+	int vios = VIOS_SCALE;
+
+	if (rq_data_dir(rq) == WRITE)
+		vios = vios * fiopsd->write_scale / fiopsd->read_scale;
+
+	return vios;
 }
 
 /* return vios dispatched */
@@ -500,6 +511,9 @@ static void *fiops_init_queue(struct req
 
 	INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue);
 
+	fiopsd->read_scale = VIOS_READ_SCALE;
+	fiopsd->write_scale = VIOS_WRITE_SCALE;
+
 	return fiopsd;
 }
 
@@ -517,6 +531,60 @@ static void fiops_init_icq(struct io_cq
 	ioc->pid = current->pid;
 }
 
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+fiops_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+fiops_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR)					\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct fiops_data *fiopsd = e->elevator_data;			\
+	return fiops_var_show(__VAR, (page));				\
+}
+SHOW_FUNCTION(fiops_read_scale_show, fiopsd->read_scale);
+SHOW_FUNCTION(fiops_write_scale_show, fiopsd->write_scale);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)				\
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct fiops_data *fiopsd = e->elevator_data;			\
+	unsigned int __data;						\
+	int ret = fiops_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	*(__PTR) = __data;						\
+	return ret;							\
+}
+STORE_FUNCTION(fiops_read_scale_store, &fiopsd->read_scale, 1, 100);
+STORE_FUNCTION(fiops_write_scale_store, &fiopsd->write_scale, 1, 100);
+#undef STORE_FUNCTION
+
+#define FIOPS_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, fiops_##name##_show, fiops_##name##_store)
+
+static struct elv_fs_entry fiops_attrs[] = {
+	FIOPS_ATTR(read_scale),
+	FIOPS_ATTR(write_scale),
+	__ATTR_NULL
+};
+
 static struct elevator_type iosched_fiops = {
 	.ops = {
 		.elevator_merge_fn =		fiops_merge,
@@ -534,6 +602,7 @@ static struct elevator_type iosched_fiop
 	},
 	.icq_size	=	sizeof(struct fiops_ioc),
 	.icq_align	=	__alignof__(struct fiops_ioc),
+	.elevator_attrs =	fiops_attrs,
 	.elevator_name =	"fiops",
 	.elevator_owner =	THIS_MODULE,
 };


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 3/8]block: fiops sync/async scale
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
  2012-01-30  7:02 ` [patch v2 1/8]block: fiops ioscheduler core Shaohua Li
  2012-01-30  7:02 ` [patch v2 2/8]block: fiops read/write request scale Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30  7:02 ` [patch v2 4/8]block: fiops add ioprio support Shaohua Li
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-sync-async-scale.patch --]
[-- Type: text/plain, Size: 2611 bytes --]

CFQ gives 2.5 times more share to sync workload. This matches CFQ.

Note this is different with the read/write scale. We have 3 types of
requests:
1. read
2. sync write
3. write
CFQ doesn't differentitate type 1 and 2, but request cost of 1 and 2
are usually different for flash based storage. So we have both sync/async
and read/write scale here.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/fiops-iosched.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-18 14:33:51.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-18 14:33:59.000000000 +0800
@@ -17,6 +17,8 @@
 
 #define VIOS_READ_SCALE (1)
 #define VIOS_WRITE_SCALE (1)
+#define VIOS_SYNC_SCALE (2)
+#define VIOS_ASYNC_SCALE (5)
 
 struct fiops_rb_root {
 	struct rb_root rb;
@@ -39,6 +41,8 @@ struct fiops_data {
 
 	unsigned int read_scale;
 	unsigned int write_scale;
+	unsigned int sync_scale;
+	unsigned int async_scale;
 };
 
 struct fiops_ioc {
@@ -291,6 +295,9 @@ static u64 fiops_scaled_vios(struct fiop
 	if (rq_data_dir(rq) == WRITE)
 		vios = vios * fiopsd->write_scale / fiopsd->read_scale;
 
+	if (!rq_is_sync(rq))
+		vios = vios * fiopsd->async_scale / fiopsd->sync_scale;
+
 	return vios;
 }
 
@@ -513,6 +520,8 @@ static void *fiops_init_queue(struct req
 
 	fiopsd->read_scale = VIOS_READ_SCALE;
 	fiopsd->write_scale = VIOS_WRITE_SCALE;
+	fiopsd->sync_scale = VIOS_SYNC_SCALE;
+	fiopsd->async_scale = VIOS_ASYNC_SCALE;
 
 	return fiopsd;
 }
@@ -557,6 +566,8 @@ static ssize_t __FUNC(struct elevator_qu
 }
 SHOW_FUNCTION(fiops_read_scale_show, fiopsd->read_scale);
 SHOW_FUNCTION(fiops_write_scale_show, fiopsd->write_scale);
+SHOW_FUNCTION(fiops_sync_scale_show, fiopsd->sync_scale);
+SHOW_FUNCTION(fiops_async_scale_show, fiopsd->async_scale);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)				\
@@ -574,6 +585,8 @@ static ssize_t __FUNC(struct elevator_qu
 }
 STORE_FUNCTION(fiops_read_scale_store, &fiopsd->read_scale, 1, 100);
 STORE_FUNCTION(fiops_write_scale_store, &fiopsd->write_scale, 1, 100);
+STORE_FUNCTION(fiops_sync_scale_store, &fiopsd->sync_scale, 1, 100);
+STORE_FUNCTION(fiops_async_scale_store, &fiopsd->async_scale, 1, 100);
 #undef STORE_FUNCTION
 
 #define FIOPS_ATTR(name) \
@@ -582,6 +595,8 @@ STORE_FUNCTION(fiops_write_scale_store,
 static struct elv_fs_entry fiops_attrs[] = {
 	FIOPS_ATTR(read_scale),
 	FIOPS_ATTR(write_scale),
+	FIOPS_ATTR(sync_scale),
+	FIOPS_ATTR(async_scale),
 	__ATTR_NULL
 };
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 4/8]block: fiops add ioprio support
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
                   ` (2 preceding siblings ...)
  2012-01-30  7:02 ` [patch v2 3/8]block: fiops sync/async scale Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30  7:02 ` [patch v2 5/8]block: fiops preserve vios key for deep queue depth workload Shaohua Li
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-priority-support.patch --]
[-- Type: text/plain, Size: 5562 bytes --]

Add CFQ-like ioprio support. Priority A will get 20% more share than priority
A+1, which matches CFQ.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/fiops-iosched.c |  103 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 92 insertions(+), 11 deletions(-)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-18 14:42:18.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-18 14:43:51.000000000 +0800
@@ -20,6 +20,8 @@
 #define VIOS_SYNC_SCALE (2)
 #define VIOS_ASYNC_SCALE (5)
 
+#define VIOS_PRIO_SCALE (5)
+
 struct fiops_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
@@ -29,10 +31,17 @@ struct fiops_rb_root {
 };
 #define FIOPS_RB_ROOT	(struct fiops_rb_root) { .rb = RB_ROOT}
 
+enum wl_prio_t {
+	IDLE_WORKLOAD = 0,
+	BE_WORKLOAD = 1,
+	RT_WORKLOAD = 2,
+	FIOPS_PRIO_NR,
+};
+
 struct fiops_data {
 	struct request_queue *queue;
 
-	struct fiops_rb_root service_tree;
+	struct fiops_rb_root service_tree[FIOPS_PRIO_NR];
 
 	unsigned int busy_queues;
 	unsigned int in_flight[2];
@@ -60,13 +69,16 @@ struct fiops_ioc {
 	struct list_head fifo;
 
 	pid_t pid;
+	unsigned short ioprio;
+	enum wl_prio_t wl_type;
 };
 
-#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree))
+#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree[(ioc)->wl_type]))
 #define RQ_CIC(rq)		icq_to_cic((rq)->elv.icq)
 
 enum ioc_state_flags {
 	FIOPS_IOC_FLAG_on_rr = 0,	/* on round-robin busy list */
+	FIOPS_IOC_FLAG_prio_changed,	/* task priority has changed */
 };
 
 #define FIOPS_IOC_FNS(name)						\
@@ -84,8 +96,18 @@ static inline int fiops_ioc_##name(const
 }
 
 FIOPS_IOC_FNS(on_rr);
+FIOPS_IOC_FNS(prio_changed);
 #undef FIOPS_IOC_FNS
 
+enum wl_prio_t fiops_wl_type(short prio_class)
+{
+	if (prio_class == IOPRIO_CLASS_RT)
+		return RT_WORKLOAD;
+	if (prio_class == IOPRIO_CLASS_BE)
+		return BE_WORKLOAD;
+	return IDLE_WORKLOAD;
+}
+
 static inline struct fiops_ioc *icq_to_cic(struct io_cq *icq)
 {
 	/* cic->icq is the first member, %NULL will convert to %NULL */
@@ -298,6 +320,8 @@ static u64 fiops_scaled_vios(struct fiop
 	if (!rq_is_sync(rq))
 		vios = vios * fiopsd->async_scale / fiopsd->sync_scale;
 
+	vios +=  vios * (ioc->ioprio - IOPRIO_NORM) / VIOS_PRIO_SCALE;
+
 	return vios;
 }
 
@@ -323,14 +347,19 @@ static int fiops_forced_dispatch(struct
 {
 	struct fiops_ioc *ioc;
 	int dispatched = 0;
+	int i;
 
-	while ((ioc = fiops_rb_first(&fiopsd->service_tree)) != NULL) {
-		while (!list_empty(&ioc->fifo)) {
-			fiops_dispatch_request(fiopsd, ioc);
-			dispatched++;
+	for (i = RT_WORKLOAD; i >= IDLE_WORKLOAD; i--) {
+		while (!RB_EMPTY_ROOT(&fiopsd->service_tree[i].rb)) {
+			ioc = fiops_rb_first(&fiopsd->service_tree[i]);
+
+			while (!list_empty(&ioc->fifo)) {
+				fiops_dispatch_request(fiopsd, ioc);
+				dispatched++;
+			}
+			if (fiops_ioc_on_rr(ioc))
+				fiops_del_ioc_rr(fiopsd, ioc);
 		}
-		if (fiops_ioc_on_rr(ioc))
-			fiops_del_ioc_rr(fiopsd, ioc);
 	}
 	return dispatched;
 }
@@ -338,10 +367,20 @@ static int fiops_forced_dispatch(struct
 static struct fiops_ioc *fiops_select_ioc(struct fiops_data *fiopsd)
 {
 	struct fiops_ioc *ioc;
+	struct fiops_rb_root *service_tree = NULL;
+	int i;
+
+	for (i = RT_WORKLOAD; i >= IDLE_WORKLOAD; i--) {
+		if (!RB_EMPTY_ROOT(&fiopsd->service_tree[i].rb)) {
+			service_tree = &fiopsd->service_tree[i];
+			break;
+		}
+	}
 
-	if (RB_EMPTY_ROOT(&fiopsd->service_tree.rb))
+	if (!service_tree)
 		return NULL;
-	ioc = fiops_rb_first(&fiopsd->service_tree);
+
+	ioc = fiops_rb_first(service_tree);
 	return ioc;
 }
 
@@ -378,10 +417,49 @@ static int fiops_dispatch_requests(struc
 	return 1;
 }
 
+static void fiops_init_prio_data(struct fiops_ioc *cic)
+{
+	struct task_struct *tsk = current;
+	struct io_context *ioc = cic->icq.ioc;
+	int ioprio_class;
+
+	if (!fiops_ioc_prio_changed(cic))
+		return;
+
+	ioprio_class = IOPRIO_PRIO_CLASS(ioc->ioprio);
+	switch (ioprio_class) {
+	default:
+		printk(KERN_ERR "fiops: bad prio %x\n", ioprio_class);
+	case IOPRIO_CLASS_NONE:
+		/*
+		 * no prio set, inherit CPU scheduling settings
+		 */
+		cic->ioprio = task_nice_ioprio(tsk);
+		cic->wl_type = fiops_wl_type(task_nice_ioclass(tsk));
+		break;
+	case IOPRIO_CLASS_RT:
+		cic->ioprio = task_ioprio(ioc);
+		cic->wl_type = fiops_wl_type(IOPRIO_CLASS_RT);
+		break;
+	case IOPRIO_CLASS_BE:
+		cic->ioprio = task_ioprio(ioc);
+		cic->wl_type = fiops_wl_type(IOPRIO_CLASS_BE);
+		break;
+	case IOPRIO_CLASS_IDLE:
+		cic->wl_type = fiops_wl_type(IOPRIO_CLASS_IDLE);
+		cic->ioprio = 7;
+		break;
+	}
+
+	fiops_clear_ioc_prio_changed(cic);
+}
+
 static void fiops_insert_request(struct request_queue *q, struct request *rq)
 {
 	struct fiops_ioc *ioc = RQ_CIC(rq);
 
+	fiops_init_prio_data(ioc);
+
 	list_add_tail(&rq->queuelist, &ioc->fifo);
 
 	fiops_add_rq_rb(rq);
@@ -507,6 +585,7 @@ static void fiops_kick_queue(struct work
 static void *fiops_init_queue(struct request_queue *q)
 {
 	struct fiops_data *fiopsd;
+	int i;
 
 	fiopsd = kzalloc_node(sizeof(*fiopsd), GFP_KERNEL, q->node);
 	if (!fiopsd)
@@ -514,7 +593,8 @@ static void *fiops_init_queue(struct req
 
 	fiopsd->queue = q;
 
-	fiopsd->service_tree = FIOPS_RB_ROOT;
+	for (i = IDLE_WORKLOAD; i <= RT_WORKLOAD; i++)
+		fiopsd->service_tree[i] = FIOPS_RB_ROOT;
 
 	INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue);
 
@@ -538,6 +618,7 @@ static void fiops_init_icq(struct io_cq
 	ioc->fiopsd = fiopsd;
 
 	ioc->pid = current->pid;
+	fiops_mark_ioc_prio_changed(ioc);
 }
 
 /*


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 5/8]block: fiops preserve vios key for deep queue depth workload
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
                   ` (3 preceding siblings ...)
  2012-01-30  7:02 ` [patch v2 4/8]block: fiops add ioprio support Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30  7:02 ` [patch v2 6/8]block: fiops bias sync workload Shaohua Li
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-vios-key-preserve.patch --]
[-- Type: text/plain, Size: 1134 bytes --]

If the task has running request, even it's added into service tree newly,
we preserve its vios key, so it will not lost its share. This should work
for task driving big queue depth. For single depth task, there is no approach
to preserve its vios key.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/fiops-iosched.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-18 14:43:51.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-18 14:44:25.000000000 +0800
@@ -188,9 +188,12 @@ static void fiops_service_tree_add(struc
 	int left;
 
 	/* New added IOC */
-	if (RB_EMPTY_NODE(&ioc->rb_node))
-		vios = max_vios(service_tree->min_vios, ioc->vios);
-	else {
+	if (RB_EMPTY_NODE(&ioc->rb_node)) {
+		if (ioc->in_flight > 0)
+			vios = ioc->vios;
+		else
+			vios = max_vios(service_tree->min_vios, ioc->vios);
+	} else {
 		vios = ioc->vios;
 		/* ioc->service_tree might not equal to service_tree */
 		fiops_rb_erase(&ioc->rb_node, ioc->service_tree);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 6/8]block: fiops bias sync workload
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
                   ` (4 preceding siblings ...)
  2012-01-30  7:02 ` [patch v2 5/8]block: fiops preserve vios key for deep queue depth workload Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30  7:02 ` [patch v2 7/8]block: fiops add some trace information Shaohua Li
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-limit-async-workload.patch --]
[-- Type: text/plain, Size: 1321 bytes --]

If there are async requests running, delay async workload. Otherwise
async workload (usually very deep iodepth) will use all queue iodepth
and later sync requests will get long delayed. The idea is from CFQ.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/fiops-iosched.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-19 11:21:01.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-21 10:30:46.000000000 +0800
@@ -372,6 +372,7 @@ static struct fiops_ioc *fiops_select_io
 	struct fiops_ioc *ioc;
 	struct fiops_rb_root *service_tree = NULL;
 	int i;
+	struct request *rq;
 
 	for (i = RT_WORKLOAD; i >= IDLE_WORKLOAD; i--) {
 		if (!RB_EMPTY_ROOT(&fiopsd->service_tree[i].rb)) {
@@ -384,6 +385,17 @@ static struct fiops_ioc *fiops_select_io
 		return NULL;
 
 	ioc = fiops_rb_first(service_tree);
+
+	rq = rq_entry_fifo(ioc->fifo.next);
+	/*
+	 * we are the only async task and sync requests are in flight, delay a
+	 * moment. If there are other tasks coming, sync tasks have no chance
+	 * to be starved, don't delay
+	 */
+	if (!rq_is_sync(rq) && fiopsd->in_flight[1] != 0 &&
+			service_tree->count == 1)
+		return NULL;
+
 	return ioc;
 }
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 7/8]block: fiops add some trace information
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
                   ` (5 preceding siblings ...)
  2012-01-30  7:02 ` [patch v2 6/8]block: fiops bias sync workload Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30  7:02 ` [patch v2 8/8]block: fiops sync preempts async Shaohua Li
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-tracing.patch --]
[-- Type: text/plain, Size: 2300 bytes --]

Add some trace information, which is helpful when I do debugging.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/fiops-iosched.c |   19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-21 10:30:46.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-21 10:35:04.000000000 +0800
@@ -10,6 +10,7 @@
 #include <linux/jiffies.h>
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
+#include <linux/blktrace_api.h>
 #include "blk.h"
 
 #define VIOS_SCALE_SHIFT 10
@@ -99,6 +100,11 @@ FIOPS_IOC_FNS(on_rr);
 FIOPS_IOC_FNS(prio_changed);
 #undef FIOPS_IOC_FNS
 
+#define fiops_log_ioc(fiopsd, ioc, fmt, args...)	\
+	blk_add_trace_msg((fiopsd)->queue, "ioc%d " fmt, (ioc)->pid, ##args)
+#define fiops_log(fiopsd, fmt, args...)	\
+	blk_add_trace_msg((fiopsd)->queue, "fiops " fmt, ##args)
+
 enum wl_prio_t fiops_wl_type(short prio_class)
 {
 	if (prio_class == IOPRIO_CLASS_RT)
@@ -200,6 +206,8 @@ static void fiops_service_tree_add(struc
 		ioc->service_tree = NULL;
 	}
 
+	fiops_log_ioc(fiopsd, ioc, "service tree add, vios %lld", vios);
+
 	left = 1;
 	parent = NULL;
 	ioc->service_tree = service_tree;
@@ -393,8 +401,12 @@ static struct fiops_ioc *fiops_select_io
 	 * to be starved, don't delay
 	 */
 	if (!rq_is_sync(rq) && fiopsd->in_flight[1] != 0 &&
-			service_tree->count == 1)
+			service_tree->count == 1) {
+		fiops_log_ioc(fiopsd, ioc,
+				"postpone async, in_flight async %d sync %d",
+				fiopsd->in_flight[0], fiopsd->in_flight[1]);
 		return NULL;
+	}
 
 	return ioc;
 }
@@ -405,6 +417,8 @@ static void fiops_charge_vios(struct fio
 	struct fiops_rb_root *service_tree = ioc->service_tree;
 	ioc->vios += vios;
 
+	fiops_log_ioc(fiopsd, ioc, "charge vios %lld, new vios %lld", vios, ioc->vios);
+
 	if (RB_EMPTY_ROOT(&ioc->sort_list))
 		fiops_del_ioc_rr(fiopsd, ioc);
 	else
@@ -498,6 +512,9 @@ static void fiops_completed_request(stru
 	fiopsd->in_flight[rq_is_sync(rq)]--;
 	ioc->in_flight--;
 
+	fiops_log_ioc(fiopsd, ioc, "in_flight %d, busy queues %d",
+		ioc->in_flight, fiopsd->busy_queues);
+
 	if (fiopsd->in_flight[0] + fiopsd->in_flight[1] == 0)
 		fiops_schedule_dispatch(fiopsd);
 }


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch v2 8/8]block: fiops sync preempts async
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
                   ` (6 preceding siblings ...)
  2012-01-30  7:02 ` [patch v2 7/8]block: fiops add some trace information Shaohua Li
@ 2012-01-30  7:02 ` Shaohua Li
  2012-01-30 11:53   ` Hillf Danton
  2012-01-30  7:09 ` [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
  2012-01-30 15:30 ` Vivek Goyal
  9 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:02 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim, shaohua.li

[-- Attachment #1: fiops-preempt-async.patch --]
[-- Type: text/plain, Size: 963 bytes --]

Debug patch only.

This implements CFQ like sync preempts async. But like CFQ, this
will starve async.

---
 block/fiops-iosched.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

Index: linux/block/fiops-iosched.c
===================================================================
--- linux.orig/block/fiops-iosched.c	2012-01-21 10:35:04.000000000 +0800
+++ linux/block/fiops-iosched.c	2012-01-21 10:48:52.000000000 +0800
@@ -408,6 +408,21 @@ static struct fiops_ioc *fiops_select_io
 		return NULL;
 	}
 
+	/* Let sync request preempt async queue */
+	if (!rq_is_sync(rq) && service_tree->count > 1) {
+		struct rb_node *tmp = rb_next(&ioc->rb_node);
+		struct fiops_ioc *sync_ioc = NULL;
+		while (tmp) {
+			sync_ioc = rb_entry(tmp, struct fiops_ioc, rb_node);
+			rq = rq_entry_fifo(sync_ioc->fifo.next);
+			if (rq_is_sync(rq))
+				break;
+			tmp = rb_next(&sync_ioc->rb_node);
+		}
+		if (sync_ioc)
+			ioc = sync_ioc;
+	}
+
 	return ioc;
 }
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch v2 0/8]block: An IOPS based ioscheduler
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
                   ` (7 preceding siblings ...)
  2012-01-30  7:02 ` [patch v2 8/8]block: fiops sync preempts async Shaohua Li
@ 2012-01-30  7:09 ` Shaohua Li
  2012-01-30 15:30 ` Vivek Goyal
  9 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-30  7:09 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

On Mon, 2012-01-30 at 15:02 +0800, Shaohua Li wrote:
> An IOPS based I/O scheduler
> 
> Flash based storage has some different characteristics against rotate disk.
> 1. no I/O seek.
> 2. read and write I/O cost usually is much different.
> 3. Time which a request takes depends on request size.
> 4. High throughput and IOPS, low latency.
> 
> CFQ iosched does well for rotate disk, for example fair dispatching, idle
> for sequential read. It also has optimization for flash based storage (for
> item 1 above), but overall it's not designed for flash based storage. It's
> a slice based algorithm. Since flash based storage request cost is very
> low, and drive has big queue_depth is quite popular now which makes
> dispatching cost even lower, CFQ's slice accounting (jiffy based)
> doesn't work well. CFQ doesn't consider above item 2 & 3.
> 
> FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> only targets for drive without I/O seek. It's quite similar like CFQ, but
> the dispatch decision is made according to IOPS instead of slice.
> 
> To illustrate the design goals, let's compare Noop and CFQ:
> Noop: best throughput; No fairness and high latency for sync.
> CFQ: lower throughput in some cases; fairness and low latency for sync.
> CFQ throughput is slow sometimes because it doesn't drive deep queue depth.
> FIOPS adopts some merits of CFQ, for example, fairness and bias sync workload.
> And it will be faster than CFQ in general.
> 
> Note, if workload iodepth is low, there is no way to maintain fairness without
> performance sacrifice. Neither with CFQ. In such case, FIOPS will choose to not
> lose performance because flash based storage is usually very fast and expensive,
> performance is more important.
> 
> The algorithm is simple. Drive has a service tree, and each task lives in
> the tree. The key into the tree is called vios (virtual I/O). Every request
> has vios, which is calculated according to its ioprio, request size and so
> on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> always selects task with minimum vios in the service tree and let the task
> dispatch request. The dispatched request's vios is then added to the task's
> vios and the task is repositioned in the sevice tree.
> 
> Benchmarks results:
> SSD I'm using: max throughput read: 250m/s; write: 80m/s. max IOPS for 4k
> request read 40k/s; write 20k/s
> Latency and fairness tests are done in a desktop with one SSD and kernel
> parameter mem=1G. I'll compare noop, cfq and fiops in such workload. The test
> script and result is attached. 
attached is the fio scripts I used for latency and fairness test and
data.

[-- Attachment #2: fiops-data.tgz --]
[-- Type: application/x-compressed-tar, Size: 14188 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch v2 8/8]block: fiops sync preempts async
  2012-01-30  7:02 ` [patch v2 8/8]block: fiops sync preempts async Shaohua Li
@ 2012-01-30 11:53   ` Hillf Danton
  0 siblings, 0 replies; 13+ messages in thread
From: Hillf Danton @ 2012-01-30 11:53 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe, linux-kernel, vgoyal, david, jack, zhu.yanhai, namhyung.kim

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 1684 bytes --]

Hello Shaohua

On Mon, Jan 30, 2012 at 3:02 PM, Shaohua Li <shaohua.li@intel.com> wrote:
> Debug patch only.
>
> This implements CFQ like sync preempts async. But like CFQ, this
> will starve async.
>
> ---
>  block/fiops-iosched.c |   15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> Index: linux/block/fiops-iosched.c
> ===================================================================
> --- linux.orig/block/fiops-iosched.c    2012-01-21 10:35:04.000000000 +0800
> +++ linux/block/fiops-iosched.c 2012-01-21 10:48:52.000000000 +0800
> @@ -408,6 +408,21 @@ static struct fiops_ioc *fiops_select_io
>                return NULL;
>        }
>
> +       /* Let sync request preempt async queue */
> +       if (!rq_is_sync(rq) && service_tree->count > 1) {
> +               struct rb_node *tmp = rb_next(&ioc->rb_node);
> +               struct fiops_ioc *sync_ioc = NULL;
> +               while (tmp) {
> +                       sync_ioc = rb_entry(tmp, struct fiops_ioc, rb_node);
> +                       rq = rq_entry_fifo(sync_ioc->fifo.next);
> +                       if (rq_is_sync(rq))
> +                               break;
> +                       tmp = rb_next(&sync_ioc->rb_node);

			sync_ioc = NULL;        yes?
Hillf
> +               }
> +               if (sync_ioc)
> +                       ioc = sync_ioc;
> +       }
> +
>        return ioc;
>  }
>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch v2 0/8]block: An IOPS based ioscheduler
  2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
                   ` (8 preceding siblings ...)
  2012-01-30  7:09 ` [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
@ 2012-01-30 15:30 ` Vivek Goyal
  2012-01-31  0:50   ` Shaohua Li
  9 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2012-01-30 15:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, linux-kernel, david, jack, zhu.yanhai, namhyung.kim

On Mon, Jan 30, 2012 at 03:02:13PM +0800, Shaohua Li wrote:
> An IOPS based I/O scheduler
> 
> Flash based storage has some different characteristics against rotate disk.
> 1. no I/O seek.
> 2. read and write I/O cost usually is much different.
> 3. Time which a request takes depends on request size.
> 4. High throughput and IOPS, low latency.

Hi Shaohua,

Last time we agreed that you will try to extend CFQ iops mode to take care
of this case. I was wondering that if that idea is out of the window?

Also what's the real workload where this is going to benefit us. I had
struggled to run something which drove constantly deep queue depths to
get the fairness without idling.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch v2 0/8]block: An IOPS based ioscheduler
  2012-01-30 15:30 ` Vivek Goyal
@ 2012-01-31  0:50   ` Shaohua Li
  0 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-01-31  0:50 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, david, jack, zhu.yanhai, namhyung.kim

On Mon, 2012-01-30 at 10:30 -0500, Vivek Goyal wrote:
> On Mon, Jan 30, 2012 at 03:02:13PM +0800, Shaohua Li wrote:
> > An IOPS based I/O scheduler
> > 
> > Flash based storage has some different characteristics against rotate disk.
> > 1. no I/O seek.
> > 2. read and write I/O cost usually is much different.
> > 3. Time which a request takes depends on request size.
> > 4. High throughput and IOPS, low latency.
> 
> Hi Shaohua,
> 
> Last time we agreed that you will try to extend CFQ iops mode to take care
> of this case. I was wondering that if that idea is out of the window?
I felt there is complexity to do the merge and mess code even with
wrapping some functions as you suggested. Another thing is I want to
avoid complexity of CFQ, for example allocating queue and sharing it
between tasks. cfq_set_request is one source that queue_lock gets
contented, making the code simple can avoid such contention.

> Also what's the real workload where this is going to benefit us. I had
> struggled to run something which drove constantly deep queue depths to
> get the fairness without idling.
To be honest, I don't have real workload. Our test environment runs some
micro benchmarks, which has some problems. I thought we all agreed CFQ
has some limitations. Hoping somebody having real workload can have a
look.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-01-31  0:51 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-30  7:02 [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
2012-01-30  7:02 ` [patch v2 1/8]block: fiops ioscheduler core Shaohua Li
2012-01-30  7:02 ` [patch v2 2/8]block: fiops read/write request scale Shaohua Li
2012-01-30  7:02 ` [patch v2 3/8]block: fiops sync/async scale Shaohua Li
2012-01-30  7:02 ` [patch v2 4/8]block: fiops add ioprio support Shaohua Li
2012-01-30  7:02 ` [patch v2 5/8]block: fiops preserve vios key for deep queue depth workload Shaohua Li
2012-01-30  7:02 ` [patch v2 6/8]block: fiops bias sync workload Shaohua Li
2012-01-30  7:02 ` [patch v2 7/8]block: fiops add some trace information Shaohua Li
2012-01-30  7:02 ` [patch v2 8/8]block: fiops sync preempts async Shaohua Li
2012-01-30 11:53   ` Hillf Danton
2012-01-30  7:09 ` [patch v2 0/8]block: An IOPS based ioscheduler Shaohua Li
2012-01-30 15:30 ` Vivek Goyal
2012-01-31  0:50   ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).