* [RFC 0/3]block: An IOPS based ioscheduler @ 2012-01-04 6:53 Shaohua Li 2012-01-04 6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li ` (5 more replies) 0 siblings, 6 replies; 29+ messages in thread From: Shaohua Li @ 2012-01-04 6:53 UTC (permalink / raw) To: linux-kernel; +Cc: axboe, vgoyal, jmoyer An IOPS based I/O scheduler Flash based storage has some different characteristics against rotate disk. 1. no I/O seek. 2. read and write I/O cost usually is much different. 3. Time which a request takes depends on request size. 4. High throughput and IOPS, low latency. CFQ iosched does well for rotate disk, for example fair dispatching, idle for sequential read. It also has optimization for flash based storage (for item 1 above), but overall it's not designed for flash based storage. It's a slice based algorithm. Since flash based storage request cost is very low, and drive has big queue_depth is quite popular now which makes dispatching cost even lower, CFQ's slice accounting (jiffy based) doesn't work well. CFQ doesn't consider above item 2 & 3. FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so only targets for drive without I/O seek. It's quite similar like CFQ, but the dispatch decision is made according to IOPS instead of slice. The algorithm is simple. Drive has a service tree, and each task lives in the tree. The key into the tree is called vios (virtual I/O). Every request has vios, which is calculated according to its ioprio, request size and so on. Task's vios is the sum of vios of all requests it dispatches. FIOPS always selects task with minimum vios in the service tree and let the task dispatch request. The dispatched request's vios is then added to the task's vios and the task is repositioned in the sevice tree. The series are orgnized as: Patch 1: separate CFQ's io context management code. FIOPS will use it too. Patch 2: The core FIOPS. Patch 3: request read/write vios scale. This demontrates how the vios scale. To make the code simple for easy view, some scale code isn't included here, some not implementated yet. TODO: 1. ioprio support (have patch already) 2. request size vios scale 3. cgroup support 4. tracing support 5. automatically select default iosched according to QUEUE_FLAG_NONROT. Comments and suggestions are welcome! Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC 1/3]block: seperate CFQ io context management code 2012-01-04 6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li @ 2012-01-04 6:53 ` Shaohua Li 2012-01-04 8:19 ` Namhyung Kim 2012-01-04 6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li ` (4 subsequent siblings) 5 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-04 6:53 UTC (permalink / raw) To: linux-kernel; +Cc: axboe, vgoyal, jmoyer, Shaohua Li [-- Attachment #1: separate-cfq-ioc.patch --] [-- Type: text/plain, Size: 48325 bytes --] CFQ's io context management creates a per-device io context for each task. It's quite generic. Separate it from CFQ, and use it for fiops I/O scheduler. Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/blk-ioc.c | 474 ++++++++++++++++++++++++++++++++++- block/blk.h | 55 ++++ block/cfq-iosched.c | 614 ++++++++++------------------------------------ include/linux/iocontext.h | 30 +- 4 files changed, 683 insertions(+), 490 deletions(-) Index: linux/block/blk-ioc.c =================================================================== --- linux.orig/block/blk-ioc.c 2011-12-27 16:13:02.000000000 +0800 +++ linux/block/blk-ioc.c 2011-12-28 09:42:18.000000000 +0800 @@ -8,6 +8,7 @@ #include <linux/blkdev.h> #include <linux/bootmem.h> /* for max_pfn/max_low_pfn */ #include <linux/slab.h> +#include <linux/idr.h> #include "blk.h" @@ -16,12 +17,12 @@ */ static struct kmem_cache *iocontext_cachep; -static void cfq_dtor(struct io_context *ioc) +static void queue_data_dtor(struct io_context *ioc) { if (!hlist_empty(&ioc->cic_list)) { - struct cfq_io_context *cic; + struct dev_io_context *cic; - cic = hlist_entry(ioc->cic_list.first, struct cfq_io_context, + cic = hlist_entry(ioc->cic_list.first, struct dev_io_context, cic_list); cic->dtor(ioc); } @@ -40,7 +41,7 @@ int put_io_context(struct io_context *io if (atomic_long_dec_and_test(&ioc->refcount)) { rcu_read_lock(); - cfq_dtor(ioc); + queue_data_dtor(ioc); rcu_read_unlock(); kmem_cache_free(iocontext_cachep, ioc); @@ -50,14 +51,14 @@ int put_io_context(struct io_context *io } EXPORT_SYMBOL(put_io_context); -static void cfq_exit(struct io_context *ioc) +static void queue_data_exit(struct io_context *ioc) { rcu_read_lock(); if (!hlist_empty(&ioc->cic_list)) { - struct cfq_io_context *cic; + struct dev_io_context *cic; - cic = hlist_entry(ioc->cic_list.first, struct cfq_io_context, + cic = hlist_entry(ioc->cic_list.first, struct dev_io_context, cic_list); cic->exit(ioc); } @@ -75,7 +76,7 @@ void exit_io_context(struct task_struct task_unlock(task); if (atomic_dec_and_test(&ioc->nr_tasks)) - cfq_exit(ioc); + queue_data_exit(ioc); put_io_context(ioc); } @@ -162,3 +163,460 @@ static int __init blk_ioc_init(void) return 0; } subsys_initcall(blk_ioc_init); + +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) +#define CIC_DEAD_INDEX_SHIFT 1 + +static inline void *queue_data_dead_key(struct queue_data *qdata) +{ + return (void *)(qdata->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY); +} + +int ioc_builder_init(struct ioc_builder *builder) +{ + if (!builder->alloc_ioc || !builder->free_ioc) + return -ENOMEM; + + builder->ioc_count = alloc_percpu(unsigned long); + if (!builder->ioc_count) + return -ENOMEM; + + builder->ioc_gone = NULL; + spin_lock_init(&builder->ioc_gone_lock); + + return 0; +} +EXPORT_SYMBOL(ioc_builder_init); + +void io_context_builder_exit(struct ioc_builder *builder) +{ + DECLARE_COMPLETION_ONSTACK(all_gone); + + builder->ioc_gone = &all_gone; + /* ioc_gone's update must be visible before reading ioc_count */ + smp_wmb(); + + /* + * this also protects us from entering cfq_slab_kill() with + * pending RCU callbacks + */ + if (elv_ioc_count_read(*builder->ioc_count)) + wait_for_completion(&all_gone); + + free_percpu(builder->ioc_count); +} +EXPORT_SYMBOL(io_context_builder_exit); + +static DEFINE_SPINLOCK(cic_index_lock); +static DEFINE_IDA(cic_index_ida); +static int builder_alloc_cic_index(struct ioc_builder *builder) +{ + int index, error; + unsigned long flags; + + do { + if (!ida_pre_get(&cic_index_ida, GFP_KERNEL)) + return -ENOMEM; + + spin_lock_irqsave(&cic_index_lock, flags); + error = ida_get_new(&cic_index_ida, &index); + spin_unlock_irqrestore(&cic_index_lock, flags); + if (error && error != -EAGAIN) + return error; + } while (error); + + return index; +} + +static void builder_free_cic_index(struct ioc_builder *builder, int index) +{ + unsigned long flags; + + spin_lock_irqsave(&cic_index_lock, flags); + ida_remove(&cic_index_ida, index); + spin_unlock_irqrestore(&cic_index_lock, flags); +} + +int ioc_builder_init_queue(struct ioc_builder *builder, + struct queue_data *qdata, struct request_queue *q) +{ + /* + * Don't need take queue_lock in the routine, since we are + * initializing the ioscheduler, and nobody is using qdata + */ + qdata->cic_index = builder_alloc_cic_index(builder); + if (qdata->cic_index < 0) + return -ENOMEM; + + qdata->queue = q; + INIT_LIST_HEAD(&qdata->cic_list); + + return 0; +} +EXPORT_SYMBOL(ioc_builder_init_queue); + +/* + * Call func for each cic attached to this ioc. + */ +static void +call_for_each_cic(struct io_context *ioc, + void (*func)(struct io_context *, struct dev_io_context *)) +{ + struct dev_io_context *cic; + struct hlist_node *n; + + rcu_read_lock(); + + hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list) + func(ioc, cic); + + rcu_read_unlock(); +} + +static void queue_data_cic_free_rcu(struct rcu_head *head) +{ + struct dev_io_context *cic; + struct ioc_builder *builder; + + cic = container_of(head, struct dev_io_context, rcu_head); + builder = cic->builder; + + builder->free_ioc(builder, cic); + elv_ioc_count_dec(*builder->ioc_count); + + if (builder->ioc_gone) { + /* + * CFQ scheduler is exiting, grab exit lock and check + * the pending io context count. If it hits zero, + * complete ioc_gone and set it back to NULL + */ + spin_lock(&builder->ioc_gone_lock); + if (builder->ioc_gone && + !elv_ioc_count_read(*builder->ioc_count)) { + complete(builder->ioc_gone); + builder->ioc_gone = NULL; + } + spin_unlock(&builder->ioc_gone_lock); + } +} + +static void queue_data_cic_free(struct dev_io_context *cic) +{ + call_rcu(&cic->rcu_head, queue_data_cic_free_rcu); +} + +static void cic_free_func(struct io_context *ioc, struct dev_io_context *cic) +{ + unsigned long flags; + unsigned long dead_key = (unsigned long) cic->key; + + BUG_ON(!(dead_key & CIC_DEAD_KEY)); + + spin_lock_irqsave(&ioc->lock, flags); + radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT); + hlist_del_rcu(&cic->cic_list); + spin_unlock_irqrestore(&ioc->lock, flags); + + queue_data_cic_free(cic); +} + +/* + * Must be called with rcu_read_lock() held or preemption otherwise disabled. + * Only two callers of this - ->dtor() which is called with the rcu_read_lock(), + * and ->trim() which is called with the task lock held + */ +void queue_data_free_io_context(struct io_context *ioc) +{ + /* + * ioc->refcount is zero here, or we are called from elv_unregister(), + * so no more cic's are allowed to be linked into this ioc. So it + * should be ok to iterate over the known list, we will see all cic's + * since no new ones are added. + */ + call_for_each_cic(ioc, cic_free_func); +} +EXPORT_SYMBOL(queue_data_free_io_context); + +static void __queue_data_exit_single_io_context(struct queue_data *qdata, + struct dev_io_context *cic) +{ + struct io_context *ioc = cic->ioc; + struct ioc_builder *builder = cic->builder; + + list_del_init(&cic->queue_list); + + /* + * Make sure dead mark is seen for dead queues + */ + smp_wmb(); + cic->key = queue_data_dead_key(qdata); + + rcu_read_lock(); + if (rcu_dereference(ioc->ioc_data) == cic) { + rcu_read_unlock(); + spin_lock(&ioc->lock); + rcu_assign_pointer(ioc->ioc_data, NULL); + spin_unlock(&ioc->lock); + } else + rcu_read_unlock(); + + if (builder->cic_exit) + builder->cic_exit(qdata, cic); +} + +/* with request_queue lock hold */ +void ioc_builder_exit_queue(struct ioc_builder *builder, + struct queue_data *qdata) +{ + while (!list_empty(&qdata->cic_list)) { + struct dev_io_context *cic = list_entry(qdata->cic_list.next, + struct dev_io_context, + queue_list); + + __queue_data_exit_single_io_context(qdata, cic); + } + + builder_free_cic_index(builder, qdata->cic_index); +} +EXPORT_SYMBOL(ioc_builder_exit_queue); + +static void queue_data_exit_single_io_context(struct io_context *ioc, + struct dev_io_context *cic) +{ + struct queue_data *qdata = cic_to_queue_data(cic); + + if (qdata) { + struct request_queue *q = qdata->queue; + unsigned long flags; + + spin_lock_irqsave(q->queue_lock, flags); + + /* + * Ensure we get a fresh copy of the ->key to prevent + * race between exiting task and queue + */ + smp_read_barrier_depends(); + if (cic->key == qdata) + __queue_data_exit_single_io_context(qdata, cic); + + spin_unlock_irqrestore(q->queue_lock, flags); + } +} + +/* + * The process that ioc belongs to has exited, we need to clean up + * and put the internal structures we have that belongs to that process. + */ +static void queue_data_exit_io_context(struct io_context *ioc) +{ + call_for_each_cic(ioc, queue_data_exit_single_io_context); +} + +static struct dev_io_context * +queue_data_alloc_io_context(struct ioc_builder *builder, + struct queue_data *qdata, gfp_t gfp_mask) +{ + struct dev_io_context *cic; + + cic = builder->alloc_ioc(builder, qdata, gfp_mask | __GFP_ZERO); + + if (cic) { + cic->builder = builder; + if (builder->cic_init) + builder->cic_init(qdata, cic); + INIT_LIST_HEAD(&cic->queue_list); + INIT_HLIST_NODE(&cic->cic_list); + cic->dtor = queue_data_free_io_context; + cic->exit = queue_data_exit_io_context; + elv_ioc_count_inc(*builder->ioc_count); + } + + return cic; +} + +/* + * We drop dev io contexts lazily, so we may find a dead one. + */ +static void +queue_data_drop_dead_cic(struct queue_data *queue_data, struct io_context *ioc, + struct dev_io_context *cic) +{ + unsigned long flags; + + WARN_ON(!list_empty(&cic->queue_list)); + BUG_ON(cic->key != queue_data_dead_key(queue_data)); + + spin_lock_irqsave(&ioc->lock, flags); + + BUG_ON(rcu_dereference_check(ioc->ioc_data, + lockdep_is_held(&ioc->lock)) == cic); + + radix_tree_delete(&ioc->radix_root, queue_data->cic_index); + hlist_del_rcu(&cic->cic_list); + spin_unlock_irqrestore(&ioc->lock, flags); + + queue_data_cic_free(cic); +} + +struct dev_io_context * +queue_data_cic_lookup(struct queue_data *qdata, struct io_context *ioc) +{ + struct dev_io_context *cic; + unsigned long flags; + + if (unlikely(!ioc)) + return NULL; + + rcu_read_lock(); + + /* + * we maintain a last-hit cache, to avoid browsing over the tree + */ + cic = rcu_dereference(ioc->ioc_data); + if (cic && cic->key == qdata) { + rcu_read_unlock(); + return cic; + } + + do { + cic = radix_tree_lookup(&ioc->radix_root, qdata->cic_index); + rcu_read_unlock(); + if (!cic) + break; + if (unlikely(cic->key != qdata)) { + queue_data_drop_dead_cic(qdata, ioc, cic); + rcu_read_lock(); + continue; + } + + spin_lock_irqsave(&ioc->lock, flags); + rcu_assign_pointer(ioc->ioc_data, cic); + spin_unlock_irqrestore(&ioc->lock, flags); + break; + } while (1); + + return cic; +} +EXPORT_SYMBOL(queue_data_cic_lookup); + +/* + * Add cic into ioc, using qdata as the search key. This enables us to lookup + * the process specific dev io context when entered from the block layer. + * Also adds the cic to a per-qdata list, used when this queue is removed. + */ +static int queue_data_cic_link(struct queue_data *qdata, + struct io_context *ioc, struct dev_io_context *cic, gfp_t gfp_mask) +{ + unsigned long flags; + int ret; + + ret = radix_tree_preload(gfp_mask); + if (!ret) { + cic->ioc = ioc; + cic->key = qdata; + + spin_lock_irqsave(&ioc->lock, flags); + ret = radix_tree_insert(&ioc->radix_root, + qdata->cic_index, cic); + if (!ret) + hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list); + spin_unlock_irqrestore(&ioc->lock, flags); + + radix_tree_preload_end(); + + if (!ret) { + spin_lock_irqsave(qdata->queue->queue_lock, flags); + list_add(&cic->queue_list, &qdata->cic_list); + spin_unlock_irqrestore(qdata->queue->queue_lock, flags); + } + } + + if (ret && ret != -EEXIST) + printk(KERN_ERR "block: cic link failed!\n"); + + return ret; +} + +static void changed_ioprio(struct io_context *ioc, + struct dev_io_context *gen_cic) +{ + struct ioc_builder *builder = gen_cic->builder; + if (builder->changed_ioprio) + builder->changed_ioprio(ioc, gen_cic); +} + +static void queue_data_ioc_set_ioprio(struct io_context *ioc) +{ + call_for_each_cic(ioc, changed_ioprio); + ioc->ioprio_changed = 0; +} + +#ifdef CONFIG_CFQ_GROUP_IOSCHED +static void changed_cgroup(struct io_context *ioc, + struct dev_io_context *gen_cic) +{ + struct ioc_builder *builder = gen_cic->builder; + if (builder->changed_cgroup) + builder->changed_cgroup(ioc, gen_cic); +} + +static void queue_data_ioc_set_cgroup(struct io_context *ioc) +{ + call_for_each_cic(ioc, changed_cgroup); + ioc->cgroup_changed = 0; +} +#endif /* CONFIG_CFQ_GROUP_IOSCHED */ + +/* + * Setup general io context and dev io context. There can be several + * dev io contexts per general io context, if this process is doing io to more + * than one device managed by elevator. + */ +struct dev_io_context *queue_data_get_io_context(struct ioc_builder *builder, + struct queue_data *qdata, gfp_t gfp_mask) +{ + struct io_context *ioc = NULL; + struct dev_io_context *cic; + int ret; + + might_sleep_if(gfp_mask & __GFP_WAIT); + + ioc = get_io_context(gfp_mask, qdata->queue->node); + if (!ioc) + return NULL; + +retry: + cic = queue_data_cic_lookup(qdata, ioc); + if (cic) + goto out; + + cic = queue_data_alloc_io_context(builder, qdata, gfp_mask); + if (cic == NULL) + goto err; + + ret = queue_data_cic_link(qdata, ioc, cic, gfp_mask); + if (ret == -EEXIST) { + /* someone has linked cic to ioc already */ + queue_data_cic_free(cic); + goto retry; + } else if (ret) + goto err_free; + +out: + smp_read_barrier_depends(); + if (unlikely(ioc->ioprio_changed)) + queue_data_ioc_set_ioprio(ioc); + +#ifdef CONFIG_CFQ_GROUP_IOSCHED + if (unlikely(ioc->cgroup_changed)) + queue_data_ioc_set_cgroup(ioc); +#endif + return cic; +err_free: + queue_data_cic_free(cic); +err: + put_io_context(ioc); + return NULL; +} +EXPORT_SYMBOL(queue_data_get_io_context); +#endif Index: linux/block/blk.h =================================================================== --- linux.orig/block/blk.h 2011-12-27 16:13:02.000000000 +0800 +++ linux/block/blk.h 2011-12-28 09:42:18.000000000 +0800 @@ -206,4 +206,59 @@ static inline void blk_throtl_exit(struc static inline void blk_throtl_release(struct request_queue *q) { } #endif /* CONFIG_BLK_DEV_THROTTLING */ +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) +struct queue_data; +struct ioc_builder { + struct dev_io_context *(*alloc_ioc)(struct ioc_builder *builder, + struct queue_data *qdata, gfp_t gfp_mask); + void (*free_ioc)(struct ioc_builder *builder, + struct dev_io_context *dev_ioc); + + void (*changed_ioprio)(struct io_context *ioc, + struct dev_io_context *cic); + void (*changed_cgroup)(struct io_context *ioc, + struct dev_io_context *cic); + void (*cic_exit)(struct queue_data *qdata, + struct dev_io_context *gen_cic); + void (*cic_init)(struct queue_data *qdata, + struct dev_io_context *gen_cic); + + unsigned long __percpu *ioc_count; + struct completion *ioc_gone; + spinlock_t ioc_gone_lock; +}; + +struct queue_data { + struct request_queue *queue; + + unsigned int cic_index; + struct list_head cic_list; +}; + +#define CIC_DEAD_KEY 1ul +static inline struct queue_data *cic_to_queue_data(struct dev_io_context *cic) +{ + struct queue_data *qdata = cic->key; + + if (unlikely((unsigned long) qdata & CIC_DEAD_KEY)) + return NULL; + + return qdata; +} + +int ioc_builder_init(struct ioc_builder *builder); +void io_context_builder_exit(struct ioc_builder *builder); + +int ioc_builder_init_queue(struct ioc_builder *builder, + struct queue_data *qdata, struct request_queue *q); +void ioc_builder_exit_queue(struct ioc_builder *builder, + struct queue_data *qdata); + +struct dev_io_context *queue_data_get_io_context(struct ioc_builder *builder, + struct queue_data *qdata, gfp_t gfp_mask); +struct dev_io_context *queue_data_cic_lookup(struct queue_data *qdata, + struct io_context *ioc); +void queue_data_free_io_context(struct io_context *ioc); +#endif + #endif /* BLK_INTERNAL_H */ Index: linux/block/cfq-iosched.c =================================================================== --- linux.orig/block/cfq-iosched.c 2011-12-27 16:13:02.000000000 +0800 +++ linux/block/cfq-iosched.c 2011-12-28 09:12:06.000000000 +0800 @@ -14,6 +14,7 @@ #include <linux/rbtree.h> #include <linux/ioprio.h> #include <linux/blktrace_api.h> +#include "blk.h" #include "cfq.h" /* @@ -60,13 +61,7 @@ static const int cfq_hist_divisor = 4; static struct kmem_cache *cfq_pool; static struct kmem_cache *cfq_ioc_pool; - -static DEFINE_PER_CPU(unsigned long, cfq_ioc_count); -static struct completion *ioc_gone; -static DEFINE_SPINLOCK(ioc_gone_lock); - -static DEFINE_SPINLOCK(cic_index_lock); -static DEFINE_IDA(cic_index_ida); +static struct ioc_builder ioc_builder; #define CFQ_PRIO_LISTS IOPRIO_BE_NR #define cfq_class_idle(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE) @@ -220,7 +215,8 @@ struct cfq_group { * Per block device queue structure */ struct cfq_data { - struct request_queue *queue; + struct queue_data qdata; + /* Root service tree for cfq_groups */ struct cfq_rb_root grp_service_tree; struct cfq_group root_group; @@ -290,9 +286,6 @@ struct cfq_data { unsigned int cfq_group_idle; unsigned int cfq_latency; - unsigned int cic_index; - struct list_head cic_list; - /* * Fallback dummy cfqq for extreme OOM conditions */ @@ -306,6 +299,10 @@ struct cfq_data { /* Number of groups which are on blkcg->blkg_list */ unsigned int nr_blkcg_linked_grps; }; +#define queue_data_to_cfqd(ptr) \ + container_of(ptr, struct cfq_data, qdata) +#define dev_ioc_to_cfq_ioc(ptr) \ + container_of(ptr, struct cfq_io_context, dev_ioc) static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd); @@ -369,21 +366,21 @@ CFQ_CFQQ_FNS(wait_busy); #ifdef CONFIG_CFQ_GROUP_IOSCHED #define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \ - blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \ + blk_add_trace_msg((cfqd)->qdata.queue, "cfq%d%c %s " fmt, (cfqq)->pid, \ cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \ blkg_path(&(cfqq)->cfqg->blkg), ##args) #define cfq_log_cfqg(cfqd, cfqg, fmt, args...) \ - blk_add_trace_msg((cfqd)->queue, "%s " fmt, \ + blk_add_trace_msg((cfqd)->qdata.queue, "%s " fmt, \ blkg_path(&(cfqg)->blkg), ##args) \ #else #define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \ - blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args) + blk_add_trace_msg((cfqd)->qdata.queue, "cfq%d " fmt, (cfqq)->pid, ##args) #define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do {} while (0) #endif #define cfq_log(cfqd, fmt, args...) \ - blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args) + blk_add_trace_msg((cfqd)->qdata.queue, "cfq " fmt, ##args) /* Traverses through cfq group service trees */ #define for_each_cfqg_st(cfqg, i, j, st) \ @@ -464,8 +461,6 @@ static inline int cfqg_busy_async_queues static void cfq_dispatch_insert(struct request_queue *, struct request *); static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool, struct io_context *, gfp_t); -static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *, - struct io_context *); static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic, bool is_sync) @@ -479,23 +474,6 @@ static inline void cic_set_cfqq(struct c cic->cfqq[is_sync] = cfqq; } -#define CIC_DEAD_KEY 1ul -#define CIC_DEAD_INDEX_SHIFT 1 - -static inline void *cfqd_dead_key(struct cfq_data *cfqd) -{ - return (void *)(cfqd->cic_index << CIC_DEAD_INDEX_SHIFT | CIC_DEAD_KEY); -} - -static inline struct cfq_data *cic_to_cfqd(struct cfq_io_context *cic) -{ - struct cfq_data *cfqd = cic->key; - - if (unlikely((unsigned long) cfqd & CIC_DEAD_KEY)) - return NULL; - - return cfqd; -} /* * We regard a request as SYNC, if it's either a read or has the SYNC bit @@ -514,7 +492,7 @@ static inline void cfq_schedule_dispatch { if (cfqd->busy_queues) { cfq_log(cfqd, "schedule dispatch"); - kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work); + kblockd_schedule_work(cfqd->qdata.queue, &cfqd->unplug_work); } } @@ -1030,7 +1008,7 @@ static void cfq_update_blkio_group_weigh static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd, struct cfq_group *cfqg, struct blkio_cgroup *blkcg) { - struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info; + struct backing_dev_info *bdi = &cfqd->qdata.queue->backing_dev_info; unsigned int major, minor; /* @@ -1065,7 +1043,7 @@ static struct cfq_group * cfq_alloc_cfqg int i, j, ret; struct cfq_rb_root *st; - cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node); + cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->qdata.queue->node); if (!cfqg) return NULL; @@ -1097,7 +1075,7 @@ cfq_find_cfqg(struct cfq_data *cfqd, str { struct cfq_group *cfqg = NULL; void *key = cfqd; - struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info; + struct backing_dev_info *bdi = &cfqd->qdata.queue->backing_dev_info; unsigned int major, minor; /* @@ -1125,7 +1103,7 @@ static struct cfq_group *cfq_get_cfqg(st { struct blkio_cgroup *blkcg; struct cfq_group *cfqg = NULL, *__cfqg = NULL; - struct request_queue *q = cfqd->queue; + struct request_queue *q = cfqd->qdata.queue; rcu_read_lock(); blkcg = task_blkio_cgroup(current); @@ -1259,9 +1237,9 @@ static void cfq_unlink_blkio_group(void unsigned long flags; struct cfq_data *cfqd = key; - spin_lock_irqsave(cfqd->queue->queue_lock, flags); + spin_lock_irqsave(cfqd->qdata.queue->queue_lock, flags); cfq_destroy_cfqg(cfqd, cfqg_of_blkg(blkg)); - spin_unlock_irqrestore(cfqd->queue->queue_lock, flags); + spin_unlock_irqrestore(cfqd->qdata.queue->queue_lock, flags); } #else /* GROUP_IOSCHED */ @@ -1561,12 +1539,14 @@ static struct request * cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio) { struct task_struct *tsk = current; + struct dev_io_context *gen_cic; struct cfq_io_context *cic; struct cfq_queue *cfqq; - cic = cfq_cic_lookup(cfqd, tsk->io_context); - if (!cic) + gen_cic = queue_data_cic_lookup(&cfqd->qdata, tsk->io_context); + if (!gen_cic) return NULL; + cic = dev_ioc_to_cfq_ioc(gen_cic); cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio)); if (cfqq) { @@ -1675,6 +1655,7 @@ static int cfq_allow_merge(struct reques struct bio *bio) { struct cfq_data *cfqd = q->elevator->elevator_data; + struct dev_io_context *gen_cic; struct cfq_io_context *cic; struct cfq_queue *cfqq; @@ -1688,9 +1669,10 @@ static int cfq_allow_merge(struct reques * Lookup the cfqq that this bio will be queued with. Allow * merge only if rq is queued there. */ - cic = cfq_cic_lookup(cfqd, current->io_context); - if (!cic) + gen_cic = queue_data_cic_lookup(&cfqd->qdata, current->io_context); + if (!gen_cic) return false; + cic = dev_ioc_to_cfq_ioc(gen_cic); cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio)); return cfqq == RQ_CFQQ(rq); @@ -1774,7 +1756,7 @@ __cfq_slice_expired(struct cfq_data *cfq cfqd->active_queue = NULL; if (cfqd->active_cic) { - put_io_context(cfqd->active_cic->ioc); + put_io_context(cfqd->active_cic->dev_ioc.ioc); cfqd->active_cic = NULL; } } @@ -1976,7 +1958,7 @@ static bool cfq_should_idle(struct cfq_d /* We do for queues that were marked with idle window flag. */ if (cfq_cfqq_idle_window(cfqq) && - !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)) + !(blk_queue_nonrot(cfqd->qdata.queue) && cfqd->hw_tag)) return true; /* @@ -2002,7 +1984,7 @@ static void cfq_arm_slice_timer(struct c * for devices that support queuing, otherwise we still have a problem * with sync vs async workloads. */ - if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag) + if (blk_queue_nonrot(cfqd->qdata.queue) && cfqd->hw_tag) return; WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list)); @@ -2029,7 +2011,7 @@ static void cfq_arm_slice_timer(struct c * task has exited, don't wait */ cic = cfqd->active_cic; - if (!cic || !atomic_read(&cic->ioc->nr_tasks)) + if (!cic || !atomic_read(&cic->dev_ioc.ioc->nr_tasks)) return; /* @@ -2423,7 +2405,7 @@ static int __cfq_forced_dispatch_cfqq(st int dispatched = 0; while (cfqq->next_rq) { - cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq); + cfq_dispatch_insert(cfqq->cfqd->qdata.queue, cfqq->next_rq); dispatched++; } @@ -2577,12 +2559,12 @@ static bool cfq_dispatch_request(struct /* * insert request into driver dispatch list */ - cfq_dispatch_insert(cfqd->queue, rq); + cfq_dispatch_insert(cfqd->qdata.queue, rq); if (!cfqd->active_cic) { struct cfq_io_context *cic = RQ_CIC(rq); - atomic_long_inc(&cic->ioc->refcount); + atomic_long_inc(&cic->dev_ioc.ioc->refcount); cfqd->active_cic = cic; } @@ -2665,84 +2647,6 @@ static void cfq_put_queue(struct cfq_que cfq_put_cfqg(cfqg); } -/* - * Call func for each cic attached to this ioc. - */ -static void -call_for_each_cic(struct io_context *ioc, - void (*func)(struct io_context *, struct cfq_io_context *)) -{ - struct cfq_io_context *cic; - struct hlist_node *n; - - rcu_read_lock(); - - hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list) - func(ioc, cic); - - rcu_read_unlock(); -} - -static void cfq_cic_free_rcu(struct rcu_head *head) -{ - struct cfq_io_context *cic; - - cic = container_of(head, struct cfq_io_context, rcu_head); - - kmem_cache_free(cfq_ioc_pool, cic); - elv_ioc_count_dec(cfq_ioc_count); - - if (ioc_gone) { - /* - * CFQ scheduler is exiting, grab exit lock and check - * the pending io context count. If it hits zero, - * complete ioc_gone and set it back to NULL - */ - spin_lock(&ioc_gone_lock); - if (ioc_gone && !elv_ioc_count_read(cfq_ioc_count)) { - complete(ioc_gone); - ioc_gone = NULL; - } - spin_unlock(&ioc_gone_lock); - } -} - -static void cfq_cic_free(struct cfq_io_context *cic) -{ - call_rcu(&cic->rcu_head, cfq_cic_free_rcu); -} - -static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic) -{ - unsigned long flags; - unsigned long dead_key = (unsigned long) cic->key; - - BUG_ON(!(dead_key & CIC_DEAD_KEY)); - - spin_lock_irqsave(&ioc->lock, flags); - radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT); - hlist_del_rcu(&cic->cic_list); - spin_unlock_irqrestore(&ioc->lock, flags); - - cfq_cic_free(cic); -} - -/* - * Must be called with rcu_read_lock() held or preemption otherwise disabled. - * Only two callers of this - ->dtor() which is called with the rcu_read_lock(), - * and ->trim() which is called with the task lock held - */ -static void cfq_free_io_context(struct io_context *ioc) -{ - /* - * ioc->refcount is zero here, or we are called from elv_unregister(), - * so no more cic's are allowed to be linked into this ioc. So it - * should be ok to iterate over the known list, we will see all cic's - * since no new ones are added. - */ - call_for_each_cic(ioc, cic_free_func); -} - static void cfq_put_cooperator(struct cfq_queue *cfqq) { struct cfq_queue *__cfqq, *next; @@ -2776,90 +2680,6 @@ static void cfq_exit_cfqq(struct cfq_dat cfq_put_queue(cfqq); } -static void __cfq_exit_single_io_context(struct cfq_data *cfqd, - struct cfq_io_context *cic) -{ - struct io_context *ioc = cic->ioc; - - list_del_init(&cic->queue_list); - - /* - * Make sure dead mark is seen for dead queues - */ - smp_wmb(); - cic->key = cfqd_dead_key(cfqd); - - rcu_read_lock(); - if (rcu_dereference(ioc->ioc_data) == cic) { - rcu_read_unlock(); - spin_lock(&ioc->lock); - rcu_assign_pointer(ioc->ioc_data, NULL); - spin_unlock(&ioc->lock); - } else - rcu_read_unlock(); - - if (cic->cfqq[BLK_RW_ASYNC]) { - cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_ASYNC]); - cic->cfqq[BLK_RW_ASYNC] = NULL; - } - - if (cic->cfqq[BLK_RW_SYNC]) { - cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_SYNC]); - cic->cfqq[BLK_RW_SYNC] = NULL; - } -} - -static void cfq_exit_single_io_context(struct io_context *ioc, - struct cfq_io_context *cic) -{ - struct cfq_data *cfqd = cic_to_cfqd(cic); - - if (cfqd) { - struct request_queue *q = cfqd->queue; - unsigned long flags; - - spin_lock_irqsave(q->queue_lock, flags); - - /* - * Ensure we get a fresh copy of the ->key to prevent - * race between exiting task and queue - */ - smp_read_barrier_depends(); - if (cic->key == cfqd) - __cfq_exit_single_io_context(cfqd, cic); - - spin_unlock_irqrestore(q->queue_lock, flags); - } -} - -/* - * The process that ioc belongs to has exited, we need to clean up - * and put the internal structures we have that belongs to that process. - */ -static void cfq_exit_io_context(struct io_context *ioc) -{ - call_for_each_cic(ioc, cfq_exit_single_io_context); -} - -static struct cfq_io_context * -cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask) -{ - struct cfq_io_context *cic; - - cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO, - cfqd->queue->node); - if (cic) { - cic->ttime.last_end_request = jiffies; - INIT_LIST_HEAD(&cic->queue_list); - INIT_HLIST_NODE(&cic->cic_list); - cic->dtor = cfq_free_io_context; - cic->exit = cfq_exit_io_context; - elv_ioc_count_inc(cfq_ioc_count); - } - - return cic; -} - static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc) { struct task_struct *tsk = current; @@ -2902,21 +2722,24 @@ static void cfq_init_prio_data(struct cf cfq_clear_cfqq_prio_changed(cfqq); } -static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic) +static void changed_ioprio(struct io_context *ioc, + struct dev_io_context *gen_cic) { - struct cfq_data *cfqd = cic_to_cfqd(cic); + struct queue_data *qdata = cic_to_queue_data(gen_cic); + struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic); + struct cfq_data *cfqd = queue_data_to_cfqd(qdata); struct cfq_queue *cfqq; unsigned long flags; if (unlikely(!cfqd)) return; - spin_lock_irqsave(cfqd->queue->queue_lock, flags); + spin_lock_irqsave(cfqd->qdata.queue->queue_lock, flags); cfqq = cic->cfqq[BLK_RW_ASYNC]; if (cfqq) { struct cfq_queue *new_cfqq; - new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc, + new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->dev_ioc.ioc, GFP_ATOMIC); if (new_cfqq) { cic->cfqq[BLK_RW_ASYNC] = new_cfqq; @@ -2928,13 +2751,7 @@ static void changed_ioprio(struct io_con if (cfqq) cfq_mark_cfqq_prio_changed(cfqq); - spin_unlock_irqrestore(cfqd->queue->queue_lock, flags); -} - -static void cfq_ioc_set_ioprio(struct io_context *ioc) -{ - call_for_each_cic(ioc, changed_ioprio); - ioc->ioprio_changed = 0; + spin_unlock_irqrestore(cfqd->qdata.queue->queue_lock, flags); } static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq, @@ -2958,17 +2775,20 @@ static void cfq_init_cfqq(struct cfq_dat } #ifdef CONFIG_CFQ_GROUP_IOSCHED -static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic) +static void changed_cgroup(struct io_context *ioc, + struct dev_io_context *gen_cic) { + struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic); struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1); - struct cfq_data *cfqd = cic_to_cfqd(cic); + struct queue_data *qdata = cic_to_queue_data(gen_cic); + struct cfq_data *cfqd = queue_data_to_cfqd(qdata); unsigned long flags; struct request_queue *q; if (unlikely(!cfqd)) return; - q = cfqd->queue; + q = cfqd->qdata.queue; spin_lock_irqsave(q->queue_lock, flags); @@ -2984,12 +2804,6 @@ static void changed_cgroup(struct io_con spin_unlock_irqrestore(q->queue_lock, flags); } - -static void cfq_ioc_set_cgroup(struct io_context *ioc) -{ - call_for_each_cic(ioc, changed_cgroup); - ioc->cgroup_changed = 0; -} #endif /* CONFIG_CFQ_GROUP_IOSCHED */ static struct cfq_queue * @@ -2997,12 +2811,14 @@ cfq_find_alloc_queue(struct cfq_data *cf struct io_context *ioc, gfp_t gfp_mask) { struct cfq_queue *cfqq, *new_cfqq = NULL; + struct dev_io_context *gen_cic; struct cfq_io_context *cic; struct cfq_group *cfqg; retry: cfqg = cfq_get_cfqg(cfqd); - cic = cfq_cic_lookup(cfqd, ioc); + gen_cic = queue_data_cic_lookup(&cfqd->qdata, ioc); + cic = dev_ioc_to_cfq_ioc(gen_cic); /* cic always exists here */ cfqq = cic_to_cfqq(cic, is_sync); @@ -3016,17 +2832,17 @@ retry: cfqq = new_cfqq; new_cfqq = NULL; } else if (gfp_mask & __GFP_WAIT) { - spin_unlock_irq(cfqd->queue->queue_lock); + spin_unlock_irq(cfqd->qdata.queue->queue_lock); new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask | __GFP_ZERO, - cfqd->queue->node); - spin_lock_irq(cfqd->queue->queue_lock); + cfqd->qdata.queue->node); + spin_lock_irq(cfqd->qdata.queue->queue_lock); if (new_cfqq) goto retry; } else { cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask | __GFP_ZERO, - cfqd->queue->node); + cfqd->qdata.queue->node); } if (cfqq) { @@ -3088,159 +2904,6 @@ cfq_get_queue(struct cfq_data *cfqd, boo return cfqq; } -/* - * We drop cfq io contexts lazily, so we may find a dead one. - */ -static void -cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc, - struct cfq_io_context *cic) -{ - unsigned long flags; - - WARN_ON(!list_empty(&cic->queue_list)); - BUG_ON(cic->key != cfqd_dead_key(cfqd)); - - spin_lock_irqsave(&ioc->lock, flags); - - BUG_ON(rcu_dereference_check(ioc->ioc_data, - lockdep_is_held(&ioc->lock)) == cic); - - radix_tree_delete(&ioc->radix_root, cfqd->cic_index); - hlist_del_rcu(&cic->cic_list); - spin_unlock_irqrestore(&ioc->lock, flags); - - cfq_cic_free(cic); -} - -static struct cfq_io_context * -cfq_cic_lookup(struct cfq_data *cfqd, struct io_context *ioc) -{ - struct cfq_io_context *cic; - unsigned long flags; - - if (unlikely(!ioc)) - return NULL; - - rcu_read_lock(); - - /* - * we maintain a last-hit cache, to avoid browsing over the tree - */ - cic = rcu_dereference(ioc->ioc_data); - if (cic && cic->key == cfqd) { - rcu_read_unlock(); - return cic; - } - - do { - cic = radix_tree_lookup(&ioc->radix_root, cfqd->cic_index); - rcu_read_unlock(); - if (!cic) - break; - if (unlikely(cic->key != cfqd)) { - cfq_drop_dead_cic(cfqd, ioc, cic); - rcu_read_lock(); - continue; - } - - spin_lock_irqsave(&ioc->lock, flags); - rcu_assign_pointer(ioc->ioc_data, cic); - spin_unlock_irqrestore(&ioc->lock, flags); - break; - } while (1); - - return cic; -} - -/* - * Add cic into ioc, using cfqd as the search key. This enables us to lookup - * the process specific cfq io context when entered from the block layer. - * Also adds the cic to a per-cfqd list, used when this queue is removed. - */ -static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc, - struct cfq_io_context *cic, gfp_t gfp_mask) -{ - unsigned long flags; - int ret; - - ret = radix_tree_preload(gfp_mask); - if (!ret) { - cic->ioc = ioc; - cic->key = cfqd; - - spin_lock_irqsave(&ioc->lock, flags); - ret = radix_tree_insert(&ioc->radix_root, - cfqd->cic_index, cic); - if (!ret) - hlist_add_head_rcu(&cic->cic_list, &ioc->cic_list); - spin_unlock_irqrestore(&ioc->lock, flags); - - radix_tree_preload_end(); - - if (!ret) { - spin_lock_irqsave(cfqd->queue->queue_lock, flags); - list_add(&cic->queue_list, &cfqd->cic_list); - spin_unlock_irqrestore(cfqd->queue->queue_lock, flags); - } - } - - if (ret && ret != -EEXIST) - printk(KERN_ERR "cfq: cic link failed!\n"); - - return ret; -} - -/* - * Setup general io context and cfq io context. There can be several cfq - * io contexts per general io context, if this process is doing io to more - * than one device managed by cfq. - */ -static struct cfq_io_context * -cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask) -{ - struct io_context *ioc = NULL; - struct cfq_io_context *cic; - int ret; - - might_sleep_if(gfp_mask & __GFP_WAIT); - - ioc = get_io_context(gfp_mask, cfqd->queue->node); - if (!ioc) - return NULL; - -retry: - cic = cfq_cic_lookup(cfqd, ioc); - if (cic) - goto out; - - cic = cfq_alloc_io_context(cfqd, gfp_mask); - if (cic == NULL) - goto err; - - ret = cfq_cic_link(cfqd, ioc, cic, gfp_mask); - if (ret == -EEXIST) { - /* someone has linked cic to ioc already */ - cfq_cic_free(cic); - goto retry; - } else if (ret) - goto err_free; - -out: - smp_read_barrier_depends(); - if (unlikely(ioc->ioprio_changed)) - cfq_ioc_set_ioprio(ioc); - -#ifdef CONFIG_CFQ_GROUP_IOSCHED - if (unlikely(ioc->cgroup_changed)) - cfq_ioc_set_cgroup(ioc); -#endif - return cic; -err_free: - cfq_cic_free(cic); -err: - put_io_context(ioc); - return NULL; -} static void __cfq_update_io_thinktime(struct cfq_ttime *ttime, unsigned long slice_idle) @@ -3281,7 +2944,7 @@ cfq_update_io_seektime(struct cfq_data * } cfqq->seek_history <<= 1; - if (blk_queue_nonrot(cfqd->queue)) + if (blk_queue_nonrot(cfqd->qdata.queue)) cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT); else cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); @@ -3310,7 +2973,8 @@ cfq_update_idle_window(struct cfq_data * if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE)) enable_idle = 0; - else if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || + else if (!atomic_read(&cic->dev_ioc.ioc->nr_tasks) || + !cfqd->cfq_slice_idle || (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) enable_idle = 0; else if (sample_valid(cic->ttime.ttime_samples)) { @@ -3471,7 +3135,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, s cfqd->busy_queues > 1) { cfq_del_timer(cfqd, cfqq); cfq_clear_cfqq_wait_request(cfqq); - __blk_run_queue(cfqd->queue); + __blk_run_queue(cfqd->qdata.queue); } else { cfq_blkiocg_update_idle_time_stats( &cfqq->cfqg->blkg); @@ -3486,7 +3150,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, s * this new queue is RT and the current one is BE */ cfq_preempt_queue(cfqd, cfqq); - __blk_run_queue(cfqd->queue); + __blk_run_queue(cfqd->qdata.queue); } } @@ -3496,7 +3160,7 @@ static void cfq_insert_request(struct re struct cfq_queue *cfqq = RQ_CFQQ(rq); cfq_log_cfqq(cfqd, cfqq, "insert_request"); - cfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc); + cfq_init_prio_data(cfqq, RQ_CIC(rq)->dev_ioc.ioc); rq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]); list_add_tail(&rq->queuelist, &cfqq->fifo); @@ -3683,6 +3347,7 @@ static int cfq_may_queue(struct request_ { struct cfq_data *cfqd = q->elevator->elevator_data; struct task_struct *tsk = current; + struct dev_io_context *gen_cic; struct cfq_io_context *cic; struct cfq_queue *cfqq; @@ -3692,13 +3357,14 @@ static int cfq_may_queue(struct request_ * so just lookup a possibly existing queue, or return 'may queue' * if that fails */ - cic = cfq_cic_lookup(cfqd, tsk->io_context); - if (!cic) + gen_cic = queue_data_cic_lookup(&cfqd->qdata, tsk->io_context); + if (!gen_cic) return ELV_MQUEUE_MAY; + cic = dev_ioc_to_cfq_ioc(gen_cic); cfqq = cic_to_cfqq(cic, rw_is_sync(rw)); if (cfqq) { - cfq_init_prio_data(cfqq, cic->ioc); + cfq_init_prio_data(cfqq, cic->dev_ioc.ioc); return __cfq_may_queue(cfqq); } @@ -3719,7 +3385,7 @@ static void cfq_put_request(struct reque BUG_ON(!cfqq->allocated[rw]); cfqq->allocated[rw]--; - put_io_context(RQ_CIC(rq)->ioc); + put_io_context(RQ_CIC(rq)->dev_ioc.ioc); rq->elevator_private[0] = NULL; rq->elevator_private[1] = NULL; @@ -3772,6 +3438,7 @@ cfq_set_request(struct request_queue *q, { struct cfq_data *cfqd = q->elevator->elevator_data; struct cfq_io_context *cic; + struct dev_io_context *dev_ioc; const int rw = rq_data_dir(rq); const bool is_sync = rq_is_sync(rq); struct cfq_queue *cfqq; @@ -3779,7 +3446,12 @@ cfq_set_request(struct request_queue *q, might_sleep_if(gfp_mask & __GFP_WAIT); - cic = cfq_get_io_context(cfqd, gfp_mask); + dev_ioc = queue_data_get_io_context(&ioc_builder, &cfqd->qdata, + gfp_mask); + if (dev_ioc) + cic = dev_ioc_to_cfq_ioc(dev_ioc); + else + cic = NULL; spin_lock_irqsave(q->queue_lock, flags); @@ -3789,7 +3461,7 @@ cfq_set_request(struct request_queue *q, new_queue: cfqq = cic_to_cfqq(cic, is_sync); if (!cfqq || cfqq == &cfqd->oom_cfqq) { - cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask); + cfqq = cfq_get_queue(cfqd, is_sync, cic->dev_ioc.ioc, gfp_mask); cic_set_cfqq(cic, cfqq, is_sync); } else { /* @@ -3832,10 +3504,10 @@ static void cfq_kick_queue(struct work_s { struct cfq_data *cfqd = container_of(work, struct cfq_data, unplug_work); - struct request_queue *q = cfqd->queue; + struct request_queue *q = cfqd->qdata.queue; spin_lock_irq(q->queue_lock); - __blk_run_queue(cfqd->queue); + __blk_run_queue(q); spin_unlock_irq(q->queue_lock); } @@ -3851,7 +3523,7 @@ static void cfq_idle_slice_timer(unsigne cfq_log(cfqd, "idle timer fired"); - spin_lock_irqsave(cfqd->queue->queue_lock, flags); + spin_lock_irqsave(cfqd->qdata.queue->queue_lock, flags); cfqq = cfqd->active_queue; if (cfqq) { @@ -3892,7 +3564,7 @@ expire: out_kick: cfq_schedule_dispatch(cfqd); out_cont: - spin_unlock_irqrestore(cfqd->queue->queue_lock, flags); + spin_unlock_irqrestore(cfqd->qdata.queue->queue_lock, flags); } static void cfq_shutdown_timer_wq(struct cfq_data *cfqd) @@ -3916,10 +3588,35 @@ static void cfq_put_async_queues(struct cfq_put_queue(cfqd->async_idle_cfqq); } +static void cfq_init_cic(struct queue_data *qdata, + struct dev_io_context *gen_cic) +{ + struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic); + + cic->ttime.last_end_request = jiffies; +} + +static void cfq_exit_cic(struct queue_data *qdata, + struct dev_io_context *gen_cic) +{ + struct cfq_io_context *cic = dev_ioc_to_cfq_ioc(gen_cic); + struct cfq_data *cfqd = queue_data_to_cfqd(qdata); + + if (cic->cfqq[BLK_RW_ASYNC]) { + cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_ASYNC]); + cic->cfqq[BLK_RW_ASYNC] = NULL; + } + + if (cic->cfqq[BLK_RW_SYNC]) { + cfq_exit_cfqq(cfqd, cic->cfqq[BLK_RW_SYNC]); + cic->cfqq[BLK_RW_SYNC] = NULL; + } +} + static void cfq_exit_queue(struct elevator_queue *e) { struct cfq_data *cfqd = e->elevator_data; - struct request_queue *q = cfqd->queue; + struct request_queue *q = cfqd->qdata.queue; bool wait = false; cfq_shutdown_timer_wq(cfqd); @@ -3929,13 +3626,7 @@ static void cfq_exit_queue(struct elevat if (cfqd->active_queue) __cfq_slice_expired(cfqd, cfqd->active_queue, 0); - while (!list_empty(&cfqd->cic_list)) { - struct cfq_io_context *cic = list_entry(cfqd->cic_list.next, - struct cfq_io_context, - queue_list); - - __cfq_exit_single_io_context(cfqd, cic); - } + ioc_builder_exit_queue(&ioc_builder, &cfqd->qdata); cfq_put_async_queues(cfqd); cfq_release_cfq_groups(cfqd); @@ -3951,10 +3642,6 @@ static void cfq_exit_queue(struct elevat cfq_shutdown_timer_wq(cfqd); - spin_lock(&cic_index_lock); - ida_remove(&cic_index_ida, cfqd->cic_index); - spin_unlock(&cic_index_lock); - /* * Wait for cfqg->blkg->key accessors to exit their grace periods. * Do this wait only if there are other unlinked groups out @@ -3976,24 +3663,6 @@ static void cfq_exit_queue(struct elevat kfree(cfqd); } -static int cfq_alloc_cic_index(void) -{ - int index, error; - - do { - if (!ida_pre_get(&cic_index_ida, GFP_KERNEL)) - return -ENOMEM; - - spin_lock(&cic_index_lock); - error = ida_get_new(&cic_index_ida, &index); - spin_unlock(&cic_index_lock); - if (error && error != -EAGAIN) - return error; - } while (error); - - return index; -} - static void *cfq_init_queue(struct request_queue *q) { struct cfq_data *cfqd; @@ -4001,24 +3670,15 @@ static void *cfq_init_queue(struct reque struct cfq_group *cfqg; struct cfq_rb_root *st; - i = cfq_alloc_cic_index(); - if (i < 0) + cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); + if (!cfqd) return NULL; - cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); - if (!cfqd) { - spin_lock(&cic_index_lock); - ida_remove(&cic_index_ida, i); - spin_unlock(&cic_index_lock); + if (ioc_builder_init_queue(&ioc_builder, &cfqd->qdata, q)) { + kfree(cfqd); return NULL; } - /* - * Don't need take queue_lock in the routine, since we are - * initializing the ioscheduler, and nobody is using cfqd - */ - cfqd->cic_index = i; - /* Init root service tree */ cfqd->grp_service_tree = CFQ_RB_ROOT; @@ -4044,9 +3704,7 @@ static void *cfq_init_queue(struct reque if (blkio_alloc_blkg_stats(&cfqg->blkg)) { kfree(cfqg); - spin_lock(&cic_index_lock); - ida_remove(&cic_index_ida, cfqd->cic_index); - spin_unlock(&cic_index_lock); + ioc_builder_exit_queue(&ioc_builder, &cfqd->qdata); kfree(cfqd); return NULL; @@ -4079,9 +3737,6 @@ static void *cfq_init_queue(struct reque cfqd->oom_cfqq.ref++; cfq_link_cfqq_cfqg(&cfqd->oom_cfqq, &cfqd->root_group); - INIT_LIST_HEAD(&cfqd->cic_list); - - cfqd->queue = q; init_timer(&cfqd->idle_slice_timer); cfqd->idle_slice_timer.function = cfq_idle_slice_timer; @@ -4137,6 +3792,34 @@ fail: return -ENOMEM; } +static struct dev_io_context *cfq_alloc_ioc(struct ioc_builder *builder, + struct queue_data *qdata, gfp_t gfp_mask) +{ + struct cfq_io_context *ioc = kmem_cache_alloc_node(cfq_ioc_pool, + gfp_mask, qdata->queue->node); + if (ioc) + return &ioc->dev_ioc; + return NULL; +} + +static void cfq_free_ioc(struct ioc_builder *builder, + struct dev_io_context *dev_ioc) +{ + struct cfq_io_context *ioc = dev_ioc_to_cfq_ioc(dev_ioc); + kmem_cache_free(cfq_ioc_pool, ioc); +} + +static struct ioc_builder ioc_builder = { + .alloc_ioc = cfq_alloc_ioc, + .free_ioc = cfq_free_ioc, + .changed_ioprio = changed_ioprio, +#ifdef CONFIG_CFQ_GROUP_IOSCHED + .changed_cgroup = changed_cgroup, +#endif + .cic_init = cfq_init_cic, + .cic_exit = cfq_exit_cic, +}; + /* * sysfs parts below --> */ @@ -4247,7 +3930,7 @@ static struct elevator_type iosched_cfq .elevator_may_queue_fn = cfq_may_queue, .elevator_init_fn = cfq_init_queue, .elevator_exit_fn = cfq_exit_queue, - .trim = cfq_free_io_context, + .trim = queue_data_free_io_context, }, .elevator_attrs = cfq_attrs, .elevator_name = "cfq", @@ -4284,6 +3967,10 @@ static int __init cfq_init(void) #endif if (cfq_slab_setup()) return -ENOMEM; + if (ioc_builder_init(&ioc_builder)) { + cfq_slab_kill(); + return -ENOMEM; + } elv_register(&iosched_cfq); blkio_policy_register(&blkio_policy_cfq); @@ -4293,20 +3980,9 @@ static int __init cfq_init(void) static void __exit cfq_exit(void) { - DECLARE_COMPLETION_ONSTACK(all_gone); blkio_policy_unregister(&blkio_policy_cfq); elv_unregister(&iosched_cfq); - ioc_gone = &all_gone; - /* ioc_gone's update must be visible before reading ioc_count */ - smp_wmb(); - - /* - * this also protects us from entering cfq_slab_kill() with - * pending RCU callbacks - */ - if (elv_ioc_count_read(cfq_ioc_count)) - wait_for_completion(&all_gone); - ida_destroy(&cic_index_ida); + io_context_builder_exit(&ioc_builder); cfq_slab_kill(); } Index: linux/include/linux/iocontext.h =================================================================== --- linux.orig/include/linux/iocontext.h 2011-12-27 16:13:02.000000000 +0800 +++ linux/include/linux/iocontext.h 2011-12-27 16:16:38.000000000 +0800 @@ -4,6 +4,22 @@ #include <linux/radix-tree.h> #include <linux/rcupdate.h> +struct ioc_builder; +struct dev_io_context { + void *key; + struct io_context *ioc; + + struct list_head queue_list; + struct hlist_node cic_list; + + void (*dtor)(struct io_context *); /* destructor */ + void (*exit)(struct io_context *); /* called on task exit */ + + struct rcu_head rcu_head; + + struct ioc_builder *builder; +}; + struct cfq_queue; struct cfq_ttime { unsigned long last_end_request; @@ -14,21 +30,9 @@ struct cfq_ttime { }; struct cfq_io_context { - void *key; - + struct dev_io_context dev_ioc; struct cfq_queue *cfqq[2]; - - struct io_context *ioc; - struct cfq_ttime ttime; - - struct list_head queue_list; - struct hlist_node cic_list; - - void (*dtor)(struct io_context *); /* destructor */ - void (*exit)(struct io_context *); /* called on task exit */ - - struct rcu_head rcu_head; }; /* ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 1/3]block: seperate CFQ io context management code 2012-01-04 6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li @ 2012-01-04 8:19 ` Namhyung Kim 0 siblings, 0 replies; 29+ messages in thread From: Namhyung Kim @ 2012-01-04 8:19 UTC (permalink / raw) To: shaohua.li; +Cc: linux-kernel, axboe, vgoyal, jmoyer Hi, Two of nitpicks below .. 2012-01-04 3:53 PM, Shaohua Li Wrote: > CFQ's io context management creates a per-device io context for each task. > It's quite generic. Separate it from CFQ, and use it for fiops I/O scheduler. > > Signed-off-by: Shaohua Li<shaohua.li@intel.com> > --- [snip] > +int ioc_builder_init(struct ioc_builder *builder) > +{ > + if (!builder->alloc_ioc || !builder->free_ioc) > + return -ENOMEM; > + > + builder->ioc_count = alloc_percpu(unsigned long); > + if (!builder->ioc_count) > + return -ENOMEM; > + > + builder->ioc_gone = NULL; > + spin_lock_init(&builder->ioc_gone_lock); > + > + return 0; > +} > +EXPORT_SYMBOL(ioc_builder_init); > + > +void io_context_builder_exit(struct ioc_builder *builder) It'd be better using 'ioc_builder_exit' as a name for consistency, IMHO. > +{ > + DECLARE_COMPLETION_ONSTACK(all_gone); > + > + builder->ioc_gone =&all_gone; > + /* ioc_gone's update must be visible before reading ioc_count */ > + smp_wmb(); > + > + /* > + * this also protects us from entering cfq_slab_kill() with > + * pending RCU callbacks > + */ > + if (elv_ioc_count_read(*builder->ioc_count)) > + wait_for_completion(&all_gone); > + > + free_percpu(builder->ioc_count); > +} > +EXPORT_SYMBOL(io_context_builder_exit); [snip] > +static void queue_data_cic_free_rcu(struct rcu_head *head) > +{ > + struct dev_io_context *cic; > + struct ioc_builder *builder; > + > + cic = container_of(head, struct dev_io_context, rcu_head); > + builder = cic->builder; > + > + builder->free_ioc(builder, cic); > + elv_ioc_count_dec(*builder->ioc_count); > + > + if (builder->ioc_gone) { > + /* > + * CFQ scheduler is exiting, grab exit lock and check s/CFQ/IO/ ? Thanks. Namhyung Kim ^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC 2/3]block: FIOPS ioscheduler core 2012-01-04 6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li 2012-01-04 6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li @ 2012-01-04 6:53 ` Shaohua Li 2012-01-06 6:05 ` Namjae Jeon 2012-01-07 1:06 ` Zhu Yanhai 2012-01-04 6:53 ` [RFC 3/3]block: fiops read/write request scale Shaohua Li ` (3 subsequent siblings) 5 siblings, 2 replies; 29+ messages in thread From: Shaohua Li @ 2012-01-04 6:53 UTC (permalink / raw) To: linux-kernel; +Cc: axboe, vgoyal, jmoyer, Shaohua Li [-- Attachment #1: fiops-iosch.patch --] [-- Type: text/plain, Size: 19808 bytes --] FIOPS (Fair IOPS) ioscheduler is IOPS based ioscheduler, so only targets for drive without I/O seek. It's quite similar like CFQ, but the dispatch decision is made according to IOPS instead of slice. The algorithm is simple. Drive has a service tree, and each task lives in the tree. The key into the tree is called vios (virtual I/O). Every request has vios, which is calculated according to its ioprio, request size and so on. Task's vios is the sum of vios of all requests it dispatches. FIOPS always selects task with minimum vios in the service tree and let the task dispatch request. The dispatched request's vios is then added to the task's vios and the task is repositioned in the sevice tree. Unlike CFQ, FIOPS doesn't have separate sync/async queues, because with I/O less writeback, usually a task can only dispatch either sync or async requests. Bias read or write request can still be done with read/write scale. One issue is if workload iodepth is lower than drive queue_depth, IOPS share of a task might not be strictly according to its priority, request size and so on. In this case, the drive is in idle actually. Solving the problem need make drive idle, so impact performance. I believe CFQ isn't completely fair between tasks in such case too. Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/Kconfig.iosched | 8 block/Makefile | 1 block/blk-ioc.c | 2 block/blk.h | 2 block/fiops-iosched.c | 659 ++++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 670 insertions(+), 2 deletions(-) Index: linux/block/Makefile =================================================================== --- linux.orig/block/Makefile 2011-12-28 09:42:18.000000000 +0800 +++ linux/block/Makefile 2011-12-28 09:42:35.000000000 +0800 @@ -14,6 +14,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk- obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o +obj-$(CONFIG_IOSCHED_FIOPS) += fiops-iosched.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o Index: linux/block/fiops-iosched.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/block/fiops-iosched.c 2012-01-04 14:01:26.000000000 +0800 @@ -0,0 +1,659 @@ +/* + * IOPS based IO scheduler. Based on CFQ. + * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> + * Shaohua Li <shli@kernel.org> + */ +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/blkdev.h> +#include <linux/elevator.h> +#include <linux/jiffies.h> +#include <linux/rbtree.h> +#include <linux/ioprio.h> +#include "blk.h" + +#define VIOS_SCALE_SHIFT 10 +#define VIOS_SCALE (1 << VIOS_SCALE_SHIFT) + +struct fiops_rb_root { + struct rb_root rb; + struct rb_node *left; + unsigned count; + + u64 min_vios; +}; +#define FIOPS_RB_ROOT (struct fiops_rb_root) { .rb = RB_ROOT} + +struct fiops_data { + struct queue_data qdata; + + struct fiops_rb_root service_tree; + + unsigned int busy_queues; + + struct work_struct unplug_work; +}; + +struct fiops_ioc { + struct dev_io_context dev_ioc; + + unsigned int flags; + struct fiops_data *fiopsd; + struct rb_node rb_node; + u64 vios; /* key in service_tree */ + struct fiops_rb_root *service_tree; + + struct rb_root sort_list; + struct list_head fifo; + + pid_t pid; +}; + +static struct kmem_cache *fiops_ioc_pool; +static struct ioc_builder ioc_builder; +#define queue_data_to_fiopsd(ptr) container_of(ptr, struct fiops_data, qdata) +#define dev_ioc_to_fiops_ioc(ptr) container_of(ptr, struct fiops_ioc, dev_ioc) +#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree)) + +#define RQ_CIC(rq) \ + ((struct fiops_ioc *) (rq)->elevator_private[0]) +enum ioc_state_flags { + FIOPS_IOC_FLAG_on_rr = 0, /* on round-robin busy list */ +}; + +#define FIOPS_IOC_FNS(name) \ +static inline void fiops_mark_ioc_##name(struct fiops_ioc *ioc) \ +{ \ + ioc->flags |= (1 << FIOPS_IOC_FLAG_##name); \ +} \ +static inline void fiops_clear_ioc_##name(struct fiops_ioc *ioc) \ +{ \ + ioc->flags &= ~(1 << FIOPS_IOC_FLAG_##name); \ +} \ +static inline int fiops_ioc_##name(const struct fiops_ioc *ioc) \ +{ \ + return ((ioc)->flags & (1 << FIOPS_IOC_FLAG_##name)) != 0; \ +} + +FIOPS_IOC_FNS(on_rr); +#undef FIOPS_IOC_FNS + +/* + * The below is leftmost cache rbtree addon + */ +static struct fiops_ioc *fiops_rb_first(struct fiops_rb_root *root) +{ + /* Service tree is empty */ + if (!root->count) + return NULL; + + if (!root->left) + root->left = rb_first(&root->rb); + + if (root->left) + return rb_entry(root->left, struct fiops_ioc, rb_node); + + return NULL; +} + +static void rb_erase_init(struct rb_node *n, struct rb_root *root) +{ + rb_erase(n, root); + RB_CLEAR_NODE(n); +} + +static void fiops_rb_erase(struct rb_node *n, struct fiops_rb_root *root) +{ + if (root->left == n) + root->left = NULL; + rb_erase_init(n, &root->rb); + --root->count; +} + +static inline u64 max_vios(u64 min_vios, u64 vios) +{ + s64 delta = (s64)(vios - min_vios); + if (delta > 0) + min_vios = vios; + + return min_vios; +} + +static void fiops_update_min_vios(struct fiops_rb_root *service_tree) +{ + struct fiops_ioc *ioc; + + ioc = fiops_rb_first(service_tree); + if (!ioc) + return; + service_tree->min_vios = max_vios(service_tree->min_vios, ioc->vios); +} + +/* + * The fiopsd->service_trees holds all pending fiops_ioc's that have + * requests waiting to be processed. It is sorted in the order that + * we will service the queues. + */ +static void fiops_service_tree_add(struct fiops_data *fiopsd, + struct fiops_ioc *ioc) +{ + struct rb_node **p, *parent; + struct fiops_ioc *__ioc; + struct fiops_rb_root *service_tree = ioc_service_tree(ioc); + u64 vios; + int left; + + /* New added IOC */ + if (RB_EMPTY_NODE(&ioc->rb_node)) + vios = max_vios(service_tree->min_vios, ioc->vios); + else { + vios = ioc->vios; + /* ioc->service_tree might not equal to service_tree */ + fiops_rb_erase(&ioc->rb_node, ioc->service_tree); + ioc->service_tree = NULL; + } + + left = 1; + parent = NULL; + ioc->service_tree = service_tree; + p = &service_tree->rb.rb_node; + while (*p) { + struct rb_node **n; + + parent = *p; + __ioc = rb_entry(parent, struct fiops_ioc, rb_node); + + /* + * sort by key, that represents service time. + */ + if (vios < __ioc->vios) + n = &(*p)->rb_left; + else { + n = &(*p)->rb_right; + left = 0; + } + + p = n; + } + + if (left) + service_tree->left = &ioc->rb_node; + + ioc->vios = vios; + rb_link_node(&ioc->rb_node, parent, p); + rb_insert_color(&ioc->rb_node, &service_tree->rb); + service_tree->count++; + + fiops_update_min_vios(service_tree); +} + +/* + * Update ioc's position in the service tree. + */ +static void fiops_resort_rr_list(struct fiops_data *fiopsd, + struct fiops_ioc *ioc) +{ + /* + * Resorting requires the ioc to be on the RR list already. + */ + if (fiops_ioc_on_rr(ioc)) + fiops_service_tree_add(fiopsd, ioc); +} + +/* + * add to busy list of queues for service, trying to be fair in ordering + * the pending list according to last request service + */ +static void fiops_add_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc) +{ + BUG_ON(fiops_ioc_on_rr(ioc)); + fiops_mark_ioc_on_rr(ioc); + + fiopsd->busy_queues++; + + fiops_resort_rr_list(fiopsd, ioc); +} + +/* + * Called when the ioc no longer has requests pending, remove it from + * the service tree. + */ +static void fiops_del_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc) +{ + BUG_ON(!fiops_ioc_on_rr(ioc)); + fiops_clear_ioc_on_rr(ioc); + + if (!RB_EMPTY_NODE(&ioc->rb_node)) { + fiops_rb_erase(&ioc->rb_node, ioc->service_tree); + ioc->service_tree = NULL; + } + + BUG_ON(!fiopsd->busy_queues); + fiopsd->busy_queues--; +} + +/* + * rb tree support functions + */ +static void fiops_del_rq_rb(struct request *rq) +{ + struct fiops_ioc *ioc = RQ_CIC(rq); + + elv_rb_del(&ioc->sort_list, rq); +} + +static void fiops_add_rq_rb(struct request *rq) +{ + struct fiops_ioc *ioc = RQ_CIC(rq); + struct fiops_data *fiopsd = ioc->fiopsd; + + elv_rb_add(&ioc->sort_list, rq); + + if (!fiops_ioc_on_rr(ioc)) + fiops_add_ioc_rr(fiopsd, ioc); +} + +static void fiops_reposition_rq_rb(struct fiops_ioc *ioc, struct request *rq) +{ + elv_rb_del(&ioc->sort_list, rq); + fiops_add_rq_rb(rq); +} + +static void fiops_remove_request(struct request *rq) +{ + list_del_init(&rq->queuelist); + fiops_del_rq_rb(rq); +} + +static u64 fiops_scaled_vios(struct fiops_data *fiopsd, + struct fiops_ioc *ioc, struct request *rq) +{ + return VIOS_SCALE; +} + +/* return vios dispatched */ +static u64 fiops_dispatch_request(struct fiops_data *fiopsd, + struct fiops_ioc *ioc) +{ + struct request *rq; + struct request_queue *q = fiopsd->qdata.queue; + + rq = rq_entry_fifo(ioc->fifo.next); + + fiops_remove_request(rq); + elv_dispatch_sort(q, rq); + + return fiops_scaled_vios(fiopsd, ioc, rq); +} + +static int fiops_forced_dispatch(struct fiops_data *fiopsd) +{ + struct fiops_ioc *ioc; + int dispatched = 0; + + while ((ioc = fiops_rb_first(&fiopsd->service_tree)) != NULL) { + while (!list_empty(&ioc->fifo)) { + fiops_dispatch_request(fiopsd, ioc); + dispatched++; + } + if (fiops_ioc_on_rr(ioc)) + fiops_del_ioc_rr(fiopsd, ioc); + } + return dispatched; +} + +static struct fiops_ioc *fiops_select_ioc(struct fiops_data *fiopsd) +{ + struct fiops_ioc *ioc; + + if (RB_EMPTY_ROOT(&fiopsd->service_tree.rb)) + return NULL; + ioc = fiops_rb_first(&fiopsd->service_tree); + return ioc; +} + +static void fiops_charge_vios(struct fiops_data *fiopsd, + struct fiops_ioc *ioc, u64 vios) +{ + struct fiops_rb_root *service_tree = ioc->service_tree; + ioc->vios += vios; + + if (RB_EMPTY_ROOT(&ioc->sort_list)) + fiops_del_ioc_rr(fiopsd, ioc); + else + fiops_resort_rr_list(fiopsd, ioc); + + fiops_update_min_vios(service_tree); +} + +static int fiops_dispatch_requests(struct request_queue *q, int force) +{ + struct fiops_data *fiopsd = q->elevator->elevator_data; + struct fiops_ioc *ioc; + u64 vios; + + if (unlikely(force)) + return fiops_forced_dispatch(fiopsd); + + ioc = fiops_select_ioc(fiopsd); + if (!ioc) + return 0; + + vios = fiops_dispatch_request(fiopsd, ioc); + + fiops_charge_vios(fiopsd, ioc, vios); + return 1; +} + +static void fiops_insert_request(struct request_queue *q, struct request *rq) +{ + struct fiops_ioc *ioc = RQ_CIC(rq); + + rq_set_fifo_time(rq, jiffies); + + list_add_tail(&rq->queuelist, &ioc->fifo); + + fiops_add_rq_rb(rq); +} + +/* + * scheduler run of queue, if there are requests pending and no one in the + * driver that will restart queueing + */ +static inline void fiops_schedule_dispatch(struct fiops_data *fiopsd) +{ + if (fiopsd->busy_queues) + kblockd_schedule_work(fiopsd->qdata.queue, + &fiopsd->unplug_work); +} + +static int +fiops_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask) +{ + struct fiops_data *fiopsd = q->elevator->elevator_data; + struct dev_io_context *dev_ioc; + struct fiops_ioc *cic; + + might_sleep_if(gfp_mask & __GFP_WAIT); + + dev_ioc = queue_data_get_io_context(&ioc_builder, &fiopsd->qdata, + gfp_mask); + if (!dev_ioc) + goto queue_fail; + + cic = dev_ioc_to_fiops_ioc(dev_ioc); + + /* + * we hold a reference of dev_ioc and nobody else set this request, + * doesn't need locking + */ + rq->elevator_private[0] = cic; + + return 0; + +queue_fail: + fiops_schedule_dispatch(fiopsd); + return 1; +} + +static void fiops_put_request(struct request *rq) +{ + struct fiops_ioc *ioc = RQ_CIC(rq); + + if (ioc) { + rq->elevator_private[0] = NULL; + put_io_context(ioc->dev_ioc.ioc); + } +} + +static struct request * +fiops_find_rq_fmerge(struct fiops_data *fiopsd, struct bio *bio) +{ + struct task_struct *tsk = current; + struct dev_io_context *gen_cic; + struct fiops_ioc *cic; + + gen_cic = queue_data_cic_lookup(&fiopsd->qdata, tsk->io_context); + if (!gen_cic) + return NULL; + cic = dev_ioc_to_fiops_ioc(gen_cic); + + if (cic) { + sector_t sector = bio->bi_sector + bio_sectors(bio); + + return elv_rb_find(&cic->sort_list, sector); + } + + return NULL; +} + +static int fiops_merge(struct request_queue *q, struct request **req, + struct bio *bio) +{ + struct fiops_data *fiopsd = q->elevator->elevator_data; + struct request *__rq; + + __rq = fiops_find_rq_fmerge(fiopsd, bio); + if (__rq && elv_rq_merge_ok(__rq, bio)) { + *req = __rq; + return ELEVATOR_FRONT_MERGE; + } + + return ELEVATOR_NO_MERGE; +} + +static void fiops_merged_request(struct request_queue *q, struct request *req, + int type) +{ + if (type == ELEVATOR_FRONT_MERGE) { + struct fiops_ioc *ioc = RQ_CIC(req); + + fiops_reposition_rq_rb(ioc, req); + } +} + +static void +fiops_merged_requests(struct request_queue *q, struct request *rq, + struct request *next) +{ + struct fiops_ioc *ioc = RQ_CIC(rq); + struct fiops_data *fiopsd = q->elevator->elevator_data; + /* + * reposition in fifo if next is older than rq + */ + if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) && + time_before(rq_fifo_time(next), rq_fifo_time(rq))) { + list_move(&rq->queuelist, &next->queuelist); + rq_set_fifo_time(rq, rq_fifo_time(next)); + } + + fiops_remove_request(next); + + ioc = RQ_CIC(next); + /* + * all requests of this task are merged to other tasks, delete it + * from the service tree. + */ + if (fiops_ioc_on_rr(ioc) && RB_EMPTY_ROOT(&ioc->sort_list)) + fiops_del_ioc_rr(fiopsd, ioc); +} + +static int fiops_allow_merge(struct request_queue *q, struct request *rq, + struct bio *bio) +{ + struct fiops_data *fiopsd = q->elevator->elevator_data; + struct dev_io_context *gen_cic; + struct fiops_ioc *cic; + + /* + * Lookup the ioc that this bio will be queued with. Allow + * merge only if rq is queued there. + */ + gen_cic = queue_data_cic_lookup(&fiopsd->qdata, current->io_context); + if (!gen_cic) + return false; + cic = dev_ioc_to_fiops_ioc(gen_cic); + + return cic == RQ_CIC(rq); +} + +static void fiops_exit_queue(struct elevator_queue *e) +{ + struct fiops_data *fiopsd = e->elevator_data; + struct request_queue *q = fiopsd->qdata.queue; + + cancel_work_sync(&fiopsd->unplug_work); + + spin_lock_irq(q->queue_lock); + + ioc_builder_exit_queue(&ioc_builder, &fiopsd->qdata); + + spin_unlock_irq(q->queue_lock); + kfree(fiopsd); +} + +static void fiops_kick_queue(struct work_struct *work) +{ + struct fiops_data *fiopsd = + container_of(work, struct fiops_data, unplug_work); + struct request_queue *q = fiopsd->qdata.queue; + + spin_lock_irq(q->queue_lock); + __blk_run_queue(q); + spin_unlock_irq(q->queue_lock); +} + +static void *fiops_init_queue(struct request_queue *q) +{ + struct fiops_data *fiopsd; + + fiopsd = kzalloc_node(sizeof(*fiopsd), GFP_KERNEL, q->node); + if (!fiopsd) + return NULL; + + if (ioc_builder_init_queue(&ioc_builder, &fiopsd->qdata, q)) { + kfree(fiopsd); + return NULL; + } + + fiopsd->service_tree = FIOPS_RB_ROOT; + + INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue); + + return fiopsd; +} + +static void fiops_slab_kill(void) +{ + /* + * Caller already ensured that pending RCU callbacks are completed, + * so we should have no busy allocations at this point. + */ + if (fiops_ioc_pool) + kmem_cache_destroy(fiops_ioc_pool); +} + +static int __init fiops_slab_setup(void) +{ + fiops_ioc_pool = KMEM_CACHE(fiops_ioc, 0); + if (!fiops_ioc_pool) + return -ENOMEM; + + return 0; +} + +static struct dev_io_context * +fiops_alloc_ioc(struct ioc_builder *builder, struct queue_data *qdata, + gfp_t gfp_mask) +{ + struct fiops_ioc *ioc = kmem_cache_alloc_node(fiops_ioc_pool, + gfp_mask, qdata->queue->node); + if (ioc) + return &ioc->dev_ioc; + return NULL; +} + +static void fiops_free_ioc(struct ioc_builder *builder, + struct dev_io_context *dev_ioc) +{ + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(dev_ioc); + kmem_cache_free(fiops_ioc_pool, ioc); +} + +static void fiops_init_cic(struct queue_data *qdata, + struct dev_io_context *gen_cic) +{ + struct fiops_data *fiopsd = queue_data_to_fiopsd(qdata); + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic); + + RB_CLEAR_NODE(&ioc->rb_node); + INIT_LIST_HEAD(&ioc->fifo); + ioc->sort_list = RB_ROOT; + + ioc->fiopsd = fiopsd; + + ioc->pid = current->pid; +} + +static void fiops_exit_cic(struct queue_data *qdata, + struct dev_io_context *gen_cic) +{ + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic); + + WARN_ON(fiops_ioc_on_rr(ioc)); +} + +static struct ioc_builder ioc_builder = { + .alloc_ioc = fiops_alloc_ioc, + .free_ioc = fiops_free_ioc, + .cic_init = fiops_init_cic, + .cic_exit = fiops_exit_cic, +}; + +static struct elevator_type iosched_fiops = { + .ops = { + .elevator_merge_fn = fiops_merge, + .elevator_merged_fn = fiops_merged_request, + .elevator_merge_req_fn = fiops_merged_requests, + .elevator_allow_merge_fn = fiops_allow_merge, + .elevator_dispatch_fn = fiops_dispatch_requests, + .elevator_add_req_fn = fiops_insert_request, + .elevator_former_req_fn = elv_rb_former_request, + .elevator_latter_req_fn = elv_rb_latter_request, + .elevator_set_req_fn = fiops_set_request, + .elevator_put_req_fn = fiops_put_request, + .elevator_init_fn = fiops_init_queue, + .elevator_exit_fn = fiops_exit_queue, + .trim = queue_data_free_io_context, + }, + .elevator_name = "fiops", + .elevator_owner = THIS_MODULE, +}; + +static int __init fiops_init(void) +{ + if (fiops_slab_setup()) + return -ENOMEM; + if (ioc_builder_init(&ioc_builder)) { + fiops_slab_kill(); + return -ENOMEM; + } + + elv_register(&iosched_fiops); + + return 0; +} + +static void __exit fiops_exit(void) +{ + elv_unregister(&iosched_fiops); + io_context_builder_exit(&ioc_builder); + fiops_slab_kill(); +} + +module_init(fiops_init); +module_exit(fiops_exit); + +MODULE_AUTHOR("Jens Axboe, Shaohua Li <shli@kernel.org>"); +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("IOPS based IO scheduler"); Index: linux/block/Kconfig.iosched =================================================================== --- linux.orig/block/Kconfig.iosched 2011-12-28 09:42:18.000000000 +0800 +++ linux/block/Kconfig.iosched 2012-01-04 13:58:35.000000000 +0800 @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED ---help--- Enable group IO scheduling in CFQ. +config IOSCHED_FIOPS + tristate "IOPS based I/O scheduler" + default y + ---help--- + This is an IOPS based I/O scheduler. It will try to distribute + IOPS equally among all processes in the system. It's mainly for + Flash based storage. + choice prompt "Default I/O scheduler" default DEFAULT_CFQ Index: linux/block/blk.h =================================================================== --- linux.orig/block/blk.h 2011-12-28 09:42:18.000000000 +0800 +++ linux/block/blk.h 2011-12-28 09:42:35.000000000 +0800 @@ -206,7 +206,7 @@ static inline void blk_throtl_exit(struc static inline void blk_throtl_release(struct request_queue *q) { } #endif /* CONFIG_BLK_DEV_THROTTLING */ -#if IS_ENABLED(CONFIG_IOSCHED_CFQ) +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS) struct queue_data; struct ioc_builder { struct dev_io_context *(*alloc_ioc)(struct ioc_builder *builder, Index: linux/block/blk-ioc.c =================================================================== --- linux.orig/block/blk-ioc.c 2011-12-28 09:42:18.000000000 +0800 +++ linux/block/blk-ioc.c 2011-12-28 09:42:35.000000000 +0800 @@ -164,7 +164,7 @@ static int __init blk_ioc_init(void) } subsys_initcall(blk_ioc_init); -#if IS_ENABLED(CONFIG_IOSCHED_CFQ) +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS) #define CIC_DEAD_INDEX_SHIFT 1 static inline void *queue_data_dead_key(struct queue_data *qdata) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3]block: FIOPS ioscheduler core 2012-01-04 6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li @ 2012-01-06 6:05 ` Namjae Jeon 2012-01-07 1:06 ` Zhu Yanhai 1 sibling, 0 replies; 29+ messages in thread From: Namjae Jeon @ 2012-01-06 6:05 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer > =================================================================== > --- linux.orig/block/Kconfig.iosched 2011-12-28 09:42:18.000000000 +0800 > +++ linux/block/Kconfig.iosched 2012-01-04 13:58:35.000000000 +0800 > @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED > ---help--- > Enable group IO scheduling in CFQ. > > +config IOSCHED_FIOPS > + tristate "IOPS based I/O scheduler" > + default y > + ---help--- > + This is an IOPS based I/O scheduler. It will try to distribute > + IOPS equally among all processes in the system. It's mainly for > + Flash based storage. > + > choice > prompt "Default I/O scheduler" > default DEFAULT_CFQ Hi. Is below code needed ? choice prompt "Default I/O scheduler" default DEFAULT_CFQ help Select the I/O scheduler which will be used by default for all block devices. config DEFAULT_DEADLINE bool "Deadline" if IOSCHED_DEADLINE=y config DEFAULT_CFQ bool "CFQ" if IOSCHED_CFQ=y + config DEFAULT_FIOPS + bool "FIOPS" if IOSCHED_FIOPS=y config DEFAULT_NOOP bool "No-op" config DEFAULT_IOSCHED string default "deadline" if DEFAULT_DEADLINE default "cfq" if DEFAULT_CFQ default "noop" if DEFAULT_NOOP + default "fiops" if DEFAULT_FIOPS ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 2/3]block: FIOPS ioscheduler core 2012-01-04 6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li 2012-01-06 6:05 ` Namjae Jeon @ 2012-01-07 1:06 ` Zhu Yanhai 1 sibling, 0 replies; 29+ messages in thread From: Zhu Yanhai @ 2012-01-07 1:06 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer 2012/1/4 Shaohua Li <shaohua.li@intel.com>: > FIOPS (Fair IOPS) ioscheduler is IOPS based ioscheduler, so only targets > for drive without I/O seek. It's quite similar like CFQ, but the dispatch > decision is made according to IOPS instead of slice. > > The algorithm is simple. Drive has a service tree, and each task lives in > the tree. The key into the tree is called vios (virtual I/O). Every request > has vios, which is calculated according to its ioprio, request size and so > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > always selects task with minimum vios in the service tree and let the task > dispatch request. The dispatched request's vios is then added to the task's > vios and the task is repositioned in the sevice tree. > > Unlike CFQ, FIOPS doesn't have separate sync/async queues, because with I/O > less writeback, usually a task can only dispatch either sync or async requests. > Bias read or write request can still be done with read/write scale. > > One issue is if workload iodepth is lower than drive queue_depth, IOPS > share of a task might not be strictly according to its priority, request > size and so on. In this case, the drive is in idle actually. Solving the > problem need make drive idle, so impact performance. I believe CFQ isn't > completely fair between tasks in such case too. > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > --- > block/Kconfig.iosched | 8 > block/Makefile | 1 > block/blk-ioc.c | 2 > block/blk.h | 2 > block/fiops-iosched.c | 659 ++++++++++++++++++++++++++++++++++++++++++++++++++ > 5 files changed, 670 insertions(+), 2 deletions(-) > > Index: linux/block/Makefile > =================================================================== > --- linux.orig/block/Makefile 2011-12-28 09:42:18.000000000 +0800 > +++ linux/block/Makefile 2011-12-28 09:42:35.000000000 +0800 > @@ -14,6 +14,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk- > obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o > obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o > obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o > +obj-$(CONFIG_IOSCHED_FIOPS) += fiops-iosched.o > > obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o > obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o > Index: linux/block/fiops-iosched.c > =================================================================== > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ linux/block/fiops-iosched.c 2012-01-04 14:01:26.000000000 +0800 > @@ -0,0 +1,659 @@ > +/* > + * IOPS based IO scheduler. Based on CFQ. > + * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> > + * Shaohua Li <shli@kernel.org> > + */ > +#include <linux/module.h> > +#include <linux/slab.h> > +#include <linux/blkdev.h> > +#include <linux/elevator.h> > +#include <linux/jiffies.h> > +#include <linux/rbtree.h> > +#include <linux/ioprio.h> > +#include "blk.h" > + > +#define VIOS_SCALE_SHIFT 10 > +#define VIOS_SCALE (1 << VIOS_SCALE_SHIFT) > + > +struct fiops_rb_root { > + struct rb_root rb; > + struct rb_node *left; > + unsigned count; > + > + u64 min_vios; > +}; > +#define FIOPS_RB_ROOT (struct fiops_rb_root) { .rb = RB_ROOT} > + > +struct fiops_data { > + struct queue_data qdata; > + > + struct fiops_rb_root service_tree; > + > + unsigned int busy_queues; > + > + struct work_struct unplug_work; > +}; > + > +struct fiops_ioc { > + struct dev_io_context dev_ioc; > + > + unsigned int flags; > + struct fiops_data *fiopsd; > + struct rb_node rb_node; > + u64 vios; /* key in service_tree */ > + struct fiops_rb_root *service_tree; > + > + struct rb_root sort_list; > + struct list_head fifo; > + > + pid_t pid; > +}; > + > +static struct kmem_cache *fiops_ioc_pool; > +static struct ioc_builder ioc_builder; > +#define queue_data_to_fiopsd(ptr) container_of(ptr, struct fiops_data, qdata) > +#define dev_ioc_to_fiops_ioc(ptr) container_of(ptr, struct fiops_ioc, dev_ioc) > +#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree)) > + > +#define RQ_CIC(rq) \ > + ((struct fiops_ioc *) (rq)->elevator_private[0]) > +enum ioc_state_flags { > + FIOPS_IOC_FLAG_on_rr = 0, /* on round-robin busy list */ > +}; > + > +#define FIOPS_IOC_FNS(name) \ > +static inline void fiops_mark_ioc_##name(struct fiops_ioc *ioc) \ > +{ \ > + ioc->flags |= (1 << FIOPS_IOC_FLAG_##name); \ > +} \ > +static inline void fiops_clear_ioc_##name(struct fiops_ioc *ioc) \ > +{ \ > + ioc->flags &= ~(1 << FIOPS_IOC_FLAG_##name); \ > +} \ > +static inline int fiops_ioc_##name(const struct fiops_ioc *ioc) \ > +{ \ > + return ((ioc)->flags & (1 << FIOPS_IOC_FLAG_##name)) != 0; \ > +} > + > +FIOPS_IOC_FNS(on_rr); > +#undef FIOPS_IOC_FNS > + > +/* > + * The below is leftmost cache rbtree addon > + */ > +static struct fiops_ioc *fiops_rb_first(struct fiops_rb_root *root) > +{ > + /* Service tree is empty */ > + if (!root->count) > + return NULL; > + > + if (!root->left) > + root->left = rb_first(&root->rb); > + > + if (root->left) > + return rb_entry(root->left, struct fiops_ioc, rb_node); > + > + return NULL; > +} > + > +static void rb_erase_init(struct rb_node *n, struct rb_root *root) > +{ > + rb_erase(n, root); > + RB_CLEAR_NODE(n); > +} > + > +static void fiops_rb_erase(struct rb_node *n, struct fiops_rb_root *root) > +{ > + if (root->left == n) > + root->left = NULL; > + rb_erase_init(n, &root->rb); > + --root->count; > +} > + > +static inline u64 max_vios(u64 min_vios, u64 vios) > +{ > + s64 delta = (s64)(vios - min_vios); > + if (delta > 0) > + min_vios = vios; > + > + return min_vios; > +} > + > +static void fiops_update_min_vios(struct fiops_rb_root *service_tree) > +{ > + struct fiops_ioc *ioc; > + > + ioc = fiops_rb_first(service_tree); > + if (!ioc) > + return; > + service_tree->min_vios = max_vios(service_tree->min_vios, ioc->vios); This is tracking the max vios here, not the min one. Should be min_vios()? > +} > + > +/* > + * The fiopsd->service_trees holds all pending fiops_ioc's that have > + * requests waiting to be processed. It is sorted in the order that > + * we will service the queues. > + */ > +static void fiops_service_tree_add(struct fiops_data *fiopsd, > + struct fiops_ioc *ioc) > +{ > + struct rb_node **p, *parent; > + struct fiops_ioc *__ioc; > + struct fiops_rb_root *service_tree = ioc_service_tree(ioc); > + u64 vios; > + int left; > + > + /* New added IOC */ > + if (RB_EMPTY_NODE(&ioc->rb_node)) > + vios = max_vios(service_tree->min_vios, ioc->vios); > + else { > + vios = ioc->vios; > + /* ioc->service_tree might not equal to service_tree */ > + fiops_rb_erase(&ioc->rb_node, ioc->service_tree); > + ioc->service_tree = NULL; > + } > + > + left = 1; > + parent = NULL; > + ioc->service_tree = service_tree; > + p = &service_tree->rb.rb_node; > + while (*p) { > + struct rb_node **n; > + > + parent = *p; > + __ioc = rb_entry(parent, struct fiops_ioc, rb_node); > + > + /* > + * sort by key, that represents service time. > + */ > + if (vios < __ioc->vios) > + n = &(*p)->rb_left; > + else { > + n = &(*p)->rb_right; > + left = 0; > + } > + > + p = n; > + } > + > + if (left) > + service_tree->left = &ioc->rb_node; > + > + ioc->vios = vios; > + rb_link_node(&ioc->rb_node, parent, p); > + rb_insert_color(&ioc->rb_node, &service_tree->rb); > + service_tree->count++; > + > + fiops_update_min_vios(service_tree); > +} > + > +/* > + * Update ioc's position in the service tree. > + */ > +static void fiops_resort_rr_list(struct fiops_data *fiopsd, > + struct fiops_ioc *ioc) > +{ > + /* > + * Resorting requires the ioc to be on the RR list already. > + */ > + if (fiops_ioc_on_rr(ioc)) > + fiops_service_tree_add(fiopsd, ioc); > +} > + > +/* > + * add to busy list of queues for service, trying to be fair in ordering > + * the pending list according to last request service > + */ > +static void fiops_add_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc) > +{ > + BUG_ON(fiops_ioc_on_rr(ioc)); > + fiops_mark_ioc_on_rr(ioc); > + > + fiopsd->busy_queues++; > + > + fiops_resort_rr_list(fiopsd, ioc); > +} > + > +/* > + * Called when the ioc no longer has requests pending, remove it from > + * the service tree. > + */ > +static void fiops_del_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc) > +{ > + BUG_ON(!fiops_ioc_on_rr(ioc)); > + fiops_clear_ioc_on_rr(ioc); > + > + if (!RB_EMPTY_NODE(&ioc->rb_node)) { > + fiops_rb_erase(&ioc->rb_node, ioc->service_tree); > + ioc->service_tree = NULL; > + } > + > + BUG_ON(!fiopsd->busy_queues); > + fiopsd->busy_queues--; > +} > + > +/* > + * rb tree support functions > + */ > +static void fiops_del_rq_rb(struct request *rq) > +{ > + struct fiops_ioc *ioc = RQ_CIC(rq); > + > + elv_rb_del(&ioc->sort_list, rq); > +} > + > +static void fiops_add_rq_rb(struct request *rq) > +{ > + struct fiops_ioc *ioc = RQ_CIC(rq); > + struct fiops_data *fiopsd = ioc->fiopsd; > + > + elv_rb_add(&ioc->sort_list, rq); > + > + if (!fiops_ioc_on_rr(ioc)) > + fiops_add_ioc_rr(fiopsd, ioc); > +} > + > +static void fiops_reposition_rq_rb(struct fiops_ioc *ioc, struct request *rq) > +{ > + elv_rb_del(&ioc->sort_list, rq); > + fiops_add_rq_rb(rq); > +} > + > +static void fiops_remove_request(struct request *rq) > +{ > + list_del_init(&rq->queuelist); > + fiops_del_rq_rb(rq); > +} > + > +static u64 fiops_scaled_vios(struct fiops_data *fiopsd, > + struct fiops_ioc *ioc, struct request *rq) > +{ > + return VIOS_SCALE; > +} > + > +/* return vios dispatched */ > +static u64 fiops_dispatch_request(struct fiops_data *fiopsd, > + struct fiops_ioc *ioc) > +{ > + struct request *rq; > + struct request_queue *q = fiopsd->qdata.queue; > + > + rq = rq_entry_fifo(ioc->fifo.next); > + > + fiops_remove_request(rq); > + elv_dispatch_sort(q, rq); > + > + return fiops_scaled_vios(fiopsd, ioc, rq); > +} > + > +static int fiops_forced_dispatch(struct fiops_data *fiopsd) > +{ > + struct fiops_ioc *ioc; > + int dispatched = 0; > + > + while ((ioc = fiops_rb_first(&fiopsd->service_tree)) != NULL) { > + while (!list_empty(&ioc->fifo)) { > + fiops_dispatch_request(fiopsd, ioc); > + dispatched++; > + } > + if (fiops_ioc_on_rr(ioc)) > + fiops_del_ioc_rr(fiopsd, ioc); > + } > + return dispatched; > +} > + > +static struct fiops_ioc *fiops_select_ioc(struct fiops_data *fiopsd) > +{ > + struct fiops_ioc *ioc; > + > + if (RB_EMPTY_ROOT(&fiopsd->service_tree.rb)) > + return NULL; > + ioc = fiops_rb_first(&fiopsd->service_tree); > + return ioc; > +} > + > +static void fiops_charge_vios(struct fiops_data *fiopsd, > + struct fiops_ioc *ioc, u64 vios) > +{ > + struct fiops_rb_root *service_tree = ioc->service_tree; > + ioc->vios += vios; > + > + if (RB_EMPTY_ROOT(&ioc->sort_list)) > + fiops_del_ioc_rr(fiopsd, ioc); > + else > + fiops_resort_rr_list(fiopsd, ioc); > + > + fiops_update_min_vios(service_tree); > +} > + > +static int fiops_dispatch_requests(struct request_queue *q, int force) > +{ > + struct fiops_data *fiopsd = q->elevator->elevator_data; > + struct fiops_ioc *ioc; > + u64 vios; > + > + if (unlikely(force)) > + return fiops_forced_dispatch(fiopsd); > + > + ioc = fiops_select_ioc(fiopsd); > + if (!ioc) > + return 0; > + > + vios = fiops_dispatch_request(fiopsd, ioc); > + > + fiops_charge_vios(fiopsd, ioc, vios); > + return 1; > +} > + > +static void fiops_insert_request(struct request_queue *q, struct request *rq) > +{ > + struct fiops_ioc *ioc = RQ_CIC(rq); > + > + rq_set_fifo_time(rq, jiffies); > + > + list_add_tail(&rq->queuelist, &ioc->fifo); > + > + fiops_add_rq_rb(rq); > +} > + > +/* > + * scheduler run of queue, if there are requests pending and no one in the > + * driver that will restart queueing > + */ > +static inline void fiops_schedule_dispatch(struct fiops_data *fiopsd) > +{ > + if (fiopsd->busy_queues) > + kblockd_schedule_work(fiopsd->qdata.queue, > + &fiopsd->unplug_work); > +} > + > +static int > +fiops_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask) > +{ > + struct fiops_data *fiopsd = q->elevator->elevator_data; > + struct dev_io_context *dev_ioc; > + struct fiops_ioc *cic; > + > + might_sleep_if(gfp_mask & __GFP_WAIT); > + > + dev_ioc = queue_data_get_io_context(&ioc_builder, &fiopsd->qdata, > + gfp_mask); > + if (!dev_ioc) > + goto queue_fail; > + > + cic = dev_ioc_to_fiops_ioc(dev_ioc); > + > + /* > + * we hold a reference of dev_ioc and nobody else set this request, > + * doesn't need locking > + */ > + rq->elevator_private[0] = cic; > + > + return 0; > + > +queue_fail: > + fiops_schedule_dispatch(fiopsd); > + return 1; > +} > + > +static void fiops_put_request(struct request *rq) > +{ > + struct fiops_ioc *ioc = RQ_CIC(rq); > + > + if (ioc) { > + rq->elevator_private[0] = NULL; > + put_io_context(ioc->dev_ioc.ioc); > + } > +} > + > +static struct request * > +fiops_find_rq_fmerge(struct fiops_data *fiopsd, struct bio *bio) > +{ > + struct task_struct *tsk = current; > + struct dev_io_context *gen_cic; > + struct fiops_ioc *cic; > + > + gen_cic = queue_data_cic_lookup(&fiopsd->qdata, tsk->io_context); > + if (!gen_cic) > + return NULL; > + cic = dev_ioc_to_fiops_ioc(gen_cic); > + > + if (cic) { > + sector_t sector = bio->bi_sector + bio_sectors(bio); > + > + return elv_rb_find(&cic->sort_list, sector); > + } > + > + return NULL; > +} > + > +static int fiops_merge(struct request_queue *q, struct request **req, > + struct bio *bio) > +{ > + struct fiops_data *fiopsd = q->elevator->elevator_data; > + struct request *__rq; > + > + __rq = fiops_find_rq_fmerge(fiopsd, bio); > + if (__rq && elv_rq_merge_ok(__rq, bio)) { > + *req = __rq; > + return ELEVATOR_FRONT_MERGE; > + } > + > + return ELEVATOR_NO_MERGE; > +} > + > +static void fiops_merged_request(struct request_queue *q, struct request *req, > + int type) > +{ > + if (type == ELEVATOR_FRONT_MERGE) { > + struct fiops_ioc *ioc = RQ_CIC(req); > + > + fiops_reposition_rq_rb(ioc, req); > + } > +} > + > +static void > +fiops_merged_requests(struct request_queue *q, struct request *rq, > + struct request *next) > +{ > + struct fiops_ioc *ioc = RQ_CIC(rq); > + struct fiops_data *fiopsd = q->elevator->elevator_data; > + /* > + * reposition in fifo if next is older than rq > + */ > + if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) && > + time_before(rq_fifo_time(next), rq_fifo_time(rq))) { > + list_move(&rq->queuelist, &next->queuelist); > + rq_set_fifo_time(rq, rq_fifo_time(next)); > + } > + > + fiops_remove_request(next); > + > + ioc = RQ_CIC(next); > + /* > + * all requests of this task are merged to other tasks, delete it > + * from the service tree. > + */ > + if (fiops_ioc_on_rr(ioc) && RB_EMPTY_ROOT(&ioc->sort_list)) > + fiops_del_ioc_rr(fiopsd, ioc); > +} > + > +static int fiops_allow_merge(struct request_queue *q, struct request *rq, > + struct bio *bio) > +{ > + struct fiops_data *fiopsd = q->elevator->elevator_data; > + struct dev_io_context *gen_cic; > + struct fiops_ioc *cic; > + > + /* > + * Lookup the ioc that this bio will be queued with. Allow > + * merge only if rq is queued there. > + */ > + gen_cic = queue_data_cic_lookup(&fiopsd->qdata, current->io_context); > + if (!gen_cic) > + return false; > + cic = dev_ioc_to_fiops_ioc(gen_cic); > + > + return cic == RQ_CIC(rq); > +} > + > +static void fiops_exit_queue(struct elevator_queue *e) > +{ > + struct fiops_data *fiopsd = e->elevator_data; > + struct request_queue *q = fiopsd->qdata.queue; > + > + cancel_work_sync(&fiopsd->unplug_work); > + > + spin_lock_irq(q->queue_lock); > + > + ioc_builder_exit_queue(&ioc_builder, &fiopsd->qdata); > + > + spin_unlock_irq(q->queue_lock); > + kfree(fiopsd); > +} > + > +static void fiops_kick_queue(struct work_struct *work) > +{ > + struct fiops_data *fiopsd = > + container_of(work, struct fiops_data, unplug_work); > + struct request_queue *q = fiopsd->qdata.queue; > + > + spin_lock_irq(q->queue_lock); > + __blk_run_queue(q); > + spin_unlock_irq(q->queue_lock); > +} > + > +static void *fiops_init_queue(struct request_queue *q) > +{ > + struct fiops_data *fiopsd; > + > + fiopsd = kzalloc_node(sizeof(*fiopsd), GFP_KERNEL, q->node); > + if (!fiopsd) > + return NULL; > + > + if (ioc_builder_init_queue(&ioc_builder, &fiopsd->qdata, q)) { > + kfree(fiopsd); > + return NULL; > + } > + > + fiopsd->service_tree = FIOPS_RB_ROOT; > + > + INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue); > + > + return fiopsd; > +} > + > +static void fiops_slab_kill(void) > +{ > + /* > + * Caller already ensured that pending RCU callbacks are completed, > + * so we should have no busy allocations at this point. > + */ > + if (fiops_ioc_pool) > + kmem_cache_destroy(fiops_ioc_pool); > +} > + > +static int __init fiops_slab_setup(void) > +{ > + fiops_ioc_pool = KMEM_CACHE(fiops_ioc, 0); > + if (!fiops_ioc_pool) > + return -ENOMEM; > + > + return 0; > +} > + > +static struct dev_io_context * > +fiops_alloc_ioc(struct ioc_builder *builder, struct queue_data *qdata, > + gfp_t gfp_mask) > +{ > + struct fiops_ioc *ioc = kmem_cache_alloc_node(fiops_ioc_pool, > + gfp_mask, qdata->queue->node); > + if (ioc) > + return &ioc->dev_ioc; > + return NULL; > +} > + > +static void fiops_free_ioc(struct ioc_builder *builder, > + struct dev_io_context *dev_ioc) > +{ > + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(dev_ioc); > + kmem_cache_free(fiops_ioc_pool, ioc); > +} > + > +static void fiops_init_cic(struct queue_data *qdata, > + struct dev_io_context *gen_cic) > +{ > + struct fiops_data *fiopsd = queue_data_to_fiopsd(qdata); > + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic); > + > + RB_CLEAR_NODE(&ioc->rb_node); > + INIT_LIST_HEAD(&ioc->fifo); > + ioc->sort_list = RB_ROOT; > + > + ioc->fiopsd = fiopsd; > + > + ioc->pid = current->pid; > +} > + > +static void fiops_exit_cic(struct queue_data *qdata, > + struct dev_io_context *gen_cic) > +{ > + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic); > + > + WARN_ON(fiops_ioc_on_rr(ioc)); > +} > + > +static struct ioc_builder ioc_builder = { > + .alloc_ioc = fiops_alloc_ioc, > + .free_ioc = fiops_free_ioc, > + .cic_init = fiops_init_cic, > + .cic_exit = fiops_exit_cic, > +}; > + > +static struct elevator_type iosched_fiops = { > + .ops = { > + .elevator_merge_fn = fiops_merge, > + .elevator_merged_fn = fiops_merged_request, > + .elevator_merge_req_fn = fiops_merged_requests, > + .elevator_allow_merge_fn = fiops_allow_merge, > + .elevator_dispatch_fn = fiops_dispatch_requests, > + .elevator_add_req_fn = fiops_insert_request, > + .elevator_former_req_fn = elv_rb_former_request, > + .elevator_latter_req_fn = elv_rb_latter_request, > + .elevator_set_req_fn = fiops_set_request, > + .elevator_put_req_fn = fiops_put_request, > + .elevator_init_fn = fiops_init_queue, > + .elevator_exit_fn = fiops_exit_queue, > + .trim = queue_data_free_io_context, > + }, > + .elevator_name = "fiops", > + .elevator_owner = THIS_MODULE, > +}; > + > +static int __init fiops_init(void) > +{ > + if (fiops_slab_setup()) > + return -ENOMEM; > + if (ioc_builder_init(&ioc_builder)) { > + fiops_slab_kill(); > + return -ENOMEM; > + } > + > + elv_register(&iosched_fiops); > + > + return 0; > +} > + > +static void __exit fiops_exit(void) > +{ > + elv_unregister(&iosched_fiops); > + io_context_builder_exit(&ioc_builder); > + fiops_slab_kill(); > +} > + > +module_init(fiops_init); > +module_exit(fiops_exit); > + > +MODULE_AUTHOR("Jens Axboe, Shaohua Li <shli@kernel.org>"); > +MODULE_LICENSE("GPL"); > +MODULE_DESCRIPTION("IOPS based IO scheduler"); > Index: linux/block/Kconfig.iosched > =================================================================== > --- linux.orig/block/Kconfig.iosched 2011-12-28 09:42:18.000000000 +0800 > +++ linux/block/Kconfig.iosched 2012-01-04 13:58:35.000000000 +0800 > @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED > ---help--- > Enable group IO scheduling in CFQ. > > +config IOSCHED_FIOPS > + tristate "IOPS based I/O scheduler" > + default y > + ---help--- > + This is an IOPS based I/O scheduler. It will try to distribute > + IOPS equally among all processes in the system. It's mainly for > + Flash based storage. > + > choice > prompt "Default I/O scheduler" > default DEFAULT_CFQ > Index: linux/block/blk.h > =================================================================== > --- linux.orig/block/blk.h 2011-12-28 09:42:18.000000000 +0800 > +++ linux/block/blk.h 2011-12-28 09:42:35.000000000 +0800 > @@ -206,7 +206,7 @@ static inline void blk_throtl_exit(struc > static inline void blk_throtl_release(struct request_queue *q) { } > #endif /* CONFIG_BLK_DEV_THROTTLING */ > > -#if IS_ENABLED(CONFIG_IOSCHED_CFQ) > +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS) > struct queue_data; > struct ioc_builder { > struct dev_io_context *(*alloc_ioc)(struct ioc_builder *builder, > Index: linux/block/blk-ioc.c > =================================================================== > --- linux.orig/block/blk-ioc.c 2011-12-28 09:42:18.000000000 +0800 > +++ linux/block/blk-ioc.c 2011-12-28 09:42:35.000000000 +0800 > @@ -164,7 +164,7 @@ static int __init blk_ioc_init(void) > } > subsys_initcall(blk_ioc_init); > > -#if IS_ENABLED(CONFIG_IOSCHED_CFQ) > +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS) > #define CIC_DEAD_INDEX_SHIFT 1 > > static inline void *queue_data_dead_key(struct queue_data *qdata) > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC 3/3]block: fiops read/write request scale 2012-01-04 6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li 2012-01-04 6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li 2012-01-04 6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li @ 2012-01-04 6:53 ` Shaohua Li 2012-01-04 7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner ` (2 subsequent siblings) 5 siblings, 0 replies; 29+ messages in thread From: Shaohua Li @ 2012-01-04 6:53 UTC (permalink / raw) To: linux-kernel; +Cc: axboe, vgoyal, jmoyer, Shaohua Li [-- Attachment #1: fiops-rw-scale.patch --] [-- Type: text/plain, Size: 3614 bytes --] read/write speed of Flash based storage usually is different. Add a scale to differenate read and write. Also add a tunable, so user can assign different scale for read and write. This can also be used to bias read or write in the system Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/fiops-iosched.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) Index: linux/block/fiops-iosched.c =================================================================== --- linux.orig/block/fiops-iosched.c 2012-01-04 14:01:26.000000000 +0800 +++ linux/block/fiops-iosched.c 2012-01-04 14:05:48.000000000 +0800 @@ -15,6 +15,9 @@ #define VIOS_SCALE_SHIFT 10 #define VIOS_SCALE (1 << VIOS_SCALE_SHIFT) +#define VIOS_READ_SCALE (1) +#define VIOS_WRITE_SCALE (1) + struct fiops_rb_root { struct rb_root rb; struct rb_node *left; @@ -32,6 +35,9 @@ struct fiops_data { unsigned int busy_queues; struct work_struct unplug_work; + + unsigned int read_scale; + unsigned int write_scale; }; struct fiops_ioc { @@ -268,7 +274,10 @@ static void fiops_remove_request(struct static u64 fiops_scaled_vios(struct fiops_data *fiopsd, struct fiops_ioc *ioc, struct request *rq) { - return VIOS_SCALE; + if (rq_data_dir(rq) == READ) + return VIOS_SCALE; + else + return VIOS_SCALE * fiopsd->write_scale / fiopsd->read_scale; } /* return vios dispatched */ @@ -540,6 +549,9 @@ static void *fiops_init_queue(struct req INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue); + fiopsd->read_scale = VIOS_READ_SCALE; + fiopsd->write_scale = VIOS_WRITE_SCALE; + return fiopsd; } @@ -610,6 +622,60 @@ static struct ioc_builder ioc_builder = .cic_exit = fiops_exit_cic, }; +/* + * sysfs parts below --> + */ +static ssize_t +fiops_var_show(unsigned int var, char *page) +{ + return sprintf(page, "%d\n", var); +} + +static ssize_t +fiops_var_store(unsigned int *var, const char *page, size_t count) +{ + char *p = (char *) page; + + *var = simple_strtoul(p, &p, 10); + return count; +} + +#define SHOW_FUNCTION(__FUNC, __VAR) \ +static ssize_t __FUNC(struct elevator_queue *e, char *page) \ +{ \ + struct fiops_data *fiopsd = e->elevator_data; \ + return fiops_var_show(__VAR, (page)); \ +} +SHOW_FUNCTION(fiops_read_scale_show, fiopsd->read_scale); +SHOW_FUNCTION(fiops_write_scale_show, fiopsd->write_scale); +#undef SHOW_FUNCTION + +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX) \ +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count) \ +{ \ + struct fiops_data *fiopsd = e->elevator_data; \ + unsigned int __data; \ + int ret = fiops_var_store(&__data, (page), count); \ + if (__data < (MIN)) \ + __data = (MIN); \ + else if (__data > (MAX)) \ + __data = (MAX); \ + *(__PTR) = __data; \ + return ret; \ +} +STORE_FUNCTION(fiops_read_scale_store, &fiopsd->read_scale, 1, 100); +STORE_FUNCTION(fiops_write_scale_store, &fiopsd->write_scale, 1, 100); +#undef STORE_FUNCTION + +#define FIOPS_ATTR(name) \ + __ATTR(name, S_IRUGO|S_IWUSR, fiops_##name##_show, fiops_##name##_store) + +static struct elv_fs_entry fiops_attrs[] = { + FIOPS_ATTR(read_scale), + FIOPS_ATTR(write_scale), + __ATTR_NULL +}; + static struct elevator_type iosched_fiops = { .ops = { .elevator_merge_fn = fiops_merge, @@ -626,6 +692,7 @@ static struct elevator_type iosched_fiop .elevator_exit_fn = fiops_exit_queue, .trim = queue_data_free_io_context, }, + .elevator_attrs = fiops_attrs, .elevator_name = "fiops", .elevator_owner = THIS_MODULE, }; ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-04 6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li ` (2 preceding siblings ...) 2012-01-04 6:53 ` [RFC 3/3]block: fiops read/write request scale Shaohua Li @ 2012-01-04 7:19 ` Dave Chinner 2012-01-05 6:50 ` Shaohua Li 2012-01-06 9:41 ` Zhu Yanhai 2012-01-15 22:24 ` Vivek Goyal 5 siblings, 1 reply; 29+ messages in thread From: Dave Chinner @ 2012-01-04 7:19 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > An IOPS based I/O scheduler > > Flash based storage has some different characteristics against rotate disk. > 1. no I/O seek. > 2. read and write I/O cost usually is much different. > 3. Time which a request takes depends on request size. > 4. High throughput and IOPS, low latency. > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > for sequential read. It also has optimization for flash based storage (for > item 1 above), but overall it's not designed for flash based storage. It's > a slice based algorithm. Since flash based storage request cost is very > low, and drive has big queue_depth is quite popular now which makes > dispatching cost even lower, CFQ's slice accounting (jiffy based) > doesn't work well. CFQ doesn't consider above item 2 & 3. > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > only targets for drive without I/O seek. It's quite similar like CFQ, but > the dispatch decision is made according to IOPS instead of slice. > > The algorithm is simple. Drive has a service tree, and each task lives in > the tree. The key into the tree is called vios (virtual I/O). Every request > has vios, which is calculated according to its ioprio, request size and so > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > always selects task with minimum vios in the service tree and let the task > dispatch request. The dispatched request's vios is then added to the task's > vios and the task is repositioned in the sevice tree. > > The series are orgnized as: > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > Patch 2: The core FIOPS. > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > To make the code simple for easy view, some scale code isn't included here, > some not implementated yet. > > TODO: > 1. ioprio support (have patch already) > 2. request size vios scale > 3. cgroup support > 4. tracing support > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > Comments and suggestions are welcome! Benchmark results? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-04 7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner @ 2012-01-05 6:50 ` Shaohua Li 2012-01-06 5:12 ` Shaohua Li 0 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-05 6:50 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, axboe, vgoyal, jmoyer On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > > An IOPS based I/O scheduler > > > > Flash based storage has some different characteristics against rotate disk. > > 1. no I/O seek. > > 2. read and write I/O cost usually is much different. > > 3. Time which a request takes depends on request size. > > 4. High throughput and IOPS, low latency. > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > > for sequential read. It also has optimization for flash based storage (for > > item 1 above), but overall it's not designed for flash based storage. It's > > a slice based algorithm. Since flash based storage request cost is very > > low, and drive has big queue_depth is quite popular now which makes > > dispatching cost even lower, CFQ's slice accounting (jiffy based) > > doesn't work well. CFQ doesn't consider above item 2 & 3. > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > > only targets for drive without I/O seek. It's quite similar like CFQ, but > > the dispatch decision is made according to IOPS instead of slice. > > > > The algorithm is simple. Drive has a service tree, and each task lives in > > the tree. The key into the tree is called vios (virtual I/O). Every request > > has vios, which is calculated according to its ioprio, request size and so > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > > always selects task with minimum vios in the service tree and let the task > > dispatch request. The dispatched request's vios is then added to the task's > > vios and the task is repositioned in the sevice tree. > > > > The series are orgnized as: > > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > > Patch 2: The core FIOPS. > > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > > > To make the code simple for easy view, some scale code isn't included here, > > some not implementated yet. > > > > TODO: > > 1. ioprio support (have patch already) > > 2. request size vios scale > > 3. cgroup support > > 4. tracing support > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > > > Comments and suggestions are welcome! > > Benchmark results? I didn't have data yet. The patches are still in earlier stage, I want to focus on the basic idea first. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-05 6:50 ` Shaohua Li @ 2012-01-06 5:12 ` Shaohua Li 2012-01-06 9:10 ` Namhyung Kim ` (3 more replies) 0 siblings, 4 replies; 29+ messages in thread From: Shaohua Li @ 2012-01-06 5:12 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, axboe, vgoyal, jmoyer On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote: > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > > > An IOPS based I/O scheduler > > > > > > Flash based storage has some different characteristics against rotate disk. > > > 1. no I/O seek. > > > 2. read and write I/O cost usually is much different. > > > 3. Time which a request takes depends on request size. > > > 4. High throughput and IOPS, low latency. > > > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > > > for sequential read. It also has optimization for flash based storage (for > > > item 1 above), but overall it's not designed for flash based storage. It's > > > a slice based algorithm. Since flash based storage request cost is very > > > low, and drive has big queue_depth is quite popular now which makes > > > dispatching cost even lower, CFQ's slice accounting (jiffy based) > > > doesn't work well. CFQ doesn't consider above item 2 & 3. > > > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > > > only targets for drive without I/O seek. It's quite similar like CFQ, but > > > the dispatch decision is made according to IOPS instead of slice. > > > > > > The algorithm is simple. Drive has a service tree, and each task lives in > > > the tree. The key into the tree is called vios (virtual I/O). Every request > > > has vios, which is calculated according to its ioprio, request size and so > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > > > always selects task with minimum vios in the service tree and let the task > > > dispatch request. The dispatched request's vios is then added to the task's > > > vios and the task is repositioned in the sevice tree. > > > > > > The series are orgnized as: > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > > > Patch 2: The core FIOPS. > > > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > > > > > To make the code simple for easy view, some scale code isn't included here, > > > some not implementated yet. > > > > > > TODO: > > > 1. ioprio support (have patch already) > > > 2. request size vios scale > > > 3. cgroup support > > > 4. tracing support > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > > > > > Comments and suggestions are welcome! > > > > Benchmark results? > I didn't have data yet. The patches are still in earlier stage, I want > to focus on the basic idea first. since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod, fs is ext4. workload percentage change with fiops against cfq fio_sync_read_4k -2 fio_mediaplay_64k 0 fio_mediaplay_128k 0 fio_mediaplay_rr_64k 0 fio_sync_read_rr_4k 0 fio_sync_write_128k 0 fio_sync_write_64k -1 fio_sync_write_4k -2 fio_sync_write_64k_create 0 fio_sync_write_rr_64k_create 0 fio_sync_write_128k_create 0 fio_aio_randread_4k -4 fio_aio_randread_64k 0 fio_aio_randwrite_4k 1 fio_aio_randwrite_64k 0 fio_aio_randrw_4k -1 fio_aio_randrw_64k 0 fio_tpch 9 fio_tpcc 0 fio_mmap_randread_4k -1 fio_mmap_randread_64k 1 fio_mmap_randread_1k -8 fio_mmap_randwrite_4k 35 fio_mmap_randwrite_64k 22 fio_mmap_randwrite_1k 28 fio_mmap_randwrite_4k_halfbusy 24 fio_mmap_randrw_4k 23 fio_mmap_randrw_64k 4 fio_mmap_randrw_1k 22 fio_mmap_randrw_4k_halfbusy 35 fio_mmap_sync_read_4k 0 fio_mmap_sync_read_64k -1 fio_mmap_sync_read_128k -1 fio_mmap_sync_read_rr_64k 5 fio_mmap_sync_read_rr_4k 3 The fio_mmap_randread_1k has regression against 3.2-rc7, but no regression against 3.2-rc6 kernel, still checking why. The fiops has improvement for read/write mixed workload. CFQ is known not good for read/write mixed workload. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-06 5:12 ` Shaohua Li @ 2012-01-06 9:10 ` Namhyung Kim 2012-01-06 14:37 ` Jan Kara ` (2 subsequent siblings) 3 siblings, 0 replies; 29+ messages in thread From: Namhyung Kim @ 2012-01-06 9:10 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, vgoyal, jmoyer 2012-01-06 PM 2:12, Shaohua Li wrote: > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote: >> On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: >>> On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: >>>> An IOPS based I/O scheduler >>>> >>>> Flash based storage has some different characteristics against rotate disk. >>>> 1. no I/O seek. >>>> 2. read and write I/O cost usually is much different. >>>> 3. Time which a request takes depends on request size. >>>> 4. High throughput and IOPS, low latency. >>>> >>>> CFQ iosched does well for rotate disk, for example fair dispatching, idle >>>> for sequential read. It also has optimization for flash based storage (for >>>> item 1 above), but overall it's not designed for flash based storage. It's >>>> a slice based algorithm. Since flash based storage request cost is very >>>> low, and drive has big queue_depth is quite popular now which makes >>>> dispatching cost even lower, CFQ's slice accounting (jiffy based) >>>> doesn't work well. CFQ doesn't consider above item 2& 3. >>>> >>>> FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so >>>> only targets for drive without I/O seek. It's quite similar like CFQ, but >>>> the dispatch decision is made according to IOPS instead of slice. >>>> >>>> The algorithm is simple. Drive has a service tree, and each task lives in >>>> the tree. The key into the tree is called vios (virtual I/O). Every request >>>> has vios, which is calculated according to its ioprio, request size and so >>>> on. Task's vios is the sum of vios of all requests it dispatches. FIOPS >>>> always selects task with minimum vios in the service tree and let the task >>>> dispatch request. The dispatched request's vios is then added to the task's >>>> vios and the task is repositioned in the sevice tree. >>>> >>>> The series are orgnized as: >>>> Patch 1: separate CFQ's io context management code. FIOPS will use it too. >>>> Patch 2: The core FIOPS. >>>> Patch 3: request read/write vios scale. This demontrates how the vios scale. >>>> >>>> To make the code simple for easy view, some scale code isn't included here, >>>> some not implementated yet. >>>> >>>> TODO: >>>> 1. ioprio support (have patch already) >>>> 2. request size vios scale >>>> 3. cgroup support >>>> 4. tracing support >>>> 5. automatically select default iosched according to QUEUE_FLAG_NONROT. >>>> >>>> Comments and suggestions are welcome! >>> >>> Benchmark results? >> I didn't have data yet. The patches are still in earlier stage, I want >> to focus on the basic idea first. > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod, > fs is ext4. > > workload percentage change with fiops against cfq > fio_sync_read_4k -2 > fio_mediaplay_64k 0 > fio_mediaplay_128k 0 > fio_mediaplay_rr_64k 0 > fio_sync_read_rr_4k 0 > fio_sync_write_128k 0 > fio_sync_write_64k -1 > fio_sync_write_4k -2 > fio_sync_write_64k_create 0 > fio_sync_write_rr_64k_create 0 > fio_sync_write_128k_create 0 > fio_aio_randread_4k -4 > fio_aio_randread_64k 0 > fio_aio_randwrite_4k 1 > fio_aio_randwrite_64k 0 > fio_aio_randrw_4k -1 > fio_aio_randrw_64k 0 > fio_tpch 9 > fio_tpcc 0 > fio_mmap_randread_4k -1 > fio_mmap_randread_64k 1 > fio_mmap_randread_1k -8 > fio_mmap_randwrite_4k 35 > fio_mmap_randwrite_64k 22 > fio_mmap_randwrite_1k 28 > fio_mmap_randwrite_4k_halfbusy 24 > fio_mmap_randrw_4k 23 > fio_mmap_randrw_64k 4 > fio_mmap_randrw_1k 22 > fio_mmap_randrw_4k_halfbusy 35 > fio_mmap_sync_read_4k 0 > fio_mmap_sync_read_64k -1 > fio_mmap_sync_read_128k -1 > fio_mmap_sync_read_rr_64k 5 > fio_mmap_sync_read_rr_4k 3 > > The fio_mmap_randread_1k has regression against 3.2-rc7, but no > regression against 3.2-rc6 kernel, still checking why. The fiops has > improvement for read/write mixed workload. CFQ is known not good for > read/write mixed workload. > > Thanks, > Shaohua > Hi, Looks promising. :) Anyway what's your configuration for the test? Did you use vios scaling based on IO direction and/or ioprio? Thanks, Namhyung Kim ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-06 5:12 ` Shaohua Li 2012-01-06 9:10 ` Namhyung Kim @ 2012-01-06 14:37 ` Jan Kara 2012-01-09 1:26 ` Shaohua Li 2012-01-08 22:16 ` Dave Chinner 2012-01-15 22:28 ` Vivek Goyal 3 siblings, 1 reply; 29+ messages in thread From: Jan Kara @ 2012-01-06 14:37 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, vgoyal, jmoyer On Fri 06-01-12 13:12:29, Shaohua Li wrote: > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote: > > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: > > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > > > > An IOPS based I/O scheduler > > > > > > > > Flash based storage has some different characteristics against rotate disk. > > > > 1. no I/O seek. > > > > 2. read and write I/O cost usually is much different. > > > > 3. Time which a request takes depends on request size. > > > > 4. High throughput and IOPS, low latency. > > > > > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > > > > for sequential read. It also has optimization for flash based storage (for > > > > item 1 above), but overall it's not designed for flash based storage. It's > > > > a slice based algorithm. Since flash based storage request cost is very > > > > low, and drive has big queue_depth is quite popular now which makes > > > > dispatching cost even lower, CFQ's slice accounting (jiffy based) > > > > doesn't work well. CFQ doesn't consider above item 2 & 3. > > > > > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > > > > only targets for drive without I/O seek. It's quite similar like CFQ, but > > > > the dispatch decision is made according to IOPS instead of slice. > > > > > > > > The algorithm is simple. Drive has a service tree, and each task lives in > > > > the tree. The key into the tree is called vios (virtual I/O). Every request > > > > has vios, which is calculated according to its ioprio, request size and so > > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > > > > always selects task with minimum vios in the service tree and let the task > > > > dispatch request. The dispatched request's vios is then added to the task's > > > > vios and the task is repositioned in the sevice tree. > > > > > > > > The series are orgnized as: > > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > > > > Patch 2: The core FIOPS. > > > > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > > > > > > > To make the code simple for easy view, some scale code isn't included here, > > > > some not implementated yet. > > > > > > > > TODO: > > > > 1. ioprio support (have patch already) > > > > 2. request size vios scale > > > > 3. cgroup support > > > > 4. tracing support > > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > > > > > > > Comments and suggestions are welcome! > > > > > > Benchmark results? > > I didn't have data yet. The patches are still in earlier stage, I want > > to focus on the basic idea first. > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod, > fs is ext4. > > workload percentage change with fiops against cfq > fio_sync_read_4k -2 > fio_mediaplay_64k 0 > fio_mediaplay_128k 0 > fio_mediaplay_rr_64k 0 > fio_sync_read_rr_4k 0 > fio_sync_write_128k 0 > fio_sync_write_64k -1 > fio_sync_write_4k -2 > fio_sync_write_64k_create 0 > fio_sync_write_rr_64k_create 0 > fio_sync_write_128k_create 0 > fio_aio_randread_4k -4 > fio_aio_randread_64k 0 > fio_aio_randwrite_4k 1 > fio_aio_randwrite_64k 0 > fio_aio_randrw_4k -1 > fio_aio_randrw_64k 0 > fio_tpch 9 > fio_tpcc 0 > fio_mmap_randread_4k -1 > fio_mmap_randread_64k 1 > fio_mmap_randread_1k -8 > fio_mmap_randwrite_4k 35 > fio_mmap_randwrite_64k 22 > fio_mmap_randwrite_1k 28 > fio_mmap_randwrite_4k_halfbusy 24 > fio_mmap_randrw_4k 23 > fio_mmap_randrw_64k 4 > fio_mmap_randrw_1k 22 > fio_mmap_randrw_4k_halfbusy 35 > fio_mmap_sync_read_4k 0 > fio_mmap_sync_read_64k -1 > fio_mmap_sync_read_128k -1 > fio_mmap_sync_read_rr_64k 5 > fio_mmap_sync_read_rr_4k 3 > > The fio_mmap_randread_1k has regression against 3.2-rc7, but no > regression against 3.2-rc6 kernel, still checking why. The fiops has > improvement for read/write mixed workload. CFQ is known not good for > read/write mixed workload. Nice, but this is only about throughput, isn't it? Part of the reason why CFQ takes hit in the throughput is that it prefers sync IO (i.e. reads and synchronous writes) against other IO. Does your scheduler do anything like that? Could you for example compare a latency of reads while running heavy background writing between CFQ and your scheduler? Loads like this where original motivation for CFQ I believe. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-06 14:37 ` Jan Kara @ 2012-01-09 1:26 ` Shaohua Li 2012-01-15 22:32 ` Vivek Goyal 0 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-09 1:26 UTC (permalink / raw) To: Jan Kara; +Cc: Dave Chinner, linux-kernel, axboe, vgoyal, jmoyer On Fri, 2012-01-06 at 15:37 +0100, Jan Kara wrote: > On Fri 06-01-12 13:12:29, Shaohua Li wrote: > > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote: > > > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: > > > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > > > > > An IOPS based I/O scheduler > > > > > > > > > > Flash based storage has some different characteristics against rotate disk. > > > > > 1. no I/O seek. > > > > > 2. read and write I/O cost usually is much different. > > > > > 3. Time which a request takes depends on request size. > > > > > 4. High throughput and IOPS, low latency. > > > > > > > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > > > > > for sequential read. It also has optimization for flash based storage (for > > > > > item 1 above), but overall it's not designed for flash based storage. It's > > > > > a slice based algorithm. Since flash based storage request cost is very > > > > > low, and drive has big queue_depth is quite popular now which makes > > > > > dispatching cost even lower, CFQ's slice accounting (jiffy based) > > > > > doesn't work well. CFQ doesn't consider above item 2 & 3. > > > > > > > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > > > > > only targets for drive without I/O seek. It's quite similar like CFQ, but > > > > > the dispatch decision is made according to IOPS instead of slice. > > > > > > > > > > The algorithm is simple. Drive has a service tree, and each task lives in > > > > > the tree. The key into the tree is called vios (virtual I/O). Every request > > > > > has vios, which is calculated according to its ioprio, request size and so > > > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > > > > > always selects task with minimum vios in the service tree and let the task > > > > > dispatch request. The dispatched request's vios is then added to the task's > > > > > vios and the task is repositioned in the sevice tree. > > > > > > > > > > The series are orgnized as: > > > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > > > > > Patch 2: The core FIOPS. > > > > > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > > > > > > > > > To make the code simple for easy view, some scale code isn't included here, > > > > > some not implementated yet. > > > > > > > > > > TODO: > > > > > 1. ioprio support (have patch already) > > > > > 2. request size vios scale > > > > > 3. cgroup support > > > > > 4. tracing support > > > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > > > > > > > > > Comments and suggestions are welcome! > > > > > > > > Benchmark results? > > > I didn't have data yet. The patches are still in earlier stage, I want > > > to focus on the basic idea first. > > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod, > > fs is ext4. > > > > workload percentage change with fiops against cfq > > fio_sync_read_4k -2 > > fio_mediaplay_64k 0 > > fio_mediaplay_128k 0 > > fio_mediaplay_rr_64k 0 > > fio_sync_read_rr_4k 0 > > fio_sync_write_128k 0 > > fio_sync_write_64k -1 > > fio_sync_write_4k -2 > > fio_sync_write_64k_create 0 > > fio_sync_write_rr_64k_create 0 > > fio_sync_write_128k_create 0 > > fio_aio_randread_4k -4 > > fio_aio_randread_64k 0 > > fio_aio_randwrite_4k 1 > > fio_aio_randwrite_64k 0 > > fio_aio_randrw_4k -1 > > fio_aio_randrw_64k 0 > > fio_tpch 9 > > fio_tpcc 0 > > fio_mmap_randread_4k -1 > > fio_mmap_randread_64k 1 > > fio_mmap_randread_1k -8 > > fio_mmap_randwrite_4k 35 > > fio_mmap_randwrite_64k 22 > > fio_mmap_randwrite_1k 28 > > fio_mmap_randwrite_4k_halfbusy 24 > > fio_mmap_randrw_4k 23 > > fio_mmap_randrw_64k 4 > > fio_mmap_randrw_1k 22 > > fio_mmap_randrw_4k_halfbusy 35 > > fio_mmap_sync_read_4k 0 > > fio_mmap_sync_read_64k -1 > > fio_mmap_sync_read_128k -1 > > fio_mmap_sync_read_rr_64k 5 > > fio_mmap_sync_read_rr_4k 3 > > > > The fio_mmap_randread_1k has regression against 3.2-rc7, but no > > regression against 3.2-rc6 kernel, still checking why. The fiops has > > improvement for read/write mixed workload. CFQ is known not good for > > read/write mixed workload. > Nice, but this is only about throughput, isn't it? Part of the reason why > CFQ takes hit in the throughput is that it prefers sync IO (i.e. reads and > synchronous writes) against other IO. Does your scheduler do anything like > that? It has. The vios will be scaled based on read/write. By default, I'll give read 2.5 times more share than write, which matches CFQ. It's quite easy to change the ratio with sysfs knobs. > Could you for example compare a latency of reads while running heavy > background writing between CFQ and your scheduler? Loads like this where > original motivation for CFQ I believe. CFQ supports preemption, FIOPS doesn't, so I suppose read latency of CFQ is still better in such workload. In this initial post, I just want to demonstrate the basic idea of the ioscheduler. I'll post more data for both latency and throughput in next round. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-09 1:26 ` Shaohua Li @ 2012-01-15 22:32 ` Vivek Goyal 0 siblings, 0 replies; 29+ messages in thread From: Vivek Goyal @ 2012-01-15 22:32 UTC (permalink / raw) To: Shaohua Li; +Cc: Jan Kara, Dave Chinner, linux-kernel, axboe, jmoyer On Mon, Jan 09, 2012 at 09:26:45AM +0800, Shaohua Li wrote: [..] > > Could you for example compare a latency of reads while running heavy > > background writing between CFQ and your scheduler? Loads like this where > > original motivation for CFQ I believe. > CFQ supports preemption, FIOPS doesn't, so I suppose read latency of CFQ > is still better in such workload. > In this initial post, I just want to demonstrate the basic idea of the > ioscheduler. I'll post more data for both latency and throughput in next > round. I think before numbers what will be more helpful to know is that what are you trying to achieve and why existing CFQ code can not do that with little modification. Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-06 5:12 ` Shaohua Li 2012-01-06 9:10 ` Namhyung Kim 2012-01-06 14:37 ` Jan Kara @ 2012-01-08 22:16 ` Dave Chinner 2012-01-09 1:09 ` Shaohua Li 2012-01-15 22:28 ` Vivek Goyal 3 siblings, 1 reply; 29+ messages in thread From: Dave Chinner @ 2012-01-08 22:16 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer On Fri, Jan 06, 2012 at 01:12:29PM +0800, Shaohua Li wrote: > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote: > > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: > > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > > > > An IOPS based I/O scheduler > > > > > > > > Flash based storage has some different characteristics against rotate disk. > > > > 1. no I/O seek. > > > > 2. read and write I/O cost usually is much different. > > > > 3. Time which a request takes depends on request size. > > > > 4. High throughput and IOPS, low latency. > > > > > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > > > > for sequential read. It also has optimization for flash based storage (for > > > > item 1 above), but overall it's not designed for flash based storage. It's > > > > a slice based algorithm. Since flash based storage request cost is very > > > > low, and drive has big queue_depth is quite popular now which makes > > > > dispatching cost even lower, CFQ's slice accounting (jiffy based) > > > > doesn't work well. CFQ doesn't consider above item 2 & 3. > > > > > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > > > > only targets for drive without I/O seek. It's quite similar like CFQ, but > > > > the dispatch decision is made according to IOPS instead of slice. > > > > > > > > The algorithm is simple. Drive has a service tree, and each task lives in > > > > the tree. The key into the tree is called vios (virtual I/O). Every request > > > > has vios, which is calculated according to its ioprio, request size and so > > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > > > > always selects task with minimum vios in the service tree and let the task > > > > dispatch request. The dispatched request's vios is then added to the task's > > > > vios and the task is repositioned in the sevice tree. > > > > > > > > The series are orgnized as: > > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > > > > Patch 2: The core FIOPS. > > > > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > > > > > > > To make the code simple for easy view, some scale code isn't included here, > > > > some not implementated yet. > > > > > > > > TODO: > > > > 1. ioprio support (have patch already) > > > > 2. request size vios scale > > > > 3. cgroup support > > > > 4. tracing support > > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > > > > > > > Comments and suggestions are welcome! > > > > > > Benchmark results? > > I didn't have data yet. The patches are still in earlier stage, I want > > to focus on the basic idea first. > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod, > fs is ext4. > > workload percentage change with fiops against cfq > fio_sync_read_4k -2 > fio_mediaplay_64k 0 > fio_mediaplay_128k 0 > fio_mediaplay_rr_64k 0 > fio_sync_read_rr_4k 0 > fio_sync_write_128k 0 > fio_sync_write_64k -1 > fio_sync_write_4k -2 > fio_sync_write_64k_create 0 > fio_sync_write_rr_64k_create 0 > fio_sync_write_128k_create 0 > fio_aio_randread_4k -4 > fio_aio_randread_64k 0 > fio_aio_randwrite_4k 1 > fio_aio_randwrite_64k 0 > fio_aio_randrw_4k -1 > fio_aio_randrw_64k 0 > fio_tpch 9 > fio_tpcc 0 > fio_mmap_randread_4k -1 > fio_mmap_randread_64k 1 > fio_mmap_randread_1k -8 > fio_mmap_randwrite_4k 35 > fio_mmap_randwrite_64k 22 > fio_mmap_randwrite_1k 28 > fio_mmap_randwrite_4k_halfbusy 24 > fio_mmap_randrw_4k 23 > fio_mmap_randrw_64k 4 > fio_mmap_randrw_1k 22 > fio_mmap_randrw_4k_halfbusy 35 > fio_mmap_sync_read_4k 0 > fio_mmap_sync_read_64k -1 > fio_mmap_sync_read_128k -1 > fio_mmap_sync_read_rr_64k 5 > fio_mmap_sync_read_rr_4k 3 > > The fio_mmap_randread_1k has regression against 3.2-rc7, but no > regression against 3.2-rc6 kernel, still checking why. The fiops has > improvement for read/write mixed workload. CFQ is known not good for > read/write mixed workload. Numbers like this are meaningless without knowing what the hardware capability is and how the numbers compare to that raw capability. They tell me only mmap based random write improves in performance, and only one specific type of random write improves, not all types. That raises more questions that it answers: why do AIO based random writes not go any faster? Is that because even with CFQ, AIO based random writes saturate the device? i.e. is AIO based IO that much faster than mmap based IO that there is no scope for improvement on your hardware? You need to present raw numbers and give us some idea of how close those numbers are to raw hardware capability for us to have any idea what improvements these numbers actually demonstrate. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-08 22:16 ` Dave Chinner @ 2012-01-09 1:09 ` Shaohua Li 2012-01-15 22:45 ` Vivek Goyal 0 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-09 1:09 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-kernel, axboe, vgoyal, jmoyer On Mon, 2012-01-09 at 09:16 +1100, Dave Chinner wrote: > On Fri, Jan 06, 2012 at 01:12:29PM +0800, Shaohua Li wrote: > > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote: > > > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: > > > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > > > > > An IOPS based I/O scheduler > > > > > > > > > > Flash based storage has some different characteristics against rotate disk. > > > > > 1. no I/O seek. > > > > > 2. read and write I/O cost usually is much different. > > > > > 3. Time which a request takes depends on request size. > > > > > 4. High throughput and IOPS, low latency. > > > > > > > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > > > > > for sequential read. It also has optimization for flash based storage (for > > > > > item 1 above), but overall it's not designed for flash based storage. It's > > > > > a slice based algorithm. Since flash based storage request cost is very > > > > > low, and drive has big queue_depth is quite popular now which makes > > > > > dispatching cost even lower, CFQ's slice accounting (jiffy based) > > > > > doesn't work well. CFQ doesn't consider above item 2 & 3. > > > > > > > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > > > > > only targets for drive without I/O seek. It's quite similar like CFQ, but > > > > > the dispatch decision is made according to IOPS instead of slice. > > > > > > > > > > The algorithm is simple. Drive has a service tree, and each task lives in > > > > > the tree. The key into the tree is called vios (virtual I/O). Every request > > > > > has vios, which is calculated according to its ioprio, request size and so > > > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > > > > > always selects task with minimum vios in the service tree and let the task > > > > > dispatch request. The dispatched request's vios is then added to the task's > > > > > vios and the task is repositioned in the sevice tree. > > > > > > > > > > The series are orgnized as: > > > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > > > > > Patch 2: The core FIOPS. > > > > > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > > > > > > > > > To make the code simple for easy view, some scale code isn't included here, > > > > > some not implementated yet. > > > > > > > > > > TODO: > > > > > 1. ioprio support (have patch already) > > > > > 2. request size vios scale > > > > > 3. cgroup support > > > > > 4. tracing support > > > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > > > > > > > > > Comments and suggestions are welcome! > > > > > > > > Benchmark results? > > > I didn't have data yet. The patches are still in earlier stage, I want > > > to focus on the basic idea first. > > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod, > > fs is ext4. > > > > workload percentage change with fiops against cfq > > fio_sync_read_4k -2 > > fio_mediaplay_64k 0 > > fio_mediaplay_128k 0 > > fio_mediaplay_rr_64k 0 > > fio_sync_read_rr_4k 0 > > fio_sync_write_128k 0 > > fio_sync_write_64k -1 > > fio_sync_write_4k -2 > > fio_sync_write_64k_create 0 > > fio_sync_write_rr_64k_create 0 > > fio_sync_write_128k_create 0 > > fio_aio_randread_4k -4 > > fio_aio_randread_64k 0 > > fio_aio_randwrite_4k 1 > > fio_aio_randwrite_64k 0 > > fio_aio_randrw_4k -1 > > fio_aio_randrw_64k 0 > > fio_tpch 9 > > fio_tpcc 0 > > fio_mmap_randread_4k -1 > > fio_mmap_randread_64k 1 > > fio_mmap_randread_1k -8 > > fio_mmap_randwrite_4k 35 > > fio_mmap_randwrite_64k 22 > > fio_mmap_randwrite_1k 28 > > fio_mmap_randwrite_4k_halfbusy 24 > > fio_mmap_randrw_4k 23 > > fio_mmap_randrw_64k 4 > > fio_mmap_randrw_1k 22 > > fio_mmap_randrw_4k_halfbusy 35 > > fio_mmap_sync_read_4k 0 > > fio_mmap_sync_read_64k -1 > > fio_mmap_sync_read_128k -1 > > fio_mmap_sync_read_rr_64k 5 > > fio_mmap_sync_read_rr_4k 3 > > > > The fio_mmap_randread_1k has regression against 3.2-rc7, but no > > regression against 3.2-rc6 kernel, still checking why. The fiops has > > improvement for read/write mixed workload. CFQ is known not good for > > read/write mixed workload. > > Numbers like this are meaningless without knowing what the hardware > capability is and how the numbers compare to that raw capability. > They tell me only mmap based random write improves in > performance, and only one specific type of random write improves, > not all types. > > That raises more questions that it answers: why do AIO based random > writes not go any faster? Is that because even with CFQ, AIO based > random writes saturate the device? i.e. is AIO based IO that much > faster than mmap based IO that there is no scope for improvement on > your hardware? > > You need to present raw numbers and give us some idea of how close > those numbers are to raw hardware capability for us to have any idea > what improvements these numbers actually demonstrate. Yes, your guess is right. The hardware has limitation. 12 SSD exceeds the jbod capability, for both throughput and IOPS, that's why only read/write mixed workload impacts. I'll use less SSD in later tests, which will demonstrate the performance better. I'll report both raw numbers and fiops/cfq numbers later. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-09 1:09 ` Shaohua Li @ 2012-01-15 22:45 ` Vivek Goyal 2012-01-16 4:36 ` Shaohua Li 0 siblings, 1 reply; 29+ messages in thread From: Vivek Goyal @ 2012-01-15 22:45 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: [..] > > You need to present raw numbers and give us some idea of how close > > those numbers are to raw hardware capability for us to have any idea > > what improvements these numbers actually demonstrate. > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > the jbod capability, for both throughput and IOPS, that's why only > read/write mixed workload impacts. I'll use less SSD in later tests, > which will demonstrate the performance better. I'll report both raw > numbers and fiops/cfq numbers later. If fiops number are better please explain why those numbers are better. If you cut down on idling, it is obivious that you will get higher throughput on these flash devices. CFQ does disable queue idling for non rotational NCQ devices. If higher throughput is due to driving deeper queue depths, then CFQ can do that too just by changing quantum and disabling idling. So I really don't understand that what are you doing fundamentally different in FIOPS ioscheduler. The only thing I can think of more accurate accounting per queue in terms of number of IOs instead of time. Which can just serve to improve fairness a bit for certain workloads. In practice, I think it might not matter much. Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-15 22:45 ` Vivek Goyal @ 2012-01-16 4:36 ` Shaohua Li 2012-01-16 7:11 ` Vivek Goyal 0 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-16 4:36 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote: > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: > > [..] > > > You need to present raw numbers and give us some idea of how close > > > those numbers are to raw hardware capability for us to have any idea > > > what improvements these numbers actually demonstrate. > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > > the jbod capability, for both throughput and IOPS, that's why only > > read/write mixed workload impacts. I'll use less SSD in later tests, > > which will demonstrate the performance better. I'll report both raw > > numbers and fiops/cfq numbers later. > > If fiops number are better please explain why those numbers are better. > If you cut down on idling, it is obivious that you will get higher > throughput on these flash devices. CFQ does disable queue idling for > non rotational NCQ devices. If higher throughput is due to driving > deeper queue depths, then CFQ can do that too just by changing quantum > and disabling idling. it's because of quantum. Surely you can change the quantum, and CFQ performance will increase, but you will find CFQ is very unfair then. > So I really don't understand that what are you doing fundamentally > different in FIOPS ioscheduler. > > The only thing I can think of more accurate accounting per queue in > terms of number of IOs instead of time. Which can just serve to improve > fairness a bit for certain workloads. In practice, I think it might > not matter much. If quantum is big, CFQ will have better performance, but it actually fallbacks to Noop, no any fairness. fairness is important and is why we introduce CFQ. In summary, CFQ isn't both fair and good performance. FIOPS is trying to be fair and have good performance. I didn't think any time based accounting can make the goal happen for NCQ and SSD (even cfq cgroup code has iops mode, so suppose you should already know this well). Surely you can change CFQ to make it IOPS based, but this will mess the code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have a separate ioscheduler which is IOPS based. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-16 4:36 ` Shaohua Li @ 2012-01-16 7:11 ` Vivek Goyal 2012-01-16 7:55 ` Shaohua Li 0 siblings, 1 reply; 29+ messages in thread From: Vivek Goyal @ 2012-01-16 7:11 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote: > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote: > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: > > > > [..] > > > > You need to present raw numbers and give us some idea of how close > > > > those numbers are to raw hardware capability for us to have any idea > > > > what improvements these numbers actually demonstrate. > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > > > the jbod capability, for both throughput and IOPS, that's why only > > > read/write mixed workload impacts. I'll use less SSD in later tests, > > > which will demonstrate the performance better. I'll report both raw > > > numbers and fiops/cfq numbers later. > > > > If fiops number are better please explain why those numbers are better. > > If you cut down on idling, it is obivious that you will get higher > > throughput on these flash devices. CFQ does disable queue idling for > > non rotational NCQ devices. If higher throughput is due to driving > > deeper queue depths, then CFQ can do that too just by changing quantum > > and disabling idling. > it's because of quantum. Surely you can change the quantum, and CFQ > performance will increase, but you will find CFQ is very unfair then. Why increasing quantum leads to CFQ being unfair? In terms of time it still tries to be fair. That's a different thing that with NCQ, right time measurement is not possible with requests from multiple queues being in the driver/disk at the same time. So accouting in terms of iops per queue might make sense. > > > So I really don't understand that what are you doing fundamentally > > different in FIOPS ioscheduler. > > > > The only thing I can think of more accurate accounting per queue in > > terms of number of IOs instead of time. Which can just serve to improve > > fairness a bit for certain workloads. In practice, I think it might > > not matter much. > If quantum is big, CFQ will have better performance, but it actually > fallbacks to Noop, no any fairness. fairness is important and is why we > introduce CFQ. It is not exactly noop. It still preempts writes and prioritizes reads and direct writes. Also, what's the real life workload where you face issues with using say deadline with these flash based storage. > > In summary, CFQ isn't both fair and good performance. FIOPS is trying to > be fair and have good performance. I didn't think any time based > accounting can make the goal happen for NCQ and SSD (even cfq cgroup > code has iops mode, so suppose you should already know this well). > > Surely you can change CFQ to make it IOPS based, but this will mess the > code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have > a separate ioscheduler which is IOPS based. I think writing a separate IO scheduler just to do accouting in IOPS while retaining rest of the CFQ code is not a very good idea. Modifying CFQ code to be able to deal with both time based as well as IOPS accounting might turn out to be simpler. Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-16 7:11 ` Vivek Goyal @ 2012-01-16 7:55 ` Shaohua Li 2012-01-16 8:29 ` Vivek Goyal 0 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-16 7:55 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote: > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote: > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote: > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: > > > > > > [..] > > > > > You need to present raw numbers and give us some idea of how close > > > > > those numbers are to raw hardware capability for us to have any idea > > > > > what improvements these numbers actually demonstrate. > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > > > > the jbod capability, for both throughput and IOPS, that's why only > > > > read/write mixed workload impacts. I'll use less SSD in later tests, > > > > which will demonstrate the performance better. I'll report both raw > > > > numbers and fiops/cfq numbers later. > > > > > > If fiops number are better please explain why those numbers are better. > > > If you cut down on idling, it is obivious that you will get higher > > > throughput on these flash devices. CFQ does disable queue idling for > > > non rotational NCQ devices. If higher throughput is due to driving > > > deeper queue depths, then CFQ can do that too just by changing quantum > > > and disabling idling. > > it's because of quantum. Surely you can change the quantum, and CFQ > > performance will increase, but you will find CFQ is very unfair then. > > Why increasing quantum leads to CFQ being unfair? In terms of time it > still tries to be fair. we can dispatch a lot of requests to NCQ SSD with very small time interval. The disk can finish a lot of requests in small time interval too. The time is much smaller than 1 jiffy. Increasing quantum can lead a task dispatches request more faster and makes the accounting worse, because with small quantum the task needs wait to dispatch. you can easily verify this with a simple fio test. > That's a different thing that with NCQ, right > time measurement is not possible with requests from multiple queues > being in the driver/disk at the same time. So accouting in terms of > iops per queue might make sense. yes. > > > So I really don't understand that what are you doing fundamentally > > > different in FIOPS ioscheduler. > > > > > > The only thing I can think of more accurate accounting per queue in > > > terms of number of IOs instead of time. Which can just serve to improve > > > fairness a bit for certain workloads. In practice, I think it might > > > not matter much. > > If quantum is big, CFQ will have better performance, but it actually > > fallbacks to Noop, no any fairness. fairness is important and is why we > > introduce CFQ. > > It is not exactly noop. It still preempts writes and prioritizes reads > and direct writes. sure, I mean fairness mostly here. > Also, what's the real life workload where you face issues with using > say deadline with these flash based storage. deadline doesn't provide fairness. mainly cgroup workload. workload with different ioprio has issues too, but I don't know which real workload uses ioprio. > > > > In summary, CFQ isn't both fair and good performance. FIOPS is trying to > > be fair and have good performance. I didn't think any time based > > accounting can make the goal happen for NCQ and SSD (even cfq cgroup > > code has iops mode, so suppose you should already know this well). > > > > Surely you can change CFQ to make it IOPS based, but this will mess the > > code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have > > a separate ioscheduler which is IOPS based. > > I think writing a separate IO scheduler just to do accouting in IOPS while > retaining rest of the CFQ code is not a very good idea. Modifying CFQ code > to be able to deal with both time based as well as IOPS accounting might > turn out to be simpler. changing CFQ works, but I really want to avoid having something like if (iops) xxx else xxx I plan adding scales for read/write, request size, etc, because read/write cost is different and request with different size has different cost in SSD. This can be added to CFQ too with pain. That said I didn't completely object to make CFQ support IOPS accounting, but my feeling is a separate ioscheduler is more clean. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-16 7:55 ` Shaohua Li @ 2012-01-16 8:29 ` Vivek Goyal 2012-01-17 1:06 ` Shaohua Li 0 siblings, 1 reply; 29+ messages in thread From: Vivek Goyal @ 2012-01-16 8:29 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote: > On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote: > > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote: > > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote: > > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: > > > > > > > > [..] > > > > > > You need to present raw numbers and give us some idea of how close > > > > > > those numbers are to raw hardware capability for us to have any idea > > > > > > what improvements these numbers actually demonstrate. > > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > > > > > the jbod capability, for both throughput and IOPS, that's why only > > > > > read/write mixed workload impacts. I'll use less SSD in later tests, > > > > > which will demonstrate the performance better. I'll report both raw > > > > > numbers and fiops/cfq numbers later. > > > > > > > > If fiops number are better please explain why those numbers are better. > > > > If you cut down on idling, it is obivious that you will get higher > > > > throughput on these flash devices. CFQ does disable queue idling for > > > > non rotational NCQ devices. If higher throughput is due to driving > > > > deeper queue depths, then CFQ can do that too just by changing quantum > > > > and disabling idling. > > > it's because of quantum. Surely you can change the quantum, and CFQ > > > performance will increase, but you will find CFQ is very unfair then. > > > > Why increasing quantum leads to CFQ being unfair? In terms of time it > > still tries to be fair. > we can dispatch a lot of requests to NCQ SSD with very small time > interval. The disk can finish a lot of requests in small time interval > too. The time is much smaller than 1 jiffy. Increasing quantum can lead > a task dispatches request more faster and makes the accounting worse, > because with small quantum the task needs wait to dispatch. you can > easily verify this with a simple fio test. > > > That's a different thing that with NCQ, right > > time measurement is not possible with requests from multiple queues > > being in the driver/disk at the same time. So accouting in terms of > > iops per queue might make sense. > yes. > > > > > So I really don't understand that what are you doing fundamentally > > > > different in FIOPS ioscheduler. > > > > > > > > The only thing I can think of more accurate accounting per queue in > > > > terms of number of IOs instead of time. Which can just serve to improve > > > > fairness a bit for certain workloads. In practice, I think it might > > > > not matter much. > > > If quantum is big, CFQ will have better performance, but it actually > > > fallbacks to Noop, no any fairness. fairness is important and is why we > > > introduce CFQ. > > > > It is not exactly noop. It still preempts writes and prioritizes reads > > and direct writes. > sure, I mean fairness mostly here. > > > Also, what's the real life workload where you face issues with using > > say deadline with these flash based storage. > deadline doesn't provide fairness. mainly cgroup workload. workload with > different ioprio has issues too, but I don't know which real workload > uses ioprio. Personally I have not run into any workload which provides deep queue depths constantly for a very long time. I had to run fio to create such scnearios. Not running deep queue depths will lead to expiration of queue (Otherwise idling will kill performance on these fast devices). And without idling most of the logic of slice and accounting does not help. A queue dispatches some requests and expires (irrespective of what time slice you had allocated it based on ioprio). That's why I am insisting that it would be nice that any move in this direction should be driven by some real workload instead of just coming up with synthetic workloads. > > > > > > > In summary, CFQ isn't both fair and good performance. FIOPS is trying to > > > be fair and have good performance. I didn't think any time based > > > accounting can make the goal happen for NCQ and SSD (even cfq cgroup > > > code has iops mode, so suppose you should already know this well). > > > > > > Surely you can change CFQ to make it IOPS based, but this will mess the > > > code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have > > > a separate ioscheduler which is IOPS based. > > > > I think writing a separate IO scheduler just to do accouting in IOPS while > > retaining rest of the CFQ code is not a very good idea. Modifying CFQ code > > to be able to deal with both time based as well as IOPS accounting might > > turn out to be simpler. > changing CFQ works, but I really want to avoid having something like > if (iops) > xxx > else > xxx I think you can provide a wrapper function and abstract out unit of time as "charge" or something like that. > I plan adding scales for read/write, request size, etc, because > read/write cost is different and request with different size has > different cost in SSD. This can be added to CFQ too with pain. That said > I didn't completely object to make CFQ support IOPS accounting, but my > feeling is a separate ioscheduler is more clean. You can provide one function to calculate the "charge" which can be either time or some kind of IOPS unit adjusted with request size and use that. Given the fact that new IOPS scheduler will be sharing lots of code with CFQ (I presume 80-90% of concepts are same) and the only major difference is accounting, I would tend to think that modifying CFQ is better. Jens, what do you think? Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-16 8:29 ` Vivek Goyal @ 2012-01-17 1:06 ` Shaohua Li 2012-01-17 9:02 ` Vivek Goyal 0 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-17 1:06 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai On Mon, 2012-01-16 at 03:29 -0500, Vivek Goyal wrote: > On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote: > > On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote: > > > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote: > > > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote: > > > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: > > > > > > > > > > [..] > > > > > > > You need to present raw numbers and give us some idea of how close > > > > > > > those numbers are to raw hardware capability for us to have any idea > > > > > > > what improvements these numbers actually demonstrate. > > > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > > > > > > the jbod capability, for both throughput and IOPS, that's why only > > > > > > read/write mixed workload impacts. I'll use less SSD in later tests, > > > > > > which will demonstrate the performance better. I'll report both raw > > > > > > numbers and fiops/cfq numbers later. > > > > > > > > > > If fiops number are better please explain why those numbers are better. > > > > > If you cut down on idling, it is obivious that you will get higher > > > > > throughput on these flash devices. CFQ does disable queue idling for > > > > > non rotational NCQ devices. If higher throughput is due to driving > > > > > deeper queue depths, then CFQ can do that too just by changing quantum > > > > > and disabling idling. > > > > it's because of quantum. Surely you can change the quantum, and CFQ > > > > performance will increase, but you will find CFQ is very unfair then. > > > > > > Why increasing quantum leads to CFQ being unfair? In terms of time it > > > still tries to be fair. > > we can dispatch a lot of requests to NCQ SSD with very small time > > interval. The disk can finish a lot of requests in small time interval > > too. The time is much smaller than 1 jiffy. Increasing quantum can lead > > a task dispatches request more faster and makes the accounting worse, > > because with small quantum the task needs wait to dispatch. you can > > easily verify this with a simple fio test. > > > > > That's a different thing that with NCQ, right > > > time measurement is not possible with requests from multiple queues > > > being in the driver/disk at the same time. So accouting in terms of > > > iops per queue might make sense. > > yes. > > > > > > > So I really don't understand that what are you doing fundamentally > > > > > different in FIOPS ioscheduler. > > > > > > > > > > The only thing I can think of more accurate accounting per queue in > > > > > terms of number of IOs instead of time. Which can just serve to improve > > > > > fairness a bit for certain workloads. In practice, I think it might > > > > > not matter much. > > > > If quantum is big, CFQ will have better performance, but it actually > > > > fallbacks to Noop, no any fairness. fairness is important and is why we > > > > introduce CFQ. > > > > > > It is not exactly noop. It still preempts writes and prioritizes reads > > > and direct writes. > > sure, I mean fairness mostly here. > > > > > Also, what's the real life workload where you face issues with using > > > say deadline with these flash based storage. > > deadline doesn't provide fairness. mainly cgroup workload. workload with > > different ioprio has issues too, but I don't know which real workload > > uses ioprio. > > Personally I have not run into any workload which provides deep queue depths > constantly for a very long time. I had to run fio to create such > scnearios. > > Not running deep queue depths will lead to expiration of queue (Otherwise > idling will kill performance on these fast devices). And without idling > most of the logic of slice and accounting does not help. A queue > dispatches some requests and expires (irrespective of what time slice > you had allocated it based on ioprio). That's true, if workload doesn't drive deep queue depths, any accounting can't help for NCQ disks as far as I tried. Idling is the only method to make accounting correct, but it impacts performance too much. > That's why I am insisting that it would be nice that any move in this > direction should be driven by some real workload instead of just coming > up with synthetic workloads. I thought yanhai from taobao (cc-ed) has real workload and he found cfq performance suffers a lot. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-17 1:06 ` Shaohua Li @ 2012-01-17 9:02 ` Vivek Goyal 2012-01-18 1:20 ` Shaohua Li 0 siblings, 1 reply; 29+ messages in thread From: Vivek Goyal @ 2012-01-17 9:02 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai On Tue, Jan 17, 2012 at 09:06:28AM +0800, Shaohua Li wrote: > On Mon, 2012-01-16 at 03:29 -0500, Vivek Goyal wrote: > > On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote: > > > On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote: > > > > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote: > > > > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote: > > > > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: > > > > > > > > > > > > [..] > > > > > > > > You need to present raw numbers and give us some idea of how close > > > > > > > > those numbers are to raw hardware capability for us to have any idea > > > > > > > > what improvements these numbers actually demonstrate. > > > > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > > > > > > > the jbod capability, for both throughput and IOPS, that's why only > > > > > > > read/write mixed workload impacts. I'll use less SSD in later tests, > > > > > > > which will demonstrate the performance better. I'll report both raw > > > > > > > numbers and fiops/cfq numbers later. > > > > > > > > > > > > If fiops number are better please explain why those numbers are better. > > > > > > If you cut down on idling, it is obivious that you will get higher > > > > > > throughput on these flash devices. CFQ does disable queue idling for > > > > > > non rotational NCQ devices. If higher throughput is due to driving > > > > > > deeper queue depths, then CFQ can do that too just by changing quantum > > > > > > and disabling idling. > > > > > it's because of quantum. Surely you can change the quantum, and CFQ > > > > > performance will increase, but you will find CFQ is very unfair then. > > > > > > > > Why increasing quantum leads to CFQ being unfair? In terms of time it > > > > still tries to be fair. > > > we can dispatch a lot of requests to NCQ SSD with very small time > > > interval. The disk can finish a lot of requests in small time interval > > > too. The time is much smaller than 1 jiffy. Increasing quantum can lead > > > a task dispatches request more faster and makes the accounting worse, > > > because with small quantum the task needs wait to dispatch. you can > > > easily verify this with a simple fio test. > > > > > > > That's a different thing that with NCQ, right > > > > time measurement is not possible with requests from multiple queues > > > > being in the driver/disk at the same time. So accouting in terms of > > > > iops per queue might make sense. > > > yes. > > > > > > > > > So I really don't understand that what are you doing fundamentally > > > > > > different in FIOPS ioscheduler. > > > > > > > > > > > > The only thing I can think of more accurate accounting per queue in > > > > > > terms of number of IOs instead of time. Which can just serve to improve > > > > > > fairness a bit for certain workloads. In practice, I think it might > > > > > > not matter much. > > > > > If quantum is big, CFQ will have better performance, but it actually > > > > > fallbacks to Noop, no any fairness. fairness is important and is why we > > > > > introduce CFQ. > > > > > > > > It is not exactly noop. It still preempts writes and prioritizes reads > > > > and direct writes. > > > sure, I mean fairness mostly here. > > > > > > > Also, what's the real life workload where you face issues with using > > > > say deadline with these flash based storage. > > > deadline doesn't provide fairness. mainly cgroup workload. workload with > > > different ioprio has issues too, but I don't know which real workload > > > uses ioprio. > > > > Personally I have not run into any workload which provides deep queue depths > > constantly for a very long time. I had to run fio to create such > > scnearios. > > > > Not running deep queue depths will lead to expiration of queue (Otherwise > > idling will kill performance on these fast devices). And without idling > > most of the logic of slice and accounting does not help. A queue > > dispatches some requests and expires (irrespective of what time slice > > you had allocated it based on ioprio). > That's true, if workload doesn't drive deep queue depths, any accounting > can't help for NCQ disks as far as I tried. Idling is the only method to > make accounting correct, but it impacts performance too much. Idiling will kill performance and faster the device, more prominent are the effects of idling. So to me using CFQ on these fast devices is not a very good idea and deadline might just serve well. > > > That's why I am insisting that it would be nice that any move in this > > direction should be driven by some real workload instead of just coming > > up with synthetic workloads. > I thought yanhai from taobao (cc-ed) has real workload and he found cfq > performance suffers a lot. Can we run that real workload with "deadline" and see what kind of concerns do we have. Is anybody getting starved for long time. If not, then we don't have to do anything. I think trying to make to make CFQ work (Or trying to come up with CFQ like IOPS scheduler) on these fast devices might not lead us anywhere. Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-17 9:02 ` Vivek Goyal @ 2012-01-18 1:20 ` Shaohua Li 2012-01-18 13:04 ` Vivek Goyal 0 siblings, 1 reply; 29+ messages in thread From: Shaohua Li @ 2012-01-18 1:20 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai On Tue, 2012-01-17 at 04:02 -0500, Vivek Goyal wrote: > On Tue, Jan 17, 2012 at 09:06:28AM +0800, Shaohua Li wrote: > > On Mon, 2012-01-16 at 03:29 -0500, Vivek Goyal wrote: > > > On Mon, Jan 16, 2012 at 03:55:41PM +0800, Shaohua Li wrote: > > > > On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote: > > > > > On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote: > > > > > > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote: > > > > > > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote: > > > > > > > > > > > > > > [..] > > > > > > > > > You need to present raw numbers and give us some idea of how close > > > > > > > > > those numbers are to raw hardware capability for us to have any idea > > > > > > > > > what improvements these numbers actually demonstrate. > > > > > > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds > > > > > > > > the jbod capability, for both throughput and IOPS, that's why only > > > > > > > > read/write mixed workload impacts. I'll use less SSD in later tests, > > > > > > > > which will demonstrate the performance better. I'll report both raw > > > > > > > > numbers and fiops/cfq numbers later. > > > > > > > > > > > > > > If fiops number are better please explain why those numbers are better. > > > > > > > If you cut down on idling, it is obivious that you will get higher > > > > > > > throughput on these flash devices. CFQ does disable queue idling for > > > > > > > non rotational NCQ devices. If higher throughput is due to driving > > > > > > > deeper queue depths, then CFQ can do that too just by changing quantum > > > > > > > and disabling idling. > > > > > > it's because of quantum. Surely you can change the quantum, and CFQ > > > > > > performance will increase, but you will find CFQ is very unfair then. > > > > > > > > > > Why increasing quantum leads to CFQ being unfair? In terms of time it > > > > > still tries to be fair. > > > > we can dispatch a lot of requests to NCQ SSD with very small time > > > > interval. The disk can finish a lot of requests in small time interval > > > > too. The time is much smaller than 1 jiffy. Increasing quantum can lead > > > > a task dispatches request more faster and makes the accounting worse, > > > > because with small quantum the task needs wait to dispatch. you can > > > > easily verify this with a simple fio test. > > > > > > > > > That's a different thing that with NCQ, right > > > > > time measurement is not possible with requests from multiple queues > > > > > being in the driver/disk at the same time. So accouting in terms of > > > > > iops per queue might make sense. > > > > yes. > > > > > > > > > > > So I really don't understand that what are you doing fundamentally > > > > > > > different in FIOPS ioscheduler. > > > > > > > > > > > > > > The only thing I can think of more accurate accounting per queue in > > > > > > > terms of number of IOs instead of time. Which can just serve to improve > > > > > > > fairness a bit for certain workloads. In practice, I think it might > > > > > > > not matter much. > > > > > > If quantum is big, CFQ will have better performance, but it actually > > > > > > fallbacks to Noop, no any fairness. fairness is important and is why we > > > > > > introduce CFQ. > > > > > > > > > > It is not exactly noop. It still preempts writes and prioritizes reads > > > > > and direct writes. > > > > sure, I mean fairness mostly here. > > > > > > > > > Also, what's the real life workload where you face issues with using > > > > > say deadline with these flash based storage. > > > > deadline doesn't provide fairness. mainly cgroup workload. workload with > > > > different ioprio has issues too, but I don't know which real workload > > > > uses ioprio. > > > > > > Personally I have not run into any workload which provides deep queue depths > > > constantly for a very long time. I had to run fio to create such > > > scnearios. > > > > > > Not running deep queue depths will lead to expiration of queue (Otherwise > > > idling will kill performance on these fast devices). And without idling > > > most of the logic of slice and accounting does not help. A queue > > > dispatches some requests and expires (irrespective of what time slice > > > you had allocated it based on ioprio). > > That's true, if workload doesn't drive deep queue depths, any accounting > > can't help for NCQ disks as far as I tried. Idling is the only method to > > make accounting correct, but it impacts performance too much. > > Idiling will kill performance and faster the device, more prominent are > the effects of idling. So to me using CFQ on these fast devices is not > a very good idea and deadline might just serve well. > > > > > > That's why I am insisting that it would be nice that any move in this > > > direction should be driven by some real workload instead of just coming > > > up with synthetic workloads. > > I thought yanhai from taobao (cc-ed) has real workload and he found cfq > > performance suffers a lot. > > Can we run that real workload with "deadline" and see what kind of > concerns do we have. Is anybody getting starved for long time. If not, > then we don't have to do anything. > > I think trying to make to make CFQ work (Or trying to come up with CFQ > like IOPS scheduler) on these fast devices might not lead us anywhere. If only performance matters, I'd rather use noop for ssd. There is requirement to have cgroup support (maybe ioprio) to give different tasks different bandwidth. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-18 1:20 ` Shaohua Li @ 2012-01-18 13:04 ` Vivek Goyal 2012-01-19 1:21 ` Shaohua Li 0 siblings, 1 reply; 29+ messages in thread From: Vivek Goyal @ 2012-01-18 13:04 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai On Wed, Jan 18, 2012 at 09:20:37AM +0800, Shaohua Li wrote: [..] > > I think trying to make to make CFQ work (Or trying to come up with CFQ > > like IOPS scheduler) on these fast devices might not lead us anywhere. > If only performance matters, I'd rather use noop for ssd. There is > requirement to have cgroup support (maybe ioprio) to give different > tasks different bandwidth. Sure but the issue is that we need to idle in an effort to prioritize a task and idling kills performance. So you can implement something but I have doubts that on a fast hardware it is going to be very useful. Another issue is that with flash based storage, it can drive really deep queue depths. If that's the case, then just ioscheduler can't solve the prioritazaion issues (until and unless ioscheduler does not drive deep queue depths and kills performance). We need some kind of cooperation from device (like device understanding the notion of iopriority), so that device can prioritize the requests and one need not to idle. That way, we might be able to get service differentiation while getting reasonable throughput. Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-18 13:04 ` Vivek Goyal @ 2012-01-19 1:21 ` Shaohua Li 0 siblings, 0 replies; 29+ messages in thread From: Shaohua Li @ 2012-01-19 1:21 UTC (permalink / raw) To: Vivek Goyal; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer, zhu.yanhai On Wed, 2012-01-18 at 08:04 -0500, Vivek Goyal wrote: > On Wed, Jan 18, 2012 at 09:20:37AM +0800, Shaohua Li wrote: > > [..] > > > I think trying to make to make CFQ work (Or trying to come up with CFQ > > > like IOPS scheduler) on these fast devices might not lead us anywhere. > > If only performance matters, I'd rather use noop for ssd. There is > > requirement to have cgroup support (maybe ioprio) to give different > > tasks different bandwidth. > > Sure but the issue is that we need to idle in an effort to prioritize > a task and idling kills performance. So you can implement something but > I have doubts that on a fast hardware it is going to be very useful. I didn't do idle in fiops. If workload iodepth is low, this will cause fairness problem and I just leave it be. There is no way to make low iodepth workload fair without performance sacrifice. CFQ for SSD has the same problem too. > Another issue is that with flash based storage, it can drive really deep > queue depths. If that's the case, then just ioscheduler can't solve the > prioritazaion issues (until and unless ioscheduler does not drive deep > queue depths and kills performance). We need some kind of cooperation > from device (like device understanding the notion of iopriority), so > that device can prioritize the requests and one need not to idle. That > way, we might be able to get service differentiation while getting > reasonable throughput. SSD is still like normal disk in terms of queue depth. Don't know the iodepth of pcie flash card. If the queue depth of such card is very big (I suppose there is a limitation, because after a critical point increasing queue depth doesn't increase device performance), we definitely need reconsider this. it would be great device let ioscheduler know more info. In my mind, I hope device can dynamatically adjust its queue depth. For example, in my ssd, if request size is 4k, I get max throughput with queue depth 32. If request size is 128k, just 2 queue depth is enough to get peek throughput. If device can stop fetching request after 2 128k requests pending, this will solve some fairness issues for low iodepth workload. Unfortunately device capacity for different request size highly depends on vendor. The fiops request size scale tries to solve the issue, but I still didn't find a good scale model for this yet. I suppose device can not do good prioritization if workload iodepth is low. If there are just few requests pending, nobody (device or ioscheduler) can do correct judgment in such case because there isn't enough info. Thanks, Shaohua ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-06 5:12 ` Shaohua Li ` (2 preceding siblings ...) 2012-01-08 22:16 ` Dave Chinner @ 2012-01-15 22:28 ` Vivek Goyal 3 siblings, 0 replies; 29+ messages in thread From: Vivek Goyal @ 2012-01-15 22:28 UTC (permalink / raw) To: Shaohua Li; +Cc: Dave Chinner, linux-kernel, axboe, jmoyer On Fri, Jan 06, 2012 at 01:12:29PM +0800, Shaohua Li wrote: > On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote: > > On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote: > > > On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > > > > An IOPS based I/O scheduler > > > > > > > > Flash based storage has some different characteristics against rotate disk. > > > > 1. no I/O seek. > > > > 2. read and write I/O cost usually is much different. > > > > 3. Time which a request takes depends on request size. > > > > 4. High throughput and IOPS, low latency. > > > > > > > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > > > > for sequential read. It also has optimization for flash based storage (for > > > > item 1 above), but overall it's not designed for flash based storage. It's > > > > a slice based algorithm. Since flash based storage request cost is very > > > > low, and drive has big queue_depth is quite popular now which makes > > > > dispatching cost even lower, CFQ's slice accounting (jiffy based) > > > > doesn't work well. CFQ doesn't consider above item 2 & 3. > > > > > > > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > > > > only targets for drive without I/O seek. It's quite similar like CFQ, but > > > > the dispatch decision is made according to IOPS instead of slice. > > > > > > > > The algorithm is simple. Drive has a service tree, and each task lives in > > > > the tree. The key into the tree is called vios (virtual I/O). Every request > > > > has vios, which is calculated according to its ioprio, request size and so > > > > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > > > > always selects task with minimum vios in the service tree and let the task > > > > dispatch request. The dispatched request's vios is then added to the task's > > > > vios and the task is repositioned in the sevice tree. > > > > > > > > The series are orgnized as: > > > > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > > > > Patch 2: The core FIOPS. > > > > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > > > > > > > To make the code simple for easy view, some scale code isn't included here, > > > > some not implementated yet. > > > > > > > > TODO: > > > > 1. ioprio support (have patch already) > > > > 2. request size vios scale > > > > 3. cgroup support > > > > 4. tracing support > > > > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. > > > > > > > > Comments and suggestions are welcome! > > > > > > Benchmark results? > > I didn't have data yet. The patches are still in earlier stage, I want > > to focus on the basic idea first. > since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod, > fs is ext4. > > workload percentage change with fiops against cfq > fio_sync_read_4k -2 > fio_mediaplay_64k 0 > fio_mediaplay_128k 0 > fio_mediaplay_rr_64k 0 > fio_sync_read_rr_4k 0 > fio_sync_write_128k 0 > fio_sync_write_64k -1 > fio_sync_write_4k -2 > fio_sync_write_64k_create 0 > fio_sync_write_rr_64k_create 0 > fio_sync_write_128k_create 0 > fio_aio_randread_4k -4 > fio_aio_randread_64k 0 > fio_aio_randwrite_4k 1 > fio_aio_randwrite_64k 0 > fio_aio_randrw_4k -1 > fio_aio_randrw_64k 0 > fio_tpch 9 > fio_tpcc 0 > fio_mmap_randread_4k -1 > fio_mmap_randread_64k 1 > fio_mmap_randread_1k -8 > fio_mmap_randwrite_4k 35 > fio_mmap_randwrite_64k 22 > fio_mmap_randwrite_1k 28 > fio_mmap_randwrite_4k_halfbusy 24 > fio_mmap_randrw_4k 23 > fio_mmap_randrw_64k 4 > fio_mmap_randrw_1k 22 > fio_mmap_randrw_4k_halfbusy 35 > fio_mmap_sync_read_4k 0 > fio_mmap_sync_read_64k -1 > fio_mmap_sync_read_128k -1 > fio_mmap_sync_read_rr_64k 5 > fio_mmap_sync_read_rr_4k 3 > > The fio_mmap_randread_1k has regression against 3.2-rc7, but no > regression against 3.2-rc6 kernel, still checking why. The fiops has > improvement for read/write mixed workload. CFQ is known not good for > read/write mixed workload. Set the slice_idle=0 and possibly increase quantum to 32 or 64 and you should get better performance. I think practically it should become a IOPS based scheduler with little imprecise accounting. Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-04 6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li ` (3 preceding siblings ...) 2012-01-04 7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner @ 2012-01-06 9:41 ` Zhu Yanhai 2012-01-15 22:24 ` Vivek Goyal 5 siblings, 0 replies; 29+ messages in thread From: Zhu Yanhai @ 2012-01-06 9:41 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer Hi, 2012/1/4 Shaohua Li <shaohua.li@intel.com>: > An IOPS based I/O scheduler > > Flash based storage has some different characteristics against rotate disk. > 1. no I/O seek. > 2. read and write I/O cost usually is much different. > 3. Time which a request takes depends on request size. > 4. High throughput and IOPS, low latency. > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > for sequential read. It also has optimization for flash based storage (for > item 1 above), but overall it's not designed for flash based storage. It's > a slice based algorithm. Since flash based storage request cost is very > low, and drive has big queue_depth is quite popular now which makes > dispatching cost even lower, CFQ's slice accounting (jiffy based) > doesn't work well. CFQ doesn't consider above item 2 & 3. > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > only targets for drive without I/O seek. It's quite similar like CFQ, but > the dispatch decision is made according to IOPS instead of slice. > > The algorithm is simple. Drive has a service tree, and each task lives in > the tree. The key into the tree is called vios (virtual I/O). Every request > has vios, which is calculated according to its ioprio, request size and so > on. Task's vios is the sum of vios of all requests it dispatches. FIOPS > always selects task with minimum vios in the service tree and let the task > dispatch request. The dispatched request's vios is then added to the task's > vios and the task is repositioned in the sevice tree. > > The series are orgnized as: > Patch 1: separate CFQ's io context management code. FIOPS will use it too. > Patch 2: The core FIOPS. > Patch 3: request read/write vios scale. This demontrates how the vios scale. > > To make the code simple for easy view, some scale code isn't included here, > some not implementated yet. > > TODO: > 1. ioprio support (have patch already) > 2. request size vios scale > 3. cgroup support > 4. tracing support > 5. automatically select default iosched according to QUEUE_FLAG_NONROT. Actually most RAID cards we have ever seen report the wrong value for this flag, including LSI's 1078E, 9260/9265, HBA cards, etc. So it may needs a human-being to setup it correctly to make this adaptive code work. Also there is a much more complicated scenario ahead :) that's the hybrid storage. E.g, one or two SSD mixed with several SAS disks, or one expensive FusionIO's ioDrive mixed with several SAS disks, powered by Facebook's Flashcache. AFAIK there is not any IO bandwidth control solution for such things. However such setups are really very popular these days. I'm going to take some tests with these code against the real workload this month, hopefully we can get more ideas, and of course the feedback data. -- Thanks, Zhu Yanhai Taobao Inc. > > Comments and suggestions are welcome! > > Thanks, > Shaohua > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC 0/3]block: An IOPS based ioscheduler 2012-01-04 6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li ` (4 preceding siblings ...) 2012-01-06 9:41 ` Zhu Yanhai @ 2012-01-15 22:24 ` Vivek Goyal 5 siblings, 0 replies; 29+ messages in thread From: Vivek Goyal @ 2012-01-15 22:24 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, axboe, jmoyer On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote: > An IOPS based I/O scheduler > > Flash based storage has some different characteristics against rotate disk. > 1. no I/O seek. > 2. read and write I/O cost usually is much different. > 3. Time which a request takes depends on request size. > 4. High throughput and IOPS, low latency. > > CFQ iosched does well for rotate disk, for example fair dispatching, idle > for sequential read. It also has optimization for flash based storage (for > item 1 above), but overall it's not designed for flash based storage. It's > a slice based algorithm. Since flash based storage request cost is very > low, and drive has big queue_depth is quite popular now which makes > dispatching cost even lower, CFQ's slice accounting (jiffy based) > doesn't work well. CFQ doesn't consider above item 2 & 3. > > FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so > only targets for drive without I/O seek. It's quite similar like CFQ, but > the dispatch decision is made according to IOPS instead of slice. Hi Shaohua, What problem are you trying to fix. If you just want to do provide fairness in terms of IOPS instead of time, then I think existing code should be easily modifiable instead of writing a new IO scheduler altogether. It is just like switching the mode either based on tunable or based on disk type. If slice_idle is zero, we already have some code to effectively do accounting in terms of IOPS. This is primarily for group level. We should be able to extend it to queue level. These are implementation details. I think what I am not able to understand that what is the problem and what are you going to gain by doing accounting in terms of IOPS. Also, for flash based storage, isn't noop or deadline a good choice of scheduler. Why do we want to come up with something which is CFQ like. For this fast device any idling will kill the performance and if you don't idle, then I think practically most of the workload will almost become round robin kind of serving. Thanks Vivek ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2012-01-19 1:20 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-01-04 6:53 [RFC 0/3]block: An IOPS based ioscheduler Shaohua Li 2012-01-04 6:53 ` [RFC 1/3]block: seperate CFQ io context management code Shaohua Li 2012-01-04 8:19 ` Namhyung Kim 2012-01-04 6:53 ` [RFC 2/3]block: FIOPS ioscheduler core Shaohua Li 2012-01-06 6:05 ` Namjae Jeon 2012-01-07 1:06 ` Zhu Yanhai 2012-01-04 6:53 ` [RFC 3/3]block: fiops read/write request scale Shaohua Li 2012-01-04 7:19 ` [RFC 0/3]block: An IOPS based ioscheduler Dave Chinner 2012-01-05 6:50 ` Shaohua Li 2012-01-06 5:12 ` Shaohua Li 2012-01-06 9:10 ` Namhyung Kim 2012-01-06 14:37 ` Jan Kara 2012-01-09 1:26 ` Shaohua Li 2012-01-15 22:32 ` Vivek Goyal 2012-01-08 22:16 ` Dave Chinner 2012-01-09 1:09 ` Shaohua Li 2012-01-15 22:45 ` Vivek Goyal 2012-01-16 4:36 ` Shaohua Li 2012-01-16 7:11 ` Vivek Goyal 2012-01-16 7:55 ` Shaohua Li 2012-01-16 8:29 ` Vivek Goyal 2012-01-17 1:06 ` Shaohua Li 2012-01-17 9:02 ` Vivek Goyal 2012-01-18 1:20 ` Shaohua Li 2012-01-18 13:04 ` Vivek Goyal 2012-01-19 1:21 ` Shaohua Li 2012-01-15 22:28 ` Vivek Goyal 2012-01-06 9:41 ` Zhu Yanhai 2012-01-15 22:24 ` Vivek Goyal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).