* Time sliced CFQ io scheduler
@ 2004-12-02 13:04 Jens Axboe
2004-12-02 13:48 ` Jens Axboe
2004-12-02 14:28 ` Giuliano Pochini
0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 13:04 UTC (permalink / raw)
To: Linux Kernel
Hi,
Some time ago I pondered modifying CFQ to do fairness based on slices of
disk time. It appeared to be a minor modification, but with some nice
bonus points:
- It scales nicely with CPU scheduler slices, making io priorities a
zinch to implement.
- It has the possibility to equal the anticipatory scheduler for
multiple processes competing for disk bandwidth
So I implemented it and ran some tests, the results are pretty
astonishing. A note on the testcases - read_files and write_files. They
either read or write a number of files sequentially or randomly, each
file has io being done to it by a specific process. IO bypasses the page
cache by using O_DIRECT. Runtime is capped at 30 seconds for each test.
Each test case was run on deadline, as, and new cfq. Drive used was an
IDE drive (results similar for SCSI), fs used was ext2.
Scroll past results for the executive summary.
Case 1: read_files, sequential, bs=4k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 19837 19837 19837 22msec
2 2116 2114 4230 22msec
4 361 360 1444 41msec
8 150 149 1201 111msec
Note: bandwidth quickly becomes seek bound as clients are added.
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 19480 19480 19480 30msec
2 9250 9189 18434 261msec
4 4513 4469 17970 488msec
8 2238 2157 17581 934msec
Note: as maintains good aggregate bandwidth as clients are added, while
still being fair between clients. Latency rises quickly.
Schedule: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 19433 19433 19433 9msec
2 8686 8628 17312 90msec
4 4507 4471 17963 254msec
8 2181 2104 17134 578msec
Note: cfq performs close to as. Aggregate bandwidth doesn't suffer with
added clients, inter-client latency and throughput excellent. Latency
half that of as.
Case 2: read_files, random, bs=64k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 7042 7042 7042 20msec
2 3052 3051 6103 28msec
4 1560 1498 6124 101msec
8 802 581 5487 231msec
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 7041 7041 7041 18msec
2 4616 2298 6912 270msec
4 3190 928 6901 360msec
8 1524 645 6765 636msec
Note: Aggregate bandwidth remains good, has big problems with
inter-client fairness.
Scheduler: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 7027 7027 7027 19msec
2 3429 3413 6841 107msec
4 1718 1700 6844 282msec
8 875 827 6795 627msec
Note: Aggregate bandwidth remains good and basically identical to as,
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.
Case 3: write_files, sequential, bs=4k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 13406 13406 13406 21msec
2 1553 1551 3104 171msec
4 690 689 2759 116msec
8 329 318 2604 106msec
Note: Aggregate bandwidth quickly drops with number of clients. Latency
is good.
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 13330 13330 13330 21msec
2 2694 2694 5388 77msec
4 1754 17 4988 762msec
8 638 342 3866 848msec
Note: Aggregate bandwidth better than deadline, but still not very good.
Latency not good. inter-client horrible.
Scheduler: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 13267 13267 13267 30msec
2 6352 6150 12459 239msec
4 3230 2945 12524 289msec
8 1640 1640 12564 599msec
Note: Aggregate bandwidth remains high with added clients
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.
Case 4: write_files, random, bs=4k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 6749 6749 6749 112msec
2 1299 1277 2574 813msec
4 432 418 1715 227msec
8 291 247 2147 1723msec
Note: Same again for deadline - aggregate bandwidth really drops with
adding clients, but at least client fairness is good.
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 4110 4110 4110 114msec
2 815 809 1623 631msec
4 482 349 1760 606msec
8 476 111 2863 752msec
Note: Does generally worse than deadline and has fairness issues.
Scheduler: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 4493 4493 4493 129msec
2 1710 1513 3216 321msec
4 521 482 2002 476msec
8 938 877 7210 927msec
Good results for such a quick hack, I'm generally surprised how well it
does without any tuning. The results above use the default settings for
cfq slices: 83ms slice time with allowed 4ms idle period (queues are
preempted if they exceed this idle time). With the disk time slices,
aggregate performance bandwidth stays close to real disk performance
even with many clients.
Patch against BK-current as of today.
===== drivers/block/cfq-iosched.c 1.15 vs edited =====
--- 1.15/drivers/block/cfq-iosched.c 2004-11-30 07:56:58 +01:00
+++ edited/drivers/block/cfq-iosched.c 2004-12-02 14:03:56 +01:00
@@ -22,21 +22,22 @@
#include <linux/rbtree.h>
#include <linux/mempool.h>
-static unsigned long max_elapsed_crq;
-static unsigned long max_elapsed_dispatch;
-
/*
* tunables
*/
static int cfq_quantum = 4; /* max queue in one round of service */
static int cfq_queued = 8; /* minimum rq allocate limit per-queue*/
-static int cfq_service = HZ; /* period over which service is avg */
static int cfq_fifo_expire_r = HZ / 2; /* fifo timeout for sync requests */
static int cfq_fifo_expire_w = 5 * HZ; /* fifo timeout for async requests */
static int cfq_fifo_rate = HZ / 8; /* fifo expiry rate */
static int cfq_back_max = 16 * 1024; /* maximum backwards seek, in KiB */
static int cfq_back_penalty = 2; /* penalty of a backwards seek */
+static int cfq_slice = HZ / 12;
+static int cfq_idle = HZ / 249;
+
+static int cfq_max_depth = 4;
+
/*
* for the hash of cfqq inside the cfqd
*/
@@ -55,6 +56,7 @@
#define list_entry_hash(ptr) hlist_entry((ptr), struct cfq_rq, hash)
#define list_entry_cfqq(ptr) list_entry((ptr), struct cfq_queue, cfq_list)
+#define list_entry_fifo(ptr) list_entry((ptr), struct request, queuelist)
#define RQ_DATA(rq) (rq)->elevator_private
@@ -76,22 +78,18 @@
#define rq_rb_key(rq) (rq)->sector
/*
- * threshold for switching off non-tag accounting
- */
-#define CFQ_MAX_TAG (4)
-
-/*
* sort key types and names
*/
enum {
CFQ_KEY_PGID,
CFQ_KEY_TGID,
+ CFQ_KEY_PID,
CFQ_KEY_UID,
CFQ_KEY_GID,
CFQ_KEY_LAST,
};
-static char *cfq_key_types[] = { "pgid", "tgid", "uid", "gid", NULL };
+static char *cfq_key_types[] = { "pgid", "tgid", "pid", "uid", "gid", NULL };
/*
* spare queue
@@ -103,6 +101,8 @@
static kmem_cache_t *cfq_ioc_pool;
struct cfq_data {
+ atomic_t ref;
+
struct list_head rr_list;
struct list_head empty_list;
@@ -114,8 +114,6 @@
unsigned int max_queued;
- atomic_t ref;
-
int key_type;
mempool_t *crq_pool;
@@ -127,6 +125,14 @@
int rq_in_driver;
/*
+ * schedule slice state info
+ */
+ struct timer_list timer;
+ struct work_struct unplug_work;
+ struct cfq_queue *active_queue;
+ unsigned int dispatch_slice;
+
+ /*
* tunables, see top of file
*/
unsigned int cfq_quantum;
@@ -137,8 +143,9 @@
unsigned int cfq_back_penalty;
unsigned int cfq_back_max;
unsigned int find_best_crq;
-
- unsigned int cfq_tagged;
+ unsigned int cfq_slice;
+ unsigned int cfq_idle;
+ unsigned int cfq_max_depth;
};
struct cfq_queue {
@@ -150,8 +157,6 @@
struct hlist_node cfq_hash;
/* hash key */
unsigned long key;
- /* whether queue is on rr (or empty) list */
- int on_rr;
/* on either rr or empty list of cfqd */
struct list_head cfq_list;
/* sorted list of pending requests */
@@ -169,15 +174,22 @@
int key_type;
- unsigned long service_start;
+ unsigned long slice_start;
unsigned long service_used;
+ unsigned long service_rq;
+ unsigned long service_last;
- unsigned int max_rate;
+ /* whether queue is on rr (or empty) list */
+ unsigned int on_rr : 1;
+ unsigned int wait_request : 1;
+ unsigned int must_dispatch : 1;
/* number of requests that have been handed to the driver */
int in_flight;
/* number of currently allocated requests */
int alloc_limit[2];
+ /* last rq was sync */
+ char name[16];
};
struct cfq_rq {
@@ -219,6 +231,8 @@
default:
case CFQ_KEY_TGID:
return tsk->tgid;
+ case CFQ_KEY_PID:
+ return tsk->pid;
case CFQ_KEY_UID:
return tsk->uid;
case CFQ_KEY_GID:
@@ -309,7 +323,7 @@
if (blk_barrier_rq(rq))
break;
-
+
if (distance < abs(s1 - rq->sector + rq->nr_sectors)) {
distance = abs(s1 - rq->sector +rq->nr_sectors);
last = rq->sector + rq->nr_sectors;
@@ -406,67 +420,22 @@
cfqq->next_crq = cfq_find_next_crq(cfqq->cfqd, cfqq, crq);
}
-static int cfq_check_sort_rr_list(struct cfq_queue *cfqq)
-{
- struct list_head *head = &cfqq->cfqd->rr_list;
- struct list_head *next, *prev;
-
- /*
- * list might still be ordered
- */
- next = cfqq->cfq_list.next;
- if (next != head) {
- struct cfq_queue *cnext = list_entry_cfqq(next);
-
- if (cfqq->service_used > cnext->service_used)
- return 1;
- }
-
- prev = cfqq->cfq_list.prev;
- if (prev != head) {
- struct cfq_queue *cprev = list_entry_cfqq(prev);
-
- if (cfqq->service_used < cprev->service_used)
- return 1;
- }
-
- return 0;
-}
-
-static void cfq_sort_rr_list(struct cfq_queue *cfqq, int new_queue)
+static void cfq_resort_rr_list(struct cfq_queue *cfqq)
{
struct list_head *entry = &cfqq->cfqd->rr_list;
- if (!cfqq->on_rr)
- return;
- if (!new_queue && !cfq_check_sort_rr_list(cfqq))
- return;
-
list_del(&cfqq->cfq_list);
/*
- * sort by our mean service_used, sub-sort by in-flight requests
+ * sort by when queue was last serviced
*/
while ((entry = entry->prev) != &cfqq->cfqd->rr_list) {
struct cfq_queue *__cfqq = list_entry_cfqq(entry);
- if (cfqq->service_used > __cfqq->service_used)
+ if (!__cfqq->service_last)
+ break;
+ if (time_before(__cfqq->service_last, cfqq->service_last))
break;
- else if (cfqq->service_used == __cfqq->service_used) {
- struct list_head *prv;
-
- while ((prv = entry->prev) != &cfqq->cfqd->rr_list) {
- __cfqq = list_entry_cfqq(prv);
-
- WARN_ON(__cfqq->service_used > cfqq->service_used);
- if (cfqq->service_used != __cfqq->service_used)
- break;
- if (cfqq->in_flight > __cfqq->in_flight)
- break;
-
- entry = prv;
- }
- }
}
list_add(&cfqq->cfq_list, entry);
@@ -479,16 +448,12 @@
static inline void
cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- /*
- * it's currently on the empty list
- */
- cfqq->on_rr = 1;
- cfqd->busy_queues++;
+ BUG_ON(cfqq->on_rr);
- if (time_after(jiffies, cfqq->service_start + cfq_service))
- cfqq->service_used >>= 3;
+ cfqd->busy_queues++;
+ cfqq->on_rr = 1;
- cfq_sort_rr_list(cfqq, 1);
+ cfq_resort_rr_list(cfqq);
}
static inline void
@@ -512,10 +477,10 @@
struct cfq_data *cfqd = cfqq->cfqd;
BUG_ON(!cfqq->queued[crq->is_sync]);
+ cfqq->queued[crq->is_sync]--;
cfq_update_next_crq(crq);
- cfqq->queued[crq->is_sync]--;
rb_erase(&crq->rb_node, &cfqq->sort_list);
RB_CLEAR_COLOR(&crq->rb_node);
@@ -622,11 +587,6 @@
if (crq) {
struct cfq_queue *cfqq = crq->cfq_queue;
- if (cfqq->cfqd->cfq_tagged) {
- cfqq->service_used--;
- cfq_sort_rr_list(cfqq, 0);
- }
-
crq->accounted = 0;
cfqq->cfqd->rq_in_driver--;
}
@@ -640,9 +600,7 @@
if (crq) {
cfq_remove_merge_hints(q, crq);
list_del_init(&rq->queuelist);
-
- if (crq->cfq_queue)
- cfq_del_crq_rb(crq);
+ cfq_del_crq_rb(crq);
}
}
@@ -724,6 +682,101 @@
}
/*
+ * current cfqq expired its slice (or was too idle), select new one
+ */
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
+{
+ struct cfq_queue *cfqq = cfqd->active_queue;
+ unsigned long now = jiffies;
+
+ if (cfqq) {
+ if (cfqq->wait_request)
+ del_timer(&cfqd->timer);
+
+ cfqq->service_used += now - cfqq->slice_start;
+ cfqq->service_rq += cfqd->dispatch_slice;
+ cfqq->service_last = now;
+ cfqq->must_dispatch = 0;
+
+ if (cfqq->on_rr)
+ cfq_resort_rr_list(cfqq);
+
+ cfqq = NULL;
+ }
+
+ if (!list_empty(&cfqd->rr_list)) {
+ cfqq = list_entry_cfqq(cfqd->rr_list.next);
+
+ cfqq->slice_start = now;
+ cfqq->wait_request = 0;
+ }
+
+ cfqd->active_queue = cfqq;
+ cfqd->dispatch_slice = 0;
+}
+
+static int cfq_arm_slice_timer(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ WARN_ON(!RB_EMPTY(&cfqq->sort_list));
+
+ cfqq->wait_request = 1;
+
+ if (!cfqd->cfq_idle)
+ return 0;
+
+ if (!timer_pending(&cfqd->timer)) {
+ unsigned long now = jiffies, slice_left;
+
+ slice_left = cfqd->cfq_slice - (now - cfqq->slice_start);
+ cfqd->timer.expires = now + min(cfqd->cfq_idle,(unsigned int)slice_left);
+ add_timer(&cfqd->timer);
+ }
+
+ return 1;
+}
+
+/*
+ * get next queue for service
+ */
+static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
+{
+ struct cfq_queue *cfqq = cfqd->active_queue;
+ unsigned long slice_used;
+
+ cfqq = cfqd->active_queue;
+ if (!cfqq)
+ goto new_queue;
+
+ slice_used = jiffies - cfqq->slice_start;
+
+ if (cfqq->must_dispatch)
+ goto must_queue;
+
+ /*
+ * slice has expired
+ */
+ if (slice_used >= cfqd->cfq_slice)
+ goto new_queue;
+
+ /*
+ * if queue has requests, dispatch one. if not, check if
+ * enough slice is left to wait for one
+ */
+must_queue:
+ if (!RB_EMPTY(&cfqq->sort_list))
+ goto keep_queue;
+ else if (cfqd->cfq_slice - slice_used >= cfqd->cfq_idle) {
+ if (cfq_arm_slice_timer(cfqd, cfqq))
+ return NULL;
+ }
+
+new_queue:
+ cfq_slice_expired(cfqd);
+keep_queue:
+ return cfqd->active_queue;
+}
+
+/*
* we dispatch cfqd->cfq_quantum requests in total from the rr_list queues,
* this function sector sorts the selected request to minimize seeks. we start
* at cfqd->last_sector, not 0.
@@ -741,9 +794,7 @@
list_del(&crq->request->queuelist);
last = cfqd->last_sector;
- while ((entry = entry->prev) != head) {
- __rq = list_entry_rq(entry);
-
+ list_for_each_entry_reverse(__rq, head, queuelist) {
if (blk_barrier_rq(crq->request))
break;
if (!blk_fs_request(crq->request))
@@ -777,95 +828,86 @@
if (time_before(now, cfqq->last_fifo_expire + cfqd->cfq_fifo_batch_expire))
return NULL;
- crq = RQ_DATA(list_entry(cfqq->fifo[0].next, struct request, queuelist));
- if (reads && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
- cfqq->last_fifo_expire = now;
- return crq;
+ if (reads) {
+ crq = RQ_DATA(list_entry_fifo(cfqq->fifo[READ].next));
+ if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
+ cfqq->last_fifo_expire = now;
+ return crq;
+ }
}
- crq = RQ_DATA(list_entry(cfqq->fifo[1].next, struct request, queuelist));
- if (writes && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
- cfqq->last_fifo_expire = now;
- return crq;
+ if (writes) {
+ crq = RQ_DATA(list_entry_fifo(cfqq->fifo[WRITE].next));
+ if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
+ cfqq->last_fifo_expire = now;
+ return crq;
+ }
}
return NULL;
}
-/*
- * dispatch a single request from given queue
- */
-static inline void
-cfq_dispatch_request(request_queue_t *q, struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
+static int
+__cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ int max_dispatch)
{
- struct cfq_rq *crq;
+ int dispatched = 0;
- /*
- * follow expired path, else get first next available
- */
- if ((crq = cfq_check_fifo(cfqq)) == NULL) {
- if (cfqd->find_best_crq)
- crq = cfqq->next_crq;
- else
- crq = rb_entry_crq(rb_first(&cfqq->sort_list));
- }
+ BUG_ON(RB_EMPTY(&cfqq->sort_list));
- cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+ do {
+ struct cfq_rq *crq;
- /*
- * finally, insert request into driver list
- */
- cfq_dispatch_sort(q, crq);
+ /*
+ * follow expired path, else get first next available
+ */
+ if ((crq = cfq_check_fifo(cfqq)) == NULL) {
+ if (cfqd->find_best_crq)
+ crq = cfqq->next_crq;
+ else
+ crq = rb_entry_crq(rb_first(&cfqq->sort_list));
+ }
+
+ cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+
+ /*
+ * finally, insert request into driver list
+ */
+ cfq_dispatch_sort(cfqd->queue, crq);
+
+ cfqd->dispatch_slice++;
+ dispatched++;
+
+ if (RB_EMPTY(&cfqq->sort_list))
+ break;
+
+ } while (dispatched < max_dispatch);
+
+ return dispatched;
}
static int cfq_dispatch_requests(request_queue_t *q, int max_dispatch)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq;
- struct list_head *entry, *tmp;
- int queued, busy_queues, first_round;
if (list_empty(&cfqd->rr_list))
return 0;
- queued = 0;
- first_round = 1;
-restart:
- busy_queues = 0;
- list_for_each_safe(entry, tmp, &cfqd->rr_list) {
- cfqq = list_entry_cfqq(entry);
-
- BUG_ON(RB_EMPTY(&cfqq->sort_list));
-
- /*
- * first round of queueing, only select from queues that
- * don't already have io in-flight
- */
- if (first_round && cfqq->in_flight)
- continue;
-
- cfq_dispatch_request(q, cfqd, cfqq);
-
- if (!RB_EMPTY(&cfqq->sort_list))
- busy_queues++;
-
- queued++;
- }
-
- if ((queued < max_dispatch) && (busy_queues || first_round)) {
- first_round = 0;
- goto restart;
- }
+ cfqq = cfq_select_queue(cfqd);
+ if (!cfqq)
+ return 0;
- return queued;
+ cfqq->wait_request = 0;
+ cfqq->must_dispatch = 0;
+ del_timer(&cfqd->timer);
+ return __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
}
static inline void cfq_account_dispatch(struct cfq_rq *crq)
{
struct cfq_queue *cfqq = crq->cfq_queue;
struct cfq_data *cfqd = cfqq->cfqd;
- unsigned long now, elapsed;
/*
* accounted bit is necessary since some drivers will call
@@ -874,62 +916,34 @@
if (crq->accounted)
return;
- now = jiffies;
- if (cfqq->service_start == ~0UL)
- cfqq->service_start = now;
-
- /*
- * on drives with tagged command queueing, command turn-around time
- * doesn't necessarily reflect the time spent processing this very
- * command inside the drive. so do the accounting differently there,
- * by just sorting on the number of requests
- */
- if (cfqd->cfq_tagged) {
- if (time_after(now, cfqq->service_start + cfq_service)) {
- cfqq->service_start = now;
- cfqq->service_used /= 10;
- }
-
- cfqq->service_used++;
- cfq_sort_rr_list(cfqq, 0);
- }
-
- elapsed = now - crq->queue_start;
- if (elapsed > max_elapsed_dispatch)
- max_elapsed_dispatch = elapsed;
-
crq->accounted = 1;
- crq->service_start = now;
-
- if (++cfqd->rq_in_driver >= CFQ_MAX_TAG && !cfqd->cfq_tagged) {
- cfqq->cfqd->cfq_tagged = 1;
- printk("cfq: depth %d reached, tagging now on\n", CFQ_MAX_TAG);
- }
+ crq->service_start = jiffies;
+ cfqd->rq_in_driver++;
}
static inline void
cfq_account_completion(struct cfq_queue *cfqq, struct cfq_rq *crq)
{
struct cfq_data *cfqd = cfqq->cfqd;
+ unsigned long now = jiffies;
WARN_ON(!cfqd->rq_in_driver);
cfqd->rq_in_driver--;
- if (!cfqd->cfq_tagged) {
- unsigned long now = jiffies;
- unsigned long duration = now - crq->service_start;
-
- if (time_after(now, cfqq->service_start + cfq_service)) {
- cfqq->service_start = now;
- cfqq->service_used >>= 3;
- }
+ cfqq->service_used += now - crq->service_start;
- cfqq->service_used += duration;
- cfq_sort_rr_list(cfqq, 0);
+ /*
+ * queue was preempted while this request was servicing
+ */
+ if (cfqd->active_queue != cfqq)
+ return;
- if (duration > max_elapsed_crq)
- max_elapsed_crq = duration;
- }
+ /*
+ * no requests. if last request was a sync request, wait for
+ * a new one.
+ */
+ if (RB_EMPTY(&cfqq->sort_list) && crq->is_sync)
+ cfq_arm_slice_timer(cfqd, cfqq);
}
static struct request *cfq_next_request(request_queue_t *q)
@@ -937,6 +951,9 @@
struct cfq_data *cfqd = q->elevator->elevator_data;
struct request *rq;
+ if (cfqd->rq_in_driver >= cfqd->cfq_max_depth)
+ return NULL;
+
if (!list_empty(&q->queue_head)) {
struct cfq_rq *crq;
dispatch:
@@ -964,6 +981,8 @@
*/
static void cfq_put_queue(struct cfq_queue *cfqq)
{
+ struct cfq_data *cfqd = cfqq->cfqd;
+
BUG_ON(!atomic_read(&cfqq->ref));
if (!atomic_dec_and_test(&cfqq->ref))
@@ -972,6 +991,9 @@
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->on_rr);
+ if (unlikely(cfqd->active_queue == cfqq))
+ cfqd->active_queue = NULL;
+
cfq_put_cfqd(cfqq->cfqd);
/*
@@ -1117,6 +1139,7 @@
cic->ioc = ioc;
cic->cfqq = __cfqq;
atomic_inc(&__cfqq->ref);
+ atomic_inc(&cfqd->ref);
} else {
struct cfq_io_context *__cic;
unsigned long flags;
@@ -1159,10 +1182,10 @@
__cic->ioc = ioc;
__cic->cfqq = __cfqq;
atomic_inc(&__cfqq->ref);
+ atomic_inc(&cfqd->ref);
spin_lock_irqsave(&ioc->lock, flags);
list_add(&__cic->list, &cic->list);
spin_unlock_irqrestore(&ioc->lock, flags);
-
cic = __cic;
*cfqq = __cfqq;
}
@@ -1199,8 +1222,11 @@
new_cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
spin_lock_irq(cfqd->queue->queue_lock);
goto retry;
- } else
- goto out;
+ } else {
+ cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
+ if (!cfqq)
+ goto out;
+ }
memset(cfqq, 0, sizeof(*cfqq));
@@ -1216,7 +1242,8 @@
cfqq->cfqd = cfqd;
atomic_inc(&cfqd->ref);
cfqq->key_type = cfqd->key_type;
- cfqq->service_start = ~0UL;
+ cfqq->service_last = 0;
+ strncpy(cfqq->name, current->comm, sizeof(cfqq->name) - 1);
}
if (new_cfqq)
@@ -1243,14 +1270,27 @@
static void cfq_enqueue(struct cfq_data *cfqd, struct cfq_rq *crq)
{
- crq->is_sync = 0;
- if (rq_data_dir(crq->request) == READ || current->flags & PF_SYNCWRITE)
- crq->is_sync = 1;
+ struct cfq_queue *cfqq = crq->cfq_queue;
+ struct request *rq = crq->request;
+
+ crq->is_sync = rq_data_dir(rq) == READ || current->flags & PF_SYNCWRITE;
cfq_add_crq_rb(crq);
crq->queue_start = jiffies;
- list_add_tail(&crq->request->queuelist, &crq->cfq_queue->fifo[crq->is_sync]);
+ list_add_tail(&rq->queuelist, &crq->cfq_queue->fifo[crq->is_sync]);
+
+ /*
+ * if we are waiting for a request for this queue, let it rip
+ * immediately and flag that we must not expire this queue just now
+ */
+ if (cfqq->wait_request && cfqq == cfqd->active_queue) {
+ request_queue_t *q = cfqd->queue;
+
+ cfqq->must_dispatch = 1;
+ del_timer(&cfqd->timer);
+ q->request_fn(q);
+ }
}
static void
@@ -1339,31 +1379,34 @@
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq;
int ret = ELV_MQUEUE_MAY;
+ int limit;
if (current->flags & PF_MEMALLOC)
return ELV_MQUEUE_MAY;
cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(cfqd, current));
- if (cfqq) {
- int limit = cfqd->max_queued;
-
- if (cfqq->allocated[rw] < cfqd->cfq_queued)
- return ELV_MQUEUE_MUST;
-
- if (cfqd->busy_queues)
- limit = q->nr_requests / cfqd->busy_queues;
-
- if (limit < cfqd->cfq_queued)
- limit = cfqd->cfq_queued;
- else if (limit > cfqd->max_queued)
- limit = cfqd->max_queued;
+ if (unlikely(!cfqq))
+ return ELV_MQUEUE_MAY;
- if (cfqq->allocated[rw] >= limit) {
- if (limit > cfqq->alloc_limit[rw])
- cfqq->alloc_limit[rw] = limit;
+ if (cfqq->allocated[rw] < cfqd->cfq_queued)
+ return ELV_MQUEUE_MUST;
+ if (cfqq->wait_request)
+ return ELV_MQUEUE_MUST;
+
+ limit = cfqd->max_queued;
+ if (cfqd->busy_queues)
+ limit = q->nr_requests / cfqd->busy_queues;
+
+ if (limit < cfqd->cfq_queued)
+ limit = cfqd->cfq_queued;
+ else if (limit > cfqd->max_queued)
+ limit = cfqd->max_queued;
+
+ if (cfqq->allocated[rw] >= limit) {
+ if (limit > cfqq->alloc_limit[rw])
+ cfqq->alloc_limit[rw] = limit;
- ret = ELV_MQUEUE_NO;
- }
+ ret = ELV_MQUEUE_NO;
}
return ret;
@@ -1395,12 +1438,12 @@
BUG_ON(q->last_merge == rq);
BUG_ON(!hlist_unhashed(&crq->hash));
- if (crq->io_context)
- put_io_context(crq->io_context->ioc);
-
BUG_ON(!cfqq->allocated[crq->is_write]);
cfqq->allocated[crq->is_write]--;
+ if (crq->io_context)
+ put_io_context(crq->io_context->ioc);
+
mempool_free(crq, cfqd->crq_pool);
rq->elevator_private = NULL;
@@ -1473,6 +1516,7 @@
crq->is_write = rw;
rq->elevator_private = crq;
cfqq->alloc_limit[rw] = 0;
+ smp_mb();
return 0;
}
@@ -1486,6 +1530,44 @@
return 1;
}
+static void cfq_kick_queue(void *data)
+{
+ request_queue_t *q = data;
+
+ blk_run_queue(q);
+}
+
+static void cfq_schedule_timer(unsigned long data)
+{
+ struct cfq_data *cfqd = (struct cfq_data *) data;
+ struct cfq_queue *cfqq;
+ unsigned long flags;
+
+ spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+
+ if ((cfqq = cfqd->active_queue) != NULL) {
+ /*
+ * expired
+ */
+ if (time_after(jiffies, cfqq->slice_start + cfqd->cfq_slice))
+ goto out;
+
+ /*
+ * not expired and it has a request pending, let it dispatch
+ */
+ if (!RB_EMPTY(&cfqq->sort_list)) {
+ cfqq->must_dispatch = 1;
+ goto out_cont;
+ }
+ }
+
+out:
+ cfq_slice_expired(cfqd);
+out_cont:
+ spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ kblockd_schedule_work(&cfqd->unplug_work);
+}
+
static void cfq_put_cfqd(struct cfq_data *cfqd)
{
request_queue_t *q = cfqd->queue;
@@ -1494,6 +1576,8 @@
if (!atomic_dec_and_test(&cfqd->ref))
return;
+ blk_sync_queue(q);
+
/*
* kill spare queue, getting it means we have two refences to it.
* drop both
@@ -1567,8 +1651,15 @@
q->nr_requests = 1024;
cfqd->max_queued = q->nr_requests / 16;
q->nr_batching = cfq_queued;
- cfqd->key_type = CFQ_KEY_TGID;
+ cfqd->key_type = CFQ_KEY_PID;
cfqd->find_best_crq = 1;
+
+ init_timer(&cfqd->timer);
+ cfqd->timer.function = cfq_schedule_timer;
+ cfqd->timer.data = (unsigned long) cfqd;
+
+ INIT_WORK(&cfqd->unplug_work, cfq_kick_queue, q);
+
atomic_set(&cfqd->ref, 1);
cfqd->cfq_queued = cfq_queued;
@@ -1578,6 +1669,9 @@
cfqd->cfq_fifo_batch_expire = cfq_fifo_rate;
cfqd->cfq_back_max = cfq_back_max;
cfqd->cfq_back_penalty = cfq_back_penalty;
+ cfqd->cfq_slice = cfq_slice;
+ cfqd->cfq_idle = cfq_idle;
+ cfqd->cfq_max_depth = cfq_max_depth;
return 0;
out_spare:
@@ -1624,7 +1718,6 @@
return -ENOMEM;
}
-
/*
* sysfs parts below -->
*/
@@ -1650,13 +1743,6 @@
}
static ssize_t
-cfq_clear_elapsed(struct cfq_data *cfqd, const char *page, size_t count)
-{
- max_elapsed_dispatch = max_elapsed_crq = 0;
- return count;
-}
-
-static ssize_t
cfq_set_key_type(struct cfq_data *cfqd, const char *page, size_t count)
{
spin_lock_irq(cfqd->queue->queue_lock);
@@ -1664,6 +1750,8 @@
cfqd->key_type = CFQ_KEY_PGID;
else if (!strncmp(page, "tgid", 4))
cfqd->key_type = CFQ_KEY_TGID;
+ else if (!strncmp(page, "pid", 3))
+ cfqd->key_type = CFQ_KEY_PID;
else if (!strncmp(page, "uid", 3))
cfqd->key_type = CFQ_KEY_UID;
else if (!strncmp(page, "gid", 3))
@@ -1688,6 +1776,52 @@
return len;
}
+static ssize_t
+cfq_status_show(struct cfq_data *cfqd, char *page)
+{
+ struct list_head *entry;
+ struct cfq_queue *cfqq;
+ ssize_t len;
+ int i = 0, queues;
+
+ len = sprintf(page, "Busy queues: %u\n", cfqd->busy_queues);
+ len += sprintf(page+len, "key type: %s\n",
+ cfq_key_types[cfqd->key_type]);
+ len += sprintf(page+len, "last sector: %Lu\n",
+ (unsigned long long)cfqd->last_sector);
+
+ len += sprintf(page+len, "Busy queue list:\n");
+ spin_lock_irq(cfqd->queue->queue_lock);
+ list_for_each(entry, &cfqd->rr_list) {
+ i++;
+ cfqq = list_entry_cfqq(entry);
+ len += sprintf(page+len, " cfqq: key=%lu alloc=%d/%d, "
+ "queued=%d/%d, service=%lu/%lu\n",
+ cfqq->key, cfqq->allocated[0], cfqq->allocated[1],
+ cfqq->queued[0], cfqq->queued[1], cfqq->service_used,
+ cfqq->service_rq);
+ }
+ len += sprintf(page+len, " busy queues total: %d\n", i);
+ queues = i;
+
+ len += sprintf(page+len, "Empty queue list:\n");
+ i = 0;
+ list_for_each(entry, &cfqd->empty_list) {
+ i++;
+ cfqq = list_entry_cfqq(entry);
+ len += sprintf(page+len, " cfqq: key=%lu alloc=%d/%d, "
+ "queued=%d/%d, service=%lu/%lu\n",
+ cfqq->key, cfqq->allocated[0], cfqq->allocated[1],
+ cfqq->queued[0], cfqq->queued[1], cfqq->service_used,
+ cfqq->service_rq);
+ }
+ len += sprintf(page+len, " empty queues total: %d\n", i);
+ queues += i;
+ len += sprintf(page+len, "Total queues: %d\n", queues);
+ spin_unlock_irq(cfqd->queue->queue_lock);
+ return len;
+}
+
#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \
static ssize_t __FUNC(struct cfq_data *cfqd, char *page) \
{ \
@@ -1704,6 +1838,9 @@
SHOW_FUNCTION(cfq_find_best_show, cfqd->find_best_crq, 0);
SHOW_FUNCTION(cfq_back_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_penalty_show, cfqd->cfq_back_penalty, 0);
+SHOW_FUNCTION(cfq_idle_show, cfqd->cfq_idle, 1);
+SHOW_FUNCTION(cfq_slice_show, cfqd->cfq_slice, 1);
+SHOW_FUNCTION(cfq_max_depth_show, cfqd->cfq_max_depth, 0);
#undef SHOW_FUNCTION
#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -1729,6 +1866,9 @@
STORE_FUNCTION(cfq_find_best_store, &cfqd->find_best_crq, 0, 1, 0);
STORE_FUNCTION(cfq_back_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_penalty_store, &cfqd->cfq_back_penalty, 1, UINT_MAX, 0);
+STORE_FUNCTION(cfq_idle_store, &cfqd->cfq_idle, 0, cfqd->cfq_slice/2, 1);
+STORE_FUNCTION(cfq_slice_store, &cfqd->cfq_slice, 1, UINT_MAX, 1);
+STORE_FUNCTION(cfq_max_depth_store, &cfqd->cfq_max_depth, 2, UINT_MAX, 0);
#undef STORE_FUNCTION
static struct cfq_fs_entry cfq_quantum_entry = {
@@ -1771,15 +1911,30 @@
.show = cfq_back_penalty_show,
.store = cfq_back_penalty_store,
};
-static struct cfq_fs_entry cfq_clear_elapsed_entry = {
- .attr = {.name = "clear_elapsed", .mode = S_IWUSR },
- .store = cfq_clear_elapsed,
+static struct cfq_fs_entry cfq_slice_entry = {
+ .attr = {.name = "slice", .mode = S_IRUGO | S_IWUSR },
+ .show = cfq_slice_show,
+ .store = cfq_slice_store,
+};
+static struct cfq_fs_entry cfq_idle_entry = {
+ .attr = {.name = "idle", .mode = S_IRUGO | S_IWUSR },
+ .show = cfq_idle_show,
+ .store = cfq_idle_store,
+};
+static struct cfq_fs_entry cfq_misc_entry = {
+ .attr = {.name = "show_status", .mode = S_IRUGO },
+ .show = cfq_status_show,
};
static struct cfq_fs_entry cfq_key_type_entry = {
.attr = {.name = "key_type", .mode = S_IRUGO | S_IWUSR },
.show = cfq_read_key_type,
.store = cfq_set_key_type,
};
+static struct cfq_fs_entry cfq_max_depth_entry = {
+ .attr = {.name = "max_depth", .mode = S_IRUGO | S_IWUSR },
+ .show = cfq_max_depth_show,
+ .store = cfq_max_depth_store,
+};
static struct attribute *default_attrs[] = {
&cfq_quantum_entry.attr,
@@ -1791,7 +1946,10 @@
&cfq_find_best_entry.attr,
&cfq_back_max_entry.attr,
&cfq_back_penalty_entry.attr,
- &cfq_clear_elapsed_entry.attr,
+ &cfq_misc_entry.attr,
+ &cfq_slice_entry.attr,
+ &cfq_idle_entry.attr,
+ &cfq_max_depth_entry.attr,
NULL,
};
@@ -1856,7 +2014,7 @@
.elevator_owner = THIS_MODULE,
};
-int cfq_init(void)
+static int __init cfq_init(void)
{
int ret;
@@ -1864,17 +2022,35 @@
return -ENOMEM;
ret = elv_register(&iosched_cfq);
- if (!ret) {
- __module_get(THIS_MODULE);
- return 0;
- }
+ if (ret)
+ cfq_slab_kill();
- cfq_slab_kill();
return ret;
}
static void __exit cfq_exit(void)
{
+ struct task_struct *g, *p;
+ unsigned long flags;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ /*
+ * iterate each process in the system, removing our io_context
+ */
+ do_each_thread(g, p) {
+ struct io_context *ioc = p->io_context;
+
+ if (ioc && ioc->cic) {
+ ioc->cic->exit(ioc->cic);
+ cfq_free_io_context(ioc->cic);
+ ioc->cic = NULL;
+ }
+
+ } while_each_thread(g, p);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+
cfq_slab_kill();
elv_unregister(&iosched_cfq);
}
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 13:04 Time sliced CFQ io scheduler Jens Axboe
@ 2004-12-02 13:48 ` Jens Axboe
2004-12-02 19:48 ` Andrew Morton
2004-12-02 14:28 ` Giuliano Pochini
1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 13:48 UTC (permalink / raw)
To: Linux Kernel; +Cc: Nick Piggin
Hi,
One more test case, while the box is booted... This just demonstrates a
process doing a file write (bs=64k) with a competing process doing a
file read (bs=64k) at the same time, again capped at 30sec.
deadline:
Reader: 2520KiB/sec (max_lat=45msec)
Writer: 1258KiB/sec (max_lat=85msec)
as:
Reader: 27985KiB/sec (max_lat=34msec)
Writer: 64KiB/sec (max_lat=1042msec)
cfq:
Reader: 12703KiB/sec (max_lat=108msec)
Writer: 9743KiB/sec (max_lat=89msec)
If you look at vmstat while running these tests, cfq and deadline give
equal bandwidth for the reader and writer all the time, while as
basically doesn't give anything to the writer (a single block per second
only). Nick, is the write batching broken or something?
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 13:04 Time sliced CFQ io scheduler Jens Axboe
2004-12-02 13:48 ` Jens Axboe
@ 2004-12-02 14:28 ` Giuliano Pochini
2004-12-02 14:41 ` Jens Axboe
1 sibling, 1 reply; 66+ messages in thread
From: Giuliano Pochini @ 2004-12-02 14:28 UTC (permalink / raw)
To: Jens Axboe; +Cc: Linux Kernel
On Thu, 2 Dec 2004, Jens Axboe wrote:
> Case 4: write_files, random, bs=4k
Just a thought... in this test the results don't look right. Why
aggregate bandwidth with 8 clients is higher than with 4 and 2 clients ?
In the cfq test with 8 clients aggregate bw is also higher than with
a single client.
--
Giuliano.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 14:28 ` Giuliano Pochini
@ 2004-12-02 14:41 ` Jens Axboe
2004-12-04 13:05 ` Giuliano Pochini
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 14:41 UTC (permalink / raw)
To: Giuliano Pochini; +Cc: Linux Kernel
On Thu, Dec 02 2004, Giuliano Pochini wrote:
>
>
> On Thu, 2 Dec 2004, Jens Axboe wrote:
>
> > Case 4: write_files, random, bs=4k
>
> Just a thought... in this test the results don't look right. Why
> aggregate bandwidth with 8 clients is higher than with 4 and 2 clients ?
> In the cfq test with 8 clients aggregate bw is also higher than with
> a single client.
I don't know what happens with the 4 client case, but it's not that
unlikely that aggregate bandwidth will be higher for more threads doing
random writes, as request coalesching will help minimize seeks.
But I did think it was strange with the 4 client case dip was strange,
it was reproducable though (as are all the results, they have very
little variance).
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 13:48 ` Jens Axboe
@ 2004-12-02 19:48 ` Andrew Morton
2004-12-02 19:52 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-02 19:48 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-kernel, nickpiggin
Jens Axboe <axboe@suse.de> wrote:
>
> as:
> Reader: 27985KiB/sec (max_lat=34msec)
> Writer: 64KiB/sec (max_lat=1042msec)
>
> cfq:
> Reader: 12703KiB/sec (max_lat=108msec)
> Writer: 9743KiB/sec (max_lat=89msec)
>
> If you look at vmstat while running these tests, cfq and deadline give
> equal bandwidth for the reader and writer all the time, while as
> basically doesn't give anything to the writer (a single block per second
> only). Nick, is the write batching broken or something?
Looks like it. We used to do 2/3rds-read, 1/3rd-write in that testcase.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 19:48 ` Andrew Morton
@ 2004-12-02 19:52 ` Jens Axboe
2004-12-02 20:19 ` Andrew Morton
2004-12-08 0:37 ` Andrea Arcangeli
0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 19:52 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, nickpiggin
On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > as:
> > Reader: 27985KiB/sec (max_lat=34msec)
> > Writer: 64KiB/sec (max_lat=1042msec)
> >
> > cfq:
> > Reader: 12703KiB/sec (max_lat=108msec)
> > Writer: 9743KiB/sec (max_lat=89msec)
> >
> > If you look at vmstat while running these tests, cfq and deadline give
> > equal bandwidth for the reader and writer all the time, while as
> > basically doesn't give anything to the writer (a single block per second
> > only). Nick, is the write batching broken or something?
>
> Looks like it. We used to do 2/3rds-read, 1/3rd-write in that testcase.
But 'as' has had no real changes in about 9 months time, it's really
strange. Twiddling with write expire and write batch expire settings
make no real difference. Upping the ante to 4 clients, two readers and
two writers work about the same: 27MiB/sec aggregate read bandwidth,
~100KiB/sec write.
At least something needs to be done about it. I don't know what kernel
this is a regression against, but at least it means that current 2.6
with its default io scheduler has basically zero write performance in
presence of reads.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 20:19 ` Andrew Morton
@ 2004-12-02 20:19 ` Jens Axboe
2004-12-02 20:34 ` Andrew Morton
2004-12-02 22:18 ` Prakash K. Cheemplavam
1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 20:19 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, nickpiggin
On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > n Thu, Dec 02 2004, Andrew Morton wrote:
> > > Jens Axboe <axboe@suse.de> wrote:
> > > >
> > > > as:
> > > > Reader: 27985KiB/sec (max_lat=34msec)
> > > > Writer: 64KiB/sec (max_lat=1042msec)
> > > >
> > > > cfq:
> > > > Reader: 12703KiB/sec (max_lat=108msec)
> > > > Writer: 9743KiB/sec (max_lat=89msec)
> > > >
> > > > If you look at vmstat while running these tests, cfq and deadline give
> > > > equal bandwidth for the reader and writer all the time, while as
> > > > basically doesn't give anything to the writer (a single block per second
> > > > only). Nick, is the write batching broken or something?
> > >
> > > Looks like it. We used to do 2/3rds-read, 1/3rd-write in that testcase.
> >
> > But 'as' has had no real changes in about 9 months time, it's really
> > strange. Twiddling with write expire and write batch expire settings
> > make no real difference. Upping the ante to 4 clients, two readers and
> > two writers work about the same: 27MiB/sec aggregate read bandwidth,
> > ~100KiB/sec write.
>
> Did a quick test here, things seem OK.
>
> Writer:
>
> while true
> do
> dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
> done
>
> Reader:
>
> time cat 1-gig-file > /dev/null
> cat x > /dev/null 0.07s user 1.55s system 3% cpu 45.434 total
>
> `vmstat 1' says:
>
>
> 0 5 1168 3248 472 220972 0 0 28 24896 1249 212 0 7 0 94
> 0 7 1168 3248 492 220952 0 0 28 28056 1284 204 0 5 0 96
> 0 8 1168 3248 500 221012 0 0 28 30632 1255 194 0 5 0 95
> 0 7 1168 2800 508 221344 0 0 16 20432 1183 170 0 3 0 97
> 0 8 1168 3024 484 221164 0 0 15008 12488 1246 460 0 4 0 96
> 1 8 1168 2252 484 221980 0 0 27808 6092 1270 624 0 4 0 96
> 0 8 1168 3248 468 221044 0 0 32420 4596 1290 690 0 4 0 96
> 0 9 1164 2084 456 222212 4 0 28964 1800 1285 596 0 3 0 96
> 1 7 1164 3032 392 221256 0 0 23456 6820 1270 527 0 4 0 96
[snip]
Looks fine, yes.
> So what are you doing different?
Doing sync io, most likely. My results above are 64k O_DIRECT reads and
writes, see the mention of the test cases in the first mail. I'll
repeat the testing with both sync and async writes tomorrow on the same
box and see what that does to fairness.
Async writes are not very interesting, it takes quite some effort to
make those go slow :-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 19:52 ` Jens Axboe
@ 2004-12-02 20:19 ` Andrew Morton
2004-12-02 20:19 ` Jens Axboe
2004-12-02 22:18 ` Prakash K. Cheemplavam
2004-12-08 0:37 ` Andrea Arcangeli
1 sibling, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-02 20:19 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-kernel, nickpiggin
Jens Axboe <axboe@suse.de> wrote:
>
> n Thu, Dec 02 2004, Andrew Morton wrote:
> > Jens Axboe <axboe@suse.de> wrote:
> > >
> > > as:
> > > Reader: 27985KiB/sec (max_lat=34msec)
> > > Writer: 64KiB/sec (max_lat=1042msec)
> > >
> > > cfq:
> > > Reader: 12703KiB/sec (max_lat=108msec)
> > > Writer: 9743KiB/sec (max_lat=89msec)
> > >
> > > If you look at vmstat while running these tests, cfq and deadline give
> > > equal bandwidth for the reader and writer all the time, while as
> > > basically doesn't give anything to the writer (a single block per second
> > > only). Nick, is the write batching broken or something?
> >
> > Looks like it. We used to do 2/3rds-read, 1/3rd-write in that testcase.
>
> But 'as' has had no real changes in about 9 months time, it's really
> strange. Twiddling with write expire and write batch expire settings
> make no real difference. Upping the ante to 4 clients, two readers and
> two writers work about the same: 27MiB/sec aggregate read bandwidth,
> ~100KiB/sec write.
Did a quick test here, things seem OK.
Writer:
while true
do
dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
done
Reader:
time cat 1-gig-file > /dev/null
cat x > /dev/null 0.07s user 1.55s system 3% cpu 45.434 total
`vmstat 1' says:
0 5 1168 3248 472 220972 0 0 28 24896 1249 212 0 7 0 94
0 7 1168 3248 492 220952 0 0 28 28056 1284 204 0 5 0 96
0 8 1168 3248 500 221012 0 0 28 30632 1255 194 0 5 0 95
0 7 1168 2800 508 221344 0 0 16 20432 1183 170 0 3 0 97
0 8 1168 3024 484 221164 0 0 15008 12488 1246 460 0 4 0 96
1 8 1168 2252 484 221980 0 0 27808 6092 1270 624 0 4 0 96
0 8 1168 3248 468 221044 0 0 32420 4596 1290 690 0 4 0 96
0 9 1164 2084 456 222212 4 0 28964 1800 1285 596 0 3 0 96
1 7 1164 3032 392 221256 0 0 23456 6820 1270 527 0 4 0 96
0 9 1164 3200 388 221124 0 0 27040 7868 1269 588 0 3 0 97
0 9 1164 2540 384 221808 0 0 21536 4024 1247 540 0 4 0 96
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 10 1164 2052 392 222276 0 0 33572 5268 1298 745 0 4 0 96
0 9 1164 3032 400 221316 0 0 28704 5448 1282 611 0 4 0 97
1 9 1164 2076 388 222144 0 0 9992 17176 1229 325 0 2 0 98
1 8 1164 3060 376 221136 0 0 9100 13168 1221 284 0 2 0 98
0 8 1164 2628 384 221536 0 0 28964 3348 1280 635 0 4 0 97
0 8 1164 2920 344 221372 0 0 27052 5744 1275 657 0 6 0 95
1 8 1164 3072 328 221252 0 0 26664 5256 1270 653 0 5 0 95
0 9 1160 2176 356 222100 12 0 26928 6320 1276 605 0 5 0 95
0 9 1160 2268 332 221920 0 0 17300 9580 1242 428 0 3 0 98
0 8 1160 3256 332 221036 0 0 23588 9280 1345 586 0 4 0 96
0 8 1160 3220 320 221116 0 0 16916 9476 1251 425 0 3 0 97
0 10 1160 3000 320 221388 0 0 21416 8168 1260 565 0 5 0 95
0 11 1160 2020 324 222268 0 0 23580 10144 1269 528 0 3 0 97
0 11 1160 2076 340 222252 0 0 20900 3896 1244 486 1 3 0 97
0 10 1160 2656 356 221692 0 0 23968 8108 1272 564 0 5 0 95
0 10 1160 3464 348 220892 0 0 26140 10272 1513 618 0 2 0 98
0 10 1160 3124 320 221260 0 0 15512 11368 1227 442 0 3 0 97
0 10 1156 3072 336 221148 32 0 22212 6776 1280 539 0 4 0 97
0 11 1156 2544 352 221608 0 0 25004 7224 1262 596 0 4 0 95
0 12 1156 2132 364 222140 0 0 20636 9500 1246 510 0 3 0 97
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 12 1156 2132 372 222064 0 0 25880 6104 1291 550 0 3 0 97
0 11 1156 3260 368 220980 0 0 19868 12860 1277 496 0 3 0 97
0 12 1156 2328 360 221872 0 0 20764 7712 1256 513 0 4 0 97
0 10 1156 3072 356 221128 0 0 17056 7800 1239 474 0 4 0 96
0 11 1156 2180 336 221964 0 0 20252 10464 1252 520 0 4 0 96
0 11 1156 2076 360 222144 0 0 22512 7448 1250 554 1 4 0 96
0 10 1156 2620 364 221692 0 0 23372 4236 1256 543 0 4 0 96
0 11 1156 2136 344 222120 0 0 22172 8060 1260 528 0 3 0 97
0 10 1156 3340 316 221060 0 0 17688 12036 1242 474 0 3 0 97
0 10 1156 2580 296 221760 0 0 18460 5608 1243 501 0 3 0 97
0 10 1156 2960 308 221408 0 0 17176 10544 1233 462 0 3 0 97
0 11 1156 2132 308 222224 0 0 32376 2048 1291 715 0 4 0 96
0 10 1156 3280 300 221008 0 0 23628 10768 1278 556 0 4 0 96
0 11 1156 2132 320 222144 0 0 18076 10888 1365 481 0 3 0 97
0 11 1156 2504 312 221880 0 0 23448 10068 1256 526 0 3 0 97
0 10 1156 2532 324 221664 0 0 18084 6012 1259 476 0 5 0 96
0 10 1156 2580 332 221792 0 0 26400 6776 1279 626 0 4 0 96
0 10 1156 3312 324 221052 0 0 22044 6036 1247 508 0 4 0 96
0 10 1152 2144 328 222204 4 0 11996 15068 1235 394 0 4 0 97
0 5 1152 3128 344 221236 0 0 20 24068 1200 172 0 3 2 95
So what are you doing different?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 20:19 ` Jens Axboe
@ 2004-12-02 20:34 ` Andrew Morton
2004-12-02 20:37 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-02 20:34 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-kernel, nickpiggin
Jens Axboe <axboe@suse.de> wrote:
>
> > So what are you doing different?
>
> Doing sync io, most likely. My results above are 64k O_DIRECT reads and
> writes, see the mention of the test cases in the first mail.
OK.
Writer:
while true
do
write-and-fsync -o -m 100 -c 65536 foo
done
Reader:
time-read -o -b 65536 -n 256 x (This is O_DIRECT)
or: time-read -b 65536 -n 256 x (This is buffered)
`vmstat 1':
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 1032 137412 4276 84388 32 0 15456 25344 1659 1538 0 3 50 47
0 1 1032 137468 4276 84388 0 0 0 32128 1521 1027 0 2 51 48
0 1 1032 137476 4276 84388 0 0 0 32064 1519 1026 0 1 50 49
0 1 1032 137476 4276 84388 0 0 0 33920 1556 1102 0 2 50 49
0 1 1032 137476 4276 84388 0 0 0 33088 1541 1074 0 1 50 49
0 2 1032 135676 4284 85944 0 0 1656 29732 1868 2506 0 3 49 47
1 1 1032 96532 4292 125172 0 0 39220 128 10813 39313 0 31 35 34
0 2 1032 57724 4332 163892 0 0 38828 128 10716 38907 0 28 38 35
0 2 1032 18860 4368 202684 0 0 38768 128 10701 38845 1 28 38 35
0 2 1032 3672 4248 217764 0 0 39188 128 10803 39327 0 28 37 34
0 1 1032 2832 4260 218840 0 0 16812 17932 5504 17457 0 14 46 40
0 1 1032 2832 4260 218840 0 0 0 30876 1501 974 0 1 50 49
0 1 1032 2944 4260 218840 0 0 0 33472 1537 1068 0 2 50 48
0 1 1032 2944 4260 218840 0 0 0 33216 1533 1046 0 2 50 48
Ugly.
(write-and-fsync and time-read are from
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz)
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 20:34 ` Andrew Morton
@ 2004-12-02 20:37 ` Jens Axboe
2004-12-07 23:11 ` Nick Piggin
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 20:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, nickpiggin
On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > > So what are you doing different?
> >
> > Doing sync io, most likely. My results above are 64k O_DIRECT reads and
> > writes, see the mention of the test cases in the first mail.
>
> OK.
>
> Writer:
>
> while true
> do
> write-and-fsync -o -m 100 -c 65536 foo
> done
>
> Reader:
>
> time-read -o -b 65536 -n 256 x (This is O_DIRECT)
> or: time-read -b 65536 -n 256 x (This is buffered)
>
> `vmstat 1':
>
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 1 1 1032 137412 4276 84388 32 0 15456 25344 1659 1538 0 3 50 47
> 0 1 1032 137468 4276 84388 0 0 0 32128 1521 1027 0 2 51 48
> 0 1 1032 137476 4276 84388 0 0 0 32064 1519 1026 0 1 50 49
> 0 1 1032 137476 4276 84388 0 0 0 33920 1556 1102 0 2 50 49
> 0 1 1032 137476 4276 84388 0 0 0 33088 1541 1074 0 1 50 49
> 0 2 1032 135676 4284 85944 0 0 1656 29732 1868 2506 0 3 49 47
> 1 1 1032 96532 4292 125172 0 0 39220 128 10813 39313 0 31 35 34
> 0 2 1032 57724 4332 163892 0 0 38828 128 10716 38907 0 28 38 35
> 0 2 1032 18860 4368 202684 0 0 38768 128 10701 38845 1 28 38 35
> 0 2 1032 3672 4248 217764 0 0 39188 128 10803 39327 0 28 37 34
> 0 1 1032 2832 4260 218840 0 0 16812 17932 5504 17457 0 14 46 40
Well there you go, exactly what I saw. The writer(s) basically make no
progress as long as the reader is going. Since 'as' treats the sync
writes like reads internally and given the really bad fairness problems
demonstrated for same direction clients, that might be the same problem.
> Ugly.
>
> (write-and-fsync and time-read are from
> http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz)
I'll try and post my cruddy test programs tomorrow as well. Pretty handy
for getting a good feel for N client read/write performance.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 20:19 ` Andrew Morton
2004-12-02 20:19 ` Jens Axboe
@ 2004-12-02 22:18 ` Prakash K. Cheemplavam
2004-12-03 7:01 ` Jens Axboe
1 sibling, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-02 22:18 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 3950 bytes --]
Andrew Morton schrieb:
> Jens Axboe <axboe@suse.de> wrote:
>
>>n Thu, Dec 02 2004, Andrew Morton wrote:
>>
>>>Jens Axboe <axboe@suse.de> wrote:
>>>
>>>>as:
>>>>Reader: 27985KiB/sec (max_lat=34msec)
>>>>Writer: 64KiB/sec (max_lat=1042msec)
>>>>
>>>>cfq:
>>>>Reader: 12703KiB/sec (max_lat=108msec)
>>>>Writer: 9743KiB/sec (max_lat=89msec)
>
>
> Did a quick test here, things seem OK.
>
> Writer:
>
> while true
> do
> dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
> done
>
> Reader:
>
> time cat 1-gig-file > /dev/null
> cat x > /dev/null 0.07s user 1.55s system 3% cpu 45.434 total
>
> `vmstat 1' says:
>
>
> 0 5 1168 3248 472 220972 0 0 28 24896 1249 212 0 7 0 94
> 0 7 1168 3248 492 220952 0 0 28 28056 1284 204 0 5 0 96
> 0 8 1168 3248 500 221012 0 0 28 30632 1255 194 0 5 0 95
> 0 7 1168 2800 508 221344 0 0 16 20432 1183 170 0 3 0 97
> 0 8 1168 3024 484 221164 0 0 15008 12488 1246 460 0 4 0 96
> 1 8 1168 2252 484 221980 0 0 27808 6092 1270 624 0 4 0 96
> 0 8 1168 3248 468 221044 0 0 32420 4596 1290 690 0 4 0 96
> 0 9 1164 2084 456 222212 4 0 28964 1800 1285 596 0 3 0 96
> 1 7 1164 3032 392 221256 0 0 23456 6820 1270 527 0 4 0 96
> 0 9 1164 3200 388 221124 0 0 27040 7868 1269 588 0 3 0 97
> 0 9 1164 2540 384 221808 0 0 21536 4024 1247 540 0 4 0 96
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 1 10 1164 2052 392 222276 0 0 33572 5268 1298 745 0 4 0 96
> 0 9 1164 3032 400 221316 0 0 28704 5448 1282 611 0 4 0 97
>
I am happy that finally the kernel devs see that there is a problem. In my case (as I
mentioned in another thread) the reader is pretty starving while I a writer is active.
(esp my email client makes trouble while writing is going on.) This is vmstat using your
scripts above (though using cfq2 scheduler):
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 3 3080 2528 1256 818976 0 0 6404 101236 1332 929 1 26 0 73
0 3 3080 2768 1252 820104 0 0 2820 32632 1328 1087 1 20 0 79
2 1 3080 6992 1292 814808 0 0 4928 106912 1337 1364 16 29 0 55
0 3 3080 2772 1252 818516 0 0 3076 42176 1357 1351 2 41 0 57
0 3 3080 2644 1256 817548 0 0 3332 110104 1375 873 1 36 0 63
0 3 3080 2592 1248 815928 0 0 2820 76860 1324 894 1 41 0 58
7 3 3080 2208 1248 817176 0 0 3328 134144 1352 1058 2 30 0 68
4 4 3080 2516 1248 817516 0 0 3588 47768 1327 1244 1 19 0 80
0 3 3080 2400 1220 818688 0 0 3844 24760 1312 1251 1 23 0 76
0 3 3080 2656 1196 816468 0 0 3588 132372 1352 1126 1 52 0 47
0 3 3080 2688 1188 815316 0 0 3076 77824 1328 933 1 35 0 64
0 3 3080 2336 1156 816744 0 0 2924 114688 1333 1038 1 25 0 74
0 3 3080 2528 1184 816812 0 0 2508 67736 1340 882 1 12 0 87
0 3 3080 2208 1156 817712 0 0 3592 75624 1326 2289 1 36 0 63
0 3 3080 2664 1156 818240 0 0 5124 15692 1302 992 1 18 0 81
0 3 3080 2580 1160 815832 0 0 4356 155792 1375 1064 1 39 0 60
0 3 3080 2472 1160 817124 0 0 3076 100852 1345 1138 1 23 0 76
2 4 3080 2836 1148 816228 0 0 3336 100412 1352 1379 1 47 0 52
0 4 3080 2708 1144 815964 0 0 3844 48908 1343 871 1 25 0 74
0 3 3080 2748 1152 815984 0 0 3332 71996 1338 843 1 27 0 72
Cheers,
Prakash
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 22:18 ` Prakash K. Cheemplavam
@ 2004-12-03 7:01 ` Jens Axboe
2004-12-03 9:12 ` Prakash K. Cheemplavam
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 7:01 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: Andrew Morton, linux-kernel, nickpiggin
On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
> Andrew Morton schrieb:
> >Jens Axboe <axboe@suse.de> wrote:
> >
> >>n Thu, Dec 02 2004, Andrew Morton wrote:
> >>
> >>>Jens Axboe <axboe@suse.de> wrote:
> >>>
> >>>>as:
> >>>>Reader: 27985KiB/sec (max_lat=34msec)
> >>>>Writer: 64KiB/sec (max_lat=1042msec)
> >>>>
> >>>>cfq:
> >>>>Reader: 12703KiB/sec (max_lat=108msec)
> >>>>Writer: 9743KiB/sec (max_lat=89msec)
> >
> >
> >Did a quick test here, things seem OK.
> >
> >Writer:
> >
> > while true
> > do
> > dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
> > done
> >
> >Reader:
> >
> > time cat 1-gig-file > /dev/null
> > cat x > /dev/null 0.07s user 1.55s system 3% cpu 45.434 total
> >
> >`vmstat 1' says:
> >
> >
> > 0 5 1168 3248 472 220972 0 0 28 24896 1249 212 0 7
> > 0 94
> > 0 7 1168 3248 492 220952 0 0 28 28056 1284 204 0 5
> > 0 96
> > 0 8 1168 3248 500 221012 0 0 28 30632 1255 194 0 5
> > 0 95
> > 0 7 1168 2800 508 221344 0 0 16 20432 1183 170 0 3
> > 0 97
> > 0 8 1168 3024 484 221164 0 0 15008 12488 1246 460 0 4
> > 0 96
> > 1 8 1168 2252 484 221980 0 0 27808 6092 1270 624 0 4
> > 0 96
> > 0 8 1168 3248 468 221044 0 0 32420 4596 1290 690 0 4
> > 0 96
> > 0 9 1164 2084 456 222212 4 0 28964 1800 1285 596 0 3
> > 0 96
> > 1 7 1164 3032 392 221256 0 0 23456 6820 1270 527 0 4
> > 0 96
> > 0 9 1164 3200 388 221124 0 0 27040 7868 1269 588 0 3
> > 0 97
> > 0 9 1164 2540 384 221808 0 0 21536 4024 1247 540 0 4
> > 0 96
> >procs -----------memory---------- ---swap-- -----io---- --system--
> >----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy
> > id wa
> > 1 10 1164 2052 392 222276 0 0 33572 5268 1298 745 0 4
> > 0 96
> > 0 9 1164 3032 400 221316 0 0 28704 5448 1282 611 0 4
> > 0 97
> >
>
> I am happy that finally the kernel devs see that there is a problem. In my
> case (as I mentioned in another thread) the reader is pretty starving while
> I a writer is active. (esp my email client makes trouble while writing is
> going on.) This is vmstat using your scripts above (though using cfq2
This was the reverse case, though :-)
> scheduler):
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id
> wa
> 1 3 3080 2528 1256 818976 0 0 6404 101236 1332 929 1 26
> 0 73
> 0 3 3080 2768 1252 820104 0 0 2820 32632 1328 1087 1 20 0
> 79
> 2 1 3080 6992 1292 814808 0 0 4928 106912 1337 1364 16 29
> 0 55
> 0 3 3080 2772 1252 818516 0 0 3076 42176 1357 1351 2 41 0
> 57
> 0 3 3080 2644 1256 817548 0 0 3332 110104 1375 873 1 36
> 0 63
> 0 3 3080 2592 1248 815928 0 0 2820 76860 1324 894 1 41 0
> 58
> 7 3 3080 2208 1248 817176 0 0 3328 134144 1352 1058 2 30
> 0 68
> 4 4 3080 2516 1248 817516 0 0 3588 47768 1327 1244 1 19 0
> 80
> 0 3 3080 2400 1220 818688 0 0 3844 24760 1312 1251 1 23 0
> 76
> 0 3 3080 2656 1196 816468 0 0 3588 132372 1352 1126 1 52
> 0 47
> 0 3 3080 2688 1188 815316 0 0 3076 77824 1328 933 1 35 0
> 64
> 0 3 3080 2336 1156 816744 0 0 2924 114688 1333 1038 1 25
> 0 74
> 0 3 3080 2528 1184 816812 0 0 2508 67736 1340 882 1 12 0
> 87
> 0 3 3080 2208 1156 817712 0 0 3592 75624 1326 2289 1 36 0
> 63
> 0 3 3080 2664 1156 818240 0 0 5124 15692 1302 992 1 18 0
> 81
> 0 3 3080 2580 1160 815832 0 0 4356 155792 1375 1064 1 39
> 0 60
> 0 3 3080 2472 1160 817124 0 0 3076 100852 1345 1138 1 23
> 0 76
> 2 4 3080 2836 1148 816228 0 0 3336 100412 1352 1379 1 47
> 0 52
> 0 4 3080 2708 1144 815964 0 0 3844 48908 1343 871 1 25 0
> 74
> 0 3 3080 2748 1152 815984 0 0 3332 71996 1338 843 1 27 0
> 72
Can you try with the patch that is in the parent of this thread? The
above doesn't look that bad, although read performance could be better
of course. But try with the patch please, I'm sure it should help you
quite a lot.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 7:01 ` Jens Axboe
@ 2004-12-03 9:12 ` Prakash K. Cheemplavam
2004-12-03 9:18 ` Jens Axboe
2004-12-03 9:26 ` Andrew Morton
0 siblings, 2 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 9:12 UTC (permalink / raw)
To: Jens Axboe; +Cc: akpm, linux-kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 3273 bytes --]
Jens Axboe schrieb:
> On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
>
>> 0 3 3080 2208 1156 817712 0 0 3592 75624 1326 2289 1 36 0
>> 63
>> 0 3 3080 2664 1156 818240 0 0 5124 15692 1302 992 1 18 0
>> 81
>> 0 3 3080 2580 1160 815832 0 0 4356 155792 1375 1064 1 39
>> 0 60
>> 0 3 3080 2472 1160 817124 0 0 3076 100852 1345 1138 1 23
>> 0 76
>> 2 4 3080 2836 1148 816228 0 0 3336 100412 1352 1379 1 47
>> 0 52
>> 0 4 3080 2708 1144 815964 0 0 3844 48908 1343 871 1 25 0
>> 74
>> 0 3 3080 2748 1152 815984 0 0 3332 71996 1338 843 1 27 0
>> 72
>
>
> Can you try with the patch that is in the parent of this thread? The
> above doesn't look that bad, although read performance could be better
> of course. But try with the patch please, I'm sure it should help you
> quite a lot.
>
It actually got worse: Though the read rate seems accepteble, it is not, as
interactivity is dead while writing. I cannot start porgrammes, other programmes
which want to do i/o pretty much hang. This is only while writing. While reading
there is no such problem.
Prakash
5 2692 5440 1640 917964 0 0 2332 72364 1337 782 1 28 0 71
0 5 2692 5536 1540 917116 0 0 2116 85360 1346 785 2 27 0 71
1 4 2692 7016 1496 919488 0 0 2152 71664 1329 740 3 29 0 68
0 4 2692 5112 1476 922536 0 0 872 110592 1328 798 0 17 0 83
0 4 2692 5560 1492 922144 0 0 1316 57348 1323 2162 1 21 0 78
0 4 2692 5240 1500 921808 0 0 2088 92200 1352 1285 1 26 0 73
0 4 2692 5816 1576 922064 0 0 1352 60716 1316 737 1 5 0 94
0 5 2692 5484 1588 920004 0 0 2072 87732 1327 3522 2 50 0 48
0 4 2692 5696 1660 920628 0 0 956 97284 1336 676 1 28 0 71
0 4 2692 5368 1592 921808 0 0 1296 23208 1367 4870 2 35 0 63
0 4 2692 5176 1628 922708 0 0 1576 67932 1400 721 0 4 0 96
1 4 2692 5496 1684 922604 0 0 2372 53216 1320 684 1 6 0 93
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 4 2692 6432 1744 924664 0 0 3144 31484 1331 651 1 5 0 94
0 4 2692 5496 1724 922056 0 0 2336 117040 1306 7770 1 63 0 36
0 4 2692 5500 1724 921588 0 0 2576 28992 1314 1244 1 26 0 73
0 5 2692 5484 1728 919340 0 0 1168 128796 1334 77435 2 45 0 53
0 4 2692 5432 1756 920864 0 0 1488 100392 1325 1100 1 25 0 74
1 4 2692 5368 1772 921900 0 0 1312 52180 1312 2087 1 21 0 78
0 4 2692 5240 1716 922272 0 0 2076 56352 1305 939 1 13 0 86
0 4 2692 5496 1732 921592 0 0 2596 68576 1320 1170 1 18 0 81
0 4 2692 5368 1776 921364 0 0 1588 21904 1281 1201 1 23 0 76
0 4 2692 5516 1852 921840 0 0 6560 93864 1593 967 1 8 0 91
0 4 2692 5176 1816 922148 0 0 1068 73728 1581 4683 2 37 0 61
0 4 2692 5484 1756 922408 0 0 2096 73632 1450 1456 2 20 0 78
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 9:12 ` Prakash K. Cheemplavam
@ 2004-12-03 9:18 ` Jens Axboe
2004-12-03 9:35 ` Prakash K. Cheemplavam
2004-12-03 9:26 ` Andrew Morton
1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 9:18 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: akpm, linux-kernel, nickpiggin
On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
> >
>
> >>0 3 3080 2208 1156 817712 0 0 3592 75624 1326 2289 1 36
> >>0 63
> >>0 3 3080 2664 1156 818240 0 0 5124 15692 1302 992 1 18
> >>0 81
> >>0 3 3080 2580 1160 815832 0 0 4356 155792 1375 1064 1 39
> >>0 60
> >>0 3 3080 2472 1160 817124 0 0 3076 100852 1345 1138 1 23
> >>0 76
> >>2 4 3080 2836 1148 816228 0 0 3336 100412 1352 1379 1 47
> >>0 52
> >>0 4 3080 2708 1144 815964 0 0 3844 48908 1343 871 1 25
> >>0 74
> >>0 3 3080 2748 1152 815984 0 0 3332 71996 1338 843 1 27
> >>0 72
> >
> >
> >Can you try with the patch that is in the parent of this thread? The
> >above doesn't look that bad, although read performance could be better
> >of course. But try with the patch please, I'm sure it should help you
> >quite a lot.
> >
>
> It actually got worse: Though the read rate seems accepteble, it is not, as
> interactivity is dead while writing. I cannot start porgrammes, other
> programmes which want to do i/o pretty much hang. This is only while
> writing. While reading there is no such problem.
Interesting, thanks for testing. I'll run some tests here as well, so
far only the cases mentioned yesterday have been tested.
You could try and bumb the slice period. But I'll experiment and see
what happens. What is your test case?
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 9:12 ` Prakash K. Cheemplavam
2004-12-03 9:18 ` Jens Axboe
@ 2004-12-03 9:26 ` Andrew Morton
2004-12-03 9:34 ` Prakash K. Cheemplavam
2004-12-03 9:39 ` Jens Axboe
1 sibling, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-03 9:26 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: axboe, linux-kernel, nickpiggin
"Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
>
> > Can you try with the patch that is in the parent of this thread? The
> > above doesn't look that bad, although read performance could be better
> > of course. But try with the patch please, I'm sure it should help you
> > quite a lot.
> >
>
> It actually got worse: Though the read rate seems accepteble, it is not, as
> interactivity is dead while writing.
Is this a parallel IDE system? SATA? SCSI? If the latter, what driver
and what is the TCQ depth?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 9:26 ` Andrew Morton
@ 2004-12-03 9:34 ` Prakash K. Cheemplavam
2004-12-03 9:39 ` Jens Axboe
1 sibling, 0 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 9:34 UTC (permalink / raw)
To: Andrew Morton; +Cc: axboe, linux-kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 961 bytes --]
Andrew Morton schrieb:
> "Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
>
>>>Can you try with the patch that is in the parent of this thread? The
>>
>> > above doesn't look that bad, although read performance could be better
>> > of course. But try with the patch please, I'm sure it should help you
>> > quite a lot.
>> >
>>
>> It actually got worse: Though the read rate seems accepteble, it is not, as
>> interactivity is dead while writing.
>
>
> Is this a parallel IDE system? SATA? SCSI? If the latter, what driver
> and what is the TCQ depth?
Hmm yes, this is a RAID0 configuration, so the regression of time
slieced CFQ might me related to it, but the problem of unresponsiveness
while writing as such was on my single disk, as well. Here one HD is
SATA (libata silicon image) and one on IDE controller (nforce2). No TCQ.
BTW, I haven't checked the problem on my ide disk only on SATA. Wil try
to free some space and do so...
Prakash
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 9:18 ` Jens Axboe
@ 2004-12-03 9:35 ` Prakash K. Cheemplavam
2004-12-03 9:43 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 9:35 UTC (permalink / raw)
To: Jens Axboe; +Cc: akpm, linux-kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]
Jens Axboe schrieb:
> On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
>
>>Jens Axboe schrieb:
>>
>>>On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
>>>
>>
>>>>0 3 3080 2208 1156 817712 0 0 3592 75624 1326 2289 1 36
>>>>0 63
>>>>0 3 3080 2664 1156 818240 0 0 5124 15692 1302 992 1 18
>>>>0 81
>>>>0 3 3080 2580 1160 815832 0 0 4356 155792 1375 1064 1 39
>>>>0 60
>>>>0 3 3080 2472 1160 817124 0 0 3076 100852 1345 1138 1 23
>>>>0 76
>>>>2 4 3080 2836 1148 816228 0 0 3336 100412 1352 1379 1 47
>>>>0 52
>>>>0 4 3080 2708 1144 815964 0 0 3844 48908 1343 871 1 25
>>>>0 74
>>>>0 3 3080 2748 1152 815984 0 0 3332 71996 1338 843 1 27
>>>>0 72
>>>
>>>
>>>Can you try with the patch that is in the parent of this thread? The
>>>above doesn't look that bad, although read performance could be better
>>>of course. But try with the patch please, I'm sure it should help you
>>>quite a lot.
>>>
>>
>>It actually got worse: Though the read rate seems accepteble, it is not, as
>>interactivity is dead while writing. I cannot start porgrammes, other
>>programmes which want to do i/o pretty much hang. This is only while
>>writing. While reading there is no such problem.
>
>
> Interesting, thanks for testing. I'll run some tests here as well, so
> far only the cases mentioned yesterday have been tested.
BTW, in case it is misread: Above (except the io performance as such) is
no regression: The other schedulers behave the same on my system.
> You could try and bumb the slice period. But I'll experiment and see
> what happens. What is your test case?
[slice bumping] Uhm, is it doable via proc? I haven't seen text docs to
your patch and I am not good at kernel code ;-)
My test case was apkm's one: write 1gb continuesly and try to cat a
several gb big file to /dev/null. At the same time I checked starting
other apps/using my emial client...
For me the problem in mainline has been since quite some time...checked
till kernel 2.6.7.
Prakash
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 9:26 ` Andrew Morton
2004-12-03 9:34 ` Prakash K. Cheemplavam
@ 2004-12-03 9:39 ` Jens Axboe
2004-12-03 9:54 ` Prakash K. Cheemplavam
[not found] ` <41B03722.5090001@gmx.de>
1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 9:39 UTC (permalink / raw)
To: Andrew Morton; +Cc: Prakash K. Cheemplavam, linux-kernel, nickpiggin
On Fri, Dec 03 2004, Andrew Morton wrote:
> "Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
> >
> > > Can you try with the patch that is in the parent of this thread? The
> > > above doesn't look that bad, although read performance could be better
> > > of course. But try with the patch please, I'm sure it should help you
> > > quite a lot.
> > >
> >
> > It actually got worse: Though the read rate seems accepteble, it is not, as
> > interactivity is dead while writing.
>
> Is this a parallel IDE system? SATA? SCSI? If the latter, what driver
> and what is the TCQ depth?
Yeah, that would be interesting to know. Or of the device is on dm or
raid. And what filesystem is being used?
TCQ depth doesn't matter with cfq really, as you can fully control how
big you go with the drive (default is 4) with max_depth.
Running buffer reads and writes here with new cfq, I get about ~7MiB/sec
read and ~14MiB/sec write aggregate performance with 4 clients (2 of
each) with the default settings. If I up idle period to 6ms and slice
period to 150ms, I get ~13MiB/sec read and ~11MiB/sec write aggregate
for the same run.
So Prakash, please try the same test with those settings:
# cd /sys/block/<dev>/queue/iosched
# echo 6 > idle
# echo 150 > slice
These are the first I tried, there may be better settings. If you have
your filesystem on dm/raid, you probably want to do the above for each
device the dm/raid is composed of.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 9:35 ` Prakash K. Cheemplavam
@ 2004-12-03 9:43 ` Jens Axboe
0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 9:43 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: akpm, linux-kernel, nickpiggin
On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> >
> >>Jens Axboe schrieb:
> >>
> >>>On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
> >>>
> >>
> >>>>0 3 3080 2208 1156 817712 0 0 3592 75624 1326 2289 1 36
> >>>>0 63
> >>>>0 3 3080 2664 1156 818240 0 0 5124 15692 1302 992 1 18
> >>>>0 81
> >>>>0 3 3080 2580 1160 815832 0 0 4356 155792 1375 1064 1
> >>>>39 0 60
> >>>>0 3 3080 2472 1160 817124 0 0 3076 100852 1345 1138 1
> >>>>23 0 76
> >>>>2 4 3080 2836 1148 816228 0 0 3336 100412 1352 1379 1
> >>>>47 0 52
> >>>>0 4 3080 2708 1144 815964 0 0 3844 48908 1343 871 1 25
> >>>>0 74
> >>>>0 3 3080 2748 1152 815984 0 0 3332 71996 1338 843 1 27
> >>>>0 72
> >>>
> >>>
> >>>Can you try with the patch that is in the parent of this thread? The
> >>>above doesn't look that bad, although read performance could be better
> >>>of course. But try with the patch please, I'm sure it should help you
> >>>quite a lot.
> >>>
> >>
> >>It actually got worse: Though the read rate seems accepteble, it is not,
> >>as interactivity is dead while writing. I cannot start porgrammes, other
> >>programmes which want to do i/o pretty much hang. This is only while
> >>writing. While reading there is no such problem.
> >
> >
> >Interesting, thanks for testing. I'll run some tests here as well, so
> >far only the cases mentioned yesterday have been tested.
>
> BTW, in case it is misread: Above (except the io performance as such) is
> no regression: The other schedulers behave the same on my system.
Yes, that's what I assumed. Another thing to keep in mind is that even
with just a single writer, you could have 3 people doing writeout for
you (pdflush for each disk, and the writer itself), while the reader is
on its own. This could affect latencies/bandwidth for the reader in
not-so pleasant ways.
> >You could try and bumb the slice period. But I'll experiment and see
> >what happens. What is your test case?
>
> [slice bumping] Uhm, is it doable via proc? I haven't seen text docs to
> your patch and I am not good at kernel code ;-)
:-)
See my previous mail, it tells you how to do it.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 9:39 ` Jens Axboe
@ 2004-12-03 9:54 ` Prakash K. Cheemplavam
[not found] ` <41B03722.5090001@gmx.de>
1 sibling, 0 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 9:54 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, linux-kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 3141 bytes --]
Jens Axboe schrieb:
> On Fri, Dec 03 2004, Andrew Morton wrote:
>
>>Is this a parallel IDE system? SATA? SCSI? If the latter, what driver
>>and what is the TCQ depth?
>
>
> Yeah, that would be interesting to know. Or of the device is on dm or
> raid. And what filesystem is being used?
It is ext3. (The writing-makes-reading-starve problem happen on reiserfs
as well. ext2 is not so bad and xfs behaves best, ie my email client
doesn't get unuasable with my earlier tests, but "only" very slow. But
then I only wrote down 2gb and nothing continuesly.)
> # cd /sys/block/<dev>/queue/iosched
> # echo 6 > idle
> # echo 150 > slice
>
> These are the first I tried, there may be better settings. If you have
> your filesystem on dm/raid, you probably want to do the above for each
> device the dm/raid is composed of.
Yes, I have linux raid (testing md1). Have appield both settings on both
drives and got a interesting new pattern: Now it alternates. My email
client is still not usale while writing though...
0 3 4120 5276 1792 856784 0 0 3880 81372 1528 931 1 17
0 82
0 4 4120 5576 1800 856148 0 0 1292 121136 1379 872 5
12 0 83
0 3 4120 6624 1796 857700 0 0 1548 4464 1324 712 4 7
0 89
1 3 4120 5200 1568 859600 0 0 42608 2308 1679 1392 3
28 0 69
1 3 4120 5212 1472 856672 0 0 12372 94992 1451 1047 1
30 0 69
1 2 4120 5700 1476 856892 0 0 2576 27252 1337 770 2 9
0 89
0 3 4120 5404 1484 860292 0 0 2076 63876 1323 758 2
13 0 85
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 2 4120 5536 1492 860240 0 0 49248 12 1783 1478 2
26 0 72
0 3 4120 5476 1492 857976 0 0 21852 98552 1500 1021 2
20 0 78
0 3 4120 5340 1552 860104 0 0 2672 36920 1321 717 1
15 0 84
0 4 4120 5436 1588 861080 0 0 2364 20748 1331 716 1
12 0 87
0 3 4120 5092 1632 860012 0 0 59520 0 1810 1591 3
32 0 65
0 3 4120 5568 1616 858252 0 0 58232 0 1833 1519 2
30 0 68
0 2 4120 5508 1548 857784 0 0 3864 70500 1347 767 1
13 0 86
0 2 4120 5376 1488 857760 0 0 5164 41440 1317 800 2
15 0 83
0 3 4120 5256 1484 858448 0 0 6452 111292 1342 829 2
22 0 76
0 3 4120 5832 1488 858768 0 0 2060 26624 1320 769 2 5
0 93
3 4 4120 5564 1492 859644 0 0 20568 12 1426 1048 1
25 0 75
0 2 4120 5448 1516 857548 0 0 41056 47636 1746 1355 2
29 0 69
0 2 4120 5572 1524 858020 0 0 2332 25020 1330 737 1
10 0 89
0 3 4120 5508 1568 858020 0 0 4152 130164 1338 844 2
20 0 78
1 3 4120 6260 1588 858840 0 0 836 14288 1314 747 1 3
0 96
1 3 4120 5192 1644 860304 0 0 41628 12 1677 2226 2
31 0 67
2 3 4120 5308 1588 857448 0 0 53324 1456 2044 9456 2
60 0 38
Prakash
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
[not found] ` <41B03722.5090001@gmx.de>
@ 2004-12-03 10:31 ` Jens Axboe
2004-12-03 10:38 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 10:31 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin
On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Andrew Morton wrote:
> >
> >>"Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
> >>
> >>Is this a parallel IDE system? SATA? SCSI? If the latter, what driver
> >>and what is the TCQ depth?
> >
> >
> >Yeah, that would be interesting to know. Or of the device is on dm or
> >raid. And what filesystem is being used?
>
> It is ext3. (The writing-makes-reading-starve problem happen on reiserfs
> as well. ext2 is not so bad and xfs behaves best, ie my email client
> doesn't get unuasable with my earlier tests, but "only" very slow. But
> then I only wrote down 2gb and nothing continuesly.)
It's impossible to give really good results on ext3/reiser in my
experience, because reads often need to generate a write as well. What
could work is if a reader got PF_SYNCWRITE set while that happens.
Or even better would be to kill that horrible PF_SYNCWRITE hack (Andrew,
how could you!) and really have the fs use the proper WRITE_SYNC
instead.
> >So Prakash, please try the same test with those settings:
> >
> ># cd /sys/block/<dev>/queue/iosched
> ># echo 6 > idle
> ># echo 150 > slice
> >
> >These are the first I tried, there may be better settings. If you have
> >your filesystem on dm/raid, you probably want to do the above for each
> >device the dm/raid is composed of.
>
> Yeas, I have linux raid (testing md1). Have appield both settings on
> both drives and got a interesting new pattern: Now it alternates. My
> email client is still not usale while writing though...
Funky. It looks like another case of the io scheduler being at the wrong
place - if raid sends dependent reads to different drives, it screws up
the io scheduling. The right way to fix that would be to io scheduler
before raid (reverse of what we do now), but that is a lot of work. A
hack would be to try and tie processes to one md component for periods
of time, sort of like cfq slicing.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 10:31 ` Jens Axboe
@ 2004-12-03 10:38 ` Jens Axboe
2004-12-03 10:45 ` Prakash K. Cheemplavam
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 10:38 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin
On Fri, Dec 03 2004, Jens Axboe wrote:
> > >So Prakash, please try the same test with those settings:
> > >
> > ># cd /sys/block/<dev>/queue/iosched
> > ># echo 6 > idle
> > ># echo 150 > slice
> > >
> > >These are the first I tried, there may be better settings. If you have
> > >your filesystem on dm/raid, you probably want to do the above for each
> > >device the dm/raid is composed of.
> >
> > Yeas, I have linux raid (testing md1). Have appield both settings on
> > both drives and got a interesting new pattern: Now it alternates. My
> > email client is still not usale while writing though...
>
> Funky. It looks like another case of the io scheduler being at the wrong
> place - if raid sends dependent reads to different drives, it screws up
> the io scheduling. The right way to fix that would be to io scheduler
> before raid (reverse of what we do now), but that is a lot of work. A
> hack would be to try and tie processes to one md component for periods
> of time, sort of like cfq slicing.
It makes sense to split the slice period for sync and async requests,
since async requests usually get a lot of requests queued in a short
period of time. Might even make sense to introduce a slice_rq value as
well, limiting the number of requests queued in a given slice.
But at least this patch lets you set slice_sync and slice_async
seperately, if you want to experiement.
===== drivers/block/cfq-iosched.c 1.15 vs edited =====
--- 1.15/drivers/block/cfq-iosched.c 2004-11-30 07:56:58 +01:00
+++ edited/drivers/block/cfq-iosched.c 2004-12-03 11:36:09 +01:00
@@ -22,21 +22,23 @@
#include <linux/rbtree.h>
#include <linux/mempool.h>
-static unsigned long max_elapsed_crq;
-static unsigned long max_elapsed_dispatch;
-
/*
* tunables
*/
static int cfq_quantum = 4; /* max queue in one round of service */
static int cfq_queued = 8; /* minimum rq allocate limit per-queue*/
-static int cfq_service = HZ; /* period over which service is avg */
static int cfq_fifo_expire_r = HZ / 2; /* fifo timeout for sync requests */
static int cfq_fifo_expire_w = 5 * HZ; /* fifo timeout for async requests */
static int cfq_fifo_rate = HZ / 8; /* fifo expiry rate */
static int cfq_back_max = 16 * 1024; /* maximum backwards seek, in KiB */
static int cfq_back_penalty = 2; /* penalty of a backwards seek */
+static int cfq_slice_sync = HZ / 10;
+static int cfq_slice_async = HZ / 25;
+static int cfq_idle = HZ / 249;
+
+static int cfq_max_depth = 4;
+
/*
* for the hash of cfqq inside the cfqd
*/
@@ -55,6 +57,7 @@
#define list_entry_hash(ptr) hlist_entry((ptr), struct cfq_rq, hash)
#define list_entry_cfqq(ptr) list_entry((ptr), struct cfq_queue, cfq_list)
+#define list_entry_fifo(ptr) list_entry((ptr), struct request, queuelist)
#define RQ_DATA(rq) (rq)->elevator_private
@@ -76,22 +79,18 @@
#define rq_rb_key(rq) (rq)->sector
/*
- * threshold for switching off non-tag accounting
- */
-#define CFQ_MAX_TAG (4)
-
-/*
* sort key types and names
*/
enum {
CFQ_KEY_PGID,
CFQ_KEY_TGID,
+ CFQ_KEY_PID,
CFQ_KEY_UID,
CFQ_KEY_GID,
CFQ_KEY_LAST,
};
-static char *cfq_key_types[] = { "pgid", "tgid", "uid", "gid", NULL };
+static char *cfq_key_types[] = { "pgid", "tgid", "pid", "uid", "gid", NULL };
/*
* spare queue
@@ -103,6 +102,8 @@
static kmem_cache_t *cfq_ioc_pool;
struct cfq_data {
+ atomic_t ref;
+
struct list_head rr_list;
struct list_head empty_list;
@@ -114,8 +115,6 @@
unsigned int max_queued;
- atomic_t ref;
-
int key_type;
mempool_t *crq_pool;
@@ -127,6 +126,14 @@
int rq_in_driver;
/*
+ * schedule slice state info
+ */
+ struct timer_list timer;
+ struct work_struct unplug_work;
+ struct cfq_queue *active_queue;
+ unsigned int dispatch_slice;
+
+ /*
* tunables, see top of file
*/
unsigned int cfq_quantum;
@@ -137,8 +144,9 @@
unsigned int cfq_back_penalty;
unsigned int cfq_back_max;
unsigned int find_best_crq;
-
- unsigned int cfq_tagged;
+ unsigned int cfq_slice[2];
+ unsigned int cfq_idle;
+ unsigned int cfq_max_depth;
};
struct cfq_queue {
@@ -150,8 +158,6 @@
struct hlist_node cfq_hash;
/* hash key */
unsigned long key;
- /* whether queue is on rr (or empty) list */
- int on_rr;
/* on either rr or empty list of cfqd */
struct list_head cfq_list;
/* sorted list of pending requests */
@@ -169,10 +175,14 @@
int key_type;
- unsigned long service_start;
- unsigned long service_used;
+ unsigned long slice_start;
+ unsigned long slice_end;
+ unsigned long service_last;
- unsigned int max_rate;
+ /* whether queue is on rr (or empty) list */
+ unsigned int on_rr : 1;
+ unsigned int wait_request : 1;
+ unsigned int must_dispatch : 1;
/* number of requests that have been handed to the driver */
int in_flight;
@@ -219,6 +229,8 @@
default:
case CFQ_KEY_TGID:
return tsk->tgid;
+ case CFQ_KEY_PID:
+ return tsk->pid;
case CFQ_KEY_UID:
return tsk->uid;
case CFQ_KEY_GID:
@@ -406,67 +418,22 @@
cfqq->next_crq = cfq_find_next_crq(cfqq->cfqd, cfqq, crq);
}
-static int cfq_check_sort_rr_list(struct cfq_queue *cfqq)
-{
- struct list_head *head = &cfqq->cfqd->rr_list;
- struct list_head *next, *prev;
-
- /*
- * list might still be ordered
- */
- next = cfqq->cfq_list.next;
- if (next != head) {
- struct cfq_queue *cnext = list_entry_cfqq(next);
-
- if (cfqq->service_used > cnext->service_used)
- return 1;
- }
-
- prev = cfqq->cfq_list.prev;
- if (prev != head) {
- struct cfq_queue *cprev = list_entry_cfqq(prev);
-
- if (cfqq->service_used < cprev->service_used)
- return 1;
- }
-
- return 0;
-}
-
-static void cfq_sort_rr_list(struct cfq_queue *cfqq, int new_queue)
+static void cfq_resort_rr_list(struct cfq_queue *cfqq)
{
struct list_head *entry = &cfqq->cfqd->rr_list;
- if (!cfqq->on_rr)
- return;
- if (!new_queue && !cfq_check_sort_rr_list(cfqq))
- return;
-
list_del(&cfqq->cfq_list);
/*
- * sort by our mean service_used, sub-sort by in-flight requests
+ * sort by when queue was last serviced
*/
while ((entry = entry->prev) != &cfqq->cfqd->rr_list) {
struct cfq_queue *__cfqq = list_entry_cfqq(entry);
- if (cfqq->service_used > __cfqq->service_used)
+ if (!__cfqq->service_last)
+ break;
+ if (time_before(__cfqq->service_last, cfqq->service_last))
break;
- else if (cfqq->service_used == __cfqq->service_used) {
- struct list_head *prv;
-
- while ((prv = entry->prev) != &cfqq->cfqd->rr_list) {
- __cfqq = list_entry_cfqq(prv);
-
- WARN_ON(__cfqq->service_used > cfqq->service_used);
- if (cfqq->service_used != __cfqq->service_used)
- break;
- if (cfqq->in_flight > __cfqq->in_flight)
- break;
-
- entry = prv;
- }
- }
}
list_add(&cfqq->cfq_list, entry);
@@ -479,16 +446,12 @@
static inline void
cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- /*
- * it's currently on the empty list
- */
- cfqq->on_rr = 1;
- cfqd->busy_queues++;
+ BUG_ON(cfqq->on_rr);
- if (time_after(jiffies, cfqq->service_start + cfq_service))
- cfqq->service_used >>= 3;
+ cfqd->busy_queues++;
+ cfqq->on_rr = 1;
- cfq_sort_rr_list(cfqq, 1);
+ cfq_resort_rr_list(cfqq);
}
static inline void
@@ -512,10 +475,10 @@
struct cfq_data *cfqd = cfqq->cfqd;
BUG_ON(!cfqq->queued[crq->is_sync]);
+ cfqq->queued[crq->is_sync]--;
cfq_update_next_crq(crq);
- cfqq->queued[crq->is_sync]--;
rb_erase(&crq->rb_node, &cfqq->sort_list);
RB_CLEAR_COLOR(&crq->rb_node);
@@ -622,11 +585,6 @@
if (crq) {
struct cfq_queue *cfqq = crq->cfq_queue;
- if (cfqq->cfqd->cfq_tagged) {
- cfqq->service_used--;
- cfq_sort_rr_list(cfqq, 0);
- }
-
crq->accounted = 0;
cfqq->cfqd->rq_in_driver--;
}
@@ -640,9 +598,7 @@
if (crq) {
cfq_remove_merge_hints(q, crq);
list_del_init(&rq->queuelist);
-
- if (crq->cfq_queue)
- cfq_del_crq_rb(crq);
+ cfq_del_crq_rb(crq);
}
}
@@ -724,6 +680,98 @@
}
/*
+ * current cfqq expired its slice (or was too idle), select new one
+ */
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
+{
+ struct cfq_queue *cfqq = cfqd->active_queue;
+ unsigned long now = jiffies;
+
+ if (cfqq) {
+ if (cfqq->wait_request)
+ del_timer(&cfqd->timer);
+
+ cfqq->service_last = now;
+ cfqq->must_dispatch = 0;
+
+ if (cfqq->on_rr)
+ cfq_resort_rr_list(cfqq);
+
+ cfqq = NULL;
+ }
+
+ if (!list_empty(&cfqd->rr_list)) {
+ cfqq = list_entry_cfqq(cfqd->rr_list.next);
+
+ cfqq->slice_start = now;
+ cfqq->slice_end = 0;
+ cfqq->wait_request = 0;
+ }
+
+ cfqd->active_queue = cfqq;
+ cfqd->dispatch_slice = 0;
+}
+
+static int cfq_arm_slice_timer(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+ WARN_ON(!RB_EMPTY(&cfqq->sort_list));
+
+ cfqq->wait_request = 1;
+
+ if (!cfqd->cfq_idle)
+ return 0;
+
+ if (!timer_pending(&cfqd->timer)) {
+ unsigned long now = jiffies, slice_left;
+
+ slice_left = cfqq->slice_end - now;
+ cfqd->timer.expires = now + min(cfqd->cfq_idle,(unsigned int)slice_left);
+ add_timer(&cfqd->timer);
+ }
+
+ return 1;
+}
+
+/*
+ * get next queue for service
+ */
+static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
+{
+ struct cfq_queue *cfqq = cfqd->active_queue;
+ unsigned long now = jiffies;
+
+ cfqq = cfqd->active_queue;
+ if (!cfqq)
+ goto new_queue;
+
+ if (cfqq->must_dispatch)
+ goto must_queue;
+
+ /*
+ * slice has expired
+ */
+ if (time_after(jiffies, cfqq->slice_end))
+ goto new_queue;
+
+ /*
+ * if queue has requests, dispatch one. if not, check if
+ * enough slice is left to wait for one
+ */
+must_queue:
+ if (!RB_EMPTY(&cfqq->sort_list))
+ goto keep_queue;
+ else if (cfqq->slice_end - now >= cfqd->cfq_idle) {
+ if (cfq_arm_slice_timer(cfqd, cfqq))
+ return NULL;
+ }
+
+new_queue:
+ cfq_slice_expired(cfqd);
+keep_queue:
+ return cfqd->active_queue;
+}
+
+/*
* we dispatch cfqd->cfq_quantum requests in total from the rr_list queues,
* this function sector sorts the selected request to minimize seeks. we start
* at cfqd->last_sector, not 0.
@@ -741,9 +789,7 @@
list_del(&crq->request->queuelist);
last = cfqd->last_sector;
- while ((entry = entry->prev) != head) {
- __rq = list_entry_rq(entry);
-
+ list_for_each_entry_reverse(__rq, head, queuelist) {
if (blk_barrier_rq(crq->request))
break;
if (!blk_fs_request(crq->request))
@@ -777,95 +823,95 @@
if (time_before(now, cfqq->last_fifo_expire + cfqd->cfq_fifo_batch_expire))
return NULL;
- crq = RQ_DATA(list_entry(cfqq->fifo[0].next, struct request, queuelist));
- if (reads && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
- cfqq->last_fifo_expire = now;
- return crq;
+ if (reads) {
+ crq = RQ_DATA(list_entry_fifo(cfqq->fifo[READ].next));
+ if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
+ cfqq->last_fifo_expire = now;
+ return crq;
+ }
}
- crq = RQ_DATA(list_entry(cfqq->fifo[1].next, struct request, queuelist));
- if (writes && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
- cfqq->last_fifo_expire = now;
- return crq;
+ if (writes) {
+ crq = RQ_DATA(list_entry_fifo(cfqq->fifo[WRITE].next));
+ if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
+ cfqq->last_fifo_expire = now;
+ return crq;
+ }
}
return NULL;
}
-/*
- * dispatch a single request from given queue
- */
-static inline void
-cfq_dispatch_request(request_queue_t *q, struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
+static int
+__cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ int max_dispatch)
{
- struct cfq_rq *crq;
+ int dispatched = 0, sync = 0;
- /*
- * follow expired path, else get first next available
- */
- if ((crq = cfq_check_fifo(cfqq)) == NULL) {
- if (cfqd->find_best_crq)
- crq = cfqq->next_crq;
- else
- crq = rb_entry_crq(rb_first(&cfqq->sort_list));
- }
+ BUG_ON(RB_EMPTY(&cfqq->sort_list));
+
+ do {
+ struct cfq_rq *crq;
+
+ /*
+ * follow expired path, else get first next available
+ */
+ if ((crq = cfq_check_fifo(cfqq)) == NULL) {
+ if (cfqd->find_best_crq)
+ crq = cfqq->next_crq;
+ else
+ crq = rb_entry_crq(rb_first(&cfqq->sort_list));
+ }
+
+ cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+
+ /*
+ * finally, insert request into driver list
+ */
+ cfq_dispatch_sort(cfqd->queue, crq);
+
+ cfqd->dispatch_slice++;
+ dispatched++;
+ sync += crq->is_sync;
+
+ if (RB_EMPTY(&cfqq->sort_list))
+ break;
- cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+ } while (dispatched < max_dispatch);
/*
- * finally, insert request into driver list
+ * if slice end isn't set yet, set it. if at least one request was
+ * sync, use the sync time slice value
*/
- cfq_dispatch_sort(q, crq);
+ if (!cfqq->slice_end)
+ cfqq->slice_end = cfqd->cfq_slice[!!sync] + jiffies;
+
+
+ return dispatched;
}
static int cfq_dispatch_requests(request_queue_t *q, int max_dispatch)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq;
- struct list_head *entry, *tmp;
- int queued, busy_queues, first_round;
if (list_empty(&cfqd->rr_list))
return 0;
- queued = 0;
- first_round = 1;
-restart:
- busy_queues = 0;
- list_for_each_safe(entry, tmp, &cfqd->rr_list) {
- cfqq = list_entry_cfqq(entry);
-
- BUG_ON(RB_EMPTY(&cfqq->sort_list));
-
- /*
- * first round of queueing, only select from queues that
- * don't already have io in-flight
- */
- if (first_round && cfqq->in_flight)
- continue;
-
- cfq_dispatch_request(q, cfqd, cfqq);
-
- if (!RB_EMPTY(&cfqq->sort_list))
- busy_queues++;
-
- queued++;
- }
-
- if ((queued < max_dispatch) && (busy_queues || first_round)) {
- first_round = 0;
- goto restart;
- }
+ cfqq = cfq_select_queue(cfqd);
+ if (!cfqq)
+ return 0;
- return queued;
+ cfqq->wait_request = 0;
+ cfqq->must_dispatch = 0;
+ del_timer(&cfqd->timer);
+ return __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
}
static inline void cfq_account_dispatch(struct cfq_rq *crq)
{
struct cfq_queue *cfqq = crq->cfq_queue;
struct cfq_data *cfqd = cfqq->cfqd;
- unsigned long now, elapsed;
/*
* accounted bit is necessary since some drivers will call
@@ -874,37 +920,9 @@
if (crq->accounted)
return;
- now = jiffies;
- if (cfqq->service_start == ~0UL)
- cfqq->service_start = now;
-
- /*
- * on drives with tagged command queueing, command turn-around time
- * doesn't necessarily reflect the time spent processing this very
- * command inside the drive. so do the accounting differently there,
- * by just sorting on the number of requests
- */
- if (cfqd->cfq_tagged) {
- if (time_after(now, cfqq->service_start + cfq_service)) {
- cfqq->service_start = now;
- cfqq->service_used /= 10;
- }
-
- cfqq->service_used++;
- cfq_sort_rr_list(cfqq, 0);
- }
-
- elapsed = now - crq->queue_start;
- if (elapsed > max_elapsed_dispatch)
- max_elapsed_dispatch = elapsed;
-
crq->accounted = 1;
- crq->service_start = now;
-
- if (++cfqd->rq_in_driver >= CFQ_MAX_TAG && !cfqd->cfq_tagged) {
- cfqq->cfqd->cfq_tagged = 1;
- printk("cfq: depth %d reached, tagging now on\n", CFQ_MAX_TAG);
- }
+ crq->service_start = jiffies;
+ cfqd->rq_in_driver++;
}
static inline void
@@ -915,21 +933,18 @@
WARN_ON(!cfqd->rq_in_driver);
cfqd->rq_in_driver--;
- if (!cfqd->cfq_tagged) {
- unsigned long now = jiffies;
- unsigned long duration = now - crq->service_start;
-
- if (time_after(now, cfqq->service_start + cfq_service)) {
- cfqq->service_start = now;
- cfqq->service_used >>= 3;
- }
-
- cfqq->service_used += duration;
- cfq_sort_rr_list(cfqq, 0);
+ /*
+ * queue was preempted while this request was servicing
+ */
+ if (cfqd->active_queue != cfqq)
+ return;
- if (duration > max_elapsed_crq)
- max_elapsed_crq = duration;
- }
+ /*
+ * no requests. if last request was a sync request, wait for
+ * a new one.
+ */
+ if (RB_EMPTY(&cfqq->sort_list) && crq->is_sync)
+ cfq_arm_slice_timer(cfqd, cfqq);
}
static struct request *cfq_next_request(request_queue_t *q)
@@ -937,6 +952,9 @@
struct cfq_data *cfqd = q->elevator->elevator_data;
struct request *rq;
+ if (cfqd->rq_in_driver >= cfqd->cfq_max_depth)
+ return NULL;
+
if (!list_empty(&q->queue_head)) {
struct cfq_rq *crq;
dispatch:
@@ -964,6 +982,8 @@
*/
static void cfq_put_queue(struct cfq_queue *cfqq)
{
+ struct cfq_data *cfqd = cfqq->cfqd;
+
BUG_ON(!atomic_read(&cfqq->ref));
if (!atomic_dec_and_test(&cfqq->ref))
@@ -972,6 +992,9 @@
BUG_ON(rb_first(&cfqq->sort_list));
BUG_ON(cfqq->on_rr);
+ if (unlikely(cfqd->active_queue == cfqq))
+ cfqd->active_queue = NULL;
+
cfq_put_cfqd(cfqq->cfqd);
/*
@@ -1117,6 +1140,7 @@
cic->ioc = ioc;
cic->cfqq = __cfqq;
atomic_inc(&__cfqq->ref);
+ atomic_inc(&cfqd->ref);
} else {
struct cfq_io_context *__cic;
unsigned long flags;
@@ -1159,10 +1183,10 @@
__cic->ioc = ioc;
__cic->cfqq = __cfqq;
atomic_inc(&__cfqq->ref);
+ atomic_inc(&cfqd->ref);
spin_lock_irqsave(&ioc->lock, flags);
list_add(&__cic->list, &cic->list);
spin_unlock_irqrestore(&ioc->lock, flags);
-
cic = __cic;
*cfqq = __cfqq;
}
@@ -1199,8 +1223,11 @@
new_cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
spin_lock_irq(cfqd->queue->queue_lock);
goto retry;
- } else
- goto out;
+ } else {
+ cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
+ if (!cfqq)
+ goto out;
+ }
memset(cfqq, 0, sizeof(*cfqq));
@@ -1216,7 +1243,7 @@
cfqq->cfqd = cfqd;
atomic_inc(&cfqd->ref);
cfqq->key_type = cfqd->key_type;
- cfqq->service_start = ~0UL;
+ cfqq->service_last = 0;
}
if (new_cfqq)
@@ -1243,14 +1270,25 @@
static void cfq_enqueue(struct cfq_data *cfqd, struct cfq_rq *crq)
{
- crq->is_sync = 0;
- if (rq_data_dir(crq->request) == READ || current->flags & PF_SYNCWRITE)
- crq->is_sync = 1;
+ struct cfq_queue *cfqq = crq->cfq_queue;
+ struct request *rq = crq->request;
+
+ crq->is_sync = rq_data_dir(rq) == READ || current->flags & PF_SYNCWRITE;
cfq_add_crq_rb(crq);
crq->queue_start = jiffies;
- list_add_tail(&crq->request->queuelist, &crq->cfq_queue->fifo[crq->is_sync]);
+ list_add_tail(&rq->queuelist, &cfqq->fifo[crq->is_sync]);
+
+ /*
+ * if we are waiting for a request for this queue, let it rip
+ * immediately and flag that we must not expire this queue just now
+ */
+ if (cfqq->wait_request && cfqq == cfqd->active_queue) {
+ cfqq->must_dispatch = 1;
+ del_timer(&cfqd->timer);
+ cfqd->queue->request_fn(cfqd->queue);
+ }
}
static void
@@ -1339,31 +1377,34 @@
struct cfq_data *cfqd = q->elevator->elevator_data;
struct cfq_queue *cfqq;
int ret = ELV_MQUEUE_MAY;
+ int limit;
if (current->flags & PF_MEMALLOC)
return ELV_MQUEUE_MAY;
cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(cfqd, current));
- if (cfqq) {
- int limit = cfqd->max_queued;
-
- if (cfqq->allocated[rw] < cfqd->cfq_queued)
- return ELV_MQUEUE_MUST;
-
- if (cfqd->busy_queues)
- limit = q->nr_requests / cfqd->busy_queues;
-
- if (limit < cfqd->cfq_queued)
- limit = cfqd->cfq_queued;
- else if (limit > cfqd->max_queued)
- limit = cfqd->max_queued;
+ if (unlikely(!cfqq))
+ return ELV_MQUEUE_MAY;
- if (cfqq->allocated[rw] >= limit) {
- if (limit > cfqq->alloc_limit[rw])
- cfqq->alloc_limit[rw] = limit;
+ if (cfqq->allocated[rw] < cfqd->cfq_queued)
+ return ELV_MQUEUE_MUST;
+ if (cfqq->wait_request)
+ return ELV_MQUEUE_MUST;
+
+ limit = cfqd->max_queued;
+ if (cfqd->busy_queues)
+ limit = q->nr_requests / cfqd->busy_queues;
+
+ if (limit < cfqd->cfq_queued)
+ limit = cfqd->cfq_queued;
+ else if (limit > cfqd->max_queued)
+ limit = cfqd->max_queued;
+
+ if (cfqq->allocated[rw] >= limit) {
+ if (limit > cfqq->alloc_limit[rw])
+ cfqq->alloc_limit[rw] = limit;
- ret = ELV_MQUEUE_NO;
- }
+ ret = ELV_MQUEUE_NO;
}
return ret;
@@ -1395,12 +1436,12 @@
BUG_ON(q->last_merge == rq);
BUG_ON(!hlist_unhashed(&crq->hash));
- if (crq->io_context)
- put_io_context(crq->io_context->ioc);
-
BUG_ON(!cfqq->allocated[crq->is_write]);
cfqq->allocated[crq->is_write]--;
+ if (crq->io_context)
+ put_io_context(crq->io_context->ioc);
+
mempool_free(crq, cfqd->crq_pool);
rq->elevator_private = NULL;
@@ -1473,6 +1514,7 @@
crq->is_write = rw;
rq->elevator_private = crq;
cfqq->alloc_limit[rw] = 0;
+ smp_mb();
return 0;
}
@@ -1486,6 +1528,44 @@
return 1;
}
+static void cfq_kick_queue(void *data)
+{
+ request_queue_t *q = data;
+
+ blk_run_queue(q);
+}
+
+static void cfq_schedule_timer(unsigned long data)
+{
+ struct cfq_data *cfqd = (struct cfq_data *) data;
+ struct cfq_queue *cfqq;
+ unsigned long flags;
+
+ spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+
+ if ((cfqq = cfqd->active_queue) != NULL) {
+ /*
+ * expired
+ */
+ if (time_after(jiffies, cfqq->slice_end))
+ goto out;
+
+ /*
+ * not expired and it has a request pending, let it dispatch
+ */
+ if (!RB_EMPTY(&cfqq->sort_list)) {
+ cfqq->must_dispatch = 1;
+ goto out_cont;
+ }
+ }
+
+out:
+ cfq_slice_expired(cfqd);
+out_cont:
+ spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+ kblockd_schedule_work(&cfqd->unplug_work);
+}
+
static void cfq_put_cfqd(struct cfq_data *cfqd)
{
request_queue_t *q = cfqd->queue;
@@ -1494,6 +1574,8 @@
if (!atomic_dec_and_test(&cfqd->ref))
return;
+ blk_sync_queue(q);
+
/*
* kill spare queue, getting it means we have two refences to it.
* drop both
@@ -1567,8 +1649,15 @@
q->nr_requests = 1024;
cfqd->max_queued = q->nr_requests / 16;
q->nr_batching = cfq_queued;
- cfqd->key_type = CFQ_KEY_TGID;
+ cfqd->key_type = CFQ_KEY_PID;
cfqd->find_best_crq = 1;
+
+ init_timer(&cfqd->timer);
+ cfqd->timer.function = cfq_schedule_timer;
+ cfqd->timer.data = (unsigned long) cfqd;
+
+ INIT_WORK(&cfqd->unplug_work, cfq_kick_queue, q);
+
atomic_set(&cfqd->ref, 1);
cfqd->cfq_queued = cfq_queued;
@@ -1578,6 +1667,10 @@
cfqd->cfq_fifo_batch_expire = cfq_fifo_rate;
cfqd->cfq_back_max = cfq_back_max;
cfqd->cfq_back_penalty = cfq_back_penalty;
+ cfqd->cfq_slice[0] = cfq_slice_async;
+ cfqd->cfq_slice[1] = cfq_slice_sync;
+ cfqd->cfq_idle = cfq_idle;
+ cfqd->cfq_max_depth = cfq_max_depth;
return 0;
out_spare:
@@ -1624,7 +1717,6 @@
return -ENOMEM;
}
-
/*
* sysfs parts below -->
*/
@@ -1650,13 +1742,6 @@
}
static ssize_t
-cfq_clear_elapsed(struct cfq_data *cfqd, const char *page, size_t count)
-{
- max_elapsed_dispatch = max_elapsed_crq = 0;
- return count;
-}
-
-static ssize_t
cfq_set_key_type(struct cfq_data *cfqd, const char *page, size_t count)
{
spin_lock_irq(cfqd->queue->queue_lock);
@@ -1664,6 +1749,8 @@
cfqd->key_type = CFQ_KEY_PGID;
else if (!strncmp(page, "tgid", 4))
cfqd->key_type = CFQ_KEY_TGID;
+ else if (!strncmp(page, "pid", 3))
+ cfqd->key_type = CFQ_KEY_PID;
else if (!strncmp(page, "uid", 3))
cfqd->key_type = CFQ_KEY_UID;
else if (!strncmp(page, "gid", 3))
@@ -1704,6 +1791,10 @@
SHOW_FUNCTION(cfq_find_best_show, cfqd->find_best_crq, 0);
SHOW_FUNCTION(cfq_back_max_show, cfqd->cfq_back_max, 0);
SHOW_FUNCTION(cfq_back_penalty_show, cfqd->cfq_back_penalty, 0);
+SHOW_FUNCTION(cfq_idle_show, cfqd->cfq_idle, 1);
+SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
+SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
+SHOW_FUNCTION(cfq_max_depth_show, cfqd->cfq_max_depth, 0);
#undef SHOW_FUNCTION
#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \
@@ -1729,6 +1820,10 @@
STORE_FUNCTION(cfq_find_best_store, &cfqd->find_best_crq, 0, 1, 0);
STORE_FUNCTION(cfq_back_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
STORE_FUNCTION(cfq_back_penalty_store, &cfqd->cfq_back_penalty, 1, UINT_MAX, 0);
+STORE_FUNCTION(cfq_idle_store, &cfqd->cfq_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
+STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
+STORE_FUNCTION(cfq_max_depth_store, &cfqd->cfq_max_depth, 2, UINT_MAX, 0);
#undef STORE_FUNCTION
static struct cfq_fs_entry cfq_quantum_entry = {
@@ -1771,15 +1866,31 @@
.show = cfq_back_penalty_show,
.store = cfq_back_penalty_store,
};
-static struct cfq_fs_entry cfq_clear_elapsed_entry = {
- .attr = {.name = "clear_elapsed", .mode = S_IWUSR },
- .store = cfq_clear_elapsed,
+static struct cfq_fs_entry cfq_slice_sync_entry = {
+ .attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+ .show = cfq_slice_sync_show,
+ .store = cfq_slice_sync_store,
+};
+static struct cfq_fs_entry cfq_slice_async_entry = {
+ .attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+ .show = cfq_slice_async_show,
+ .store = cfq_slice_async_store,
+};
+static struct cfq_fs_entry cfq_idle_entry = {
+ .attr = {.name = "idle", .mode = S_IRUGO | S_IWUSR },
+ .show = cfq_idle_show,
+ .store = cfq_idle_store,
};
static struct cfq_fs_entry cfq_key_type_entry = {
.attr = {.name = "key_type", .mode = S_IRUGO | S_IWUSR },
.show = cfq_read_key_type,
.store = cfq_set_key_type,
};
+static struct cfq_fs_entry cfq_max_depth_entry = {
+ .attr = {.name = "max_depth", .mode = S_IRUGO | S_IWUSR },
+ .show = cfq_max_depth_show,
+ .store = cfq_max_depth_store,
+};
static struct attribute *default_attrs[] = {
&cfq_quantum_entry.attr,
@@ -1791,7 +1902,10 @@
&cfq_find_best_entry.attr,
&cfq_back_max_entry.attr,
&cfq_back_penalty_entry.attr,
- &cfq_clear_elapsed_entry.attr,
+ &cfq_slice_sync_entry.attr,
+ &cfq_slice_async_entry.attr,
+ &cfq_idle_entry.attr,
+ &cfq_max_depth_entry.attr,
NULL,
};
@@ -1856,7 +1970,7 @@
.elevator_owner = THIS_MODULE,
};
-int cfq_init(void)
+static int __init cfq_init(void)
{
int ret;
@@ -1864,17 +1978,35 @@
return -ENOMEM;
ret = elv_register(&iosched_cfq);
- if (!ret) {
- __module_get(THIS_MODULE);
- return 0;
- }
+ if (ret)
+ cfq_slab_kill();
- cfq_slab_kill();
return ret;
}
static void __exit cfq_exit(void)
{
+ struct task_struct *g, *p;
+ unsigned long flags;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ /*
+ * iterate each process in the system, removing our io_context
+ */
+ do_each_thread(g, p) {
+ struct io_context *ioc = p->io_context;
+
+ if (ioc && ioc->cic) {
+ ioc->cic->exit(ioc->cic);
+ cfq_free_io_context(ioc->cic);
+ ioc->cic = NULL;
+ }
+
+ } while_each_thread(g, p);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+
cfq_slab_kill();
elv_unregister(&iosched_cfq);
}
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 10:38 ` Jens Axboe
@ 2004-12-03 10:45 ` Prakash K. Cheemplavam
2004-12-03 10:48 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 10:45 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Linux Kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 1184 bytes --]
Jens Axboe schrieb:
> On Fri, Dec 03 2004, Jens Axboe wrote:
>
>>Funky. It looks like another case of the io scheduler being at the wrong
>>place - if raid sends dependent reads to different drives, it screws up
>>the io scheduling. The right way to fix that would be to io scheduler
>>before raid (reverse of what we do now), but that is a lot of work. A
>>hack would be to try and tie processes to one md component for periods
>>of time, sort of like cfq slicing.
>
>
> It makes sense to split the slice period for sync and async requests,
> since async requests usually get a lot of requests queued in a short
> period of time. Might even make sense to introduce a slice_rq value as
> well, limiting the number of requests queued in a given slice.
>
> But at least this patch lets you set slice_sync and slice_async
> seperately, if you want to experiement.
An idea, which values I should try?
In generell I rather have the impression the problem I am experiencing
is not the problem of the io scheduler alone or why do all show the same
problem?
BTW, I just did my little test on the ide drive and it shows the same
problem, so it is not sata / libata related.
Prakash
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 10:45 ` Prakash K. Cheemplavam
@ 2004-12-03 10:48 ` Jens Axboe
2004-12-03 11:27 ` Prakash K. Cheemplavam
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 10:48 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin
On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Jens Axboe wrote:
> >
> >>Funky. It looks like another case of the io scheduler being at the wrong
> >>place - if raid sends dependent reads to different drives, it screws up
> >>the io scheduling. The right way to fix that would be to io scheduler
> >>before raid (reverse of what we do now), but that is a lot of work. A
> >>hack would be to try and tie processes to one md component for periods
> >>of time, sort of like cfq slicing.
> >
> >
> >It makes sense to split the slice period for sync and async requests,
> >since async requests usually get a lot of requests queued in a short
> >period of time. Might even make sense to introduce a slice_rq value as
> >well, limiting the number of requests queued in a given slice.
> >
> >But at least this patch lets you set slice_sync and slice_async
> >seperately, if you want to experiement.
>
> An idea, which values I should try?
Just see if the default ones work (or how they work :-)
> In generell I rather have the impression the problem I am experiencing
> is not the problem of the io scheduler alone or why do all show the same
> problem?
It is not, but some io schedulers perform better than others.
> BTW, I just did my little test on the ide drive and it shows the same
> problem, so it is not sata / libata related.
Single read/writer case works fine here for me, about half the bandwidth
for each. Please show some vmstats for this case, too. Right now I'm not
terribly interested in problems with raid alone, as I can poke holes in
that. If the single drive case is correct, then we can focus on raid.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 10:48 ` Jens Axboe
@ 2004-12-03 11:27 ` Prakash K. Cheemplavam
2004-12-03 11:29 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 11:27 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Linux Kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 4168 bytes --]
Jens Axboe schrieb:
> On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
>
>>>But at least this patch lets you set slice_sync and slice_async
>>>seperately, if you want to experiement.
>>
>>An idea, which values I should try?
>
>
> Just see if the default ones work (or how they work :-)
>
>>BTW, I just did my little test on the ide drive and it shows the same
>>problem, so it is not sata / libata related.
>
>
> Single read/writer case works fine here for me, about half the bandwidth
> for each. Please show some vmstats for this case, too. Right now I'm not
> terribly interested in problems with raid alone, as I can poke holes in
> that. If the single drive case is correct, then we can focus on raid.
I have not enough space to perform this test on the ide drive, so I did
it on the sata (single disk). The patch doesn't seem to be better. (But
on the other hand I haven't tested you first version on single disk.) At
least it still doesn't look good enough in my eyes.
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 3 2704 5368 1528 906540 0 4 2176 24068 1245 743 0
7 0 93
0 3 2704 5432 1532 906252 0 0 5072 28160 1277 782 1
8 0 91
0 5 2704 5688 1532 906080 0 0 9280 4524 1309 842 1
10 0 89
1 3 2704 5232 1544 906208 0 0 6404 76388 1285 716 1
14 0 85
0 3 2704 5496 1544 906524 0 0 8328 26624 1301 856 1
8 0 91
0 3 2704 5512 1528 906636 0 0 9484 22016 1302 883 1
8 0 91
0 3 2704 5816 1500 906296 0 0 5508 10288 1270 749 1
9 0 90
0 4 2704 5620 1488 906608 0 0 3076 19920 1267 818 0
13 0 87
1 4 2704 5684 1456 906432 0 0 3204 18432 1252 704 1
8 0 91
1 3 2704 5504 1408 906168 0 0 5252 28672 1279 777 1
14 0 85
0 4 2704 5120 1404 906296 0 0 8968 16384 1351 876 1
9 0 90
0 4 2704 5364 1404 905620 0 0 5252 26112 1339 835 1
14 0 85
0 4 2704 5600 1432 905036 0 0 1468 15876 1312 741 2
8 0 90
1 4 2704 5556 1424 904704 0 0 1664 26112 1243 714 1
10 0 89
0 4 2704 5492 1428 904100 0 0 1412 31232 1253 760 1
15 0 84
0 4 2704 5568 1432 903456 0 0 1668 29696 1253 703 1
14 0 85
1 4 2704 5620 1408 902980 0 0 1280 28672 1248 732 0
14 0 86
0 4 2704 5236 1404 902888 0 0 2180 28704 1252 705 1
11 0 88
0 4 2704 5632 1388 902180 0 0 1536 28160 1251 731 1
11 0 88
0 3 2704 5120 1356 905968 0 0 384 57896 1257 751 1
14 0 85
What I don't like about the time sliced cfq (first version as well) is
that I don't get good sustained rate anymore if I have only one writer
and nothing else. IIRC with plain cfq I at least got near to maximum
throughput (40-50mb/sec) now it oscillates much more. I have to recheck
with plain cfq though. It might be ext3 related...
0 2 2684 7016 9384 900664 0 0 0 59128 1217 576 1
7 0 92
1 1 2684 5160 9368 898660 0 0 0 12300 1239 4861 1
60 0 39
0 3 2684 5532 9364 896360 0 0 0 18684 1246 1723 1
48 0 51
0 3 2684 5596 9364 896616 0 0 0 24576 1246 686 1
9 0 90
0 3 2684 5596 9364 896612 0 0 0 38400 1261 718 0
13 0 87
0 3 2684 5532 9360 896564 0 0 0 37888 1257 708 1
13 0 86
0 3 2684 5532 8848 896884 0 0 0 36864 1260 825 1
12 0 87
1 3 2696 5596 7440 898120 0 0 0 31744 1247 703 1
11 0 88
0 3 2700 5660 5352 900080 0 0 0 37888 1258 768 1
13 0 86
0 2 2700 6816 5216 900436 0 0 0 68772 1266 783 1
25 0 74
0 2 2700 6884 5216 900436 0 0 0 19616 1247 679 2
1 0 97
1 2 2700 7096 5216 900436 0 0 0 14976 1249 786 1
3 0 96
0 2 2700 5352 4572 902432 0 0 4 66544 1263 2333 1
21 0 78
Prakash
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 11:27 ` Prakash K. Cheemplavam
@ 2004-12-03 11:29 ` Jens Axboe
2004-12-03 11:52 ` Prakash K. Cheemplavam
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 11:29 UTC (permalink / raw)
To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin
On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> >
> >>>But at least this patch lets you set slice_sync and slice_async
> >>>seperately, if you want to experiement.
> >>
> >>An idea, which values I should try?
> >
> >
> >Just see if the default ones work (or how they work :-)
> >
> >>BTW, I just did my little test on the ide drive and it shows the same
> >>problem, so it is not sata / libata related.
> >
> >
> >Single read/writer case works fine here for me, about half the bandwidth
> >for each. Please show some vmstats for this case, too. Right now I'm not
> >terribly interested in problems with raid alone, as I can poke holes in
> >that. If the single drive case is correct, then we can focus on raid.
>
> I have not enough space to perform this test on the ide drive, so I did
> it on the sata (single disk). The patch doesn't seem to be better. (But
> on the other hand I haven't tested you first version on single disk.) At
> least it still doesn't look good enough in my eyes.
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> ----cpu----
> r b swpd free buff cache si so bi bo in cs us
> sy id wa
> 1 3 2704 5368 1528 906540 0 4 2176 24068 1245 743 0
> 7 0 93
> 0 3 2704 5432 1532 906252 0 0 5072 28160 1277 782 1
> 8 0 91
> 0 5 2704 5688 1532 906080 0 0 9280 4524 1309 842 1
> 10 0 89
> 1 3 2704 5232 1544 906208 0 0 6404 76388 1285 716 1
> 14 0 85
> 0 3 2704 5496 1544 906524 0 0 8328 26624 1301 856 1
> 8 0 91
> 0 3 2704 5512 1528 906636 0 0 9484 22016 1302 883 1
> 8 0 91
> 0 3 2704 5816 1500 906296 0 0 5508 10288 1270 749 1
> 9 0 90
> 0 4 2704 5620 1488 906608 0 0 3076 19920 1267 818 0
> 13 0 87
> 1 4 2704 5684 1456 906432 0 0 3204 18432 1252 704 1
> 8 0 91
> 1 3 2704 5504 1408 906168 0 0 5252 28672 1279 777 1
> 14 0 85
> 0 4 2704 5120 1404 906296 0 0 8968 16384 1351 876 1
> 9 0 90
> 0 4 2704 5364 1404 905620 0 0 5252 26112 1339 835 1
> 14 0 85
> 0 4 2704 5600 1432 905036 0 0 1468 15876 1312 741 2
> 8 0 90
> 1 4 2704 5556 1424 904704 0 0 1664 26112 1243 714 1
> 10 0 89
> 0 4 2704 5492 1428 904100 0 0 1412 31232 1253 760 1
> 15 0 84
> 0 4 2704 5568 1432 903456 0 0 1668 29696 1253 703 1
> 14 0 85
> 1 4 2704 5620 1408 902980 0 0 1280 28672 1248 732 0
> 14 0 86
> 0 4 2704 5236 1404 902888 0 0 2180 28704 1252 705 1
> 11 0 88
> 0 4 2704 5632 1388 902180 0 0 1536 28160 1251 731 1
> 11 0 88
> 0 3 2704 5120 1356 905968 0 0 384 57896 1257 751 1
> 14 0 85
Try increasing slice_sync and idle, just for fun.
> What I don't like about the time sliced cfq (first version as well) is
> that I don't get good sustained rate anymore if I have only one writer
> and nothing else. IIRC with plain cfq I at least got near to maximum
> throughput (40-50mb/sec) now it oscillates much more. I have to recheck
> with plain cfq though. It might be ext3 related...
>
> 0 2 2684 7016 9384 900664 0 0 0 59128 1217 576 1
> 7 0 92
> 1 1 2684 5160 9368 898660 0 0 0 12300 1239 4861 1
> 60 0 39
> 0 3 2684 5532 9364 896360 0 0 0 18684 1246 1723 1
> 48 0 51
> 0 3 2684 5596 9364 896616 0 0 0 24576 1246 686 1
That's a bug, I've noticed that too. Sustained write rate for a single
thread is somewhat lower than it should be, it's on my todo to
investigate.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-03 11:29 ` Jens Axboe
@ 2004-12-03 11:52 ` Prakash K. Cheemplavam
0 siblings, 0 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 11:52 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Linux Kernel, nickpiggin
[-- Attachment #1: Type: text/plain, Size: 6584 bytes --]
Jens Axboe schrieb:
> On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
>
>>Jens Axboe schrieb:
>>
>>>On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
>>>
>>>
>>>>>But at least this patch lets you set slice_sync and slice_async
>>>>>seperately, if you want to experiement.
>>>>
>>>>An idea, which values I should try?
>>>
>>>
>>>Just see if the default ones work (or how they work :-)
>>>
>>>
>>>>BTW, I just did my little test on the ide drive and it shows the same
>>>>problem, so it is not sata / libata related.
>>>
>>>
>>>Single read/writer case works fine here for me, about half the bandwidth
>>>for each. Please show some vmstats for this case, too. Right now I'm not
>>>terribly interested in problems with raid alone, as I can poke holes in
>>>that. If the single drive case is correct, then we can focus on raid.
>>
>>I have not enough space to perform this test on the ide drive, so I did
>>it on the sata (single disk). The patch doesn't seem to be better. (But
>>on the other hand I haven't tested you first version on single disk.) At
>>least it still doesn't look good enough in my eyes.
>>
>> procs -----------memory---------- ---swap-- -----io---- --system--
>>----cpu----
>> r b swpd free buff cache si so bi bo in cs us
>>sy id wa
>> 1 3 2704 5368 1528 906540 0 4 2176 24068 1245 743 0
>>7 0 93
>> 0 3 2704 5432 1532 906252 0 0 5072 28160 1277 782 1
>>8 0 91
>> 0 5 2704 5688 1532 906080 0 0 9280 4524 1309 842 1
>>10 0 89
>> 1 3 2704 5232 1544 906208 0 0 6404 76388 1285 716 1
>>14 0 85
>> 0 3 2704 5496 1544 906524 0 0 8328 26624 1301 856 1
>>8 0 91
>> 0 3 2704 5512 1528 906636 0 0 9484 22016 1302 883 1
>>8 0 91
>> 0 3 2704 5816 1500 906296 0 0 5508 10288 1270 749 1
>>9 0 90
>> 0 4 2704 5620 1488 906608 0 0 3076 19920 1267 818 0
>>13 0 87
>> 1 4 2704 5684 1456 906432 0 0 3204 18432 1252 704 1
>>8 0 91
>> 1 3 2704 5504 1408 906168 0 0 5252 28672 1279 777 1
>>14 0 85
>> 0 4 2704 5120 1404 906296 0 0 8968 16384 1351 876 1
>>9 0 90
>> 0 4 2704 5364 1404 905620 0 0 5252 26112 1339 835 1
>>14 0 85
>> 0 4 2704 5600 1432 905036 0 0 1468 15876 1312 741 2
>>8 0 90
>> 1 4 2704 5556 1424 904704 0 0 1664 26112 1243 714 1
>>10 0 89
>> 0 4 2704 5492 1428 904100 0 0 1412 31232 1253 760 1
>>15 0 84
>> 0 4 2704 5568 1432 903456 0 0 1668 29696 1253 703 1
>>14 0 85
>> 1 4 2704 5620 1408 902980 0 0 1280 28672 1248 732 0
>>14 0 86
>> 0 4 2704 5236 1404 902888 0 0 2180 28704 1252 705 1
>>11 0 88
>> 0 4 2704 5632 1388 902180 0 0 1536 28160 1251 731 1
>>11 0 88
>> 0 3 2704 5120 1356 905968 0 0 384 57896 1257 751 1
>>14 0 85
>
>
> Try increasing slice_sync and idle, just for fun.
Changed to 150 resp. 6:
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 5 2704 5720 960 900020 0 0 68 26624 1251 741 1
16 0 83
1 3 2704 5708 1004 900312 0 0 312 4044 1294 686 1
11 0 88
0 1 2704 5484 1024 899800 0 0 396 40008 1236 608 1
5 0 94
0 3 2704 5284 1036 900696 0 0 516 49196 1246 682 1
5 0 94
1 3 2704 5640 1040 900956 0 0 1416 21504 1252 722 1
4 0 95
0 3 2704 5120 1040 902108 0 0 2688 12288 1230 672 1
2 0 97
1 3 2704 5416 1036 902276 0 0 3076 0 1248 632 0
2 0 98
0 4 2704 5448 1092 902748 0 0 11700 16 1306 857 1
16 0 83
0 3 2704 5712 1132 900704 0 0 1064 63488 1259 755 1
15 0 84
0 3 2704 5476 1156 901336 0 0 5656 8296 1272 725 1
7 0 92
0 3 2704 5320 1208 900996 0 0 2988 3972 1256 696 1
18 0 81
1 4 2704 5288 1240 899660 0 0 1956 60964 1278 757 1
12 0 87
0 3 2704 5596 1292 899032 0 0 1688 24732 1284 813 1
8 0 91
0 3 2704 6124 1308 899776 0 0 1424 42496 1253 678 1
7 0 92
1 3 2704 5744 1324 900124 0 0 16 23552 1250 707 1
9 0 90
0 3 2704 5108 1332 900768 0 0 1800 19968 1242 703 1
4 0 95
0 3 2704 5640 1332 900132 0 0 3204 16896 1240 689 1
1 0 98
0 3 2704 5512 1344 900696 0 0 2564 3036 1255 652 1
2 0 97
2 3 2704 5264 1364 901384 0 0 2704 42572 1253 726 1
10 0 89
1 3 2704 5096 1368 898108 0 0 1808 51724 1240 1984 1
53 0 46
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 4 2704 5572 1348 896164 0 0 2816 18944 1239 1304 1
30 0 69
0 4 2704 5436 1332 896152 0 0 3204 17408 1239 716 0
6 0 94
0 4 2704 5452 1324 895884 0 0 3076 20480 1248 711 2
8 0 90
0 4 2704 5444 1328 895668 0 0 3020 16384 1360 830 1
7 0 92
0 4 2704 5708 1328 895248 0 0 1976 21952 1509 1213 4
8 0 88
0 4 2704 5708 1328 895020 0 0 1536 25200 1258 803 2
10 0 88
0 4 2704 5836 1332 894880 0 0 3204 16264 1281 908 3
8 0 89
0 4 2704 5668 1320 895084 0 0 896 18172 1433 941 1
7 0 92
0 4 2704 5324 1324 895644 0 0 4612 15924 1450 968 1
7 0 92
0 3 2704 5464 1324 897836 0 0 7176 42820 1421 1074 1
29 0 70
1 3 2704 5304 1324 898092 0 0 896 11516 1266 727 1
2 0 97
0 4 2704 5336 1312 898080 0 0 2436 16684 1270 971 1
10 0 89
0 3 2704 5608 1328 897816 0 0 17040 14124 1463 1162 3
7 0 90
0 3 2704 5272 1348 897960 0 0 18196 11264 1435 1281 2
13 0 85
0 3 2704 5592 1348 897488 0 0 6792 24284 1348 1102 6
8 0 86
0 3 2704 5528 1364 897448 0 0 872 19516 1239 760 1
6 0 93
0 3 2704 5592 1364 897348 0 0 1976 22408 1253 761 1
5 0 94
0 3 2704 5528 1364 897252 0 0 2048 30820 1267 858 1
8 0 91
0 3 2704 5528 1372 897132 0 0 5640 18812 1382 907 1
6 0 93
0 3 2704 5208 1368 897388 0 0 2820 17356 1352 863 1
5 0 94
Prakash
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 14:41 ` Jens Axboe
@ 2004-12-04 13:05 ` Giuliano Pochini
0 siblings, 0 replies; 66+ messages in thread
From: Giuliano Pochini @ 2004-12-04 13:05 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-kernel
On Thu, 2 Dec 2004 15:41:34 +0100
Jens Axboe <axboe@suse.de> wrote:
> > > Case 4: write_files, random, bs=4k
> >
> > Just a thought... in this test the results don't look right. Why
> > aggregate bandwidth with 8 clients is higher than with 4 and 2 clients ?
> > In the cfq test with 8 clients aggregate bw is also higher than with
> > a single client.
>
> I don't know what happens with the 4 client case, but it's not that
> unlikely that aggregate bandwidth will be higher for more threads doing
> random writes, as request coalesching will help minimize seeks.
In order to keep the probabilty that requests get coalesced constant, the
size of the test file should be multiple of the number of clients.
--
Giuliano.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 20:37 ` Jens Axboe
@ 2004-12-07 23:11 ` Nick Piggin
0 siblings, 0 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-07 23:11 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, linux-kernel
Jens Axboe wrote:
> On Thu, Dec 02 2004, Andrew Morton wrote:
>
>>Jens Axboe <axboe@suse.de> wrote:
>>
>>>>So what are you doing different?
>>>
>>>Doing sync io, most likely. My results above are 64k O_DIRECT reads and
>>>writes, see the mention of the test cases in the first mail.
>>
>>OK.
>>
>>Writer:
>>
>> while true
>> do
>> write-and-fsync -o -m 100 -c 65536 foo
>> done
>>
>>Reader:
>>
>> time-read -o -b 65536 -n 256 x (This is O_DIRECT)
>>or: time-read -b 65536 -n 256 x (This is buffered)
>>
>>`vmstat 1':
>>
>>procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>> r b swpd free buff cache si so bi bo in cs us sy id wa
>> 1 1 1032 137412 4276 84388 32 0 15456 25344 1659 1538 0 3 50 47
>> 0 1 1032 137468 4276 84388 0 0 0 32128 1521 1027 0 2 51 48
>> 0 1 1032 137476 4276 84388 0 0 0 32064 1519 1026 0 1 50 49
>> 0 1 1032 137476 4276 84388 0 0 0 33920 1556 1102 0 2 50 49
>> 0 1 1032 137476 4276 84388 0 0 0 33088 1541 1074 0 1 50 49
>> 0 2 1032 135676 4284 85944 0 0 1656 29732 1868 2506 0 3 49 47
>> 1 1 1032 96532 4292 125172 0 0 39220 128 10813 39313 0 31 35 34
>> 0 2 1032 57724 4332 163892 0 0 38828 128 10716 38907 0 28 38 35
>> 0 2 1032 18860 4368 202684 0 0 38768 128 10701 38845 1 28 38 35
>> 0 2 1032 3672 4248 217764 0 0 39188 128 10803 39327 0 28 37 34
>> 0 1 1032 2832 4260 218840 0 0 16812 17932 5504 17457 0 14 46 40
>
>
> Well there you go, exactly what I saw. The writer(s) basically make no
> progress as long as the reader is going. Since 'as' treats the sync
> writes like reads internally and given the really bad fairness problems
> demonstrated for same direction clients, that might be the same problem.
>
>
>>Ugly.
>>
>>(write-and-fsync and time-read are from
>>http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz)
>
>
> I'll try and post my cruddy test programs tomorrow as well. Pretty handy
> for getting a good feel for N client read/write performance.
>
OK, sorry for not jumping in earlier. Yes, it will be synch IO that
is your problem.
I'll see if I can try improving things there for AS. I see (from your
first results in this thread) that CFQ does quite nicely here, better
than deadline.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-02 19:52 ` Jens Axboe
2004-12-02 20:19 ` Andrew Morton
@ 2004-12-08 0:37 ` Andrea Arcangeli
2004-12-08 0:54 ` Nick Piggin
1 sibling, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08 0:37 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, linux-kernel, nickpiggin
On Thu, Dec 02, 2004 at 08:52:36PM +0100, Jens Axboe wrote:
> with its default io scheduler has basically zero write performance in
IMHO the default io scheduler should be changed to cfq. as is all but
general purpose so it's a mistake to leave it the default (plus as Jens
found the write bandwidth is not existent during reads, no surprise it
falls apart in any database load). We had to make the cfq the default
for the enterprise release already. The first thing I do is to add
elevator=cfq on a new install. I really like how well cfq has been
designed, implemented and turned, Jens's results with his last patch are
quite impressive.
BTW, a bit of historic that may be funny to read (and I believe nobody
on l-k knows about it): the first sfq I/O elevator idea (sfq is the
ancestor of cfq, cfq still fallbacks in sfq mode in the unlikely case
that no atomic memory is available during I/O) started at an openmosix
conference in Bologna, when I was listening to one guy fixing the
latency of some videogame app migrating from server to server with
openmosix. So they could use a few boxes clustered to host some hundred
videogames servers migrating depending the the load (I recall they said
the users tend to move from one game to the other all at the same time).
I never heard of sfq before, but when I understood how it worked for the
packet scheduler and how they were using it to fix a latency issue in
the responsiveness of their game while the server was migrating, I
immediatly got the idea I could use the very same sfq algorightm for the
disk elevator too (at that time it was being used only in the networking
qdisc packet scheduler). I wasn't really sure at first if it would work
equally well for disk too (network pays nothing for seeks). But
conceptually it did worth mentioning the idea to Jens so he could
evaluate it (I think he was already working on something similar but I
hope I did provide him with some useful hint). You know the rest, he
quickly turned it into a cfq and did numerous improvements. The funny
thing I meant to say, is that if I didn't incidentally listen to the
videogame talk (a talk I'd normally avoid) we wouldn't have cfq today in
the I/O scheduler in its current great status (of course we could have
it since Jens was already working on something similar, but perhaps it
would be at least a bit in the past compared to his current great
development).
Even a videogame server may turn out to be very useful ;). I'm quite
sure the developer doing the videogame openmosix speech doesn't know his
speech had an impact on the kernel I/O scheduler ;). Hope he reads this
email.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 0:37 ` Andrea Arcangeli
@ 2004-12-08 0:54 ` Nick Piggin
2004-12-08 1:37 ` Andrea Arcangeli
2004-12-08 6:49 ` Jens Axboe
0 siblings, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 0:54 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Jens Axboe, Andrew Morton, linux-kernel
On Wed, 2004-12-08 at 01:37 +0100, Andrea Arcangeli wrote:
> On Thu, Dec 02, 2004 at 08:52:36PM +0100, Jens Axboe wrote:
> > with its default io scheduler has basically zero write performance in
>
> IMHO the default io scheduler should be changed to cfq. as is all but
> general purpose so it's a mistake to leave it the default (plus as Jens
I think it is actually pretty good at general purpose stuff. For
example, the old writes starve reads thing. It is especially bad
when doing small dependent reads like `find | xargs grep`. (Although
CFQ is probably better at this than deadline too).
It also tends to degrade more gracefully under memory load because
it doesn't require much readahead.
> found the write bandwidth is not existent during reads, no surprise it
> falls apart in any database load). We had to make the cfq the default
> for the enterprise release already. The first thing I do is to add
> elevator=cfq on a new install. I really like how well cfq has been
> designed, implemented and turned, Jens's results with his last patch are
> quite impressive.
>
That is synch write bandwidth. Yes that seems to be a problem.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 0:54 ` Nick Piggin
@ 2004-12-08 1:37 ` Andrea Arcangeli
2004-12-08 1:47 ` Nick Piggin
2004-12-08 2:00 ` Andrew Morton
2004-12-08 6:49 ` Jens Axboe
1 sibling, 2 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08 1:37 UTC (permalink / raw)
To: Nick Piggin; +Cc: Jens Axboe, Andrew Morton, linux-kernel
On Wed, Dec 08, 2004 at 11:54:13AM +1100, Nick Piggin wrote:
> That is synch write bandwidth. Yes that seems to be a problem.
It's not just sync write, it's a write in general, blkdev doesn't know
if the one waiting is pdflush or some other task. Once this will be
fixed I will have to reconsider my opinion of course, but I guess after
it gets fixed the benefit of "as" on the desktop will as well decrease
compared to cfq. The desktop is ok with "as" simply because it's
normally optimal to stop writes completely, since there are few apps
doing write journaling or heavy writes, and there's normally no
contigous read happening in the background. Desktop just needs a
temporary peak read max bandwidth when you click on openoffice or
similar app (and "as" provides it). But on a mixed server doing some
significant read and write (i.e. somebody downloading the kernel from
kernel.org and installing it on some application server) I don't think
"as" is general purpose enough. Another example is the multiuser usage
with one user reading a big mbox folder in mutt, whole the other user
s exiting mutt at the same time. The one exiting will pratically have to
wait the first user to finish its read I/O. All I/O becomes sync when it
exceeds the max size of the writeback cache.
"as" is clearly the best for the common case of the very desktop usage
(i.e. machine 99.9% idle and without any I/O except when starting an app
or saving a file, and the user noticing delay only while waiting the
window to open up after he clicked the button). But I believe cfq is
better for a general purpose usage where we cannot assume how the kernel
will be used.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 1:37 ` Andrea Arcangeli
@ 2004-12-08 1:47 ` Nick Piggin
2004-12-08 2:09 ` Andrea Arcangeli
2004-12-08 6:52 ` Jens Axboe
2004-12-08 2:00 ` Andrew Morton
1 sibling, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 1:47 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Jens Axboe, Andrew Morton, linux-kernel
On Wed, 2004-12-08 at 02:37 +0100, Andrea Arcangeli wrote:
> On Wed, Dec 08, 2004 at 11:54:13AM +1100, Nick Piggin wrote:
> > That is synch write bandwidth. Yes that seems to be a problem.
>
> It's not just sync write, it's a write in general, blkdev doesn't know
> if the one waiting is pdflush or some other task. Once this will be
> fixed I will have to reconsider my opinion of course, but I guess after
Yeah those sorts of dependencies are tricky. I think the best bet is
to not get _too_ fancy, and try to cover the basics like keeping
fairness good, and minimising write latency as much as possible.
> it gets fixed the benefit of "as" on the desktop will as well decrease
> compared to cfq. The desktop is ok with "as" simply because it's
> normally optimal to stop writes completely, since there are few apps
> doing write journaling or heavy writes, and there's normally no
> contigous read happening in the background. Desktop just needs a
> temporary peak read max bandwidth when you click on openoffice or
> similar app (and "as" provides it). But on a mixed server doing some
> significant read and write (i.e. somebody downloading the kernel from
> kernel.org and installing it on some application server) I don't think
> "as" is general purpose enough. Another example is the multiuser usage
> with one user reading a big mbox folder in mutt, whole the other user
> s exiting mutt at the same time. The one exiting will pratically have to
> wait the first user to finish its read I/O. All I/O becomes sync when it
> exceeds the max size of the writeback cache.
>
AS is surprisingly good when doing concurrent reads and buffered writes.
The buffered writes don't get starved too badly. Basically, AS just
ensures a reader will get the chance to play out its entire read batch
before switching to another reader or a writer.
Buffered writes don't suffer the same problem obviously because the
disk can can easily be kept fed from cache. Any read vs buffered write
starvation you see will mainly be due to the /sys tunables that give
more priority to reads (which isn't a bad idea, generally).
> "as" is clearly the best for the common case of the very desktop usage
> (i.e. machine 99.9% idle and without any I/O except when starting an app
> or saving a file, and the user noticing delay only while waiting the
> window to open up after he clicked the button). But I believe cfq is
> better for a general purpose usage where we cannot assume how the kernel
> will be used.
Maybe. CFQ may be a bit closer to a traditional elevator behaviour,
while AS uses some significantly different concepts which I guess
aren't as well tested and optimised for.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 1:37 ` Andrea Arcangeli
2004-12-08 1:47 ` Nick Piggin
@ 2004-12-08 2:00 ` Andrew Morton
2004-12-08 2:08 ` Andrew Morton
` (2 more replies)
1 sibling, 3 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-08 2:00 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: nickpiggin, axboe, linux-kernel
Andrea Arcangeli <andrea@suse.de> wrote:
>
> The desktop is ok with "as" simply because it's
> normally optimal to stop writes completely
AS doesn't "stop writes completely". With the current settings it
apportions about 1/3 of the disk's bandwidth to writes.
This thing Jens has found is for direct-io writes only. It's a bug.
The other problem with AS is that it basically doesn't work at all with a
TCQ depth greater than four or so, and lots of people blindly look at
untuned SCSI benchmark results without realising that. If a distro is
always selecting CFQ then they've probably gone and deoptimised all their
IDE users.
AS needs another iteration of development to fix these things. Right now
it's probably the case that we need CFQ or deadline for servers and AS for
desktops. That's awkward.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:00 ` Andrew Morton
@ 2004-12-08 2:08 ` Andrew Morton
2004-12-08 6:55 ` Jens Axboe
2004-12-08 2:20 ` Andrea Arcangeli
2004-12-08 6:55 ` Jens Axboe
2 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-08 2:08 UTC (permalink / raw)
To: andrea, nickpiggin, axboe, linux-kernel
Andrew Morton <akpm@osdl.org> wrote:
>
> If a distro is
> always selecting CFQ then they've probably gone and deoptimised all their
> IDE users.
That being said, yeah, once we get the time-sliced-CFQ happening, it should
probably be made the default, at least until AS gets fixed up. We need to
run the numbers and settle on that.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 1:47 ` Nick Piggin
@ 2004-12-08 2:09 ` Andrea Arcangeli
2004-12-08 2:11 ` Andrew Morton
2004-12-08 6:52 ` Jens Axboe
1 sibling, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08 2:09 UTC (permalink / raw)
To: Nick Piggin; +Cc: Jens Axboe, Andrew Morton, linux-kernel
On Wed, Dec 08, 2004 at 12:47:08PM +1100, Nick Piggin wrote:
> Buffered writes don't suffer the same problem obviously because the
> disk can can easily be kept fed from cache. Any read vs buffered write
This is true for very small buffered writes, which is the case for
desktop usage, but for more server oriented usage if the write isn't so
small, and you flush the writeback cache to disk very slowly, eventually
it will become a _sync_ write. So I agree that as long as the write
doesn't become synchronous "as" provides better behaviour.
One hidden side effect of "as" is that by writing so slowly (and
64KiB/sec really is slow), it increases the time it will take for a
dirty page to be flushed to disk (with tons of ram and lot of continous
readers I wouldn't be surprised if it could take hours for the data to
hit disk in an artificial testcase, you can do the math and find how
long it would take to the last page in the list to hit disk at
64KiB/sec).
> starvation you see will mainly be due to the /sys tunables that give
> more priority to reads (which isn't a bad idea, generally).
sure.
> Maybe. CFQ may be a bit closer to a traditional elevator behaviour,
> while AS uses some significantly different concepts which I guess
> aren't as well tested and optimised for.
It's already the best for desktop usage (even the 64KiB/sec is the best
on desktop), but as you said above it uses significantly different
concepts and that makes it by definition not general purpose (and
definitely a no-way for database, while cfq isn't a no-way on the
desktop).
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:09 ` Andrea Arcangeli
@ 2004-12-08 2:11 ` Andrew Morton
2004-12-08 2:22 ` Andrea Arcangeli
0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-08 2:11 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: nickpiggin, axboe, linux-kernel
Andrea Arcangeli <andrea@suse.de> wrote:
>
> One hidden side effect of "as" is that by writing so slowly (and
> 64KiB/sec really is slow), it increases the time it will take for a
> dirty page to be flushed to disk
The 64k/sec only happens for direct-io, and those pages aren't dirty.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:00 ` Andrew Morton
2004-12-08 2:08 ` Andrew Morton
@ 2004-12-08 2:20 ` Andrea Arcangeli
2004-12-08 2:25 ` Andrew Morton
2004-12-08 6:55 ` Jens Axboe
2 siblings, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08 2:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: nickpiggin, axboe, linux-kernel
On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> untuned SCSI benchmark results without realising that. If a distro is
> always selecting CFQ then they've probably gone and deoptimised all their
> IDE users.
The enterprise distro definitely shouldn't use "as" by default: database
apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
definitely the best for enterprise distros. This is a tangible result,
SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
lot, so such 64kib Jens found would be a showstopper for a enterprise
release, slelecting something different than "as" is a _must_ for
enterprise distro).
In the desktop distro you'll notice the /proc/cmdline has elevator="as"
because for desktop distros more people is going to use them for
desktop as expected.
But for enterprise distros this isn't the case, and cfq (or deadline)
must be the default, sure not "as". So claiming that selecting cfq by
default (I said in the enterprise distro) is deoptimising users, is
a wrong statement and the opposite of reality.
And personally I use cfq even on the desktop (since I'm not a normal
desktop user and I've apps writing too).
> AS needs another iteration of development to fix these things. Right now
> it's probably the case that we need CFQ or deadline for servers and AS for
> desktops. That's awkward.
Exactly.
If you believe AS is going to perform better than CFQ on the database
enterprise usage, we just need to prove it in practice after the round
of fixes, then changing the default back to "as" it'll be an additional
one liner on top of the blocker direct-io bug.
Desktop is already forced to "as" by /proc/cmdline, so it's not affected
on how we change the default of the enterprise distro AFIK.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:11 ` Andrew Morton
@ 2004-12-08 2:22 ` Andrea Arcangeli
0 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08 2:22 UTC (permalink / raw)
To: Andrew Morton; +Cc: nickpiggin, axboe, linux-kernel
On Tue, Dec 07, 2004 at 06:11:37PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > One hidden side effect of "as" is that by writing so slowly (and
> > 64KiB/sec really is slow), it increases the time it will take for a
> > dirty page to be flushed to disk
>
> The 64k/sec only happens for direct-io, and those pages aren't dirty.
I agree my above claim was wrong, thanks for correcting it.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:20 ` Andrea Arcangeli
@ 2004-12-08 2:25 ` Andrew Morton
2004-12-08 2:33 ` Andrea Arcangeli
2004-12-08 2:33 ` Nick Piggin
0 siblings, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-08 2:25 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: nickpiggin, axboe, linux-kernel
Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> > untuned SCSI benchmark results without realising that. If a distro is
> > always selecting CFQ then they've probably gone and deoptimised all their
> > IDE users.
>
> The enterprise distro definitely shouldn't use "as" by default: database
> apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
> definitely the best for enterprise distros. This is a tangible result,
> SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
> lot, so such 64kib Jens found would be a showstopper for a enterprise
> release, slelecting something different than "as" is a _must_ for
> enterprise distro).
That's a missing hint in the direct-io code. This fixes it up:
--- 25/fs/direct-io.c~a 2004-12-07 18:12:25.491602512 -0800
+++ 25-akpm/fs/direct-io.c 2004-12-07 18:13:13.661279608 -0800
@@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
struct dio *dio;
int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
+ current->flags |= PF_SYNCWRITE;
+
if (bdev)
bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
@@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
out:
if (reader_with_isem)
up(&inode->i_sem);
+ current->flags &= ~PF_SYNCWRITE;
return retval;
}
EXPORT_SYMBOL(__blockdev_direct_IO);
_
> ...
>
> If you believe AS is going to perform better than CFQ on the database
> enterprise usage, we just need to prove it in practice after the round
> of fixes, then changing the default back to "as" it'll be an additional
> one liner on top of the blocker direct-io bug.
I don't think AS will ever meet the performance of CFQ or deadline for the
seeky database loads, unfortunately. We busted a gut over that and were
never able to get better than 90% or so.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:25 ` Andrew Morton
@ 2004-12-08 2:33 ` Andrea Arcangeli
2004-12-08 2:33 ` Nick Piggin
1 sibling, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08 2:33 UTC (permalink / raw)
To: Andrew Morton; +Cc: nickpiggin, axboe, linux-kernel
On Tue, Dec 07, 2004 at 06:25:57PM -0800, Andrew Morton wrote:
> That's a missing hint in the direct-io code. This fixes it up:
>
> --- 25/fs/direct-io.c~a 2004-12-07 18:12:25.491602512 -0800
> +++ 25-akpm/fs/direct-io.c 2004-12-07 18:13:13.661279608 -0800
> @@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
> struct dio *dio;
> int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
>
> + current->flags |= PF_SYNCWRITE;
> +
> if (bdev)
> bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
>
> @@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
> out:
> if (reader_with_isem)
> up(&inode->i_sem);
> + current->flags &= ~PF_SYNCWRITE;
> return retval;
> }
> EXPORT_SYMBOL(__blockdev_direct_IO);
that was fast ;) great, thanks!
> I don't think AS will ever meet the performance of CFQ or deadline for the
This is my expectation too, since for these apps write latency is almost
more important than read latency and writes are often sync.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:25 ` Andrew Morton
2004-12-08 2:33 ` Andrea Arcangeli
@ 2004-12-08 2:33 ` Nick Piggin
2004-12-08 2:51 ` Andrea Arcangeli
2004-12-08 6:58 ` Jens Axboe
1 sibling, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 2:33 UTC (permalink / raw)
To: Andrew Morton; +Cc: Andrea Arcangeli, axboe, linux-kernel
On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> > > untuned SCSI benchmark results without realising that. If a distro is
> > > always selecting CFQ then they've probably gone and deoptimised all their
> > > IDE users.
> >
> > The enterprise distro definitely shouldn't use "as" by default: database
> > apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
> > definitely the best for enterprise distros. This is a tangible result,
> > SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
> > lot, so such 64kib Jens found would be a showstopper for a enterprise
> > release, slelecting something different than "as" is a _must_ for
> > enterprise distro).
>
> That's a missing hint in the direct-io code. This fixes it up:
>
> --- 25/fs/direct-io.c~a 2004-12-07 18:12:25.491602512 -0800
> +++ 25-akpm/fs/direct-io.c 2004-12-07 18:13:13.661279608 -0800
> @@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
> struct dio *dio;
> int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
>
> + current->flags |= PF_SYNCWRITE;
> +
> if (bdev)
> bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
>
> @@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
> out:
> if (reader_with_isem)
> up(&inode->i_sem);
> + current->flags &= ~PF_SYNCWRITE;
> return retval;
> }
> EXPORT_SYMBOL(__blockdev_direct_IO);
> _
>
> > ...
> >
> > If you believe AS is going to perform better than CFQ on the database
> > enterprise usage, we just need to prove it in practice after the round
> > of fixes, then changing the default back to "as" it'll be an additional
> > one liner on top of the blocker direct-io bug.
>
> I don't think AS will ever meet the performance of CFQ or deadline for the
> seeky database loads, unfortunately. We busted a gut over that and were
> never able to get better than 90% or so.
>
I think we could detect when a disk asks for more than, say, 4
concurrent requests, and in that case turn off read anticipation
and all the anti-starvation for TCQ by default (with the option
to force it back on).
I think this would be a decent "it works" solution that would make
AS acceptable as a default.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:33 ` Nick Piggin
@ 2004-12-08 2:51 ` Andrea Arcangeli
2004-12-08 3:02 ` Nick Piggin
2004-12-08 6:58 ` Jens Axboe
1 sibling, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08 2:51 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, axboe, linux-kernel
On Wed, Dec 08, 2004 at 01:33:33PM +1100, Nick Piggin wrote:
> I think we could detect when a disk asks for more than, say, 4
> concurrent requests, and in that case turn off read anticipation
> and all the anti-starvation for TCQ by default (with the option
> to force it back on).
What do you mean with "disk asks for more than 4 concurrent requests?"
You mean checking the TCQ capability of the hardware storage?
> I think this would be a decent "it works" solution that would make
> AS acceptable as a default.
Perhaps the code would be the same but if you disable it completely on
certain hardware that's not AS anymore...
Then I believe it would be better to switch to cfq for storage capable
of more than 4 concurrent tagged queued requests instead of sticking
with a "disabled AS". What's the point of AS if the features of AS are
disabled?
One relevant feature of cfq is the fairness property of pid against pid
or user against user. You don't get that fairness with the other I/O
schedulers. It was designed for fairness since the first place. Fariness
of writes against writes and reads against reads and write against reads
and read against writes.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:51 ` Andrea Arcangeli
@ 2004-12-08 3:02 ` Nick Piggin
0 siblings, 0 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 3:02 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, axboe, linux-kernel
On Wed, 2004-12-08 at 03:51 +0100, Andrea Arcangeli wrote:
> On Wed, Dec 08, 2004 at 01:33:33PM +1100, Nick Piggin wrote:
> > I think we could detect when a disk asks for more than, say, 4
> > concurrent requests, and in that case turn off read anticipation
> > and all the anti-starvation for TCQ by default (with the option
> > to force it back on).
>
> What do you mean with "disk asks for more than 4 concurrent requests?"
> You mean checking the TCQ capability of the hardware storage?
>
Yeah. Just check if there are more than 4 outstanding requests at once.
> > I think this would be a decent "it works" solution that would make
> > AS acceptable as a default.
>
> Perhaps the code would be the same but if you disable it completely on
> certain hardware that's not AS anymore...
>
Which is what we want on those system ;)
> Then I believe it would be better to switch to cfq for storage capable
> of more than 4 concurrent tagged queued requests instead of sticking
> with a "disabled AS". What's the point of AS if the features of AS are
> disabled?
>
For everyone else, who do want the AS features (ie. not databases).
> One relevant feature of cfq is the fairness property of pid against pid
> or user against user. You don't get that fairness with the other I/O
> schedulers. It was designed for fairness since the first place. Fariness
> of writes against writes and reads against reads and write against reads
> and read against writes.
That is something, I'll grant you that.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 0:54 ` Nick Piggin
2004-12-08 1:37 ` Andrea Arcangeli
@ 2004-12-08 6:49 ` Jens Axboe
1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 6:49 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 01:37 +0100, Andrea Arcangeli wrote:
> > On Thu, Dec 02, 2004 at 08:52:36PM +0100, Jens Axboe wrote:
> > > with its default io scheduler has basically zero write performance in
> >
> > IMHO the default io scheduler should be changed to cfq. as is all but
> > general purpose so it's a mistake to leave it the default (plus as Jens
>
> I think it is actually pretty good at general purpose stuff. For
> example, the old writes starve reads thing. It is especially bad
> when doing small dependent reads like `find | xargs grep`. (Although
> CFQ is probably better at this than deadline too).
Time sliced cfq fixes this.
> It also tends to degrade more gracefully under memory load because
> it doesn't require much readahead.
Ditto.
> > found the write bandwidth is not existent during reads, no surprise it
> > falls apart in any database load). We had to make the cfq the default
> > for the enterprise release already. The first thing I do is to add
> > elevator=cfq on a new install. I really like how well cfq has been
> > designed, implemented and turned, Jens's results with his last patch are
> > quite impressive.
> >
>
> That is synch write bandwidth. Yes that seems to be a problem.
A pretty big one :-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 1:47 ` Nick Piggin
2004-12-08 2:09 ` Andrea Arcangeli
@ 2004-12-08 6:52 ` Jens Axboe
1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 6:52 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> > it gets fixed the benefit of "as" on the desktop will as well decrease
> > compared to cfq. The desktop is ok with "as" simply because it's
> > normally optimal to stop writes completely, since there are few apps
> > doing write journaling or heavy writes, and there's normally no
> > contigous read happening in the background. Desktop just needs a
> > temporary peak read max bandwidth when you click on openoffice or
> > similar app (and "as" provides it). But on a mixed server doing some
> > significant read and write (i.e. somebody downloading the kernel from
> > kernel.org and installing it on some application server) I don't think
> > "as" is general purpose enough. Another example is the multiuser usage
> > with one user reading a big mbox folder in mutt, whole the other user
> > s exiting mutt at the same time. The one exiting will pratically have to
> > wait the first user to finish its read I/O. All I/O becomes sync when it
> > exceeds the max size of the writeback cache.
> >
>
> AS is surprisingly good when doing concurrent reads and buffered writes.
> The buffered writes don't get starved too badly. Basically, AS just
> ensures a reader will get the chance to play out its entire read batch
> before switching to another reader or a writer.
AS doesn't give a lot of bandwidth to the writes, about 10-20% only.
Tiem sliced cfq is more fair, you get closer to 50/50 in that case.
> Buffered writes don't suffer the same problem obviously because the
> disk can can easily be kept fed from cache. Any read vs buffered write
> starvation you see will mainly be due to the /sys tunables that give
> more priority to reads (which isn't a bad idea, generally).
Depends entirely on the work load, I don't think you can say something
like that in generel. For a desktop load, sure.
> > "as" is clearly the best for the common case of the very desktop usage
> > (i.e. machine 99.9% idle and without any I/O except when starting an app
> > or saving a file, and the user noticing delay only while waiting the
> > window to open up after he clicked the button). But I believe cfq is
> > better for a general purpose usage where we cannot assume how the kernel
> > will be used.
>
> Maybe. CFQ may be a bit closer to a traditional elevator behaviour,
> while AS uses some significantly different concepts which I guess
> aren't as well tested and optimised for.
You should read the new cfq code, there's isn't that much difference
when it comes to the plain act of ordering io or finding the next
request (I stole some code :-).
The data direction batching that AS does I don't see the point of.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:00 ` Andrew Morton
2004-12-08 2:08 ` Andrew Morton
2004-12-08 2:20 ` Andrea Arcangeli
@ 2004-12-08 6:55 ` Jens Axboe
2004-12-08 7:08 ` Nick Piggin
2004-12-08 10:52 ` Helge Hafting
2 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 6:55 UTC (permalink / raw)
To: Andrew Morton; +Cc: Andrea Arcangeli, nickpiggin, linux-kernel
On Tue, Dec 07 2004, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > The desktop is ok with "as" simply because it's
> > normally optimal to stop writes completely
>
> AS doesn't "stop writes completely". With the current settings it
> apportions about 1/3 of the disk's bandwidth to writes.
>
> This thing Jens has found is for direct-io writes only. It's a bug.
Indeed. It's a special case one, but nasty for that case.
> The other problem with AS is that it basically doesn't work at all with a
> TCQ depth greater than four or so, and lots of people blindly look at
> untuned SCSI benchmark results without realising that. If a distro is
That's pretty easy to fix. I added something like that to cfq, and it's
not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).
> always selecting CFQ then they've probably gone and deoptimised all their
> IDE users.
Andrew, AS has other issues, it's not a case of AS always being faster
at everything.
> AS needs another iteration of development to fix these things. Right now
> it's probably the case that we need CFQ or deadline for servers and AS for
> desktops. That's awkward.
Currently I think the time sliced cfq is the best all around. There's
still a few kinks to be shaken out, but generally I think the concept is
sounder than AS.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:08 ` Andrew Morton
@ 2004-12-08 6:55 ` Jens Axboe
0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 6:55 UTC (permalink / raw)
To: Andrew Morton; +Cc: andrea, nickpiggin, linux-kernel
On Tue, Dec 07 2004, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > If a distro is
> > always selecting CFQ then they've probably gone and deoptimised all their
> > IDE users.
>
> That being said, yeah, once we get the time-sliced-CFQ happening, it should
> probably be made the default, at least until AS gets fixed up. We need to
> run the numbers and settle on that.
I'll do a new round of numbers today.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 2:33 ` Nick Piggin
2004-12-08 2:51 ` Andrea Arcangeli
@ 2004-12-08 6:58 ` Jens Axboe
2004-12-08 7:14 ` Nick Piggin
1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 6:58 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> > > > untuned SCSI benchmark results without realising that. If a distro is
> > > > always selecting CFQ then they've probably gone and deoptimised all their
> > > > IDE users.
> > >
> > > The enterprise distro definitely shouldn't use "as" by default: database
> > > apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
> > > definitely the best for enterprise distros. This is a tangible result,
> > > SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
> > > lot, so such 64kib Jens found would be a showstopper for a enterprise
> > > release, slelecting something different than "as" is a _must_ for
> > > enterprise distro).
> >
> > That's a missing hint in the direct-io code. This fixes it up:
> >
> > --- 25/fs/direct-io.c~a 2004-12-07 18:12:25.491602512 -0800
> > +++ 25-akpm/fs/direct-io.c 2004-12-07 18:13:13.661279608 -0800
> > @@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
> > struct dio *dio;
> > int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
> >
> > + current->flags |= PF_SYNCWRITE;
> > +
> > if (bdev)
> > bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
> >
> > @@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
> > out:
> > if (reader_with_isem)
> > up(&inode->i_sem);
> > + current->flags &= ~PF_SYNCWRITE;
> > return retval;
> > }
> > EXPORT_SYMBOL(__blockdev_direct_IO);
> > _
> >
> > > ...
> > >
> > > If you believe AS is going to perform better than CFQ on the database
> > > enterprise usage, we just need to prove it in practice after the round
> > > of fixes, then changing the default back to "as" it'll be an additional
> > > one liner on top of the blocker direct-io bug.
> >
> > I don't think AS will ever meet the performance of CFQ or deadline for the
> > seeky database loads, unfortunately. We busted a gut over that and were
> > never able to get better than 90% or so.
> >
>
> I think we could detect when a disk asks for more than, say, 4
> concurrent requests, and in that case turn off read anticipation
> and all the anti-starvation for TCQ by default (with the option
> to force it back on).
CFQ only allows a certain depth a the hardware level, you can control
that. I don't think you should drop the AS behaviour in that case, you
should look at when the last request comes in and what type it is.
With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
it gets harder to get good read bandwidth as the disk is trying pretty
hard to starve me. Maybe killing write back caching would help, I'll
have to try.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 6:55 ` Jens Axboe
@ 2004-12-08 7:08 ` Nick Piggin
2004-12-08 7:11 ` Jens Axboe
2004-12-08 10:52 ` Helge Hafting
1 sibling, 1 reply; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 7:08 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:
> On Tue, Dec 07 2004, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > The desktop is ok with "as" simply because it's
> > > normally optimal to stop writes completely
> >
> > AS doesn't "stop writes completely". With the current settings it
> > apportions about 1/3 of the disk's bandwidth to writes.
> >
> > This thing Jens has found is for direct-io writes only. It's a bug.
>
> Indeed. It's a special case one, but nasty for that case.
>
> > The other problem with AS is that it basically doesn't work at all with a
> > TCQ depth greater than four or so, and lots of people blindly look at
> > untuned SCSI benchmark results without realising that. If a distro is
>
> That's pretty easy to fix. I added something like that to cfq, and it's
> not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).
>
> > always selecting CFQ then they've probably gone and deoptimised all their
> > IDE users.
>
> Andrew, AS has other issues, it's not a case of AS always being faster
> at everything.
>
> > AS needs another iteration of development to fix these things. Right now
> > it's probably the case that we need CFQ or deadline for servers and AS for
> > desktops. That's awkward.
>
> Currently I think the time sliced cfq is the best all around. There's
> still a few kinks to be shaken out, but generally I think the concept is
> sounder than AS.
>
But aren't you basically unconditionally allowing a 4ms idle time after
reads? The complexity of AS (other than all the work we had to do to get
the block layer to cope with it), is getting it to turn off at (mostly)
the right times. Other than that, it is basically the deadline
scheduler.
I could be wrong, but it looks like you'll just run into the same sorts
of performance problems as AS initially had.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:08 ` Nick Piggin
@ 2004-12-08 7:11 ` Jens Axboe
2004-12-08 7:19 ` Nick Piggin
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 7:11 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:
> > On Tue, Dec 07 2004, Andrew Morton wrote:
> > > Andrea Arcangeli <andrea@suse.de> wrote:
> > > >
> > > > The desktop is ok with "as" simply because it's
> > > > normally optimal to stop writes completely
> > >
> > > AS doesn't "stop writes completely". With the current settings it
> > > apportions about 1/3 of the disk's bandwidth to writes.
> > >
> > > This thing Jens has found is for direct-io writes only. It's a bug.
> >
> > Indeed. It's a special case one, but nasty for that case.
> >
> > > The other problem with AS is that it basically doesn't work at all with a
> > > TCQ depth greater than four or so, and lots of people blindly look at
> > > untuned SCSI benchmark results without realising that. If a distro is
> >
> > That's pretty easy to fix. I added something like that to cfq, and it's
> > not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).
> >
> > > always selecting CFQ then they've probably gone and deoptimised all their
> > > IDE users.
> >
> > Andrew, AS has other issues, it's not a case of AS always being faster
> > at everything.
> >
> > > AS needs another iteration of development to fix these things. Right now
> > > it's probably the case that we need CFQ or deadline for servers and AS for
> > > desktops. That's awkward.
> >
> > Currently I think the time sliced cfq is the best all around. There's
> > still a few kinks to be shaken out, but generally I think the concept is
> > sounder than AS.
> >
>
> But aren't you basically unconditionally allowing a 4ms idle time after
> reads? The complexity of AS (other than all the work we had to do to get
> the block layer to cope with it), is getting it to turn off at (mostly)
> the right times. Other than that, it is basically the deadline
> scheduler.
Yes, the concept is similar and there will be time wasting currently.
I've got some cases covered that AS doesn't, and there are definitely
some the other way around as well.
If you have any test cases/programs, I'd like to see them.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 6:58 ` Jens Axboe
@ 2004-12-08 7:14 ` Nick Piggin
2004-12-08 7:20 ` Jens Axboe
2004-12-08 13:48 ` Jens Axboe
0 siblings, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 7:14 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> On Wed, Dec 08 2004, Nick Piggin wrote:
> > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> > I think we could detect when a disk asks for more than, say, 4
> > concurrent requests, and in that case turn off read anticipation
> > and all the anti-starvation for TCQ by default (with the option
> > to force it back on).
>
> CFQ only allows a certain depth a the hardware level, you can control
> that. I don't think you should drop the AS behaviour in that case, you
> should look at when the last request comes in and what type it is.
>
> With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> it gets harder to get good read bandwidth as the disk is trying pretty
> hard to starve me. Maybe killing write back caching would help, I'll
> have to try.
>
I "fixed" this in AS. It gets (or got, last time we checked, many months
ago) pretty good read latency even with a big write and a very large
tag depth.
What were the main things I had to do... hmm, I think the main one was
to not start on a new batch until all requests from a previous batch
are reported to have completed. So eg. you get all reads completing
before you start issuing any more writes. The write->read side of things
isn't so clear cut with your "smart" write caches on the IO systems, but
no doubt that helps a bit.
Of course, after you do all that your database performance has well and
truly gone down the shitter. It is also hampered by the more fundamental
issue that read anticipating can block up the pipe for IO that is cached
on the controller/disks and would get satisfied immediately.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:11 ` Jens Axboe
@ 2004-12-08 7:19 ` Nick Piggin
2004-12-08 7:26 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 7:19 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, 2004-12-08 at 08:11 +0100, Jens Axboe wrote:
> On Wed, Dec 08 2004, Nick Piggin wrote:
> > On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:
> > > Currently I think the time sliced cfq is the best all around. There's
> > > still a few kinks to be shaken out, but generally I think the concept is
> > > sounder than AS.
> > >
> >
> > But aren't you basically unconditionally allowing a 4ms idle time after
> > reads? The complexity of AS (other than all the work we had to do to get
> > the block layer to cope with it), is getting it to turn off at (mostly)
> > the right times. Other than that, it is basically the deadline
> > scheduler.
>
> Yes, the concept is similar and there will be time wasting currently.
> I've got some cases covered that AS doesn't, and there are definitely
> some the other way around as well.
>
Oh? What have you got covered that AS doesn't? (I'm only reading the
patch itself, which isn't trivial to follow).
> If you have any test cases/programs, I'd like to see them.
>
Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
had trouble with are OraSim (Oracle might give you a copy), Andrew's
patch scripts when applying a stack of patches, pgbench... can't
really remember any others off the top of my head.
I've got a small set of basic test programs that are similar to the
sort of tests you've been running in this thread as well.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:14 ` Nick Piggin
@ 2004-12-08 7:20 ` Jens Axboe
2004-12-08 7:29 ` Nick Piggin
2004-12-08 7:30 ` Andrew Morton
2004-12-08 13:48 ` Jens Axboe
1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 7:20 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
>
> > > I think we could detect when a disk asks for more than, say, 4
> > > concurrent requests, and in that case turn off read anticipation
> > > and all the anti-starvation for TCQ by default (with the option
> > > to force it back on).
> >
> > CFQ only allows a certain depth a the hardware level, you can control
> > that. I don't think you should drop the AS behaviour in that case, you
> > should look at when the last request comes in and what type it is.
> >
> > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > it gets harder to get good read bandwidth as the disk is trying pretty
> > hard to starve me. Maybe killing write back caching would help, I'll
> > have to try.
> >
>
> I "fixed" this in AS. It gets (or got, last time we checked, many months
> ago) pretty good read latency even with a big write and a very large
> tag depth.
>
> What were the main things I had to do... hmm, I think the main one was
> to not start on a new batch until all requests from a previous batch
> are reported to have completed. So eg. you get all reads completing
> before you start issuing any more writes. The write->read side of things
> isn't so clear cut with your "smart" write caches on the IO systems, but
> no doubt that helps a bit.
I can see the read/write batching being helpful there, at least to
prevent writes starving reads if you let the queue drain completely
before starting a new batch.
CFQ does something similar, just not batched together. But it does let
the depth build up a little and drain out. In fact I think I'm missing
a little fix there thinking about it, that could be why the read
latencies hurt on write intensive loads (the dispatch queue is drained,
the hardware queue is not fully).
> Of course, after you do all that your database performance has well and
> truly gone down the shitter. It is also hampered by the more fundamental
> issue that read anticipating can block up the pipe for IO that is cached
> on the controller/disks and would get satisfied immediately.
I think we need to end up with something that sets the machine profile
for the interesting disks. Some things you can check for at runtime
(like the writes being extremely fast is a good indicator of write
caching), but it is just not possible to cover it all. Plus, you end up
with 30-40% of the code being convoluted stuff added to detect it.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:19 ` Nick Piggin
@ 2004-12-08 7:26 ` Jens Axboe
2004-12-08 9:35 ` Jens Axboe
0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 7:26 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 08:11 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:
>
> > > > Currently I think the time sliced cfq is the best all around. There's
> > > > still a few kinks to be shaken out, but generally I think the concept is
> > > > sounder than AS.
> > > >
> > >
> > > But aren't you basically unconditionally allowing a 4ms idle time after
> > > reads? The complexity of AS (other than all the work we had to do to get
> > > the block layer to cope with it), is getting it to turn off at (mostly)
> > > the right times. Other than that, it is basically the deadline
> > > scheduler.
> >
> > Yes, the concept is similar and there will be time wasting currently.
> > I've got some cases covered that AS doesn't, and there are definitely
> > some the other way around as well.
> >
>
> Oh? What have you got covered that AS doesn't? (I'm only reading the
> patch itself, which isn't trivial to follow).
You are only thinking in terms of single process characteristics like
will it exit and think times, the inter-process characteristics are very
hap hazard. You might find the applied code easier to read, I think.
> > If you have any test cases/programs, I'd like to see them.
> >
>
> Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> had trouble with are OraSim (Oracle might give you a copy), Andrew's
> patch scripts when applying a stack of patches, pgbench... can't
> really remember any others off the top of my head.
The patch scripts case is interesting, last night (when committing other
patches) I was thinking I should try and bench that today. It has a good
mix of reads and writes.
There's still lots of tuning in the pipe line. As I wrote originally,
this was basically just a quick hack that I was surprised did so well
:-) It has grown a little since then and I think the concept is really
sound, so I'll continue to work on it.
> I've got a small set of basic test programs that are similar to the
> sort of tests you've been running in this thread as well.
Ok
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:20 ` Jens Axboe
@ 2004-12-08 7:29 ` Nick Piggin
2004-12-08 7:32 ` Jens Axboe
2004-12-08 7:30 ` Andrew Morton
1 sibling, 1 reply; 66+ messages in thread
From: Nick Piggin @ 2004-12-08 7:29 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, 2004-12-08 at 08:20 +0100, Jens Axboe wrote:
> On Wed, Dec 08 2004, Nick Piggin wrote:
> > On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> >
> > > > I think we could detect when a disk asks for more than, say, 4
> > > > concurrent requests, and in that case turn off read anticipation
> > > > and all the anti-starvation for TCQ by default (with the option
> > > > to force it back on).
> > >
> > > CFQ only allows a certain depth a the hardware level, you can control
> > > that. I don't think you should drop the AS behaviour in that case, you
> > > should look at when the last request comes in and what type it is.
> > >
> > > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > > it gets harder to get good read bandwidth as the disk is trying pretty
> > > hard to starve me. Maybe killing write back caching would help, I'll
> > > have to try.
> > >
> >
> > I "fixed" this in AS. It gets (or got, last time we checked, many months
> > ago) pretty good read latency even with a big write and a very large
> > tag depth.
> >
> > What were the main things I had to do... hmm, I think the main one was
> > to not start on a new batch until all requests from a previous batch
> > are reported to have completed. So eg. you get all reads completing
> > before you start issuing any more writes. The write->read side of things
> > isn't so clear cut with your "smart" write caches on the IO systems, but
> > no doubt that helps a bit.
>
> I can see the read/write batching being helpful there, at least to
> prevent writes starving reads if you let the queue drain completely
> before starting a new batch.
>
> CFQ does something similar, just not batched together. But it does let
> the depth build up a little and drain out. In fact I think I'm missing
> a little fix there thinking about it, that could be why the read
> latencies hurt on write intensive loads (the dispatch queue is drained,
> the hardware queue is not fully).
>
OK, you should look into that, because I found it was quite effective.
Maybe you have a little bug or oversight somewhere if you read latencies
are really bad. Note that AS read latencies at 256 tags aren't so good
as at 2 tags... but I think they're an order of magnitude better than
with deadline on the hardware we were testing.
> > Of course, after you do all that your database performance has well and
> > truly gone down the shitter. It is also hampered by the more fundamental
> > issue that read anticipating can block up the pipe for IO that is cached
> > on the controller/disks and would get satisfied immediately.
>
> I think we need to end up with something that sets the machine profile
> for the interesting disks. Some things you can check for at runtime
> (like the writes being extremely fast is a good indicator of write
> caching), but it is just not possible to cover it all. Plus, you end up
> with 30-40% of the code being convoluted stuff added to detect it.
>
Ideally maybe we would have a userspace program that is run to detect
various disk parameters and ask the user / config file what sort of
workloads we want to do, and spits out a recommended IO scheduler and
/sys configuration to accompany it.
That at least could be made quite sophisticated than a kernel solution,
and could gather quite a lot of "static" disk properties.
Of course there will be also some things that need to be done in
kernel...
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:20 ` Jens Axboe
2004-12-08 7:29 ` Nick Piggin
@ 2004-12-08 7:30 ` Andrew Morton
2004-12-08 7:36 ` Jens Axboe
1 sibling, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-08 7:30 UTC (permalink / raw)
To: Jens Axboe; +Cc: nickpiggin, andrea, linux-kernel
Jens Axboe <axboe@suse.de> wrote:
>
> I think we need to end up with something that sets the machine profile
> for the interesting disks. Some things you can check for at runtime
> (like the writes being extremely fast is a good indicator of write
> caching), but it is just not possible to cover it all. Plus, you end up
> with 30-40% of the code being convoluted stuff added to detect it.
We can detect these things from userspace. Parse the hdparm/scsiinfo
output, then poke numbers into /sys tunables.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:29 ` Nick Piggin
@ 2004-12-08 7:32 ` Jens Axboe
0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 7:32 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 08:20 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > > > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> > >
> > > > > I think we could detect when a disk asks for more than, say, 4
> > > > > concurrent requests, and in that case turn off read anticipation
> > > > > and all the anti-starvation for TCQ by default (with the option
> > > > > to force it back on).
> > > >
> > > > CFQ only allows a certain depth a the hardware level, you can control
> > > > that. I don't think you should drop the AS behaviour in that case, you
> > > > should look at when the last request comes in and what type it is.
> > > >
> > > > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > > > it gets harder to get good read bandwidth as the disk is trying pretty
> > > > hard to starve me. Maybe killing write back caching would help, I'll
> > > > have to try.
> > > >
> > >
> > > I "fixed" this in AS. It gets (or got, last time we checked, many months
> > > ago) pretty good read latency even with a big write and a very large
> > > tag depth.
> > >
> > > What were the main things I had to do... hmm, I think the main one was
> > > to not start on a new batch until all requests from a previous batch
> > > are reported to have completed. So eg. you get all reads completing
> > > before you start issuing any more writes. The write->read side of things
> > > isn't so clear cut with your "smart" write caches on the IO systems, but
> > > no doubt that helps a bit.
> >
> > I can see the read/write batching being helpful there, at least to
> > prevent writes starving reads if you let the queue drain completely
> > before starting a new batch.
> >
> > CFQ does something similar, just not batched together. But it does let
> > the depth build up a little and drain out. In fact I think I'm missing
> > a little fix there thinking about it, that could be why the read
> > latencies hurt on write intensive loads (the dispatch queue is drained,
> > the hardware queue is not fully).
> >
>
> OK, you should look into that, because I found it was quite effective.
> Maybe you have a little bug or oversight somewhere if you read latencies
> are really bad. Note that AS read latencies at 256 tags aren't so good
> as at 2 tags... but I think they're an order of magnitude better than
> with deadline on the hardware we were testing.
It wasn't _that_ bad, the main issue really was that it was interferring
with the cfq slices and you didn't get really good aggregate throughput
for several threads. Once that happens, there's the nasty tendency for
both latency to rise and throughput to plummit quickly :-)
I cap the depth at a variable setting right now, so no more than 4 by
default.
> > > Of course, after you do all that your database performance has well and
> > > truly gone down the shitter. It is also hampered by the more fundamental
> > > issue that read anticipating can block up the pipe for IO that is cached
> > > on the controller/disks and would get satisfied immediately.
> >
> > I think we need to end up with something that sets the machine profile
> > for the interesting disks. Some things you can check for at runtime
> > (like the writes being extremely fast is a good indicator of write
> > caching), but it is just not possible to cover it all. Plus, you end up
> > with 30-40% of the code being convoluted stuff added to detect it.
> >
>
> Ideally maybe we would have a userspace program that is run to detect
> various disk parameters and ask the user / config file what sort of
> workloads we want to do, and spits out a recommended IO scheduler and
> /sys configuration to accompany it.
Well, or have the user give a profile of the drive. There's no point in
attempting to guess things the user knows. And then there are things you
probably cannot get right in either case :)
> That at least could be made quite sophisticated than a kernel solution,
> and could gather quite a lot of "static" disk properties.
And move some code to user space.
> Of course there will be also some things that need to be done in
> kernel...
Always, we should of course run as well as we can without magic disk
programs being needed.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:30 ` Andrew Morton
@ 2004-12-08 7:36 ` Jens Axboe
0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 7:36 UTC (permalink / raw)
To: Andrew Morton; +Cc: nickpiggin, andrea, linux-kernel
On Tue, Dec 07 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > I think we need to end up with something that sets the machine profile
> > for the interesting disks. Some things you can check for at runtime
> > (like the writes being extremely fast is a good indicator of write
> > caching), but it is just not possible to cover it all. Plus, you end up
> > with 30-40% of the code being convoluted stuff added to detect it.
>
> We can detect these things from userspace. Parse the hdparm/scsiinfo
> output, then poke numbers into /sys tunables.
The simple things, like cache settings and queue depth - definitely. The
harder things like how does this drive behave you cannot. And
unfortunately the former is also pretty easy to control (at least for
the depth) and at least gather at runtime. So I think a user mode helper
only makes sense if it can help you with real drive characteristics that
are hard to detect. Plus, settings have a nack for changing while we
are running as well.
Hmm so perhaps not such a hot idea after all. I don't envision anyone
actually doing it anyways, so...
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:26 ` Jens Axboe
@ 2004-12-08 9:35 ` Jens Axboe
2004-12-08 10:08 ` Jens Axboe
2004-12-08 12:47 ` Jens Axboe
0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 9:35 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Jens Axboe wrote:
> > Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> > had trouble with are OraSim (Oracle might give you a copy), Andrew's
> > patch scripts when applying a stack of patches, pgbench... can't
> > really remember any others off the top of my head.
>
> The patch scripts case is interesting, last night (when committing other
> patches) I was thinking I should try and bench that today. It has a good
> mix of reads and writes.
AS is currently 10 seconds faster for that workload (untar of a kernel
and then applying 2237 patches). AS completes it in 155 seconds, CFQ
takes 164 seconds.
I still need to fix the streamed write perfomance regression, then I'll
see how the above compares again. CFQ doesn't do very well in eg
tiobench streamed write case (it's about 30% slower than AS).
(btw, any mention of CFQ in this thread refers to time sliced cfq).
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 9:35 ` Jens Axboe
@ 2004-12-08 10:08 ` Jens Axboe
2004-12-08 12:47 ` Jens Axboe
1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 10:08 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Jens Axboe wrote:
> On Wed, Dec 08 2004, Jens Axboe wrote:
> > > Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> > > had trouble with are OraSim (Oracle might give you a copy), Andrew's
> > > patch scripts when applying a stack of patches, pgbench... can't
> > > really remember any others off the top of my head.
> >
> > The patch scripts case is interesting, last night (when committing other
> > patches) I was thinking I should try and bench that today. It has a good
> > mix of reads and writes.
>
> AS is currently 10 seconds faster for that workload (untar of a kernel
> and then applying 2237 patches). AS completes it in 155 seconds, CFQ
> takes 164 seconds.
DEADLINE does 160 seconds, btw.
Something like
for i in patches.*/*; do cp "$i" /dev/null; done
while running a
dd if=/dev/zero of=testfile bs=64k
could be better for both schedulers. AS completes the workload in
4min 14sec, CFQ in 3min 5sec. I don't have the time to try DEADLINE :-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 10:52 ` Helge Hafting
@ 2004-12-08 10:49 ` Jens Axboe
0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 10:49 UTC (permalink / raw)
To: Helge Hafting; +Cc: Andrew Morton, Andrea Arcangeli, nickpiggin, linux-kernel
On Wed, Dec 08 2004, Helge Hafting wrote:
> >>AS needs another iteration of development to fix these things. Right now
> >>it's probably the case that we need CFQ or deadline for servers and AS for
> >>desktops. That's awkward.
> >>
> >>
> >
> >Currently I think the time sliced cfq is the best all around. There's
> >still a few kinks to be shaken out, but generally I think the concept is
> >sounder than AS.
> >
> >
> I wonder, would it make sense to add some limited anticipation
> to the cfq scheduler? It seems to me that there is room to
> get some of the AS benefit without getting too unfair:
>
> AS does a wait that is short compared to a seek, getting some
> more locality almost for free. Consider if CFQ did this, with
> the added limitation that it only let a few extra read requests
> in this way before doing the next seek anyway. For example,
> allowing up to 3 extra anticipated read requests before
> seeking could quadruple read bandwith in some cases. This is
> clearly not as fair, but the extra reads will be almost free
> because those few reads take little time compared to the seek
> that follows anyway. Therefore, the latency for other requests
> shouldn't change much and we get the best of both AS and CFQ.
> Or have I made a broken assumption?
This is basically what time sliced cfq does. For sync requests, cfq
allows a definable idle period where we give the process a chance to
submit a new request if it has enough time slice to do so. This
'anticipation' then is just an artifact of the design of time sliced
cfq, where we do assign a finite time period where a given process owns
the disk.
See my initial posting on time sliced cfq. That is why time sliced cfq
does as well (or better) then AS for the many client cases, while still
being fair.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 6:55 ` Jens Axboe
2004-12-08 7:08 ` Nick Piggin
@ 2004-12-08 10:52 ` Helge Hafting
2004-12-08 10:49 ` Jens Axboe
1 sibling, 1 reply; 66+ messages in thread
From: Helge Hafting @ 2004-12-08 10:52 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, nickpiggin, linux-kernel
Jens Axboe wrote:
>On Tue, Dec 07 2004, Andrew Morton wrote:
>
>
>>Andrea Arcangeli <andrea@suse.de> wrote:
>>
>>
>>>The desktop is ok with "as" simply because it's
>>> normally optimal to stop writes completely
>>>
>>>
>>AS doesn't "stop writes completely". With the current settings it
>>apportions about 1/3 of the disk's bandwidth to writes.
>>
>>This thing Jens has found is for direct-io writes only. It's a bug.
>>
>>
>
>Indeed. It's a special case one, but nasty for that case.
>
>
>
>>The other problem with AS is that it basically doesn't work at all with a
>>TCQ depth greater than four or so, and lots of people blindly look at
>>untuned SCSI benchmark results without realising that. If a distro is
>>
>>
>
>That's pretty easy to fix. I added something like that to cfq, and it's
>not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).
>
>
>
>>always selecting CFQ then they've probably gone and deoptimised all their
>>IDE users.
>>
>>
>
>Andrew, AS has other issues, it's not a case of AS always being faster
>at everything.
>
>
>
>>AS needs another iteration of development to fix these things. Right now
>>it's probably the case that we need CFQ or deadline for servers and AS for
>>desktops. That's awkward.
>>
>>
>
>Currently I think the time sliced cfq is the best all around. There's
>still a few kinks to be shaken out, but generally I think the concept is
>sounder than AS.
>
>
I wonder, would it make sense to add some limited anticipation
to the cfq scheduler? It seems to me that there is room to
get some of the AS benefit without getting too unfair:
AS does a wait that is short compared to a seek, getting some
more locality almost for free. Consider if CFQ did this, with
the added limitation that it only let a few extra read requests
in this way before doing the next seek anyway. For example,
allowing up to 3 extra anticipated read requests before
seeking could quadruple read bandwith in some cases. This is
clearly not as fair, but the extra reads will be almost free
because those few reads take little time compared to the seek
that follows anyway. Therefore, the latency for other requests
shouldn't change much and we get the best of both AS and CFQ.
Or have I made a broken assumption?
The max number of requests to anticipate could even be
configurable, jut set it to 0 to get pure CFQ.
Helge Hafting
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 9:35 ` Jens Axboe
2004-12-08 10:08 ` Jens Axboe
@ 2004-12-08 12:47 ` Jens Axboe
1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 12:47 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Jens Axboe wrote:
> On Wed, Dec 08 2004, Jens Axboe wrote:
> > > Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> > > had trouble with are OraSim (Oracle might give you a copy), Andrew's
> > > patch scripts when applying a stack of patches, pgbench... can't
> > > really remember any others off the top of my head.
> >
> > The patch scripts case is interesting, last night (when committing other
> > patches) I was thinking I should try and bench that today. It has a good
> > mix of reads and writes.
>
> AS is currently 10 seconds faster for that workload (untar of a kernel
> and then applying 2237 patches). AS completes it in 155 seconds, CFQ
> takes 164 seconds.
Turned out to be a stupid dispatch sort error in cfq, now it has the
exact same runtime as AS.
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
2004-12-08 7:14 ` Nick Piggin
2004-12-08 7:20 ` Jens Axboe
@ 2004-12-08 13:48 ` Jens Axboe
1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 13:48 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel
On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
>
> > > I think we could detect when a disk asks for more than, say, 4
> > > concurrent requests, and in that case turn off read anticipation
> > > and all the anti-starvation for TCQ by default (with the option
> > > to force it back on).
> >
> > CFQ only allows a certain depth a the hardware level, you can control
> > that. I don't think you should drop the AS behaviour in that case, you
> > should look at when the last request comes in and what type it is.
> >
> > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > it gets harder to get good read bandwidth as the disk is trying pretty
> > hard to starve me. Maybe killing write back caching would help, I'll
> > have to try.
> >
>
> I "fixed" this in AS. It gets (or got, last time we checked, many months
> ago) pretty good read latency even with a big write and a very large
> tag depth.
This problem was also caused by the dispatch sort bug. So you were
right, it was 'some little bug' in the code :)
--
Jens Axboe
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Time sliced CFQ io scheduler
@ 2004-12-03 20:52 Chuck Ebbert
0 siblings, 0 replies; 66+ messages in thread
From: Chuck Ebbert @ 2004-12-03 20:52 UTC (permalink / raw)
To: Jens Axboe
Cc: Prakash K. Cheemplavam, Andrew Morton, linux-kernel, Nick Piggin,
Neil Brown
On Fri, 3 Dec 2004 at 11:31:30 +0100 Jens Axboe wrote:
>> Yeas, I have linux raid (testing md1). Have appield both settings on
>> both drives and got a interesting new pattern: Now it alternates. My
>> email client is still not usale while writing though...
>
> Funky. It looks like another case of the io scheduler being at the wrong
> place - if raid sends dependent reads to different drives, it screws up
> the io scheduling. The right way to fix that would be to io scheduler
> before raid (reverse of what we do now), but that is a lot of work. A
> hack would be to try and tie processes to one md component for periods
> of time, sort of like cfq slicing.
How about having the raid1 read balance code send each read to every drive
in the mirror, and just take the first one that returns data? It could then
cancel the rest, or just ignore them... ;)
--Chuck Ebbert 03-Dec-04 15:43:54
^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2004-12-08 13:49 UTC | newest]
Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-02 13:04 Time sliced CFQ io scheduler Jens Axboe
2004-12-02 13:48 ` Jens Axboe
2004-12-02 19:48 ` Andrew Morton
2004-12-02 19:52 ` Jens Axboe
2004-12-02 20:19 ` Andrew Morton
2004-12-02 20:19 ` Jens Axboe
2004-12-02 20:34 ` Andrew Morton
2004-12-02 20:37 ` Jens Axboe
2004-12-07 23:11 ` Nick Piggin
2004-12-02 22:18 ` Prakash K. Cheemplavam
2004-12-03 7:01 ` Jens Axboe
2004-12-03 9:12 ` Prakash K. Cheemplavam
2004-12-03 9:18 ` Jens Axboe
2004-12-03 9:35 ` Prakash K. Cheemplavam
2004-12-03 9:43 ` Jens Axboe
2004-12-03 9:26 ` Andrew Morton
2004-12-03 9:34 ` Prakash K. Cheemplavam
2004-12-03 9:39 ` Jens Axboe
2004-12-03 9:54 ` Prakash K. Cheemplavam
[not found] ` <41B03722.5090001@gmx.de>
2004-12-03 10:31 ` Jens Axboe
2004-12-03 10:38 ` Jens Axboe
2004-12-03 10:45 ` Prakash K. Cheemplavam
2004-12-03 10:48 ` Jens Axboe
2004-12-03 11:27 ` Prakash K. Cheemplavam
2004-12-03 11:29 ` Jens Axboe
2004-12-03 11:52 ` Prakash K. Cheemplavam
2004-12-08 0:37 ` Andrea Arcangeli
2004-12-08 0:54 ` Nick Piggin
2004-12-08 1:37 ` Andrea Arcangeli
2004-12-08 1:47 ` Nick Piggin
2004-12-08 2:09 ` Andrea Arcangeli
2004-12-08 2:11 ` Andrew Morton
2004-12-08 2:22 ` Andrea Arcangeli
2004-12-08 6:52 ` Jens Axboe
2004-12-08 2:00 ` Andrew Morton
2004-12-08 2:08 ` Andrew Morton
2004-12-08 6:55 ` Jens Axboe
2004-12-08 2:20 ` Andrea Arcangeli
2004-12-08 2:25 ` Andrew Morton
2004-12-08 2:33 ` Andrea Arcangeli
2004-12-08 2:33 ` Nick Piggin
2004-12-08 2:51 ` Andrea Arcangeli
2004-12-08 3:02 ` Nick Piggin
2004-12-08 6:58 ` Jens Axboe
2004-12-08 7:14 ` Nick Piggin
2004-12-08 7:20 ` Jens Axboe
2004-12-08 7:29 ` Nick Piggin
2004-12-08 7:32 ` Jens Axboe
2004-12-08 7:30 ` Andrew Morton
2004-12-08 7:36 ` Jens Axboe
2004-12-08 13:48 ` Jens Axboe
2004-12-08 6:55 ` Jens Axboe
2004-12-08 7:08 ` Nick Piggin
2004-12-08 7:11 ` Jens Axboe
2004-12-08 7:19 ` Nick Piggin
2004-12-08 7:26 ` Jens Axboe
2004-12-08 9:35 ` Jens Axboe
2004-12-08 10:08 ` Jens Axboe
2004-12-08 12:47 ` Jens Axboe
2004-12-08 10:52 ` Helge Hafting
2004-12-08 10:49 ` Jens Axboe
2004-12-08 6:49 ` Jens Axboe
2004-12-02 14:28 ` Giuliano Pochini
2004-12-02 14:41 ` Jens Axboe
2004-12-04 13:05 ` Giuliano Pochini
2004-12-03 20:52 Chuck Ebbert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).