linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Time sliced CFQ io scheduler
@ 2004-12-02 13:04 Jens Axboe
  2004-12-02 13:48 ` Jens Axboe
  2004-12-02 14:28 ` Giuliano Pochini
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 13:04 UTC (permalink / raw)
  To: Linux Kernel

Hi,

Some time ago I pondered modifying CFQ to do fairness based on slices of
disk time. It appeared to be a minor modification, but with some nice
bonus points:

- It scales nicely with CPU scheduler slices, making io priorities a
  zinch to implement.

- It has the possibility to equal the anticipatory scheduler for
  multiple processes competing for disk bandwidth

So I implemented it and ran some tests, the results are pretty
astonishing. A note on the testcases - read_files and write_files. They
either read or write a number of files sequentially or randomly, each
file has io being done to it by a specific process. IO bypasses the page
cache by using O_DIRECT. Runtime is capped at 30 seconds for each test.
Each test case was run on deadline, as, and new cfq. Drive used was an
IDE drive (results similar for SCSI), fs used was ext2.

Scroll past results for the executive summary.

Case 1: read_files, sequential, bs=4k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        19837         19837         19837        22msec
   2         2116          2114          4230        22msec
   4          361           360          1444        41msec
   8          150           149          1201       111msec

Note: bandwidth quickly becomes seek bound as clients are added.

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        19480         19480         19480        30msec
   2         9250          9189         18434       261msec
   4         4513          4469         17970       488msec
   8         2238          2157         17581       934msec

Note: as maintains good aggregate bandwidth as clients are added, while
still being fair between clients. Latency rises quickly.

Schedule: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        19433         19433         19433         9msec
   2         8686          8628         17312        90msec
   4         4507          4471         17963       254msec
   8         2181          2104         17134       578msec

Note: cfq performs close to as. Aggregate bandwidth doesn't suffer with
added clients, inter-client latency and throughput excellent. Latency
half that of as.


Case 2: read_files, random, bs=64k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         7042          7042          7042        20msec
   2         3052          3051          6103        28msec
   4         1560          1498          6124       101msec
   8          802           581          5487       231msec

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         7041          7041          7041        18msec
   2         4616          2298          6912       270msec
   4         3190           928          6901       360msec
   8         1524           645          6765       636msec

Note: Aggregate bandwidth remains good, has big problems with
inter-client fairness.

Scheduler: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         7027          7027          7027        19msec
   2         3429          3413          6841       107msec
   4         1718          1700          6844       282msec
   8          875           827          6795       627msec

Note: Aggregate bandwidth remains good and basically identical to as,
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.


Case 3: write_files, sequential, bs=4k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        13406         13406         13406        21msec
   2         1553          1551          3104       171msec
   4          690           689          2759       116msec
   8          329           318          2604       106msec

Note: Aggregate bandwidth quickly drops with number of clients. Latency
is good.

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        13330         13330         13330        21msec
   2         2694          2694          5388        77msec
   4         1754            17          4988       762msec
   8          638           342          3866       848msec

Note: Aggregate bandwidth better than deadline, but still not very good.
Latency not good. inter-client horrible.

Scheduler: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        13267         13267         13267        30msec
   2         6352          6150         12459       239msec
   4         3230          2945         12524       289msec
   8         1640          1640         12564       599msec

Note: Aggregate bandwidth remains high with added clients
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.


Case 4: write_files, random, bs=4k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         6749          6749          6749       112msec
   2         1299          1277          2574       813msec
   4          432           418          1715       227msec
   8          291           247          2147      1723msec

Note: Same again for deadline - aggregate bandwidth really drops with
adding clients, but at least client fairness is good.

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         4110          4110          4110       114msec
   2          815           809          1623       631msec
   4          482           349          1760       606msec
   8          476           111          2863       752msec

Note: Does generally worse than deadline and has fairness issues.

Scheduler: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         4493          4493          4493       129msec
   2         1710          1513          3216       321msec
   4          521           482          2002       476msec
   8          938           877          7210       927msec


Good results for such a quick hack, I'm generally surprised how well it
does without any tuning. The results above use the default settings for
cfq slices: 83ms slice time with allowed 4ms idle period (queues are
preempted if they exceed this idle time). With the disk time slices,
aggregate performance bandwidth stays close to real disk performance
even with many clients.

Patch against BK-current as of today.

===== drivers/block/cfq-iosched.c 1.15 vs edited =====
--- 1.15/drivers/block/cfq-iosched.c	2004-11-30 07:56:58 +01:00
+++ edited/drivers/block/cfq-iosched.c	2004-12-02 14:03:56 +01:00
@@ -22,21 +22,22 @@
 #include <linux/rbtree.h>
 #include <linux/mempool.h>
 
-static unsigned long max_elapsed_crq;
-static unsigned long max_elapsed_dispatch;
-
 /*
  * tunables
  */
 static int cfq_quantum = 4;		/* max queue in one round of service */
 static int cfq_queued = 8;		/* minimum rq allocate limit per-queue*/
-static int cfq_service = HZ;		/* period over which service is avg */
 static int cfq_fifo_expire_r = HZ / 2;	/* fifo timeout for sync requests */
 static int cfq_fifo_expire_w = 5 * HZ;	/* fifo timeout for async requests */
 static int cfq_fifo_rate = HZ / 8;	/* fifo expiry rate */
 static int cfq_back_max = 16 * 1024;	/* maximum backwards seek, in KiB */
 static int cfq_back_penalty = 2;	/* penalty of a backwards seek */
 
+static int cfq_slice = HZ / 12;
+static int cfq_idle = HZ / 249;
+
+static int cfq_max_depth = 4;
+
 /*
  * for the hash of cfqq inside the cfqd
  */
@@ -55,6 +56,7 @@
 #define list_entry_hash(ptr)	hlist_entry((ptr), struct cfq_rq, hash)
 
 #define list_entry_cfqq(ptr)	list_entry((ptr), struct cfq_queue, cfq_list)
+#define list_entry_fifo(ptr)	list_entry((ptr), struct request, queuelist)
 
 #define RQ_DATA(rq)		(rq)->elevator_private
 
@@ -76,22 +78,18 @@
 #define rq_rb_key(rq)		(rq)->sector
 
 /*
- * threshold for switching off non-tag accounting
- */
-#define CFQ_MAX_TAG		(4)
-
-/*
  * sort key types and names
  */
 enum {
 	CFQ_KEY_PGID,
 	CFQ_KEY_TGID,
+	CFQ_KEY_PID,
 	CFQ_KEY_UID,
 	CFQ_KEY_GID,
 	CFQ_KEY_LAST,
 };
 
-static char *cfq_key_types[] = { "pgid", "tgid", "uid", "gid", NULL };
+static char *cfq_key_types[] = { "pgid", "tgid", "pid", "uid", "gid", NULL };
 
 /*
  * spare queue
@@ -103,6 +101,8 @@
 static kmem_cache_t *cfq_ioc_pool;
 
 struct cfq_data {
+	atomic_t ref;
+
 	struct list_head rr_list;
 	struct list_head empty_list;
 
@@ -114,8 +114,6 @@
 
 	unsigned int max_queued;
 
-	atomic_t ref;
-
 	int key_type;
 
 	mempool_t *crq_pool;
@@ -127,6 +125,14 @@
 	int rq_in_driver;
 
 	/*
+	 * schedule slice state info
+	 */
+	struct timer_list timer;
+	struct work_struct unplug_work;
+	struct cfq_queue *active_queue;
+	unsigned int dispatch_slice;
+
+	/*
 	 * tunables, see top of file
 	 */
 	unsigned int cfq_quantum;
@@ -137,8 +143,9 @@
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
 	unsigned int find_best_crq;
-
-	unsigned int cfq_tagged;
+	unsigned int cfq_slice;
+	unsigned int cfq_idle;
+	unsigned int cfq_max_depth;
 };
 
 struct cfq_queue {
@@ -150,8 +157,6 @@
 	struct hlist_node cfq_hash;
 	/* hash key */
 	unsigned long key;
-	/* whether queue is on rr (or empty) list */
-	int on_rr;
 	/* on either rr or empty list of cfqd */
 	struct list_head cfq_list;
 	/* sorted list of pending requests */
@@ -169,15 +174,22 @@
 
 	int key_type;
 
-	unsigned long service_start;
+	unsigned long slice_start;
 	unsigned long service_used;
+	unsigned long service_rq;
+	unsigned long service_last;
 
-	unsigned int max_rate;
+	/* whether queue is on rr (or empty) list */
+	unsigned int on_rr : 1;
+	unsigned int wait_request : 1;
+	unsigned int must_dispatch : 1;
 
 	/* number of requests that have been handed to the driver */
 	int in_flight;
 	/* number of currently allocated requests */
 	int alloc_limit[2];
+	/* last rq was sync */
+	char name[16];
 };
 
 struct cfq_rq {
@@ -219,6 +231,8 @@
 		default:
 		case CFQ_KEY_TGID:
 			return tsk->tgid;
+		case CFQ_KEY_PID:
+			return tsk->pid;
 		case CFQ_KEY_UID:
 			return tsk->uid;
 		case CFQ_KEY_GID:
@@ -309,7 +323,7 @@
 
 			if (blk_barrier_rq(rq))
 				break;
-
+	
 			if (distance < abs(s1 - rq->sector + rq->nr_sectors)) {
 				distance = abs(s1 - rq->sector +rq->nr_sectors);
 				last = rq->sector + rq->nr_sectors;
@@ -406,67 +420,22 @@
 		cfqq->next_crq = cfq_find_next_crq(cfqq->cfqd, cfqq, crq);
 }
 
-static int cfq_check_sort_rr_list(struct cfq_queue *cfqq)
-{
-	struct list_head *head = &cfqq->cfqd->rr_list;
-	struct list_head *next, *prev;
-
-	/*
-	 * list might still be ordered
-	 */
-	next = cfqq->cfq_list.next;
-	if (next != head) {
-		struct cfq_queue *cnext = list_entry_cfqq(next);
-
-		if (cfqq->service_used > cnext->service_used)
-			return 1;
-	}
-
-	prev = cfqq->cfq_list.prev;
-	if (prev != head) {
-		struct cfq_queue *cprev = list_entry_cfqq(prev);
-
-		if (cfqq->service_used < cprev->service_used)
-			return 1;
-	}
-
-	return 0;
-}
-
-static void cfq_sort_rr_list(struct cfq_queue *cfqq, int new_queue)
+static void cfq_resort_rr_list(struct cfq_queue *cfqq)
 {
 	struct list_head *entry = &cfqq->cfqd->rr_list;
 
-	if (!cfqq->on_rr)
-		return;
-	if (!new_queue && !cfq_check_sort_rr_list(cfqq))
-		return;
-
 	list_del(&cfqq->cfq_list);
 
 	/*
-	 * sort by our mean service_used, sub-sort by in-flight requests
+	 * sort by when queue was last serviced
 	 */
 	while ((entry = entry->prev) != &cfqq->cfqd->rr_list) {
 		struct cfq_queue *__cfqq = list_entry_cfqq(entry);
 
-		if (cfqq->service_used > __cfqq->service_used)
+		if (!__cfqq->service_last)
+			break;
+		if (time_before(__cfqq->service_last, cfqq->service_last))
 			break;
-		else if (cfqq->service_used == __cfqq->service_used) {
-			struct list_head *prv;
-
-			while ((prv = entry->prev) != &cfqq->cfqd->rr_list) {
-				__cfqq = list_entry_cfqq(prv);
-
-				WARN_ON(__cfqq->service_used > cfqq->service_used);
-				if (cfqq->service_used != __cfqq->service_used)
-					break;
-				if (cfqq->in_flight > __cfqq->in_flight)
-					break;
-
-				entry = prv;
-			}
-		}
 	}
 
 	list_add(&cfqq->cfq_list, entry);
@@ -479,16 +448,12 @@
 static inline void
 cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	/*
-	 * it's currently on the empty list
-	 */
-	cfqq->on_rr = 1;
-	cfqd->busy_queues++;
+	BUG_ON(cfqq->on_rr);
 
-	if (time_after(jiffies, cfqq->service_start + cfq_service))
-		cfqq->service_used >>= 3;
+	cfqd->busy_queues++;
+	cfqq->on_rr = 1;
 
-	cfq_sort_rr_list(cfqq, 1);
+	cfq_resort_rr_list(cfqq);
 }
 
 static inline void
@@ -512,10 +477,10 @@
 		struct cfq_data *cfqd = cfqq->cfqd;
 
 		BUG_ON(!cfqq->queued[crq->is_sync]);
+		cfqq->queued[crq->is_sync]--;
 
 		cfq_update_next_crq(crq);
 
-		cfqq->queued[crq->is_sync]--;
 		rb_erase(&crq->rb_node, &cfqq->sort_list);
 		RB_CLEAR_COLOR(&crq->rb_node);
 
@@ -622,11 +587,6 @@
 	if (crq) {
 		struct cfq_queue *cfqq = crq->cfq_queue;
 
-		if (cfqq->cfqd->cfq_tagged) {
-			cfqq->service_used--;
-			cfq_sort_rr_list(cfqq, 0);
-		}
-
 		crq->accounted = 0;
 		cfqq->cfqd->rq_in_driver--;
 	}
@@ -640,9 +600,7 @@
 	if (crq) {
 		cfq_remove_merge_hints(q, crq);
 		list_del_init(&rq->queuelist);
-
-		if (crq->cfq_queue)
-			cfq_del_crq_rb(crq);
+		cfq_del_crq_rb(crq);
 	}
 }
 
@@ -724,6 +682,101 @@
 }
 
 /*
+ * current cfqq expired its slice (or was too idle), select new one
+ */
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
+{
+	struct cfq_queue *cfqq = cfqd->active_queue;
+	unsigned long now = jiffies;
+
+	if (cfqq) {
+		if (cfqq->wait_request)
+			del_timer(&cfqd->timer);
+
+		cfqq->service_used += now - cfqq->slice_start;
+		cfqq->service_rq += cfqd->dispatch_slice;
+		cfqq->service_last = now;
+		cfqq->must_dispatch = 0;
+
+		if (cfqq->on_rr)
+			cfq_resort_rr_list(cfqq);
+
+		cfqq = NULL;
+	}
+
+	if (!list_empty(&cfqd->rr_list)) {
+		cfqq = list_entry_cfqq(cfqd->rr_list.next);
+
+		cfqq->slice_start = now;
+		cfqq->wait_request = 0;
+	}
+
+	cfqd->active_queue = cfqq;
+	cfqd->dispatch_slice = 0;
+}
+
+static int cfq_arm_slice_timer(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	WARN_ON(!RB_EMPTY(&cfqq->sort_list));
+
+	cfqq->wait_request = 1;
+
+	if (!cfqd->cfq_idle)
+		return 0;
+
+	if (!timer_pending(&cfqd->timer)) {
+		unsigned long now = jiffies, slice_left;
+
+		slice_left = cfqd->cfq_slice - (now - cfqq->slice_start);
+		cfqd->timer.expires = now + min(cfqd->cfq_idle,(unsigned int)slice_left);
+		add_timer(&cfqd->timer);
+	}
+
+	return 1;
+}
+
+/*
+ * get next queue for service
+ */
+static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
+{
+	struct cfq_queue *cfqq = cfqd->active_queue;
+	unsigned long slice_used;
+
+	cfqq = cfqd->active_queue;
+	if (!cfqq)
+		goto new_queue;
+
+	slice_used = jiffies - cfqq->slice_start;
+
+	if (cfqq->must_dispatch)
+		goto must_queue;
+
+	/*
+	 * slice has expired
+	 */
+	if (slice_used >= cfqd->cfq_slice)
+		goto new_queue;
+
+	/*
+	 * if queue has requests, dispatch one. if not, check if
+	 * enough slice is left to wait for one
+	 */
+must_queue:
+	if (!RB_EMPTY(&cfqq->sort_list))
+		goto keep_queue;
+	else if (cfqd->cfq_slice - slice_used >= cfqd->cfq_idle) {
+		if (cfq_arm_slice_timer(cfqd, cfqq))
+			return NULL;
+	}
+
+new_queue:
+	cfq_slice_expired(cfqd);
+keep_queue:
+	return cfqd->active_queue;
+}
+
+/*
  * we dispatch cfqd->cfq_quantum requests in total from the rr_list queues,
  * this function sector sorts the selected request to minimize seeks. we start
  * at cfqd->last_sector, not 0.
@@ -741,9 +794,7 @@
 	list_del(&crq->request->queuelist);
 
 	last = cfqd->last_sector;
-	while ((entry = entry->prev) != head) {
-		__rq = list_entry_rq(entry);
-
+	list_for_each_entry_reverse(__rq, head, queuelist) {
 		if (blk_barrier_rq(crq->request))
 			break;
 		if (!blk_fs_request(crq->request))
@@ -777,95 +828,86 @@
 	if (time_before(now, cfqq->last_fifo_expire + cfqd->cfq_fifo_batch_expire))
 		return NULL;
 
-	crq = RQ_DATA(list_entry(cfqq->fifo[0].next, struct request, queuelist));
-	if (reads && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
-		cfqq->last_fifo_expire = now;
-		return crq;
+	if (reads) {
+		crq = RQ_DATA(list_entry_fifo(cfqq->fifo[READ].next));
+		if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
+			cfqq->last_fifo_expire = now;
+			return crq;
+		}
 	}
 
-	crq = RQ_DATA(list_entry(cfqq->fifo[1].next, struct request, queuelist));
-	if (writes && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
-		cfqq->last_fifo_expire = now;
-		return crq;
+	if (writes) {
+		crq = RQ_DATA(list_entry_fifo(cfqq->fifo[WRITE].next));
+		if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
+			cfqq->last_fifo_expire = now;
+			return crq;
+		}
 	}
 
 	return NULL;
 }
 
-/*
- * dispatch a single request from given queue
- */
-static inline void
-cfq_dispatch_request(request_queue_t *q, struct cfq_data *cfqd,
-		     struct cfq_queue *cfqq)
+static int
+__cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+			int max_dispatch)
 {
-	struct cfq_rq *crq;
+	int dispatched = 0;
 
-	/*
-	 * follow expired path, else get first next available
-	 */
-	if ((crq = cfq_check_fifo(cfqq)) == NULL) {
-		if (cfqd->find_best_crq)
-			crq = cfqq->next_crq;
-		else
-			crq = rb_entry_crq(rb_first(&cfqq->sort_list));
-	}
+	BUG_ON(RB_EMPTY(&cfqq->sort_list));
 
-	cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+	do {
+		struct cfq_rq *crq;
 
-	/*
-	 * finally, insert request into driver list
-	 */
-	cfq_dispatch_sort(q, crq);
+		/*
+		 * follow expired path, else get first next available
+		 */
+		if ((crq = cfq_check_fifo(cfqq)) == NULL) {
+			if (cfqd->find_best_crq)
+				crq = cfqq->next_crq;
+			else
+				crq = rb_entry_crq(rb_first(&cfqq->sort_list));
+		}
+
+		cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+
+		/*
+		 * finally, insert request into driver list
+		 */
+		cfq_dispatch_sort(cfqd->queue, crq);
+
+		cfqd->dispatch_slice++;
+		dispatched++;
+
+		if (RB_EMPTY(&cfqq->sort_list))
+			break;
+
+	} while (dispatched < max_dispatch);
+
+	return dispatched;
 }
 
 static int cfq_dispatch_requests(request_queue_t *q, int max_dispatch)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
-	struct list_head *entry, *tmp;
-	int queued, busy_queues, first_round;
 
 	if (list_empty(&cfqd->rr_list))
 		return 0;
 
-	queued = 0;
-	first_round = 1;
-restart:
-	busy_queues = 0;
-	list_for_each_safe(entry, tmp, &cfqd->rr_list) {
-		cfqq = list_entry_cfqq(entry);
-
-		BUG_ON(RB_EMPTY(&cfqq->sort_list));
-
-		/*
-		 * first round of queueing, only select from queues that
-		 * don't already have io in-flight
-		 */
-		if (first_round && cfqq->in_flight)
-			continue;
-
-		cfq_dispatch_request(q, cfqd, cfqq);
-
-		if (!RB_EMPTY(&cfqq->sort_list))
-			busy_queues++;
-
-		queued++;
-	}
-
-	if ((queued < max_dispatch) && (busy_queues || first_round)) {
-		first_round = 0;
-		goto restart;
-	}
+	cfqq = cfq_select_queue(cfqd);
+	if (!cfqq)
+		return 0;
 
-	return queued;
+	cfqq->wait_request = 0;
+	cfqq->must_dispatch = 0;
+	del_timer(&cfqd->timer);
+	return __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
 }
 
 static inline void cfq_account_dispatch(struct cfq_rq *crq)
 {
 	struct cfq_queue *cfqq = crq->cfq_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
-	unsigned long now, elapsed;
 
 	/*
 	 * accounted bit is necessary since some drivers will call
@@ -874,62 +916,34 @@
 	if (crq->accounted)
 		return;
 
-	now = jiffies;
-	if (cfqq->service_start == ~0UL)
-		cfqq->service_start = now;
-
-	/*
-	 * on drives with tagged command queueing, command turn-around time
-	 * doesn't necessarily reflect the time spent processing this very
-	 * command inside the drive. so do the accounting differently there,
-	 * by just sorting on the number of requests
-	 */
-	if (cfqd->cfq_tagged) {
-		if (time_after(now, cfqq->service_start + cfq_service)) {
-			cfqq->service_start = now;
-			cfqq->service_used /= 10;
-		}
-
-		cfqq->service_used++;
-		cfq_sort_rr_list(cfqq, 0);
-	}
-
-	elapsed = now - crq->queue_start;
-	if (elapsed > max_elapsed_dispatch)
-		max_elapsed_dispatch = elapsed;
-
 	crq->accounted = 1;
-	crq->service_start = now;
-
-	if (++cfqd->rq_in_driver >= CFQ_MAX_TAG && !cfqd->cfq_tagged) {
-		cfqq->cfqd->cfq_tagged = 1;
-		printk("cfq: depth %d reached, tagging now on\n", CFQ_MAX_TAG);
-	}
+	crq->service_start = jiffies;
+	cfqd->rq_in_driver++;
 }
 
 static inline void
 cfq_account_completion(struct cfq_queue *cfqq, struct cfq_rq *crq)
 {
 	struct cfq_data *cfqd = cfqq->cfqd;
+	unsigned long now = jiffies;
 
 	WARN_ON(!cfqd->rq_in_driver);
 	cfqd->rq_in_driver--;
 
-	if (!cfqd->cfq_tagged) {
-		unsigned long now = jiffies;
-		unsigned long duration = now - crq->service_start;
-
-		if (time_after(now, cfqq->service_start + cfq_service)) {
-			cfqq->service_start = now;
-			cfqq->service_used >>= 3;
-		}
+	cfqq->service_used += now - crq->service_start;
 
-		cfqq->service_used += duration;
-		cfq_sort_rr_list(cfqq, 0);
+	/*
+	 * queue was preempted while this request was servicing
+	 */
+	if (cfqd->active_queue != cfqq)
+		return;
 
-		if (duration > max_elapsed_crq)
-			max_elapsed_crq = duration;
-	}
+	/*
+	 * no requests. if last request was a sync request, wait for
+	 * a new one.
+	 */
+	if (RB_EMPTY(&cfqq->sort_list) && crq->is_sync)
+		cfq_arm_slice_timer(cfqd, cfqq);
 }
 
 static struct request *cfq_next_request(request_queue_t *q)
@@ -937,6 +951,9 @@
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct request *rq;
 
+	if (cfqd->rq_in_driver >= cfqd->cfq_max_depth)
+		return NULL;
+
 	if (!list_empty(&q->queue_head)) {
 		struct cfq_rq *crq;
 dispatch:
@@ -964,6 +981,8 @@
  */
 static void cfq_put_queue(struct cfq_queue *cfqq)
 {
+	struct cfq_data *cfqd = cfqq->cfqd;
+
 	BUG_ON(!atomic_read(&cfqq->ref));
 
 	if (!atomic_dec_and_test(&cfqq->ref))
@@ -972,6 +991,9 @@
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->on_rr);
 
+	if (unlikely(cfqd->active_queue == cfqq))
+		cfqd->active_queue = NULL;
+
 	cfq_put_cfqd(cfqq->cfqd);
 
 	/*
@@ -1117,6 +1139,7 @@
 		cic->ioc = ioc;
 		cic->cfqq = __cfqq;
 		atomic_inc(&__cfqq->ref);
+		atomic_inc(&cfqd->ref);
 	} else {
 		struct cfq_io_context *__cic;
 		unsigned long flags;
@@ -1159,10 +1182,10 @@
 		__cic->ioc = ioc;
 		__cic->cfqq = __cfqq;
 		atomic_inc(&__cfqq->ref);
+		atomic_inc(&cfqd->ref);
 		spin_lock_irqsave(&ioc->lock, flags);
 		list_add(&__cic->list, &cic->list);
 		spin_unlock_irqrestore(&ioc->lock, flags);
-
 		cic = __cic;
 		*cfqq = __cfqq;
 	}
@@ -1199,8 +1222,11 @@
 			new_cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
 			spin_lock_irq(cfqd->queue->queue_lock);
 			goto retry;
-		} else
-			goto out;
+		} else {
+			cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
+			if (!cfqq)
+				goto out;
+		}
 
 		memset(cfqq, 0, sizeof(*cfqq));
 
@@ -1216,7 +1242,8 @@
 		cfqq->cfqd = cfqd;
 		atomic_inc(&cfqd->ref);
 		cfqq->key_type = cfqd->key_type;
-		cfqq->service_start = ~0UL;
+		cfqq->service_last = 0;
+		strncpy(cfqq->name, current->comm, sizeof(cfqq->name) - 1);
 	}
 
 	if (new_cfqq)
@@ -1243,14 +1270,27 @@
 
 static void cfq_enqueue(struct cfq_data *cfqd, struct cfq_rq *crq)
 {
-	crq->is_sync = 0;
-	if (rq_data_dir(crq->request) == READ || current->flags & PF_SYNCWRITE)
-		crq->is_sync = 1;
+	struct cfq_queue *cfqq = crq->cfq_queue;
+	struct request *rq = crq->request;
+
+	crq->is_sync = rq_data_dir(rq) == READ || current->flags & PF_SYNCWRITE;
 
 	cfq_add_crq_rb(crq);
 	crq->queue_start = jiffies;
 
-	list_add_tail(&crq->request->queuelist, &crq->cfq_queue->fifo[crq->is_sync]);
+	list_add_tail(&rq->queuelist, &crq->cfq_queue->fifo[crq->is_sync]);
+
+	/*
+	 * if we are waiting for a request for this queue, let it rip
+	 * immediately and flag that we must not expire this queue just now
+	 */
+	if (cfqq->wait_request && cfqq == cfqd->active_queue) {
+		request_queue_t *q = cfqd->queue;
+
+		cfqq->must_dispatch = 1;
+		del_timer(&cfqd->timer);
+		q->request_fn(q);
+	}
 }
 
 static void
@@ -1339,31 +1379,34 @@
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
 	int ret = ELV_MQUEUE_MAY;
+	int limit;
 
 	if (current->flags & PF_MEMALLOC)
 		return ELV_MQUEUE_MAY;
 
 	cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(cfqd, current));
-	if (cfqq) {
-		int limit = cfqd->max_queued;
-
-		if (cfqq->allocated[rw] < cfqd->cfq_queued)
-			return ELV_MQUEUE_MUST;
-
-		if (cfqd->busy_queues)
-			limit = q->nr_requests / cfqd->busy_queues;
-
-		if (limit < cfqd->cfq_queued)
-			limit = cfqd->cfq_queued;
-		else if (limit > cfqd->max_queued)
-			limit = cfqd->max_queued;
+	if (unlikely(!cfqq))
+		return ELV_MQUEUE_MAY;
 
-		if (cfqq->allocated[rw] >= limit) {
-			if (limit > cfqq->alloc_limit[rw])
-				cfqq->alloc_limit[rw] = limit;
+	if (cfqq->allocated[rw] < cfqd->cfq_queued)
+		return ELV_MQUEUE_MUST;
+	if (cfqq->wait_request)
+		return ELV_MQUEUE_MUST;
+
+	limit = cfqd->max_queued;
+	if (cfqd->busy_queues)
+		limit = q->nr_requests / cfqd->busy_queues;
+
+	if (limit < cfqd->cfq_queued)
+		limit = cfqd->cfq_queued;
+	else if (limit > cfqd->max_queued)
+		limit = cfqd->max_queued;
+
+	if (cfqq->allocated[rw] >= limit) {
+		if (limit > cfqq->alloc_limit[rw])
+			cfqq->alloc_limit[rw] = limit;
 
-			ret = ELV_MQUEUE_NO;
-		}
+		ret = ELV_MQUEUE_NO;
 	}
 
 	return ret;
@@ -1395,12 +1438,12 @@
 		BUG_ON(q->last_merge == rq);
 		BUG_ON(!hlist_unhashed(&crq->hash));
 
-		if (crq->io_context)
-			put_io_context(crq->io_context->ioc);
-
 		BUG_ON(!cfqq->allocated[crq->is_write]);
 		cfqq->allocated[crq->is_write]--;
 
+		if (crq->io_context)
+			put_io_context(crq->io_context->ioc);
+
 		mempool_free(crq, cfqd->crq_pool);
 		rq->elevator_private = NULL;
 
@@ -1473,6 +1516,7 @@
 		crq->is_write = rw;
 		rq->elevator_private = crq;
 		cfqq->alloc_limit[rw] = 0;
+		smp_mb();
 		return 0;
 	}
 
@@ -1486,6 +1530,44 @@
 	return 1;
 }
 
+static void cfq_kick_queue(void *data)
+{
+	request_queue_t *q = data;
+
+	blk_run_queue(q);
+}
+
+static void cfq_schedule_timer(unsigned long data)
+{
+	struct cfq_data *cfqd = (struct cfq_data *) data;
+	struct cfq_queue *cfqq;
+	unsigned long flags;
+
+	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+
+	if ((cfqq = cfqd->active_queue) != NULL) {
+		/*
+		 * expired
+		 */
+		if (time_after(jiffies, cfqq->slice_start + cfqd->cfq_slice))
+			goto out;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (!RB_EMPTY(&cfqq->sort_list)) {
+			cfqq->must_dispatch = 1;
+			goto out_cont;
+		}
+	} 
+
+out:
+	cfq_slice_expired(cfqd);
+out_cont:
+	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	kblockd_schedule_work(&cfqd->unplug_work);
+}
+
 static void cfq_put_cfqd(struct cfq_data *cfqd)
 {
 	request_queue_t *q = cfqd->queue;
@@ -1494,6 +1576,8 @@
 	if (!atomic_dec_and_test(&cfqd->ref))
 		return;
 
+	blk_sync_queue(q);
+
 	/*
 	 * kill spare queue, getting it means we have two refences to it.
 	 * drop both
@@ -1567,8 +1651,15 @@
 	q->nr_requests = 1024;
 	cfqd->max_queued = q->nr_requests / 16;
 	q->nr_batching = cfq_queued;
-	cfqd->key_type = CFQ_KEY_TGID;
+	cfqd->key_type = CFQ_KEY_PID;
 	cfqd->find_best_crq = 1;
+
+	init_timer(&cfqd->timer);
+	cfqd->timer.function = cfq_schedule_timer;
+	cfqd->timer.data = (unsigned long) cfqd;
+
+	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue, q);
+
 	atomic_set(&cfqd->ref, 1);
 
 	cfqd->cfq_queued = cfq_queued;
@@ -1578,6 +1669,9 @@
 	cfqd->cfq_fifo_batch_expire = cfq_fifo_rate;
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
+	cfqd->cfq_slice = cfq_slice;
+	cfqd->cfq_idle = cfq_idle;
+	cfqd->cfq_max_depth = cfq_max_depth;
 
 	return 0;
 out_spare:
@@ -1624,7 +1718,6 @@
 	return -ENOMEM;
 }
 
-
 /*
  * sysfs parts below -->
  */
@@ -1650,13 +1743,6 @@
 }
 
 static ssize_t
-cfq_clear_elapsed(struct cfq_data *cfqd, const char *page, size_t count)
-{
-	max_elapsed_dispatch = max_elapsed_crq = 0;
-	return count;
-}
-
-static ssize_t
 cfq_set_key_type(struct cfq_data *cfqd, const char *page, size_t count)
 {
 	spin_lock_irq(cfqd->queue->queue_lock);
@@ -1664,6 +1750,8 @@
 		cfqd->key_type = CFQ_KEY_PGID;
 	else if (!strncmp(page, "tgid", 4))
 		cfqd->key_type = CFQ_KEY_TGID;
+	else if (!strncmp(page, "pid", 3))
+		cfqd->key_type = CFQ_KEY_PID;
 	else if (!strncmp(page, "uid", 3))
 		cfqd->key_type = CFQ_KEY_UID;
 	else if (!strncmp(page, "gid", 3))
@@ -1688,6 +1776,52 @@
 	return len;
 }
 
+static ssize_t
+cfq_status_show(struct cfq_data *cfqd, char *page)
+{
+	struct list_head *entry;
+	struct cfq_queue *cfqq;
+	ssize_t len;
+	int i = 0, queues;
+
+	len = sprintf(page, "Busy queues: %u\n", cfqd->busy_queues);
+	len += sprintf(page+len, "key type: %s\n",
+				cfq_key_types[cfqd->key_type]);
+	len += sprintf(page+len, "last sector: %Lu\n",
+				(unsigned long long)cfqd->last_sector);
+
+	len += sprintf(page+len, "Busy queue list:\n");
+	spin_lock_irq(cfqd->queue->queue_lock);
+	list_for_each(entry, &cfqd->rr_list) {
+		i++;
+		cfqq = list_entry_cfqq(entry);
+		len += sprintf(page+len, "  cfqq: key=%lu alloc=%d/%d, "
+			"queued=%d/%d, service=%lu/%lu\n",
+			cfqq->key, cfqq->allocated[0], cfqq->allocated[1],
+			cfqq->queued[0], cfqq->queued[1], cfqq->service_used,
+			cfqq->service_rq);
+	}
+	len += sprintf(page+len, "  busy queues total: %d\n", i);
+	queues = i;
+	
+	len += sprintf(page+len, "Empty queue list:\n");
+	i = 0;
+	list_for_each(entry, &cfqd->empty_list) {
+		i++;
+		cfqq = list_entry_cfqq(entry);
+		len += sprintf(page+len, "  cfqq: key=%lu alloc=%d/%d, "
+			"queued=%d/%d, service=%lu/%lu\n",
+			cfqq->key, cfqq->allocated[0], cfqq->allocated[1],
+			cfqq->queued[0], cfqq->queued[1], cfqq->service_used,
+			cfqq->service_rq);
+	}
+	len += sprintf(page+len, "  empty queues total: %d\n", i);
+	queues += i;
+	len += sprintf(page+len, "Total queues: %d\n", queues);
+	spin_unlock_irq(cfqd->queue->queue_lock);
+	return len;
+}
+
 #define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
 static ssize_t __FUNC(struct cfq_data *cfqd, char *page)		\
 {									\
@@ -1704,6 +1838,9 @@
 SHOW_FUNCTION(cfq_find_best_show, cfqd->find_best_crq, 0);
 SHOW_FUNCTION(cfq_back_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_penalty_show, cfqd->cfq_back_penalty, 0);
+SHOW_FUNCTION(cfq_idle_show, cfqd->cfq_idle, 1);
+SHOW_FUNCTION(cfq_slice_show, cfqd->cfq_slice, 1);
+SHOW_FUNCTION(cfq_max_depth_show, cfqd->cfq_max_depth, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -1729,6 +1866,9 @@
 STORE_FUNCTION(cfq_find_best_store, &cfqd->find_best_crq, 0, 1, 0);
 STORE_FUNCTION(cfq_back_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_penalty_store, &cfqd->cfq_back_penalty, 1, UINT_MAX, 0);
+STORE_FUNCTION(cfq_idle_store, &cfqd->cfq_idle, 0, cfqd->cfq_slice/2, 1);
+STORE_FUNCTION(cfq_slice_store, &cfqd->cfq_slice, 1, UINT_MAX, 1);
+STORE_FUNCTION(cfq_max_depth_store, &cfqd->cfq_max_depth, 2, UINT_MAX, 0);
 #undef STORE_FUNCTION
 
 static struct cfq_fs_entry cfq_quantum_entry = {
@@ -1771,15 +1911,30 @@
 	.show = cfq_back_penalty_show,
 	.store = cfq_back_penalty_store,
 };
-static struct cfq_fs_entry cfq_clear_elapsed_entry = {
-	.attr = {.name = "clear_elapsed", .mode = S_IWUSR },
-	.store = cfq_clear_elapsed,
+static struct cfq_fs_entry cfq_slice_entry = {
+	.attr = {.name = "slice", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_slice_show,
+	.store = cfq_slice_store,
+};
+static struct cfq_fs_entry cfq_idle_entry = {
+	.attr = {.name = "idle", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_idle_show,
+	.store = cfq_idle_store,
+};
+static struct cfq_fs_entry cfq_misc_entry = {
+	.attr = {.name = "show_status", .mode = S_IRUGO },
+	.show = cfq_status_show,
 };
 static struct cfq_fs_entry cfq_key_type_entry = {
 	.attr = {.name = "key_type", .mode = S_IRUGO | S_IWUSR },
 	.show = cfq_read_key_type,
 	.store = cfq_set_key_type,
 };
+static struct cfq_fs_entry cfq_max_depth_entry = {
+	.attr = {.name = "max_depth", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_max_depth_show,
+	.store = cfq_max_depth_store,
+};
 
 static struct attribute *default_attrs[] = {
 	&cfq_quantum_entry.attr,
@@ -1791,7 +1946,10 @@
 	&cfq_find_best_entry.attr,
 	&cfq_back_max_entry.attr,
 	&cfq_back_penalty_entry.attr,
-	&cfq_clear_elapsed_entry.attr,
+	&cfq_misc_entry.attr,
+	&cfq_slice_entry.attr,
+	&cfq_idle_entry.attr,
+	&cfq_max_depth_entry.attr,
 	NULL,
 };
 
@@ -1856,7 +2014,7 @@
 	.elevator_owner =	THIS_MODULE,
 };
 
-int cfq_init(void)
+static int __init cfq_init(void)
 {
 	int ret;
 
@@ -1864,17 +2022,35 @@
 		return -ENOMEM;
 
 	ret = elv_register(&iosched_cfq);
-	if (!ret) {
-		__module_get(THIS_MODULE);
-		return 0;
-	}
+	if (ret)
+		cfq_slab_kill();
 
-	cfq_slab_kill();
 	return ret;
 }
 
 static void __exit cfq_exit(void)
 {
+	struct task_struct *g, *p;
+	unsigned long flags;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	/*
+	 * iterate each process in the system, removing our io_context
+	 */
+	do_each_thread(g, p) {
+		struct io_context *ioc = p->io_context;
+
+		if (ioc && ioc->cic) {
+			ioc->cic->exit(ioc->cic);
+			cfq_free_io_context(ioc->cic);
+			ioc->cic = NULL;
+		}
+
+	} while_each_thread(g, p);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+
 	cfq_slab_kill();
 	elv_unregister(&iosched_cfq);
 }


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 13:04 Time sliced CFQ io scheduler Jens Axboe
@ 2004-12-02 13:48 ` Jens Axboe
  2004-12-02 19:48   ` Andrew Morton
  2004-12-02 14:28 ` Giuliano Pochini
  1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 13:48 UTC (permalink / raw)
  To: Linux Kernel; +Cc: Nick Piggin

Hi,

One more test case, while the box is booted... This just demonstrates a
process doing a file write (bs=64k) with a competing process doing a
file read (bs=64k) at the same time, again capped at 30sec.

deadline:
Reader:  2520KiB/sec (max_lat=45msec)
Writer:  1258KiB/sec (max_lat=85msec)

as:
Reader: 27985KiB/sec (max_lat=34msec)
Writer:    64KiB/sec (max_lat=1042msec)

cfq:
Reader: 12703KiB/sec (max_lat=108msec)
Writer:  9743KiB/sec (max_lat=89msec)

If you look at vmstat while running these tests, cfq and deadline give
equal bandwidth for the reader and writer all the time, while as
basically doesn't give anything to the writer (a single block per second
only). Nick, is the write batching broken or something?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 13:04 Time sliced CFQ io scheduler Jens Axboe
  2004-12-02 13:48 ` Jens Axboe
@ 2004-12-02 14:28 ` Giuliano Pochini
  2004-12-02 14:41   ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Giuliano Pochini @ 2004-12-02 14:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel



On Thu, 2 Dec 2004, Jens Axboe wrote:

> Case 4: write_files, random, bs=4k

Just a thought... in this test the results don't look right. Why
aggregate bandwidth with 8 clients is higher than with 4 and 2 clients ?
In the cfq test with 8 clients aggregate bw is also higher than with
a single client.


--
Giuliano.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 14:28 ` Giuliano Pochini
@ 2004-12-02 14:41   ` Jens Axboe
  2004-12-04 13:05     ` Giuliano Pochini
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 14:41 UTC (permalink / raw)
  To: Giuliano Pochini; +Cc: Linux Kernel

On Thu, Dec 02 2004, Giuliano Pochini wrote:
> 
> 
> On Thu, 2 Dec 2004, Jens Axboe wrote:
> 
> > Case 4: write_files, random, bs=4k
> 
> Just a thought... in this test the results don't look right. Why
> aggregate bandwidth with 8 clients is higher than with 4 and 2 clients ?
> In the cfq test with 8 clients aggregate bw is also higher than with
> a single client.

I don't know what happens with the 4 client case, but it's not that
unlikely that aggregate bandwidth will be higher for more threads doing
random writes, as request coalesching will help minimize seeks.

But I did think it was strange with the 4 client case dip was strange,
it was reproducable though (as are all the results, they have very
little variance).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 13:48 ` Jens Axboe
@ 2004-12-02 19:48   ` Andrew Morton
  2004-12-02 19:52     ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-02 19:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, nickpiggin

Jens Axboe <axboe@suse.de> wrote:
>
> as:
> Reader: 27985KiB/sec (max_lat=34msec)
> Writer:    64KiB/sec (max_lat=1042msec)
> 
> cfq:
> Reader: 12703KiB/sec (max_lat=108msec)
> Writer:  9743KiB/sec (max_lat=89msec)
> 
> If you look at vmstat while running these tests, cfq and deadline give
> equal bandwidth for the reader and writer all the time, while as
> basically doesn't give anything to the writer (a single block per second
> only). Nick, is the write batching broken or something?

Looks like it.  We used to do 2/3rds-read, 1/3rd-write in that testcase.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 19:48   ` Andrew Morton
@ 2004-12-02 19:52     ` Jens Axboe
  2004-12-02 20:19       ` Andrew Morton
  2004-12-08  0:37       ` Andrea Arcangeli
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 19:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nickpiggin

On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > as:
> > Reader: 27985KiB/sec (max_lat=34msec)
> > Writer:    64KiB/sec (max_lat=1042msec)
> > 
> > cfq:
> > Reader: 12703KiB/sec (max_lat=108msec)
> > Writer:  9743KiB/sec (max_lat=89msec)
> > 
> > If you look at vmstat while running these tests, cfq and deadline give
> > equal bandwidth for the reader and writer all the time, while as
> > basically doesn't give anything to the writer (a single block per second
> > only). Nick, is the write batching broken or something?
> 
> Looks like it.  We used to do 2/3rds-read, 1/3rd-write in that testcase.

But 'as' has had no real changes in about 9 months time, it's really
strange. Twiddling with write expire and write batch expire settings
make no real difference. Upping the ante to 4 clients, two readers and
two writers work about the same: 27MiB/sec aggregate read bandwidth,
~100KiB/sec write.

At least something needs to be done about it. I don't know what kernel
this is a regression against, but at least it means that current 2.6
with its default io scheduler has basically zero write performance in
presence of reads.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 20:19       ` Andrew Morton
@ 2004-12-02 20:19         ` Jens Axboe
  2004-12-02 20:34           ` Andrew Morton
  2004-12-02 22:18         ` Prakash K. Cheemplavam
  1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 20:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nickpiggin

On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > n Thu, Dec 02 2004, Andrew Morton wrote:
> > > Jens Axboe <axboe@suse.de> wrote:
> > > >
> > > > as:
> > > > Reader: 27985KiB/sec (max_lat=34msec)
> > > > Writer:    64KiB/sec (max_lat=1042msec)
> > > > 
> > > > cfq:
> > > > Reader: 12703KiB/sec (max_lat=108msec)
> > > > Writer:  9743KiB/sec (max_lat=89msec)
> > > > 
> > > > If you look at vmstat while running these tests, cfq and deadline give
> > > > equal bandwidth for the reader and writer all the time, while as
> > > > basically doesn't give anything to the writer (a single block per second
> > > > only). Nick, is the write batching broken or something?
> > > 
> > > Looks like it.  We used to do 2/3rds-read, 1/3rd-write in that testcase.
> > 
> > But 'as' has had no real changes in about 9 months time, it's really
> > strange. Twiddling with write expire and write batch expire settings
> > make no real difference. Upping the ante to 4 clients, two readers and
> > two writers work about the same: 27MiB/sec aggregate read bandwidth,
> > ~100KiB/sec write.
> 
> Did a quick test here, things seem OK.
> 
> Writer:
> 
> 	while true
> 	do
> 	dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
> 	done
> 
> Reader:
> 
> 	time cat 1-gig-file > /dev/null
> 	cat x > /dev/null  0.07s user 1.55s system 3% cpu 45.434 total
> 
> `vmstat 1' says:
> 
> 
>  0  5   1168   3248    472 220972    0    0    28 24896 1249   212  0  7  0 94
>  0  7   1168   3248    492 220952    0    0    28 28056 1284   204  0  5  0 96
>  0  8   1168   3248    500 221012    0    0    28 30632 1255   194  0  5  0 95
>  0  7   1168   2800    508 221344    0    0    16 20432 1183   170  0  3  0 97
>  0  8   1168   3024    484 221164    0    0 15008 12488 1246   460  0  4  0 96
>  1  8   1168   2252    484 221980    0    0 27808  6092 1270   624  0  4  0 96
>  0  8   1168   3248    468 221044    0    0 32420  4596 1290   690  0  4  0 96
>  0  9   1164   2084    456 222212    4    0 28964  1800 1285   596  0  3  0 96
>  1  7   1164   3032    392 221256    0    0 23456  6820 1270   527  0  4  0 96

[snip]

Looks fine, yes.

> So what are you doing different?

Doing sync io, most likely. My results above are 64k O_DIRECT reads and
writes, see the mention of the test cases in the first mail.  I'll
repeat the testing with both sync and async writes tomorrow on the same
box and see what that does to fairness.

Async writes are not very interesting, it takes quite some effort to
make those go slow :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 19:52     ` Jens Axboe
@ 2004-12-02 20:19       ` Andrew Morton
  2004-12-02 20:19         ` Jens Axboe
  2004-12-02 22:18         ` Prakash K. Cheemplavam
  2004-12-08  0:37       ` Andrea Arcangeli
  1 sibling, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-02 20:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, nickpiggin

Jens Axboe <axboe@suse.de> wrote:
>
> n Thu, Dec 02 2004, Andrew Morton wrote:
> > Jens Axboe <axboe@suse.de> wrote:
> > >
> > > as:
> > > Reader: 27985KiB/sec (max_lat=34msec)
> > > Writer:    64KiB/sec (max_lat=1042msec)
> > > 
> > > cfq:
> > > Reader: 12703KiB/sec (max_lat=108msec)
> > > Writer:  9743KiB/sec (max_lat=89msec)
> > > 
> > > If you look at vmstat while running these tests, cfq and deadline give
> > > equal bandwidth for the reader and writer all the time, while as
> > > basically doesn't give anything to the writer (a single block per second
> > > only). Nick, is the write batching broken or something?
> > 
> > Looks like it.  We used to do 2/3rds-read, 1/3rd-write in that testcase.
> 
> But 'as' has had no real changes in about 9 months time, it's really
> strange. Twiddling with write expire and write batch expire settings
> make no real difference. Upping the ante to 4 clients, two readers and
> two writers work about the same: 27MiB/sec aggregate read bandwidth,
> ~100KiB/sec write.

Did a quick test here, things seem OK.

Writer:

	while true
	do
	dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
	done

Reader:

	time cat 1-gig-file > /dev/null
	cat x > /dev/null  0.07s user 1.55s system 3% cpu 45.434 total

`vmstat 1' says:


 0  5   1168   3248    472 220972    0    0    28 24896 1249   212  0  7  0 94
 0  7   1168   3248    492 220952    0    0    28 28056 1284   204  0  5  0 96
 0  8   1168   3248    500 221012    0    0    28 30632 1255   194  0  5  0 95
 0  7   1168   2800    508 221344    0    0    16 20432 1183   170  0  3  0 97
 0  8   1168   3024    484 221164    0    0 15008 12488 1246   460  0  4  0 96
 1  8   1168   2252    484 221980    0    0 27808  6092 1270   624  0  4  0 96
 0  8   1168   3248    468 221044    0    0 32420  4596 1290   690  0  4  0 96
 0  9   1164   2084    456 222212    4    0 28964  1800 1285   596  0  3  0 96
 1  7   1164   3032    392 221256    0    0 23456  6820 1270   527  0  4  0 96
 0  9   1164   3200    388 221124    0    0 27040  7868 1269   588  0  3  0 97
 0  9   1164   2540    384 221808    0    0 21536  4024 1247   540  0  4  0 96
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1 10   1164   2052    392 222276    0    0 33572  5268 1298   745  0  4  0 96
 0  9   1164   3032    400 221316    0    0 28704  5448 1282   611  0  4  0 97
 1  9   1164   2076    388 222144    0    0  9992 17176 1229   325  0  2  0 98
 1  8   1164   3060    376 221136    0    0  9100 13168 1221   284  0  2  0 98
 0  8   1164   2628    384 221536    0    0 28964  3348 1280   635  0  4  0 97
 0  8   1164   2920    344 221372    0    0 27052  5744 1275   657  0  6  0 95
 1  8   1164   3072    328 221252    0    0 26664  5256 1270   653  0  5  0 95
 0  9   1160   2176    356 222100   12    0 26928  6320 1276   605  0  5  0 95
 0  9   1160   2268    332 221920    0    0 17300  9580 1242   428  0  3  0 98
 0  8   1160   3256    332 221036    0    0 23588  9280 1345   586  0  4  0 96
 0  8   1160   3220    320 221116    0    0 16916  9476 1251   425  0  3  0 97
 0 10   1160   3000    320 221388    0    0 21416  8168 1260   565  0  5  0 95
 0 11   1160   2020    324 222268    0    0 23580 10144 1269   528  0  3  0 97
 0 11   1160   2076    340 222252    0    0 20900  3896 1244   486  1  3  0 97
 0 10   1160   2656    356 221692    0    0 23968  8108 1272   564  0  5  0 95
 0 10   1160   3464    348 220892    0    0 26140 10272 1513   618  0  2  0 98
 0 10   1160   3124    320 221260    0    0 15512 11368 1227   442  0  3  0 97
 0 10   1156   3072    336 221148   32    0 22212  6776 1280   539  0  4  0 97
 0 11   1156   2544    352 221608    0    0 25004  7224 1262   596  0  4  0 95
 0 12   1156   2132    364 222140    0    0 20636  9500 1246   510  0  3  0 97
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1 12   1156   2132    372 222064    0    0 25880  6104 1291   550  0  3  0 97
 0 11   1156   3260    368 220980    0    0 19868 12860 1277   496  0  3  0 97
 0 12   1156   2328    360 221872    0    0 20764  7712 1256   513  0  4  0 97
 0 10   1156   3072    356 221128    0    0 17056  7800 1239   474  0  4  0 96
 0 11   1156   2180    336 221964    0    0 20252 10464 1252   520  0  4  0 96
 0 11   1156   2076    360 222144    0    0 22512  7448 1250   554  1  4  0 96
 0 10   1156   2620    364 221692    0    0 23372  4236 1256   543  0  4  0 96
 0 11   1156   2136    344 222120    0    0 22172  8060 1260   528  0  3  0 97
 0 10   1156   3340    316 221060    0    0 17688 12036 1242   474  0  3  0 97
 0 10   1156   2580    296 221760    0    0 18460  5608 1243   501  0  3  0 97
 0 10   1156   2960    308 221408    0    0 17176 10544 1233   462  0  3  0 97
 0 11   1156   2132    308 222224    0    0 32376  2048 1291   715  0  4  0 96
 0 10   1156   3280    300 221008    0    0 23628 10768 1278   556  0  4  0 96
 0 11   1156   2132    320 222144    0    0 18076 10888 1365   481  0  3  0 97
 0 11   1156   2504    312 221880    0    0 23448 10068 1256   526  0  3  0 97
 0 10   1156   2532    324 221664    0    0 18084  6012 1259   476  0  5  0 96
 0 10   1156   2580    332 221792    0    0 26400  6776 1279   626  0  4  0 96
 0 10   1156   3312    324 221052    0    0 22044  6036 1247   508  0  4  0 96
 0 10   1152   2144    328 222204    4    0 11996 15068 1235   394  0  4  0 97
 0  5   1152   3128    344 221236    0    0    20 24068 1200   172  0  3  2 95


So what are you doing different?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 20:19         ` Jens Axboe
@ 2004-12-02 20:34           ` Andrew Morton
  2004-12-02 20:37             ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-02 20:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, nickpiggin

Jens Axboe <axboe@suse.de> wrote:
>
> > So what are you doing different?
> 
> Doing sync io, most likely. My results above are 64k O_DIRECT reads and
> writes, see the mention of the test cases in the first mail.

OK.

Writer:

	while true
	do
	write-and-fsync -o -m 100 -c 65536 foo 
	done

Reader:

	time-read -o -b 65536 -n 256 x      (This is O_DIRECT)
or:	time-read -b 65536 -n 256 x	    (This is buffered)

`vmstat 1':

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1   1032 137412   4276  84388   32    0 15456 25344 1659  1538  0  3 50 47
 0  1   1032 137468   4276  84388    0    0     0 32128 1521  1027  0  2 51 48
 0  1   1032 137476   4276  84388    0    0     0 32064 1519  1026  0  1 50 49
 0  1   1032 137476   4276  84388    0    0     0 33920 1556  1102  0  2 50 49
 0  1   1032 137476   4276  84388    0    0     0 33088 1541  1074  0  1 50 49
 0  2   1032 135676   4284  85944    0    0  1656 29732 1868  2506  0  3 49 47
 1  1   1032  96532   4292 125172    0    0 39220   128 10813 39313  0 31 35 34
 0  2   1032  57724   4332 163892    0    0 38828   128 10716 38907  0 28 38 35
 0  2   1032  18860   4368 202684    0    0 38768   128 10701 38845  1 28 38 35
 0  2   1032   3672   4248 217764    0    0 39188   128 10803 39327  0 28 37 34
 0  1   1032   2832   4260 218840    0    0 16812 17932 5504 17457  0 14 46 40
 0  1   1032   2832   4260 218840    0    0     0 30876 1501   974  0  1 50 49
 0  1   1032   2944   4260 218840    0    0     0 33472 1537  1068  0  2 50 48
 0  1   1032   2944   4260 218840    0    0     0 33216 1533  1046  0  2 50 48
 
Ugly.

(write-and-fsync and time-read are from
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 20:34           ` Andrew Morton
@ 2004-12-02 20:37             ` Jens Axboe
  2004-12-07 23:11               ` Nick Piggin
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-02 20:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nickpiggin

On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> > > So what are you doing different?
> > 
> > Doing sync io, most likely. My results above are 64k O_DIRECT reads and
> > writes, see the mention of the test cases in the first mail.
> 
> OK.
> 
> Writer:
> 
> 	while true
> 	do
> 	write-and-fsync -o -m 100 -c 65536 foo 
> 	done
> 
> Reader:
> 
> 	time-read -o -b 65536 -n 256 x      (This is O_DIRECT)
> or:	time-read -b 65536 -n 256 x	    (This is buffered)
> 
> `vmstat 1':
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  1  1   1032 137412   4276  84388   32    0 15456 25344 1659  1538  0  3 50 47
>  0  1   1032 137468   4276  84388    0    0     0 32128 1521  1027  0  2 51 48
>  0  1   1032 137476   4276  84388    0    0     0 32064 1519  1026  0  1 50 49
>  0  1   1032 137476   4276  84388    0    0     0 33920 1556  1102  0  2 50 49
>  0  1   1032 137476   4276  84388    0    0     0 33088 1541  1074  0  1 50 49
>  0  2   1032 135676   4284  85944    0    0  1656 29732 1868  2506  0  3 49 47
>  1  1   1032  96532   4292 125172    0    0 39220   128 10813 39313  0 31 35 34
>  0  2   1032  57724   4332 163892    0    0 38828   128 10716 38907  0 28 38 35
>  0  2   1032  18860   4368 202684    0    0 38768   128 10701 38845  1 28 38 35
>  0  2   1032   3672   4248 217764    0    0 39188   128 10803 39327  0 28 37 34
>  0  1   1032   2832   4260 218840    0    0 16812 17932 5504 17457  0 14 46 40

Well there you go, exactly what I saw. The writer(s) basically make no
progress as long as the reader is going. Since 'as' treats the sync
writes like reads internally and given the really bad fairness problems
demonstrated for same direction clients, that might be the same problem.

> Ugly.
> 
> (write-and-fsync and time-read are from
> http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz)

I'll try and post my cruddy test programs tomorrow as well. Pretty handy
for getting a good feel for N client read/write performance.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 20:19       ` Andrew Morton
  2004-12-02 20:19         ` Jens Axboe
@ 2004-12-02 22:18         ` Prakash K. Cheemplavam
  2004-12-03  7:01           ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-02 22:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, linux-kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 3950 bytes --]

Andrew Morton schrieb:
> Jens Axboe <axboe@suse.de> wrote:
> 
>>n Thu, Dec 02 2004, Andrew Morton wrote:
>>
>>>Jens Axboe <axboe@suse.de> wrote:
>>>
>>>>as:
>>>>Reader: 27985KiB/sec (max_lat=34msec)
>>>>Writer:    64KiB/sec (max_lat=1042msec)
>>>>
>>>>cfq:
>>>>Reader: 12703KiB/sec (max_lat=108msec)
>>>>Writer:  9743KiB/sec (max_lat=89msec)
> 
> 
> Did a quick test here, things seem OK.
> 
> Writer:
> 
> 	while true
> 	do
> 	dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
> 	done
> 
> Reader:
> 
> 	time cat 1-gig-file > /dev/null
> 	cat x > /dev/null  0.07s user 1.55s system 3% cpu 45.434 total
> 
> `vmstat 1' says:
> 
> 
>  0  5   1168   3248    472 220972    0    0    28 24896 1249   212  0  7  0 94
>  0  7   1168   3248    492 220952    0    0    28 28056 1284   204  0  5  0 96
>  0  8   1168   3248    500 221012    0    0    28 30632 1255   194  0  5  0 95
>  0  7   1168   2800    508 221344    0    0    16 20432 1183   170  0  3  0 97
>  0  8   1168   3024    484 221164    0    0 15008 12488 1246   460  0  4  0 96
>  1  8   1168   2252    484 221980    0    0 27808  6092 1270   624  0  4  0 96
>  0  8   1168   3248    468 221044    0    0 32420  4596 1290   690  0  4  0 96
>  0  9   1164   2084    456 222212    4    0 28964  1800 1285   596  0  3  0 96
>  1  7   1164   3032    392 221256    0    0 23456  6820 1270   527  0  4  0 96
>  0  9   1164   3200    388 221124    0    0 27040  7868 1269   588  0  3  0 97
>  0  9   1164   2540    384 221808    0    0 21536  4024 1247   540  0  4  0 96
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  1 10   1164   2052    392 222276    0    0 33572  5268 1298   745  0  4  0 96
>  0  9   1164   3032    400 221316    0    0 28704  5448 1282   611  0  4  0 97
> 

I am happy that finally the kernel devs see that there is a problem. In my case (as I 
mentioned in another thread) the reader is pretty starving while I a writer is active. 
(esp my email client makes trouble while writing is going on.) This is vmstat using your 
scripts above (though using cfq2 scheduler):

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
  1  3   3080   2528   1256 818976    0    0  6404 101236 1332   929  1 26  0 73
  0  3   3080   2768   1252 820104    0    0  2820 32632 1328  1087  1 20  0 79
  2  1   3080   6992   1292 814808    0    0  4928 106912 1337  1364 16 29  0 55
  0  3   3080   2772   1252 818516    0    0  3076 42176 1357  1351  2 41  0 57
  0  3   3080   2644   1256 817548    0    0  3332 110104 1375   873  1 36  0 63
  0  3   3080   2592   1248 815928    0    0  2820 76860 1324   894  1 41  0 58
  7  3   3080   2208   1248 817176    0    0  3328 134144 1352  1058  2 30  0 68
  4  4   3080   2516   1248 817516    0    0  3588 47768 1327  1244  1 19  0 80
  0  3   3080   2400   1220 818688    0    0  3844 24760 1312  1251  1 23  0 76
  0  3   3080   2656   1196 816468    0    0  3588 132372 1352  1126  1 52  0 47
  0  3   3080   2688   1188 815316    0    0  3076 77824 1328   933  1 35  0 64
  0  3   3080   2336   1156 816744    0    0  2924 114688 1333  1038  1 25  0 74
  0  3   3080   2528   1184 816812    0    0  2508 67736 1340   882  1 12  0 87
  0  3   3080   2208   1156 817712    0    0  3592 75624 1326  2289  1 36  0 63
  0  3   3080   2664   1156 818240    0    0  5124 15692 1302   992  1 18  0 81
  0  3   3080   2580   1160 815832    0    0  4356 155792 1375  1064  1 39  0 60
  0  3   3080   2472   1160 817124    0    0  3076 100852 1345  1138  1 23  0 76
  2  4   3080   2836   1148 816228    0    0  3336 100412 1352  1379  1 47  0 52
  0  4   3080   2708   1144 815964    0    0  3844 48908 1343   871  1 25  0 74
  0  3   3080   2748   1152 815984    0    0  3332 71996 1338   843  1 27  0 72

Cheers,

Prakash

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 22:18         ` Prakash K. Cheemplavam
@ 2004-12-03  7:01           ` Jens Axboe
  2004-12-03  9:12             ` Prakash K. Cheemplavam
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03  7:01 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: Andrew Morton, linux-kernel, nickpiggin

On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
> Andrew Morton schrieb:
> >Jens Axboe <axboe@suse.de> wrote:
> >
> >>n Thu, Dec 02 2004, Andrew Morton wrote:
> >>
> >>>Jens Axboe <axboe@suse.de> wrote:
> >>>
> >>>>as:
> >>>>Reader: 27985KiB/sec (max_lat=34msec)
> >>>>Writer:    64KiB/sec (max_lat=1042msec)
> >>>>
> >>>>cfq:
> >>>>Reader: 12703KiB/sec (max_lat=108msec)
> >>>>Writer:  9743KiB/sec (max_lat=89msec)
> >
> >
> >Did a quick test here, things seem OK.
> >
> >Writer:
> >
> >	while true
> >	do
> >	dd if=/dev/zero of=foo bs=1M count=1000 conv=notrunc
> >	done
> >
> >Reader:
> >
> >	time cat 1-gig-file > /dev/null
> >	cat x > /dev/null  0.07s user 1.55s system 3% cpu 45.434 total
> >
> >`vmstat 1' says:
> >
> >
> > 0  5   1168   3248    472 220972    0    0    28 24896 1249   212  0  7  
> > 0 94
> > 0  7   1168   3248    492 220952    0    0    28 28056 1284   204  0  5  
> > 0 96
> > 0  8   1168   3248    500 221012    0    0    28 30632 1255   194  0  5  
> > 0 95
> > 0  7   1168   2800    508 221344    0    0    16 20432 1183   170  0  3  
> > 0 97
> > 0  8   1168   3024    484 221164    0    0 15008 12488 1246   460  0  4  
> > 0 96
> > 1  8   1168   2252    484 221980    0    0 27808  6092 1270   624  0  4  
> > 0 96
> > 0  8   1168   3248    468 221044    0    0 32420  4596 1290   690  0  4  
> > 0 96
> > 0  9   1164   2084    456 222212    4    0 28964  1800 1285   596  0  3  
> > 0 96
> > 1  7   1164   3032    392 221256    0    0 23456  6820 1270   527  0  4  
> > 0 96
> > 0  9   1164   3200    388 221124    0    0 27040  7868 1269   588  0  3  
> > 0 97
> > 0  9   1164   2540    384 221808    0    0 21536  4024 1247   540  0  4  
> > 0 96
> >procs -----------memory---------- ---swap-- -----io---- --system-- 
> >----cpu----
> > r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy 
> > id wa
> > 1 10   1164   2052    392 222276    0    0 33572  5268 1298   745  0  4  
> > 0 96
> > 0  9   1164   3032    400 221316    0    0 28704  5448 1282   611  0  4  
> > 0 97
> >
> 
> I am happy that finally the kernel devs see that there is a problem. In my 
> case (as I mentioned in another thread) the reader is pretty starving while 
> I a writer is active. (esp my email client makes trouble while writing is 
> going on.) This is vmstat using your scripts above (though using cfq2 

This was the reverse case, though :-)

> scheduler):
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- 
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id 
>  wa
>  1  3   3080   2528   1256 818976    0    0  6404 101236 1332   929  1 26  
>  0 73
>  0  3   3080   2768   1252 820104    0    0  2820 32632 1328  1087  1 20  0 
>  79
>  2  1   3080   6992   1292 814808    0    0  4928 106912 1337  1364 16 29  
>  0 55
>  0  3   3080   2772   1252 818516    0    0  3076 42176 1357  1351  2 41  0 
>  57
>  0  3   3080   2644   1256 817548    0    0  3332 110104 1375   873  1 36  
>  0 63
>  0  3   3080   2592   1248 815928    0    0  2820 76860 1324   894  1 41  0 
>  58
>  7  3   3080   2208   1248 817176    0    0  3328 134144 1352  1058  2 30  
>  0 68
>  4  4   3080   2516   1248 817516    0    0  3588 47768 1327  1244  1 19  0 
>  80
>  0  3   3080   2400   1220 818688    0    0  3844 24760 1312  1251  1 23  0 
>  76
>  0  3   3080   2656   1196 816468    0    0  3588 132372 1352  1126  1 52  
>  0 47
>  0  3   3080   2688   1188 815316    0    0  3076 77824 1328   933  1 35  0 
>  64
>  0  3   3080   2336   1156 816744    0    0  2924 114688 1333  1038  1 25  
>  0 74
>  0  3   3080   2528   1184 816812    0    0  2508 67736 1340   882  1 12  0 
>  87
>  0  3   3080   2208   1156 817712    0    0  3592 75624 1326  2289  1 36  0 
>  63
>  0  3   3080   2664   1156 818240    0    0  5124 15692 1302   992  1 18  0 
>  81
>  0  3   3080   2580   1160 815832    0    0  4356 155792 1375  1064  1 39  
>  0 60
>  0  3   3080   2472   1160 817124    0    0  3076 100852 1345  1138  1 23  
>  0 76
>  2  4   3080   2836   1148 816228    0    0  3336 100412 1352  1379  1 47  
>  0 52
>  0  4   3080   2708   1144 815964    0    0  3844 48908 1343   871  1 25  0 
>  74
>  0  3   3080   2748   1152 815984    0    0  3332 71996 1338   843  1 27  0 
>  72

Can you try with the patch that is in the parent of this thread? The
above doesn't look that bad, although read performance could be better
of course. But try with the patch please, I'm sure it should help you
quite a lot.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  7:01           ` Jens Axboe
@ 2004-12-03  9:12             ` Prakash K. Cheemplavam
  2004-12-03  9:18               ` Jens Axboe
  2004-12-03  9:26               ` Andrew Morton
  0 siblings, 2 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03  9:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: akpm, linux-kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 3273 bytes --]

Jens Axboe schrieb:
> On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
> 

>> 0  3   3080   2208   1156 817712    0    0  3592 75624 1326  2289  1 36  0 
>> 63
>> 0  3   3080   2664   1156 818240    0    0  5124 15692 1302   992  1 18  0 
>> 81
>> 0  3   3080   2580   1160 815832    0    0  4356 155792 1375  1064  1 39  
>> 0 60
>> 0  3   3080   2472   1160 817124    0    0  3076 100852 1345  1138  1 23  
>> 0 76
>> 2  4   3080   2836   1148 816228    0    0  3336 100412 1352  1379  1 47  
>> 0 52
>> 0  4   3080   2708   1144 815964    0    0  3844 48908 1343   871  1 25  0 
>> 74
>> 0  3   3080   2748   1152 815984    0    0  3332 71996 1338   843  1 27  0 
>> 72
> 
> 
> Can you try with the patch that is in the parent of this thread? The
> above doesn't look that bad, although read performance could be better
> of course. But try with the patch please, I'm sure it should help you
> quite a lot.
> 

It actually got worse: Though the read rate seems accepteble, it is not, as 
interactivity is dead while writing. I cannot start porgrammes, other programmes 
which want to do i/o pretty much hang. This is only while writing. While reading 
there is no such problem.

Prakash

  5   2692   5440   1640 917964    0    0  2332 72364 1337   782  1 28  0 71
  0  5   2692   5536   1540 917116    0    0  2116 85360 1346   785  2 27  0 71
  1  4   2692   7016   1496 919488    0    0  2152 71664 1329   740  3 29  0 68
  0  4   2692   5112   1476 922536    0    0   872 110592 1328   798  0 17  0 83
  0  4   2692   5560   1492 922144    0    0  1316 57348 1323  2162  1 21  0 78
  0  4   2692   5240   1500 921808    0    0  2088 92200 1352  1285  1 26  0 73
  0  4   2692   5816   1576 922064    0    0  1352 60716 1316   737  1  5  0 94
  0  5   2692   5484   1588 920004    0    0  2072 87732 1327  3522  2 50  0 48
  0  4   2692   5696   1660 920628    0    0   956 97284 1336   676  1 28  0 71
  0  4   2692   5368   1592 921808    0    0  1296 23208 1367  4870  2 35  0 63
  0  4   2692   5176   1628 922708    0    0  1576 67932 1400   721  0  4  0 96
  1  4   2692   5496   1684 922604    0    0  2372 53216 1320   684  1  6  0 93
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
  1  4   2692   6432   1744 924664    0    0  3144 31484 1331   651  1  5  0 94
  0  4   2692   5496   1724 922056    0    0  2336 117040 1306  7770  1 63  0 36
  0  4   2692   5500   1724 921588    0    0  2576 28992 1314  1244  1 26  0 73
  0  5   2692   5484   1728 919340    0    0  1168 128796 1334 77435  2 45  0 53
  0  4   2692   5432   1756 920864    0    0  1488 100392 1325  1100  1 25  0 74
  1  4   2692   5368   1772 921900    0    0  1312 52180 1312  2087  1 21  0 78
  0  4   2692   5240   1716 922272    0    0  2076 56352 1305   939  1 13  0 86
  0  4   2692   5496   1732 921592    0    0  2596 68576 1320  1170  1 18  0 81
  0  4   2692   5368   1776 921364    0    0  1588 21904 1281  1201  1 23  0 76
  0  4   2692   5516   1852 921840    0    0  6560 93864 1593   967  1  8  0 91
  0  4   2692   5176   1816 922148    0    0  1068 73728 1581  4683  2 37  0 61
  0  4   2692   5484   1756 922408    0    0  2096 73632 1450  1456  2 20  0 78

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  9:12             ` Prakash K. Cheemplavam
@ 2004-12-03  9:18               ` Jens Axboe
  2004-12-03  9:35                 ` Prakash K. Cheemplavam
  2004-12-03  9:26               ` Andrew Morton
  1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03  9:18 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: akpm, linux-kernel, nickpiggin

On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
> >
> 
> >>0  3   3080   2208   1156 817712    0    0  3592 75624 1326  2289  1 36  
> >>0 63
> >>0  3   3080   2664   1156 818240    0    0  5124 15692 1302   992  1 18  
> >>0 81
> >>0  3   3080   2580   1160 815832    0    0  4356 155792 1375  1064  1 39  
> >>0 60
> >>0  3   3080   2472   1160 817124    0    0  3076 100852 1345  1138  1 23  
> >>0 76
> >>2  4   3080   2836   1148 816228    0    0  3336 100412 1352  1379  1 47  
> >>0 52
> >>0  4   3080   2708   1144 815964    0    0  3844 48908 1343   871  1 25  
> >>0 74
> >>0  3   3080   2748   1152 815984    0    0  3332 71996 1338   843  1 27  
> >>0 72
> >
> >
> >Can you try with the patch that is in the parent of this thread? The
> >above doesn't look that bad, although read performance could be better
> >of course. But try with the patch please, I'm sure it should help you
> >quite a lot.
> >
> 
> It actually got worse: Though the read rate seems accepteble, it is not, as 
> interactivity is dead while writing. I cannot start porgrammes, other 
> programmes which want to do i/o pretty much hang. This is only while 
> writing. While reading there is no such problem.

Interesting, thanks for testing. I'll run some tests here as well, so
far only the cases mentioned yesterday have been tested.

You could try and bumb the slice period. But I'll experiment and see
what happens. What is your test case?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  9:12             ` Prakash K. Cheemplavam
  2004-12-03  9:18               ` Jens Axboe
@ 2004-12-03  9:26               ` Andrew Morton
  2004-12-03  9:34                 ` Prakash K. Cheemplavam
  2004-12-03  9:39                 ` Jens Axboe
  1 sibling, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-03  9:26 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: axboe, linux-kernel, nickpiggin

"Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
>
> > Can you try with the patch that is in the parent of this thread? The
>  > above doesn't look that bad, although read performance could be better
>  > of course. But try with the patch please, I'm sure it should help you
>  > quite a lot.
>  > 
> 
>  It actually got worse: Though the read rate seems accepteble, it is not, as 
>  interactivity is dead while writing.

Is this a parallel IDE system?  SATA?  SCSI?  If the latter, what driver
and what is the TCQ depth?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  9:26               ` Andrew Morton
@ 2004-12-03  9:34                 ` Prakash K. Cheemplavam
  2004-12-03  9:39                 ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03  9:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: axboe, linux-kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 961 bytes --]

Andrew Morton schrieb:
> "Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
> 
>>>Can you try with the patch that is in the parent of this thread? The
>>
>> > above doesn't look that bad, although read performance could be better
>> > of course. But try with the patch please, I'm sure it should help you
>> > quite a lot.
>> > 
>>
>> It actually got worse: Though the read rate seems accepteble, it is not, as 
>> interactivity is dead while writing.
> 
> 
> Is this a parallel IDE system?  SATA?  SCSI?  If the latter, what driver
> and what is the TCQ depth?

Hmm yes, this is a RAID0 configuration, so the regression of time 
slieced CFQ might me related to it, but the problem of unresponsiveness 
while writing as such was on my single disk, as well. Here one HD is 
SATA (libata silicon image) and one on IDE controller (nforce2). No TCQ. 
BTW, I haven't checked the problem on my ide disk only on SATA. Wil try 
to free some space and do so...

Prakash

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  9:18               ` Jens Axboe
@ 2004-12-03  9:35                 ` Prakash K. Cheemplavam
  2004-12-03  9:43                   ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03  9:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: akpm, linux-kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]

Jens Axboe schrieb:
> On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> 
>>Jens Axboe schrieb:
>>
>>>On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
>>>
>>
>>>>0  3   3080   2208   1156 817712    0    0  3592 75624 1326  2289  1 36  
>>>>0 63
>>>>0  3   3080   2664   1156 818240    0    0  5124 15692 1302   992  1 18  
>>>>0 81
>>>>0  3   3080   2580   1160 815832    0    0  4356 155792 1375  1064  1 39  
>>>>0 60
>>>>0  3   3080   2472   1160 817124    0    0  3076 100852 1345  1138  1 23  
>>>>0 76
>>>>2  4   3080   2836   1148 816228    0    0  3336 100412 1352  1379  1 47  
>>>>0 52
>>>>0  4   3080   2708   1144 815964    0    0  3844 48908 1343   871  1 25  
>>>>0 74
>>>>0  3   3080   2748   1152 815984    0    0  3332 71996 1338   843  1 27  
>>>>0 72
>>>
>>>
>>>Can you try with the patch that is in the parent of this thread? The
>>>above doesn't look that bad, although read performance could be better
>>>of course. But try with the patch please, I'm sure it should help you
>>>quite a lot.
>>>
>>
>>It actually got worse: Though the read rate seems accepteble, it is not, as 
>>interactivity is dead while writing. I cannot start porgrammes, other 
>>programmes which want to do i/o pretty much hang. This is only while 
>>writing. While reading there is no such problem.
> 
> 
> Interesting, thanks for testing. I'll run some tests here as well, so
> far only the cases mentioned yesterday have been tested.

BTW, in case it is misread: Above (except the io performance as such) is
no regression: The other schedulers behave the same on my system.


> You could try and bumb the slice period. But I'll experiment and see
> what happens. What is your test case?

[slice bumping] Uhm, is it doable via proc? I haven't seen text docs to
your patch and I am not good at kernel code ;-)

My test case was apkm's one: write 1gb continuesly and try to cat a
several gb big file to /dev/null. At the same time I checked starting
other apps/using my emial client...

For me the problem in mainline has been since quite some time...checked 
till kernel 2.6.7.

Prakash

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  9:26               ` Andrew Morton
  2004-12-03  9:34                 ` Prakash K. Cheemplavam
@ 2004-12-03  9:39                 ` Jens Axboe
  2004-12-03  9:54                   ` Prakash K. Cheemplavam
       [not found]                   ` <41B03722.5090001@gmx.de>
  1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-03  9:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Prakash K. Cheemplavam, linux-kernel, nickpiggin

On Fri, Dec 03 2004, Andrew Morton wrote:
> "Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
> >
> > > Can you try with the patch that is in the parent of this thread? The
> >  > above doesn't look that bad, although read performance could be better
> >  > of course. But try with the patch please, I'm sure it should help you
> >  > quite a lot.
> >  > 
> > 
> >  It actually got worse: Though the read rate seems accepteble, it is not, as 
> >  interactivity is dead while writing.
> 
> Is this a parallel IDE system?  SATA?  SCSI?  If the latter, what driver
> and what is the TCQ depth?

Yeah, that would be interesting to know. Or of the device is on dm or
raid. And what filesystem is being used?

TCQ depth doesn't matter with cfq really, as you can fully control how
big you go with the drive (default is 4) with max_depth.

Running buffer reads and writes here with new cfq, I get about ~7MiB/sec
read and ~14MiB/sec write aggregate performance with 4 clients (2 of
each) with the default settings. If I up idle period to 6ms and slice
period to 150ms, I get ~13MiB/sec read and ~11MiB/sec write aggregate
for the same run.

So Prakash, please try the same test with those settings:

# cd /sys/block/<dev>/queue/iosched
# echo 6 > idle
# echo 150 > slice

These are the first I tried, there may be better settings. If you have
your filesystem on dm/raid, you probably want to do the above for each
device the dm/raid is composed of.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  9:35                 ` Prakash K. Cheemplavam
@ 2004-12-03  9:43                   ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-03  9:43 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: akpm, linux-kernel, nickpiggin

On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> >
> >>Jens Axboe schrieb:
> >>
> >>>On Thu, Dec 02 2004, Prakash K. Cheemplavam wrote:
> >>>
> >>
> >>>>0  3   3080   2208   1156 817712    0    0  3592 75624 1326  2289  1 36 
> >>>>0 63
> >>>>0  3   3080   2664   1156 818240    0    0  5124 15692 1302   992  1 18 
> >>>>0 81
> >>>>0  3   3080   2580   1160 815832    0    0  4356 155792 1375  1064  1 
> >>>>39  0 60
> >>>>0  3   3080   2472   1160 817124    0    0  3076 100852 1345  1138  1 
> >>>>23  0 76
> >>>>2  4   3080   2836   1148 816228    0    0  3336 100412 1352  1379  1 
> >>>>47  0 52
> >>>>0  4   3080   2708   1144 815964    0    0  3844 48908 1343   871  1 25 
> >>>>0 74
> >>>>0  3   3080   2748   1152 815984    0    0  3332 71996 1338   843  1 27 
> >>>>0 72
> >>>
> >>>
> >>>Can you try with the patch that is in the parent of this thread? The
> >>>above doesn't look that bad, although read performance could be better
> >>>of course. But try with the patch please, I'm sure it should help you
> >>>quite a lot.
> >>>
> >>
> >>It actually got worse: Though the read rate seems accepteble, it is not, 
> >>as interactivity is dead while writing. I cannot start porgrammes, other 
> >>programmes which want to do i/o pretty much hang. This is only while 
> >>writing. While reading there is no such problem.
> >
> >
> >Interesting, thanks for testing. I'll run some tests here as well, so
> >far only the cases mentioned yesterday have been tested.
> 
> BTW, in case it is misread: Above (except the io performance as such) is
> no regression: The other schedulers behave the same on my system.

Yes, that's what I assumed. Another thing to keep in mind is that even
with just a single writer, you could have 3 people doing writeout for
you (pdflush for each disk, and the writer itself), while the reader is
on its own. This could affect latencies/bandwidth for the reader in
not-so pleasant ways.

> >You could try and bumb the slice period. But I'll experiment and see
> >what happens. What is your test case?
> 
> [slice bumping] Uhm, is it doable via proc? I haven't seen text docs to
> your patch and I am not good at kernel code ;-)

:-)

See my previous mail, it tells you how to do it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03  9:39                 ` Jens Axboe
@ 2004-12-03  9:54                   ` Prakash K. Cheemplavam
       [not found]                   ` <41B03722.5090001@gmx.de>
  1 sibling, 0 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03  9:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 3141 bytes --]

Jens Axboe schrieb:
> On Fri, Dec 03 2004, Andrew Morton wrote:
> 
>>Is this a parallel IDE system?  SATA?  SCSI?  If the latter, what driver
>>and what is the TCQ depth?
> 
> 
> Yeah, that would be interesting to know. Or of the device is on dm or
> raid. And what filesystem is being used?

It is ext3. (The writing-makes-reading-starve problem happen on reiserfs 
as well. ext2 is not so bad and xfs behaves best, ie my email client 
doesn't get unuasable with my earlier tests, but "only" very slow. But 
then I only wrote down 2gb and nothing continuesly.)

> # cd /sys/block/<dev>/queue/iosched
> # echo 6 > idle
> # echo 150 > slice
> 
> These are the first I tried, there may be better settings. If you have
> your filesystem on dm/raid, you probably want to do the above for each
> device the dm/raid is composed of.

Yes, I have linux raid (testing md1). Have appield both settings on both 
drives and got a interesting new pattern: Now it alternates. My email 
client is still not usale while writing though...


0  3   4120   5276   1792 856784    0    0  3880 81372 1528   931  1 17 
  0 82
  0  4   4120   5576   1800 856148    0    0  1292 121136 1379   872  5 
12  0 83
  0  3   4120   6624   1796 857700    0    0  1548  4464 1324   712  4 7 
  0 89
  1  3   4120   5200   1568 859600    0    0 42608  2308 1679  1392  3 
28  0 69
  1  3   4120   5212   1472 856672    0    0 12372 94992 1451  1047  1 
30  0 69
  1  2   4120   5700   1476 856892    0    0  2576 27252 1337   770  2 9 
  0 89
  0  3   4120   5404   1484 860292    0    0  2076 63876 1323   758  2 
13  0 85
procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
  1  2   4120   5536   1492 860240    0    0 49248    12 1783  1478  2 
26  0 72
  0  3   4120   5476   1492 857976    0    0 21852 98552 1500  1021  2 
20  0 78
  0  3   4120   5340   1552 860104    0    0  2672 36920 1321   717  1 
15  0 84
  0  4   4120   5436   1588 861080    0    0  2364 20748 1331   716  1 
12  0 87
  0  3   4120   5092   1632 860012    0    0 59520     0 1810  1591  3 
32  0 65
  0  3   4120   5568   1616 858252    0    0 58232     0 1833  1519  2 
30  0 68
  0  2   4120   5508   1548 857784    0    0  3864 70500 1347   767  1 
13  0 86
  0  2   4120   5376   1488 857760    0    0  5164 41440 1317   800  2 
15  0 83
  0  3   4120   5256   1484 858448    0    0  6452 111292 1342   829  2 
22  0 76
  0  3   4120   5832   1488 858768    0    0  2060 26624 1320   769  2 5 
  0 93
  3  4   4120   5564   1492 859644    0    0 20568    12 1426  1048  1 
25  0 75
  0  2   4120   5448   1516 857548    0    0 41056 47636 1746  1355  2 
29  0 69
  0  2   4120   5572   1524 858020    0    0  2332 25020 1330   737  1 
10  0 89
  0  3   4120   5508   1568 858020    0    0  4152 130164 1338   844  2 
20  0 78
  1  3   4120   6260   1588 858840    0    0   836 14288 1314   747  1 3 
  0 96
  1  3   4120   5192   1644 860304    0    0 41628    12 1677  2226  2 
31  0 67
  2  3   4120   5308   1588 857448    0    0 53324  1456 2044  9456  2 
60  0 38


Prakash

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
       [not found]                   ` <41B03722.5090001@gmx.de>
@ 2004-12-03 10:31                     ` Jens Axboe
  2004-12-03 10:38                       ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 10:31 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin

On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Andrew Morton wrote:
> >
> >>"Prakash K. Cheemplavam" <prakashkc@gmx.de> wrote:
> >>
> >>Is this a parallel IDE system?  SATA?  SCSI?  If the latter, what driver
> >>and what is the TCQ depth?
> >
> >
> >Yeah, that would be interesting to know. Or of the device is on dm or
> >raid. And what filesystem is being used?
> 
> It is ext3. (The writing-makes-reading-starve problem happen on reiserfs 
> as well. ext2 is not so bad and xfs behaves best, ie my email client 
> doesn't get unuasable with my earlier tests, but "only" very slow. But 
> then I only wrote down 2gb and nothing continuesly.)

It's impossible to give really good results on ext3/reiser in my
experience, because reads often need to generate a write as well. What
could work is if a reader got PF_SYNCWRITE set while that happens.

Or even better would be to kill that horrible PF_SYNCWRITE hack (Andrew,
how could you!) and really have the fs use the proper WRITE_SYNC
instead.

> >So Prakash, please try the same test with those settings:
> >
> ># cd /sys/block/<dev>/queue/iosched
> ># echo 6 > idle
> ># echo 150 > slice
> >
> >These are the first I tried, there may be better settings. If you have
> >your filesystem on dm/raid, you probably want to do the above for each
> >device the dm/raid is composed of.
> 
> Yeas, I have linux raid (testing md1). Have appield both settings on 
> both drives and got a interesting new pattern: Now it alternates. My 
> email client is still not usale while writing though...

Funky. It looks like another case of the io scheduler being at the wrong
place - if raid sends dependent reads to different drives, it screws up
the io scheduling. The right way to fix that would be to io scheduler
before raid (reverse of what we do now), but that is a lot of work. A
hack would be to try and tie processes to one md component for periods
of time, sort of like cfq slicing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03 10:31                     ` Jens Axboe
@ 2004-12-03 10:38                       ` Jens Axboe
  2004-12-03 10:45                         ` Prakash K. Cheemplavam
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 10:38 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin

On Fri, Dec 03 2004, Jens Axboe wrote:
> > >So Prakash, please try the same test with those settings:
> > >
> > ># cd /sys/block/<dev>/queue/iosched
> > ># echo 6 > idle
> > ># echo 150 > slice
> > >
> > >These are the first I tried, there may be better settings. If you have
> > >your filesystem on dm/raid, you probably want to do the above for each
> > >device the dm/raid is composed of.
> > 
> > Yeas, I have linux raid (testing md1). Have appield both settings on 
> > both drives and got a interesting new pattern: Now it alternates. My 
> > email client is still not usale while writing though...
> 
> Funky. It looks like another case of the io scheduler being at the wrong
> place - if raid sends dependent reads to different drives, it screws up
> the io scheduling. The right way to fix that would be to io scheduler
> before raid (reverse of what we do now), but that is a lot of work. A
> hack would be to try and tie processes to one md component for periods
> of time, sort of like cfq slicing.

It makes sense to split the slice period for sync and async requests,
since async requests usually get a lot of requests queued in a short
period of time. Might even make sense to introduce a slice_rq value as
well, limiting the number of requests queued in a given slice.

But at least this patch lets you set slice_sync and slice_async
seperately, if you want to experiement.

===== drivers/block/cfq-iosched.c 1.15 vs edited =====
--- 1.15/drivers/block/cfq-iosched.c	2004-11-30 07:56:58 +01:00
+++ edited/drivers/block/cfq-iosched.c	2004-12-03 11:36:09 +01:00
@@ -22,21 +22,23 @@
 #include <linux/rbtree.h>
 #include <linux/mempool.h>
 
-static unsigned long max_elapsed_crq;
-static unsigned long max_elapsed_dispatch;
-
 /*
  * tunables
  */
 static int cfq_quantum = 4;		/* max queue in one round of service */
 static int cfq_queued = 8;		/* minimum rq allocate limit per-queue*/
-static int cfq_service = HZ;		/* period over which service is avg */
 static int cfq_fifo_expire_r = HZ / 2;	/* fifo timeout for sync requests */
 static int cfq_fifo_expire_w = 5 * HZ;	/* fifo timeout for async requests */
 static int cfq_fifo_rate = HZ / 8;	/* fifo expiry rate */
 static int cfq_back_max = 16 * 1024;	/* maximum backwards seek, in KiB */
 static int cfq_back_penalty = 2;	/* penalty of a backwards seek */
 
+static int cfq_slice_sync = HZ / 10;
+static int cfq_slice_async = HZ / 25;
+static int cfq_idle = HZ / 249;
+
+static int cfq_max_depth = 4;
+
 /*
  * for the hash of cfqq inside the cfqd
  */
@@ -55,6 +57,7 @@
 #define list_entry_hash(ptr)	hlist_entry((ptr), struct cfq_rq, hash)
 
 #define list_entry_cfqq(ptr)	list_entry((ptr), struct cfq_queue, cfq_list)
+#define list_entry_fifo(ptr)	list_entry((ptr), struct request, queuelist)
 
 #define RQ_DATA(rq)		(rq)->elevator_private
 
@@ -76,22 +79,18 @@
 #define rq_rb_key(rq)		(rq)->sector
 
 /*
- * threshold for switching off non-tag accounting
- */
-#define CFQ_MAX_TAG		(4)
-
-/*
  * sort key types and names
  */
 enum {
 	CFQ_KEY_PGID,
 	CFQ_KEY_TGID,
+	CFQ_KEY_PID,
 	CFQ_KEY_UID,
 	CFQ_KEY_GID,
 	CFQ_KEY_LAST,
 };
 
-static char *cfq_key_types[] = { "pgid", "tgid", "uid", "gid", NULL };
+static char *cfq_key_types[] = { "pgid", "tgid", "pid", "uid", "gid", NULL };
 
 /*
  * spare queue
@@ -103,6 +102,8 @@
 static kmem_cache_t *cfq_ioc_pool;
 
 struct cfq_data {
+	atomic_t ref;
+
 	struct list_head rr_list;
 	struct list_head empty_list;
 
@@ -114,8 +115,6 @@
 
 	unsigned int max_queued;
 
-	atomic_t ref;
-
 	int key_type;
 
 	mempool_t *crq_pool;
@@ -127,6 +126,14 @@
 	int rq_in_driver;
 
 	/*
+	 * schedule slice state info
+	 */
+	struct timer_list timer;
+	struct work_struct unplug_work;
+	struct cfq_queue *active_queue;
+	unsigned int dispatch_slice;
+
+	/*
 	 * tunables, see top of file
 	 */
 	unsigned int cfq_quantum;
@@ -137,8 +144,9 @@
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
 	unsigned int find_best_crq;
-
-	unsigned int cfq_tagged;
+	unsigned int cfq_slice[2];
+	unsigned int cfq_idle;
+	unsigned int cfq_max_depth;
 };
 
 struct cfq_queue {
@@ -150,8 +158,6 @@
 	struct hlist_node cfq_hash;
 	/* hash key */
 	unsigned long key;
-	/* whether queue is on rr (or empty) list */
-	int on_rr;
 	/* on either rr or empty list of cfqd */
 	struct list_head cfq_list;
 	/* sorted list of pending requests */
@@ -169,10 +175,14 @@
 
 	int key_type;
 
-	unsigned long service_start;
-	unsigned long service_used;
+	unsigned long slice_start;
+	unsigned long slice_end;
+	unsigned long service_last;
 
-	unsigned int max_rate;
+	/* whether queue is on rr (or empty) list */
+	unsigned int on_rr : 1;
+	unsigned int wait_request : 1;
+	unsigned int must_dispatch : 1;
 
 	/* number of requests that have been handed to the driver */
 	int in_flight;
@@ -219,6 +229,8 @@
 		default:
 		case CFQ_KEY_TGID:
 			return tsk->tgid;
+		case CFQ_KEY_PID:
+			return tsk->pid;
 		case CFQ_KEY_UID:
 			return tsk->uid;
 		case CFQ_KEY_GID:
@@ -406,67 +418,22 @@
 		cfqq->next_crq = cfq_find_next_crq(cfqq->cfqd, cfqq, crq);
 }
 
-static int cfq_check_sort_rr_list(struct cfq_queue *cfqq)
-{
-	struct list_head *head = &cfqq->cfqd->rr_list;
-	struct list_head *next, *prev;
-
-	/*
-	 * list might still be ordered
-	 */
-	next = cfqq->cfq_list.next;
-	if (next != head) {
-		struct cfq_queue *cnext = list_entry_cfqq(next);
-
-		if (cfqq->service_used > cnext->service_used)
-			return 1;
-	}
-
-	prev = cfqq->cfq_list.prev;
-	if (prev != head) {
-		struct cfq_queue *cprev = list_entry_cfqq(prev);
-
-		if (cfqq->service_used < cprev->service_used)
-			return 1;
-	}
-
-	return 0;
-}
-
-static void cfq_sort_rr_list(struct cfq_queue *cfqq, int new_queue)
+static void cfq_resort_rr_list(struct cfq_queue *cfqq)
 {
 	struct list_head *entry = &cfqq->cfqd->rr_list;
 
-	if (!cfqq->on_rr)
-		return;
-	if (!new_queue && !cfq_check_sort_rr_list(cfqq))
-		return;
-
 	list_del(&cfqq->cfq_list);
 
 	/*
-	 * sort by our mean service_used, sub-sort by in-flight requests
+	 * sort by when queue was last serviced
 	 */
 	while ((entry = entry->prev) != &cfqq->cfqd->rr_list) {
 		struct cfq_queue *__cfqq = list_entry_cfqq(entry);
 
-		if (cfqq->service_used > __cfqq->service_used)
+		if (!__cfqq->service_last)
+			break;
+		if (time_before(__cfqq->service_last, cfqq->service_last))
 			break;
-		else if (cfqq->service_used == __cfqq->service_used) {
-			struct list_head *prv;
-
-			while ((prv = entry->prev) != &cfqq->cfqd->rr_list) {
-				__cfqq = list_entry_cfqq(prv);
-
-				WARN_ON(__cfqq->service_used > cfqq->service_used);
-				if (cfqq->service_used != __cfqq->service_used)
-					break;
-				if (cfqq->in_flight > __cfqq->in_flight)
-					break;
-
-				entry = prv;
-			}
-		}
 	}
 
 	list_add(&cfqq->cfq_list, entry);
@@ -479,16 +446,12 @@
 static inline void
 cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	/*
-	 * it's currently on the empty list
-	 */
-	cfqq->on_rr = 1;
-	cfqd->busy_queues++;
+	BUG_ON(cfqq->on_rr);
 
-	if (time_after(jiffies, cfqq->service_start + cfq_service))
-		cfqq->service_used >>= 3;
+	cfqd->busy_queues++;
+	cfqq->on_rr = 1;
 
-	cfq_sort_rr_list(cfqq, 1);
+	cfq_resort_rr_list(cfqq);
 }
 
 static inline void
@@ -512,10 +475,10 @@
 		struct cfq_data *cfqd = cfqq->cfqd;
 
 		BUG_ON(!cfqq->queued[crq->is_sync]);
+		cfqq->queued[crq->is_sync]--;
 
 		cfq_update_next_crq(crq);
 
-		cfqq->queued[crq->is_sync]--;
 		rb_erase(&crq->rb_node, &cfqq->sort_list);
 		RB_CLEAR_COLOR(&crq->rb_node);
 
@@ -622,11 +585,6 @@
 	if (crq) {
 		struct cfq_queue *cfqq = crq->cfq_queue;
 
-		if (cfqq->cfqd->cfq_tagged) {
-			cfqq->service_used--;
-			cfq_sort_rr_list(cfqq, 0);
-		}
-
 		crq->accounted = 0;
 		cfqq->cfqd->rq_in_driver--;
 	}
@@ -640,9 +598,7 @@
 	if (crq) {
 		cfq_remove_merge_hints(q, crq);
 		list_del_init(&rq->queuelist);
-
-		if (crq->cfq_queue)
-			cfq_del_crq_rb(crq);
+		cfq_del_crq_rb(crq);
 	}
 }
 
@@ -724,6 +680,98 @@
 }
 
 /*
+ * current cfqq expired its slice (or was too idle), select new one
+ */
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
+{
+	struct cfq_queue *cfqq = cfqd->active_queue;
+	unsigned long now = jiffies;
+
+	if (cfqq) {
+		if (cfqq->wait_request)
+			del_timer(&cfqd->timer);
+
+		cfqq->service_last = now;
+		cfqq->must_dispatch = 0;
+
+		if (cfqq->on_rr)
+			cfq_resort_rr_list(cfqq);
+
+		cfqq = NULL;
+	}
+
+	if (!list_empty(&cfqd->rr_list)) {
+		cfqq = list_entry_cfqq(cfqd->rr_list.next);
+
+		cfqq->slice_start = now;
+		cfqq->slice_end = 0;
+		cfqq->wait_request = 0;
+	}
+
+	cfqd->active_queue = cfqq;
+	cfqd->dispatch_slice = 0;
+}
+
+static int cfq_arm_slice_timer(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	WARN_ON(!RB_EMPTY(&cfqq->sort_list));
+
+	cfqq->wait_request = 1;
+
+	if (!cfqd->cfq_idle)
+		return 0;
+
+	if (!timer_pending(&cfqd->timer)) {
+		unsigned long now = jiffies, slice_left;
+
+		slice_left = cfqq->slice_end - now;
+		cfqd->timer.expires = now + min(cfqd->cfq_idle,(unsigned int)slice_left);
+		add_timer(&cfqd->timer);
+	}
+
+	return 1;
+}
+
+/*
+ * get next queue for service
+ */
+static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
+{
+	struct cfq_queue *cfqq = cfqd->active_queue;
+	unsigned long now = jiffies;
+
+	cfqq = cfqd->active_queue;
+	if (!cfqq)
+		goto new_queue;
+
+	if (cfqq->must_dispatch)
+		goto must_queue;
+
+	/*
+	 * slice has expired
+	 */
+	if (time_after(jiffies, cfqq->slice_end))
+		goto new_queue;
+
+	/*
+	 * if queue has requests, dispatch one. if not, check if
+	 * enough slice is left to wait for one
+	 */
+must_queue:
+	if (!RB_EMPTY(&cfqq->sort_list))
+		goto keep_queue;
+	else if (cfqq->slice_end - now >= cfqd->cfq_idle) {
+		if (cfq_arm_slice_timer(cfqd, cfqq))
+			return NULL;
+	}
+
+new_queue:
+	cfq_slice_expired(cfqd);
+keep_queue:
+	return cfqd->active_queue;
+}
+
+/*
  * we dispatch cfqd->cfq_quantum requests in total from the rr_list queues,
  * this function sector sorts the selected request to minimize seeks. we start
  * at cfqd->last_sector, not 0.
@@ -741,9 +789,7 @@
 	list_del(&crq->request->queuelist);
 
 	last = cfqd->last_sector;
-	while ((entry = entry->prev) != head) {
-		__rq = list_entry_rq(entry);
-
+	list_for_each_entry_reverse(__rq, head, queuelist) {
 		if (blk_barrier_rq(crq->request))
 			break;
 		if (!blk_fs_request(crq->request))
@@ -777,95 +823,95 @@
 	if (time_before(now, cfqq->last_fifo_expire + cfqd->cfq_fifo_batch_expire))
 		return NULL;
 
-	crq = RQ_DATA(list_entry(cfqq->fifo[0].next, struct request, queuelist));
-	if (reads && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
-		cfqq->last_fifo_expire = now;
-		return crq;
+	if (reads) {
+		crq = RQ_DATA(list_entry_fifo(cfqq->fifo[READ].next));
+		if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_r)) {
+			cfqq->last_fifo_expire = now;
+			return crq;
+		}
 	}
 
-	crq = RQ_DATA(list_entry(cfqq->fifo[1].next, struct request, queuelist));
-	if (writes && time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
-		cfqq->last_fifo_expire = now;
-		return crq;
+	if (writes) {
+		crq = RQ_DATA(list_entry_fifo(cfqq->fifo[WRITE].next));
+		if (time_after(now, crq->queue_start + cfqd->cfq_fifo_expire_w)) {
+			cfqq->last_fifo_expire = now;
+			return crq;
+		}
 	}
 
 	return NULL;
 }
 
-/*
- * dispatch a single request from given queue
- */
-static inline void
-cfq_dispatch_request(request_queue_t *q, struct cfq_data *cfqd,
-		     struct cfq_queue *cfqq)
+static int
+__cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+			int max_dispatch)
 {
-	struct cfq_rq *crq;
+	int dispatched = 0, sync = 0;
 
-	/*
-	 * follow expired path, else get first next available
-	 */
-	if ((crq = cfq_check_fifo(cfqq)) == NULL) {
-		if (cfqd->find_best_crq)
-			crq = cfqq->next_crq;
-		else
-			crq = rb_entry_crq(rb_first(&cfqq->sort_list));
-	}
+	BUG_ON(RB_EMPTY(&cfqq->sort_list));
+
+	do {
+		struct cfq_rq *crq;
+
+		/*
+		 * follow expired path, else get first next available
+		 */
+		if ((crq = cfq_check_fifo(cfqq)) == NULL) {
+			if (cfqd->find_best_crq)
+				crq = cfqq->next_crq;
+			else
+				crq = rb_entry_crq(rb_first(&cfqq->sort_list));
+		}
+
+		cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+
+		/*
+		 * finally, insert request into driver list
+		 */
+		cfq_dispatch_sort(cfqd->queue, crq);
+
+		cfqd->dispatch_slice++;
+		dispatched++;
+		sync += crq->is_sync;
+
+		if (RB_EMPTY(&cfqq->sort_list))
+			break;
 
-	cfqd->last_sector = crq->request->sector + crq->request->nr_sectors;
+	} while (dispatched < max_dispatch);
 
 	/*
-	 * finally, insert request into driver list
+	 * if slice end isn't set yet, set it. if at least one request was
+	 * sync, use the sync time slice value
 	 */
-	cfq_dispatch_sort(q, crq);
+	if (!cfqq->slice_end)
+		cfqq->slice_end = cfqd->cfq_slice[!!sync] + jiffies;
+
+
+	return dispatched;
 }
 
 static int cfq_dispatch_requests(request_queue_t *q, int max_dispatch)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
-	struct list_head *entry, *tmp;
-	int queued, busy_queues, first_round;
 
 	if (list_empty(&cfqd->rr_list))
 		return 0;
 
-	queued = 0;
-	first_round = 1;
-restart:
-	busy_queues = 0;
-	list_for_each_safe(entry, tmp, &cfqd->rr_list) {
-		cfqq = list_entry_cfqq(entry);
-
-		BUG_ON(RB_EMPTY(&cfqq->sort_list));
-
-		/*
-		 * first round of queueing, only select from queues that
-		 * don't already have io in-flight
-		 */
-		if (first_round && cfqq->in_flight)
-			continue;
-
-		cfq_dispatch_request(q, cfqd, cfqq);
-
-		if (!RB_EMPTY(&cfqq->sort_list))
-			busy_queues++;
-
-		queued++;
-	}
-
-	if ((queued < max_dispatch) && (busy_queues || first_round)) {
-		first_round = 0;
-		goto restart;
-	}
+	cfqq = cfq_select_queue(cfqd);
+	if (!cfqq)
+		return 0;
 
-	return queued;
+	cfqq->wait_request = 0;
+	cfqq->must_dispatch = 0;
+	del_timer(&cfqd->timer);
+	return __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
 }
 
 static inline void cfq_account_dispatch(struct cfq_rq *crq)
 {
 	struct cfq_queue *cfqq = crq->cfq_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
-	unsigned long now, elapsed;
 
 	/*
 	 * accounted bit is necessary since some drivers will call
@@ -874,37 +920,9 @@
 	if (crq->accounted)
 		return;
 
-	now = jiffies;
-	if (cfqq->service_start == ~0UL)
-		cfqq->service_start = now;
-
-	/*
-	 * on drives with tagged command queueing, command turn-around time
-	 * doesn't necessarily reflect the time spent processing this very
-	 * command inside the drive. so do the accounting differently there,
-	 * by just sorting on the number of requests
-	 */
-	if (cfqd->cfq_tagged) {
-		if (time_after(now, cfqq->service_start + cfq_service)) {
-			cfqq->service_start = now;
-			cfqq->service_used /= 10;
-		}
-
-		cfqq->service_used++;
-		cfq_sort_rr_list(cfqq, 0);
-	}
-
-	elapsed = now - crq->queue_start;
-	if (elapsed > max_elapsed_dispatch)
-		max_elapsed_dispatch = elapsed;
-
 	crq->accounted = 1;
-	crq->service_start = now;
-
-	if (++cfqd->rq_in_driver >= CFQ_MAX_TAG && !cfqd->cfq_tagged) {
-		cfqq->cfqd->cfq_tagged = 1;
-		printk("cfq: depth %d reached, tagging now on\n", CFQ_MAX_TAG);
-	}
+	crq->service_start = jiffies;
+	cfqd->rq_in_driver++;
 }
 
 static inline void
@@ -915,21 +933,18 @@
 	WARN_ON(!cfqd->rq_in_driver);
 	cfqd->rq_in_driver--;
 
-	if (!cfqd->cfq_tagged) {
-		unsigned long now = jiffies;
-		unsigned long duration = now - crq->service_start;
-
-		if (time_after(now, cfqq->service_start + cfq_service)) {
-			cfqq->service_start = now;
-			cfqq->service_used >>= 3;
-		}
-
-		cfqq->service_used += duration;
-		cfq_sort_rr_list(cfqq, 0);
+	/*
+	 * queue was preempted while this request was servicing
+	 */
+	if (cfqd->active_queue != cfqq)
+		return;
 
-		if (duration > max_elapsed_crq)
-			max_elapsed_crq = duration;
-	}
+	/*
+	 * no requests. if last request was a sync request, wait for
+	 * a new one.
+	 */
+	if (RB_EMPTY(&cfqq->sort_list) && crq->is_sync)
+		cfq_arm_slice_timer(cfqd, cfqq);
 }
 
 static struct request *cfq_next_request(request_queue_t *q)
@@ -937,6 +952,9 @@
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct request *rq;
 
+	if (cfqd->rq_in_driver >= cfqd->cfq_max_depth)
+		return NULL;
+
 	if (!list_empty(&q->queue_head)) {
 		struct cfq_rq *crq;
 dispatch:
@@ -964,6 +982,8 @@
  */
 static void cfq_put_queue(struct cfq_queue *cfqq)
 {
+	struct cfq_data *cfqd = cfqq->cfqd;
+
 	BUG_ON(!atomic_read(&cfqq->ref));
 
 	if (!atomic_dec_and_test(&cfqq->ref))
@@ -972,6 +992,9 @@
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->on_rr);
 
+	if (unlikely(cfqd->active_queue == cfqq))
+		cfqd->active_queue = NULL;
+
 	cfq_put_cfqd(cfqq->cfqd);
 
 	/*
@@ -1117,6 +1140,7 @@
 		cic->ioc = ioc;
 		cic->cfqq = __cfqq;
 		atomic_inc(&__cfqq->ref);
+		atomic_inc(&cfqd->ref);
 	} else {
 		struct cfq_io_context *__cic;
 		unsigned long flags;
@@ -1159,10 +1183,10 @@
 		__cic->ioc = ioc;
 		__cic->cfqq = __cfqq;
 		atomic_inc(&__cfqq->ref);
+		atomic_inc(&cfqd->ref);
 		spin_lock_irqsave(&ioc->lock, flags);
 		list_add(&__cic->list, &cic->list);
 		spin_unlock_irqrestore(&ioc->lock, flags);
-
 		cic = __cic;
 		*cfqq = __cfqq;
 	}
@@ -1199,8 +1223,11 @@
 			new_cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
 			spin_lock_irq(cfqd->queue->queue_lock);
 			goto retry;
-		} else
-			goto out;
+		} else {
+			cfqq = kmem_cache_alloc(cfq_pool, gfp_mask);
+			if (!cfqq)
+				goto out;
+		}
 
 		memset(cfqq, 0, sizeof(*cfqq));
 
@@ -1216,7 +1243,7 @@
 		cfqq->cfqd = cfqd;
 		atomic_inc(&cfqd->ref);
 		cfqq->key_type = cfqd->key_type;
-		cfqq->service_start = ~0UL;
+		cfqq->service_last = 0;
 	}
 
 	if (new_cfqq)
@@ -1243,14 +1270,25 @@
 
 static void cfq_enqueue(struct cfq_data *cfqd, struct cfq_rq *crq)
 {
-	crq->is_sync = 0;
-	if (rq_data_dir(crq->request) == READ || current->flags & PF_SYNCWRITE)
-		crq->is_sync = 1;
+	struct cfq_queue *cfqq = crq->cfq_queue;
+	struct request *rq = crq->request;
+
+	crq->is_sync = rq_data_dir(rq) == READ || current->flags & PF_SYNCWRITE;
 
 	cfq_add_crq_rb(crq);
 	crq->queue_start = jiffies;
 
-	list_add_tail(&crq->request->queuelist, &crq->cfq_queue->fifo[crq->is_sync]);
+	list_add_tail(&rq->queuelist, &cfqq->fifo[crq->is_sync]);
+
+	/*
+	 * if we are waiting for a request for this queue, let it rip
+	 * immediately and flag that we must not expire this queue just now
+	 */
+	if (cfqq->wait_request && cfqq == cfqd->active_queue) {
+		cfqq->must_dispatch = 1;
+		del_timer(&cfqd->timer);
+		cfqd->queue->request_fn(cfqd->queue);
+	}
 }
 
 static void
@@ -1339,31 +1377,34 @@
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
 	int ret = ELV_MQUEUE_MAY;
+	int limit;
 
 	if (current->flags & PF_MEMALLOC)
 		return ELV_MQUEUE_MAY;
 
 	cfqq = cfq_find_cfq_hash(cfqd, cfq_hash_key(cfqd, current));
-	if (cfqq) {
-		int limit = cfqd->max_queued;
-
-		if (cfqq->allocated[rw] < cfqd->cfq_queued)
-			return ELV_MQUEUE_MUST;
-
-		if (cfqd->busy_queues)
-			limit = q->nr_requests / cfqd->busy_queues;
-
-		if (limit < cfqd->cfq_queued)
-			limit = cfqd->cfq_queued;
-		else if (limit > cfqd->max_queued)
-			limit = cfqd->max_queued;
+	if (unlikely(!cfqq))
+		return ELV_MQUEUE_MAY;
 
-		if (cfqq->allocated[rw] >= limit) {
-			if (limit > cfqq->alloc_limit[rw])
-				cfqq->alloc_limit[rw] = limit;
+	if (cfqq->allocated[rw] < cfqd->cfq_queued)
+		return ELV_MQUEUE_MUST;
+	if (cfqq->wait_request)
+		return ELV_MQUEUE_MUST;
+
+	limit = cfqd->max_queued;
+	if (cfqd->busy_queues)
+		limit = q->nr_requests / cfqd->busy_queues;
+
+	if (limit < cfqd->cfq_queued)
+		limit = cfqd->cfq_queued;
+	else if (limit > cfqd->max_queued)
+		limit = cfqd->max_queued;
+
+	if (cfqq->allocated[rw] >= limit) {
+		if (limit > cfqq->alloc_limit[rw])
+			cfqq->alloc_limit[rw] = limit;
 
-			ret = ELV_MQUEUE_NO;
-		}
+		ret = ELV_MQUEUE_NO;
 	}
 
 	return ret;
@@ -1395,12 +1436,12 @@
 		BUG_ON(q->last_merge == rq);
 		BUG_ON(!hlist_unhashed(&crq->hash));
 
-		if (crq->io_context)
-			put_io_context(crq->io_context->ioc);
-
 		BUG_ON(!cfqq->allocated[crq->is_write]);
 		cfqq->allocated[crq->is_write]--;
 
+		if (crq->io_context)
+			put_io_context(crq->io_context->ioc);
+
 		mempool_free(crq, cfqd->crq_pool);
 		rq->elevator_private = NULL;
 
@@ -1473,6 +1514,7 @@
 		crq->is_write = rw;
 		rq->elevator_private = crq;
 		cfqq->alloc_limit[rw] = 0;
+		smp_mb();
 		return 0;
 	}
 
@@ -1486,6 +1528,44 @@
 	return 1;
 }
 
+static void cfq_kick_queue(void *data)
+{
+	request_queue_t *q = data;
+
+	blk_run_queue(q);
+}
+
+static void cfq_schedule_timer(unsigned long data)
+{
+	struct cfq_data *cfqd = (struct cfq_data *) data;
+	struct cfq_queue *cfqq;
+	unsigned long flags;
+
+	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+
+	if ((cfqq = cfqd->active_queue) != NULL) {
+		/*
+		 * expired
+		 */
+		if (time_after(jiffies, cfqq->slice_end))
+			goto out;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (!RB_EMPTY(&cfqq->sort_list)) {
+			cfqq->must_dispatch = 1;
+			goto out_cont;
+		}
+	} 
+
+out:
+	cfq_slice_expired(cfqd);
+out_cont:
+	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	kblockd_schedule_work(&cfqd->unplug_work);
+}
+
 static void cfq_put_cfqd(struct cfq_data *cfqd)
 {
 	request_queue_t *q = cfqd->queue;
@@ -1494,6 +1574,8 @@
 	if (!atomic_dec_and_test(&cfqd->ref))
 		return;
 
+	blk_sync_queue(q);
+
 	/*
 	 * kill spare queue, getting it means we have two refences to it.
 	 * drop both
@@ -1567,8 +1649,15 @@
 	q->nr_requests = 1024;
 	cfqd->max_queued = q->nr_requests / 16;
 	q->nr_batching = cfq_queued;
-	cfqd->key_type = CFQ_KEY_TGID;
+	cfqd->key_type = CFQ_KEY_PID;
 	cfqd->find_best_crq = 1;
+
+	init_timer(&cfqd->timer);
+	cfqd->timer.function = cfq_schedule_timer;
+	cfqd->timer.data = (unsigned long) cfqd;
+
+	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue, q);
+
 	atomic_set(&cfqd->ref, 1);
 
 	cfqd->cfq_queued = cfq_queued;
@@ -1578,6 +1667,10 @@
 	cfqd->cfq_fifo_batch_expire = cfq_fifo_rate;
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
+	cfqd->cfq_slice[0] = cfq_slice_async;
+	cfqd->cfq_slice[1] = cfq_slice_sync;
+	cfqd->cfq_idle = cfq_idle;
+	cfqd->cfq_max_depth = cfq_max_depth;
 
 	return 0;
 out_spare:
@@ -1624,7 +1717,6 @@
 	return -ENOMEM;
 }
 
-
 /*
  * sysfs parts below -->
  */
@@ -1650,13 +1742,6 @@
 }
 
 static ssize_t
-cfq_clear_elapsed(struct cfq_data *cfqd, const char *page, size_t count)
-{
-	max_elapsed_dispatch = max_elapsed_crq = 0;
-	return count;
-}
-
-static ssize_t
 cfq_set_key_type(struct cfq_data *cfqd, const char *page, size_t count)
 {
 	spin_lock_irq(cfqd->queue->queue_lock);
@@ -1664,6 +1749,8 @@
 		cfqd->key_type = CFQ_KEY_PGID;
 	else if (!strncmp(page, "tgid", 4))
 		cfqd->key_type = CFQ_KEY_TGID;
+	else if (!strncmp(page, "pid", 3))
+		cfqd->key_type = CFQ_KEY_PID;
 	else if (!strncmp(page, "uid", 3))
 		cfqd->key_type = CFQ_KEY_UID;
 	else if (!strncmp(page, "gid", 3))
@@ -1704,6 +1791,10 @@
 SHOW_FUNCTION(cfq_find_best_show, cfqd->find_best_crq, 0);
 SHOW_FUNCTION(cfq_back_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_penalty_show, cfqd->cfq_back_penalty, 0);
+SHOW_FUNCTION(cfq_idle_show, cfqd->cfq_idle, 1);
+SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
+SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
+SHOW_FUNCTION(cfq_max_depth_show, cfqd->cfq_max_depth, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -1729,6 +1820,10 @@
 STORE_FUNCTION(cfq_find_best_store, &cfqd->find_best_crq, 0, 1, 0);
 STORE_FUNCTION(cfq_back_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_penalty_store, &cfqd->cfq_back_penalty, 1, UINT_MAX, 0);
+STORE_FUNCTION(cfq_idle_store, &cfqd->cfq_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
+STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
+STORE_FUNCTION(cfq_max_depth_store, &cfqd->cfq_max_depth, 2, UINT_MAX, 0);
 #undef STORE_FUNCTION
 
 static struct cfq_fs_entry cfq_quantum_entry = {
@@ -1771,15 +1866,31 @@
 	.show = cfq_back_penalty_show,
 	.store = cfq_back_penalty_store,
 };
-static struct cfq_fs_entry cfq_clear_elapsed_entry = {
-	.attr = {.name = "clear_elapsed", .mode = S_IWUSR },
-	.store = cfq_clear_elapsed,
+static struct cfq_fs_entry cfq_slice_sync_entry = {
+	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_slice_sync_show,
+	.store = cfq_slice_sync_store,
+};
+static struct cfq_fs_entry cfq_slice_async_entry = {
+	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_slice_async_show,
+	.store = cfq_slice_async_store,
+};
+static struct cfq_fs_entry cfq_idle_entry = {
+	.attr = {.name = "idle", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_idle_show,
+	.store = cfq_idle_store,
 };
 static struct cfq_fs_entry cfq_key_type_entry = {
 	.attr = {.name = "key_type", .mode = S_IRUGO | S_IWUSR },
 	.show = cfq_read_key_type,
 	.store = cfq_set_key_type,
 };
+static struct cfq_fs_entry cfq_max_depth_entry = {
+	.attr = {.name = "max_depth", .mode = S_IRUGO | S_IWUSR },
+	.show = cfq_max_depth_show,
+	.store = cfq_max_depth_store,
+};
 
 static struct attribute *default_attrs[] = {
 	&cfq_quantum_entry.attr,
@@ -1791,7 +1902,10 @@
 	&cfq_find_best_entry.attr,
 	&cfq_back_max_entry.attr,
 	&cfq_back_penalty_entry.attr,
-	&cfq_clear_elapsed_entry.attr,
+	&cfq_slice_sync_entry.attr,
+	&cfq_slice_async_entry.attr,
+	&cfq_idle_entry.attr,
+	&cfq_max_depth_entry.attr,
 	NULL,
 };
 
@@ -1856,7 +1970,7 @@
 	.elevator_owner =	THIS_MODULE,
 };
 
-int cfq_init(void)
+static int __init cfq_init(void)
 {
 	int ret;
 
@@ -1864,17 +1978,35 @@
 		return -ENOMEM;
 
 	ret = elv_register(&iosched_cfq);
-	if (!ret) {
-		__module_get(THIS_MODULE);
-		return 0;
-	}
+	if (ret)
+		cfq_slab_kill();
 
-	cfq_slab_kill();
 	return ret;
 }
 
 static void __exit cfq_exit(void)
 {
+	struct task_struct *g, *p;
+	unsigned long flags;
+
+	read_lock_irqsave(&tasklist_lock, flags);
+
+	/*
+	 * iterate each process in the system, removing our io_context
+	 */
+	do_each_thread(g, p) {
+		struct io_context *ioc = p->io_context;
+
+		if (ioc && ioc->cic) {
+			ioc->cic->exit(ioc->cic);
+			cfq_free_io_context(ioc->cic);
+			ioc->cic = NULL;
+		}
+
+	} while_each_thread(g, p);
+
+	read_unlock_irqrestore(&tasklist_lock, flags);
+
 	cfq_slab_kill();
 	elv_unregister(&iosched_cfq);
 }
 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03 10:38                       ` Jens Axboe
@ 2004-12-03 10:45                         ` Prakash K. Cheemplavam
  2004-12-03 10:48                           ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 10:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Linux Kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 1184 bytes --]

Jens Axboe schrieb:
> On Fri, Dec 03 2004, Jens Axboe wrote:
> 
>>Funky. It looks like another case of the io scheduler being at the wrong
>>place - if raid sends dependent reads to different drives, it screws up
>>the io scheduling. The right way to fix that would be to io scheduler
>>before raid (reverse of what we do now), but that is a lot of work. A
>>hack would be to try and tie processes to one md component for periods
>>of time, sort of like cfq slicing.
> 
> 
> It makes sense to split the slice period for sync and async requests,
> since async requests usually get a lot of requests queued in a short
> period of time. Might even make sense to introduce a slice_rq value as
> well, limiting the number of requests queued in a given slice.
> 
> But at least this patch lets you set slice_sync and slice_async
> seperately, if you want to experiement.

An idea, which values I should try?

In generell I rather have the impression the problem I am experiencing 
is not the problem of the io scheduler alone or why do all show the same 
problem?

BTW, I just did my little test on the ide drive and it shows the same 
problem, so it is not sata / libata related.

Prakash

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03 10:45                         ` Prakash K. Cheemplavam
@ 2004-12-03 10:48                           ` Jens Axboe
  2004-12-03 11:27                             ` Prakash K. Cheemplavam
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 10:48 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin

On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Jens Axboe wrote:
> >
> >>Funky. It looks like another case of the io scheduler being at the wrong
> >>place - if raid sends dependent reads to different drives, it screws up
> >>the io scheduling. The right way to fix that would be to io scheduler
> >>before raid (reverse of what we do now), but that is a lot of work. A
> >>hack would be to try and tie processes to one md component for periods
> >>of time, sort of like cfq slicing.
> >
> >
> >It makes sense to split the slice period for sync and async requests,
> >since async requests usually get a lot of requests queued in a short
> >period of time. Might even make sense to introduce a slice_rq value as
> >well, limiting the number of requests queued in a given slice.
> >
> >But at least this patch lets you set slice_sync and slice_async
> >seperately, if you want to experiement.
> 
> An idea, which values I should try?

Just see if the default ones work (or how they work :-)

> In generell I rather have the impression the problem I am experiencing 
> is not the problem of the io scheduler alone or why do all show the same 
> problem?

It is not, but some io schedulers perform better than others.

> BTW, I just did my little test on the ide drive and it shows the same 
> problem, so it is not sata / libata related.

Single read/writer case works fine here for me, about half the bandwidth
for each. Please show some vmstats for this case, too. Right now I'm not
terribly interested in problems with raid alone, as I can poke holes in
that. If the single drive case is correct, then we can focus on raid.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03 10:48                           ` Jens Axboe
@ 2004-12-03 11:27                             ` Prakash K. Cheemplavam
  2004-12-03 11:29                               ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 11:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Linux Kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 4168 bytes --]

Jens Axboe schrieb:
> On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> 
>>>But at least this patch lets you set slice_sync and slice_async
>>>seperately, if you want to experiement.
>>
>>An idea, which values I should try?
> 
> 
> Just see if the default ones work (or how they work :-)
> 
>>BTW, I just did my little test on the ide drive and it shows the same 
>>problem, so it is not sata / libata related.
> 
> 
> Single read/writer case works fine here for me, about half the bandwidth
> for each. Please show some vmstats for this case, too. Right now I'm not
> terribly interested in problems with raid alone, as I can poke holes in
> that. If the single drive case is correct, then we can focus on raid.

I have not enough space to perform this test on the ide drive, so I did 
it on the sata (single disk). The patch doesn't seem to be better. (But 
on the other hand I haven't tested you first version on single disk.) At 
least it still doesn't look good enough in my eyes.

  procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
  1  3   2704   5368   1528 906540    0    4  2176 24068 1245   743  0 
7  0 93
  0  3   2704   5432   1532 906252    0    0  5072 28160 1277   782  1 
8  0 91
  0  5   2704   5688   1532 906080    0    0  9280  4524 1309   842  1 
10  0 89
  1  3   2704   5232   1544 906208    0    0  6404 76388 1285   716  1 
14  0 85
  0  3   2704   5496   1544 906524    0    0  8328 26624 1301   856  1 
8  0 91
  0  3   2704   5512   1528 906636    0    0  9484 22016 1302   883  1 
8  0 91
  0  3   2704   5816   1500 906296    0    0  5508 10288 1270   749  1 
9  0 90
  0  4   2704   5620   1488 906608    0    0  3076 19920 1267   818  0 
13  0 87
  1  4   2704   5684   1456 906432    0    0  3204 18432 1252   704  1 
8  0 91
  1  3   2704   5504   1408 906168    0    0  5252 28672 1279   777  1 
14  0 85
  0  4   2704   5120   1404 906296    0    0  8968 16384 1351   876  1 
9  0 90
  0  4   2704   5364   1404 905620    0    0  5252 26112 1339   835  1 
14  0 85
  0  4   2704   5600   1432 905036    0    0  1468 15876 1312   741  2 
8  0 90
  1  4   2704   5556   1424 904704    0    0  1664 26112 1243   714  1 
10  0 89
  0  4   2704   5492   1428 904100    0    0  1412 31232 1253   760  1 
15  0 84
  0  4   2704   5568   1432 903456    0    0  1668 29696 1253   703  1 
14  0 85
  1  4   2704   5620   1408 902980    0    0  1280 28672 1248   732  0 
14  0 86
  0  4   2704   5236   1404 902888    0    0  2180 28704 1252   705  1 
11  0 88
  0  4   2704   5632   1388 902180    0    0  1536 28160 1251   731  1 
11  0 88
  0  3   2704   5120   1356 905968    0    0   384 57896 1257   751  1 
14  0 85




What I don't like about the time sliced cfq (first version as well) is 
that I don't get good sustained rate anymore if I have only one writer 
and nothing else. IIRC with plain cfq I at least got near to maximum 
throughput (40-50mb/sec) now it oscillates much more. I have to recheck 
with plain cfq though. It might be ext3 related...

  0  2   2684   7016   9384 900664    0    0     0 59128 1217   576  1 
7  0 92
  1  1   2684   5160   9368 898660    0    0     0 12300 1239  4861  1 
60  0 39
  0  3   2684   5532   9364 896360    0    0     0 18684 1246  1723  1 
48  0 51
  0  3   2684   5596   9364 896616    0    0     0 24576 1246   686  1 
9  0 90
  0  3   2684   5596   9364 896612    0    0     0 38400 1261   718  0 
13  0 87
  0  3   2684   5532   9360 896564    0    0     0 37888 1257   708  1 
13  0 86
  0  3   2684   5532   8848 896884    0    0     0 36864 1260   825  1 
12  0 87
  1  3   2696   5596   7440 898120    0    0     0 31744 1247   703  1 
11  0 88
  0  3   2700   5660   5352 900080    0    0     0 37888 1258   768  1 
13  0 86
  0  2   2700   6816   5216 900436    0    0     0 68772 1266   783  1 
25  0 74
  0  2   2700   6884   5216 900436    0    0     0 19616 1247   679  2 
1  0 97
  1  2   2700   7096   5216 900436    0    0     0 14976 1249   786  1 
3  0 96
  0  2   2700   5352   4572 902432    0    0     4 66544 1263  2333  1 
21  0 78



Prakash

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03 11:27                             ` Prakash K. Cheemplavam
@ 2004-12-03 11:29                               ` Jens Axboe
  2004-12-03 11:52                                 ` Prakash K. Cheemplavam
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-03 11:29 UTC (permalink / raw)
  To: Prakash K. Cheemplavam; +Cc: Andrew Morton, Linux Kernel, nickpiggin

On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> Jens Axboe schrieb:
> >On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> >
> >>>But at least this patch lets you set slice_sync and slice_async
> >>>seperately, if you want to experiement.
> >>
> >>An idea, which values I should try?
> >
> >
> >Just see if the default ones work (or how they work :-)
> >
> >>BTW, I just did my little test on the ide drive and it shows the same 
> >>problem, so it is not sata / libata related.
> >
> >
> >Single read/writer case works fine here for me, about half the bandwidth
> >for each. Please show some vmstats for this case, too. Right now I'm not
> >terribly interested in problems with raid alone, as I can poke holes in
> >that. If the single drive case is correct, then we can focus on raid.
> 
> I have not enough space to perform this test on the ide drive, so I did 
> it on the sata (single disk). The patch doesn't seem to be better. (But 
> on the other hand I haven't tested you first version on single disk.) At 
> least it still doesn't look good enough in my eyes.
> 
>  procs -----------memory---------- ---swap-- -----io---- --system-- 
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
> sy id wa
>  1  3   2704   5368   1528 906540    0    4  2176 24068 1245   743  0 
> 7  0 93
>  0  3   2704   5432   1532 906252    0    0  5072 28160 1277   782  1 
> 8  0 91
>  0  5   2704   5688   1532 906080    0    0  9280  4524 1309   842  1 
> 10  0 89
>  1  3   2704   5232   1544 906208    0    0  6404 76388 1285   716  1 
> 14  0 85
>  0  3   2704   5496   1544 906524    0    0  8328 26624 1301   856  1 
> 8  0 91
>  0  3   2704   5512   1528 906636    0    0  9484 22016 1302   883  1 
> 8  0 91
>  0  3   2704   5816   1500 906296    0    0  5508 10288 1270   749  1 
> 9  0 90
>  0  4   2704   5620   1488 906608    0    0  3076 19920 1267   818  0 
> 13  0 87
>  1  4   2704   5684   1456 906432    0    0  3204 18432 1252   704  1 
> 8  0 91
>  1  3   2704   5504   1408 906168    0    0  5252 28672 1279   777  1 
> 14  0 85
>  0  4   2704   5120   1404 906296    0    0  8968 16384 1351   876  1 
> 9  0 90
>  0  4   2704   5364   1404 905620    0    0  5252 26112 1339   835  1 
> 14  0 85
>  0  4   2704   5600   1432 905036    0    0  1468 15876 1312   741  2 
> 8  0 90
>  1  4   2704   5556   1424 904704    0    0  1664 26112 1243   714  1 
> 10  0 89
>  0  4   2704   5492   1428 904100    0    0  1412 31232 1253   760  1 
> 15  0 84
>  0  4   2704   5568   1432 903456    0    0  1668 29696 1253   703  1 
> 14  0 85
>  1  4   2704   5620   1408 902980    0    0  1280 28672 1248   732  0 
> 14  0 86
>  0  4   2704   5236   1404 902888    0    0  2180 28704 1252   705  1 
> 11  0 88
>  0  4   2704   5632   1388 902180    0    0  1536 28160 1251   731  1 
> 11  0 88
>  0  3   2704   5120   1356 905968    0    0   384 57896 1257   751  1 
> 14  0 85

Try increasing slice_sync and idle, just for fun.

> What I don't like about the time sliced cfq (first version as well) is 
> that I don't get good sustained rate anymore if I have only one writer 
> and nothing else. IIRC with plain cfq I at least got near to maximum 
> throughput (40-50mb/sec) now it oscillates much more. I have to recheck 
> with plain cfq though. It might be ext3 related...
> 
>  0  2   2684   7016   9384 900664    0    0     0 59128 1217   576  1 
> 7  0 92
>  1  1   2684   5160   9368 898660    0    0     0 12300 1239  4861  1 
> 60  0 39
>  0  3   2684   5532   9364 896360    0    0     0 18684 1246  1723  1 
> 48  0 51
>  0  3   2684   5596   9364 896616    0    0     0 24576 1246   686  1 

That's a bug, I've noticed that too. Sustained write rate for a single
thread is somewhat lower than it should be, it's on my todo to
investigate.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-03 11:29                               ` Jens Axboe
@ 2004-12-03 11:52                                 ` Prakash K. Cheemplavam
  0 siblings, 0 replies; 66+ messages in thread
From: Prakash K. Cheemplavam @ 2004-12-03 11:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Linux Kernel, nickpiggin

[-- Attachment #1: Type: text/plain, Size: 6584 bytes --]

Jens Axboe schrieb:
> On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
> 
>>Jens Axboe schrieb:
>>
>>>On Fri, Dec 03 2004, Prakash K. Cheemplavam wrote:
>>>
>>>
>>>>>But at least this patch lets you set slice_sync and slice_async
>>>>>seperately, if you want to experiement.
>>>>
>>>>An idea, which values I should try?
>>>
>>>
>>>Just see if the default ones work (or how they work :-)
>>>
>>>
>>>>BTW, I just did my little test on the ide drive and it shows the same 
>>>>problem, so it is not sata / libata related.
>>>
>>>
>>>Single read/writer case works fine here for me, about half the bandwidth
>>>for each. Please show some vmstats for this case, too. Right now I'm not
>>>terribly interested in problems with raid alone, as I can poke holes in
>>>that. If the single drive case is correct, then we can focus on raid.
>>
>>I have not enough space to perform this test on the ide drive, so I did 
>>it on the sata (single disk). The patch doesn't seem to be better. (But 
>>on the other hand I haven't tested you first version on single disk.) At 
>>least it still doesn't look good enough in my eyes.
>>
>> procs -----------memory---------- ---swap-- -----io---- --system-- 
>>----cpu----
>> r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
>>sy id wa
>> 1  3   2704   5368   1528 906540    0    4  2176 24068 1245   743  0 
>>7  0 93
>> 0  3   2704   5432   1532 906252    0    0  5072 28160 1277   782  1 
>>8  0 91
>> 0  5   2704   5688   1532 906080    0    0  9280  4524 1309   842  1 
>>10  0 89
>> 1  3   2704   5232   1544 906208    0    0  6404 76388 1285   716  1 
>>14  0 85
>> 0  3   2704   5496   1544 906524    0    0  8328 26624 1301   856  1 
>>8  0 91
>> 0  3   2704   5512   1528 906636    0    0  9484 22016 1302   883  1 
>>8  0 91
>> 0  3   2704   5816   1500 906296    0    0  5508 10288 1270   749  1 
>>9  0 90
>> 0  4   2704   5620   1488 906608    0    0  3076 19920 1267   818  0 
>>13  0 87
>> 1  4   2704   5684   1456 906432    0    0  3204 18432 1252   704  1 
>>8  0 91
>> 1  3   2704   5504   1408 906168    0    0  5252 28672 1279   777  1 
>>14  0 85
>> 0  4   2704   5120   1404 906296    0    0  8968 16384 1351   876  1 
>>9  0 90
>> 0  4   2704   5364   1404 905620    0    0  5252 26112 1339   835  1 
>>14  0 85
>> 0  4   2704   5600   1432 905036    0    0  1468 15876 1312   741  2 
>>8  0 90
>> 1  4   2704   5556   1424 904704    0    0  1664 26112 1243   714  1 
>>10  0 89
>> 0  4   2704   5492   1428 904100    0    0  1412 31232 1253   760  1 
>>15  0 84
>> 0  4   2704   5568   1432 903456    0    0  1668 29696 1253   703  1 
>>14  0 85
>> 1  4   2704   5620   1408 902980    0    0  1280 28672 1248   732  0 
>>14  0 86
>> 0  4   2704   5236   1404 902888    0    0  2180 28704 1252   705  1 
>>11  0 88
>> 0  4   2704   5632   1388 902180    0    0  1536 28160 1251   731  1 
>>11  0 88
>> 0  3   2704   5120   1356 905968    0    0   384 57896 1257   751  1 
>>14  0 85
> 
> 
> Try increasing slice_sync and idle, just for fun.

Changed to 150 resp. 6:

procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
  1  5   2704   5720    960 900020    0    0    68 26624 1251   741  1 
16  0 83
  1  3   2704   5708   1004 900312    0    0   312  4044 1294   686  1 
11  0 88
  0  1   2704   5484   1024 899800    0    0   396 40008 1236   608  1 
5  0 94
  0  3   2704   5284   1036 900696    0    0   516 49196 1246   682  1 
5  0 94
  1  3   2704   5640   1040 900956    0    0  1416 21504 1252   722  1 
4  0 95
  0  3   2704   5120   1040 902108    0    0  2688 12288 1230   672  1 
2  0 97
  1  3   2704   5416   1036 902276    0    0  3076     0 1248   632  0 
2  0 98
  0  4   2704   5448   1092 902748    0    0 11700    16 1306   857  1 
16  0 83
  0  3   2704   5712   1132 900704    0    0  1064 63488 1259   755  1 
15  0 84
  0  3   2704   5476   1156 901336    0    0  5656  8296 1272   725  1 
7  0 92
  0  3   2704   5320   1208 900996    0    0  2988  3972 1256   696  1 
18  0 81
  1  4   2704   5288   1240 899660    0    0  1956 60964 1278   757  1 
12  0 87
  0  3   2704   5596   1292 899032    0    0  1688 24732 1284   813  1 
8  0 91
  0  3   2704   6124   1308 899776    0    0  1424 42496 1253   678  1 
7  0 92
  1  3   2704   5744   1324 900124    0    0    16 23552 1250   707  1 
9  0 90
  0  3   2704   5108   1332 900768    0    0  1800 19968 1242   703  1 
4  0 95
  0  3   2704   5640   1332 900132    0    0  3204 16896 1240   689  1 
1  0 98
  0  3   2704   5512   1344 900696    0    0  2564  3036 1255   652  1 
2  0 97
  2  3   2704   5264   1364 901384    0    0  2704 42572 1253   726  1 
10  0 89
  1  3   2704   5096   1368 898108    0    0  1808 51724 1240  1984  1 
53  0 46
procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
  1  4   2704   5572   1348 896164    0    0  2816 18944 1239  1304  1 
30  0 69
  0  4   2704   5436   1332 896152    0    0  3204 17408 1239   716  0 
6  0 94
  0  4   2704   5452   1324 895884    0    0  3076 20480 1248   711  2 
8  0 90
  0  4   2704   5444   1328 895668    0    0  3020 16384 1360   830  1 
7  0 92
  0  4   2704   5708   1328 895248    0    0  1976 21952 1509  1213  4 
8  0 88
  0  4   2704   5708   1328 895020    0    0  1536 25200 1258   803  2 
10  0 88
  0  4   2704   5836   1332 894880    0    0  3204 16264 1281   908  3 
8  0 89
  0  4   2704   5668   1320 895084    0    0   896 18172 1433   941  1 
7  0 92
  0  4   2704   5324   1324 895644    0    0  4612 15924 1450   968  1 
7  0 92
  0  3   2704   5464   1324 897836    0    0  7176 42820 1421  1074  1 
29  0 70
  1  3   2704   5304   1324 898092    0    0   896 11516 1266   727  1 
2  0 97
  0  4   2704   5336   1312 898080    0    0  2436 16684 1270   971  1 
10  0 89
  0  3   2704   5608   1328 897816    0    0 17040 14124 1463  1162  3 
7  0 90
  0  3   2704   5272   1348 897960    0    0 18196 11264 1435  1281  2 
13  0 85
  0  3   2704   5592   1348 897488    0    0  6792 24284 1348  1102  6 
8  0 86
  0  3   2704   5528   1364 897448    0    0   872 19516 1239   760  1 
6  0 93
  0  3   2704   5592   1364 897348    0    0  1976 22408 1253   761  1 
5  0 94
  0  3   2704   5528   1364 897252    0    0  2048 30820 1267   858  1 
8  0 91
  0  3   2704   5528   1372 897132    0    0  5640 18812 1382   907  1 
6  0 93
  0  3   2704   5208   1368 897388    0    0  2820 17356 1352   863  1 
5  0 94

Prakash

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 14:41   ` Jens Axboe
@ 2004-12-04 13:05     ` Giuliano Pochini
  0 siblings, 0 replies; 66+ messages in thread
From: Giuliano Pochini @ 2004-12-04 13:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

On Thu, 2 Dec 2004 15:41:34 +0100
Jens Axboe <axboe@suse.de> wrote:

> > > Case 4: write_files, random, bs=4k
> >
> > Just a thought... in this test the results don't look right. Why
> > aggregate bandwidth with 8 clients is higher than with 4 and 2 clients ?
> > In the cfq test with 8 clients aggregate bw is also higher than with
> > a single client.
>
> I don't know what happens with the 4 client case, but it's not that
> unlikely that aggregate bandwidth will be higher for more threads doing
> random writes, as request coalesching will help minimize seeks.

In order to keep the probabilty that requests get coalesced constant, the
size of the test file should be multiple of the number of clients.


--
Giuliano.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 20:37             ` Jens Axboe
@ 2004-12-07 23:11               ` Nick Piggin
  0 siblings, 0 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-07 23:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-kernel

Jens Axboe wrote:
> On Thu, Dec 02 2004, Andrew Morton wrote:
> 
>>Jens Axboe <axboe@suse.de> wrote:
>>
>>>>So what are you doing different?
>>>
>>>Doing sync io, most likely. My results above are 64k O_DIRECT reads and
>>>writes, see the mention of the test cases in the first mail.
>>
>>OK.
>>
>>Writer:
>>
>>	while true
>>	do
>>	write-and-fsync -o -m 100 -c 65536 foo 
>>	done
>>
>>Reader:
>>
>>	time-read -o -b 65536 -n 256 x      (This is O_DIRECT)
>>or:	time-read -b 65536 -n 256 x	    (This is buffered)
>>
>>`vmstat 1':
>>
>>procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>> r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>> 1  1   1032 137412   4276  84388   32    0 15456 25344 1659  1538  0  3 50 47
>> 0  1   1032 137468   4276  84388    0    0     0 32128 1521  1027  0  2 51 48
>> 0  1   1032 137476   4276  84388    0    0     0 32064 1519  1026  0  1 50 49
>> 0  1   1032 137476   4276  84388    0    0     0 33920 1556  1102  0  2 50 49
>> 0  1   1032 137476   4276  84388    0    0     0 33088 1541  1074  0  1 50 49
>> 0  2   1032 135676   4284  85944    0    0  1656 29732 1868  2506  0  3 49 47
>> 1  1   1032  96532   4292 125172    0    0 39220   128 10813 39313  0 31 35 34
>> 0  2   1032  57724   4332 163892    0    0 38828   128 10716 38907  0 28 38 35
>> 0  2   1032  18860   4368 202684    0    0 38768   128 10701 38845  1 28 38 35
>> 0  2   1032   3672   4248 217764    0    0 39188   128 10803 39327  0 28 37 34
>> 0  1   1032   2832   4260 218840    0    0 16812 17932 5504 17457  0 14 46 40
> 
> 
> Well there you go, exactly what I saw. The writer(s) basically make no
> progress as long as the reader is going. Since 'as' treats the sync
> writes like reads internally and given the really bad fairness problems
> demonstrated for same direction clients, that might be the same problem.
> 
> 
>>Ugly.
>>
>>(write-and-fsync and time-read are from
>>http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz)
> 
> 
> I'll try and post my cruddy test programs tomorrow as well. Pretty handy
> for getting a good feel for N client read/write performance.
> 


OK, sorry for not jumping in earlier. Yes, it will be synch IO that
is your problem.

I'll see if I can try improving things there for AS. I see (from your
first results in this thread) that CFQ does quite nicely here, better
than deadline.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-02 19:52     ` Jens Axboe
  2004-12-02 20:19       ` Andrew Morton
@ 2004-12-08  0:37       ` Andrea Arcangeli
  2004-12-08  0:54         ` Nick Piggin
  1 sibling, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08  0:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-kernel, nickpiggin

On Thu, Dec 02, 2004 at 08:52:36PM +0100, Jens Axboe wrote:
> with its default io scheduler has basically zero write performance in

IMHO the default io scheduler should be changed to cfq. as is all but
general purpose so it's a mistake to leave it the default (plus as Jens
found the write bandwidth is not existent during reads, no surprise it
falls apart in any database load). We had to make the cfq the default
for the enterprise release already. The first thing I do is to add
elevator=cfq on a new install. I really like how well cfq has been
designed, implemented and turned, Jens's results with his last patch are
quite impressive.

BTW, a bit of historic that may be funny to read (and I believe nobody
on l-k knows about it): the first sfq I/O elevator idea (sfq is the
ancestor of cfq, cfq still fallbacks in sfq mode in the unlikely case
that no atomic memory is available during I/O) started at an openmosix
conference in Bologna, when I was listening to one guy fixing the
latency of some videogame app migrating from server to server with
openmosix. So they could use a few boxes clustered to host some hundred
videogames servers migrating depending the the load (I recall they said
the users tend to move from one game to the other all at the same time).
I never heard of sfq before, but when I understood how it worked for the
packet scheduler and how they were using it to fix a latency issue in
the responsiveness of their game while the server was migrating, I
immediatly got the idea I could use the very same sfq algorightm for the
disk elevator too (at that time it was being used only in the networking
qdisc packet scheduler). I wasn't really sure at first if it would work
equally well for disk too (network pays nothing for seeks). But
conceptually it did worth mentioning the idea to Jens so he could
evaluate it (I think he was already working on something similar but I
hope I did provide him with some useful hint). You know the rest, he
quickly turned it into a cfq and did numerous improvements. The funny
thing I meant to say, is that if I didn't incidentally listen to the
videogame talk (a talk I'd normally avoid) we wouldn't have cfq today in
the I/O scheduler in its current great status (of course we could have
it since Jens was already working on something similar, but perhaps it
would be at least a bit in the past compared to his current great
development). 

Even a videogame server may turn out to be very useful ;). I'm quite
sure the developer doing the videogame openmosix speech doesn't know his
speech had an impact on the kernel I/O scheduler ;). Hope he reads this
email.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  0:37       ` Andrea Arcangeli
@ 2004-12-08  0:54         ` Nick Piggin
  2004-12-08  1:37           ` Andrea Arcangeli
  2004-12-08  6:49           ` Jens Axboe
  0 siblings, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  0:54 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Jens Axboe, Andrew Morton, linux-kernel

On Wed, 2004-12-08 at 01:37 +0100, Andrea Arcangeli wrote:
> On Thu, Dec 02, 2004 at 08:52:36PM +0100, Jens Axboe wrote:
> > with its default io scheduler has basically zero write performance in
> 
> IMHO the default io scheduler should be changed to cfq. as is all but
> general purpose so it's a mistake to leave it the default (plus as Jens

I think it is actually pretty good at general purpose stuff. For
example, the old writes starve reads thing. It is especially bad
when doing small dependent reads like `find | xargs grep`. (Although
CFQ is probably better at this than deadline too).

It also tends to degrade more gracefully under memory load because
it doesn't require much readahead.

> found the write bandwidth is not existent during reads, no surprise it
> falls apart in any database load). We had to make the cfq the default
> for the enterprise release already. The first thing I do is to add
> elevator=cfq on a new install. I really like how well cfq has been
> designed, implemented and turned, Jens's results with his last patch are
> quite impressive.
> 

That is synch write bandwidth. Yes that seems to be a problem.




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  0:54         ` Nick Piggin
@ 2004-12-08  1:37           ` Andrea Arcangeli
  2004-12-08  1:47             ` Nick Piggin
  2004-12-08  2:00             ` Andrew Morton
  2004-12-08  6:49           ` Jens Axboe
  1 sibling, 2 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08  1:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jens Axboe, Andrew Morton, linux-kernel

On Wed, Dec 08, 2004 at 11:54:13AM +1100, Nick Piggin wrote:
> That is synch write bandwidth. Yes that seems to be a problem.

It's not just sync write, it's a write in general, blkdev doesn't know
if the one waiting is pdflush or some other task. Once this will be
fixed I will have to reconsider my opinion of course, but I guess after
it gets fixed the benefit of "as" on the desktop will as well decrease
compared to cfq. The desktop is ok with "as" simply because it's
normally optimal to stop writes completely, since there are few apps
doing write journaling or heavy writes, and there's normally no
contigous read happening in the background. Desktop just needs a
temporary peak read max bandwidth when you click on openoffice or
similar app (and "as" provides it). But on a mixed server doing some
significant read and write (i.e.  somebody downloading the kernel from
kernel.org and installing it on some application server) I don't think
"as" is general purpose enough. Another example is the multiuser usage
with one user reading a big mbox folder in mutt, whole the other user
s exiting mutt at the same time. The one exiting will pratically have to
wait the first user to finish its read I/O. All I/O becomes sync when it
exceeds the max size of the writeback cache.

"as" is clearly the best for the common case of the very desktop usage
(i.e. machine 99.9% idle and without any I/O except when starting an app
or saving a file, and the user noticing delay only while waiting the
window to open up after he clicked the button).  But I believe cfq is
better for a general purpose usage where we cannot assume how the kernel
will be used.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  1:37           ` Andrea Arcangeli
@ 2004-12-08  1:47             ` Nick Piggin
  2004-12-08  2:09               ` Andrea Arcangeli
  2004-12-08  6:52               ` Jens Axboe
  2004-12-08  2:00             ` Andrew Morton
  1 sibling, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  1:47 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Jens Axboe, Andrew Morton, linux-kernel

On Wed, 2004-12-08 at 02:37 +0100, Andrea Arcangeli wrote:
> On Wed, Dec 08, 2004 at 11:54:13AM +1100, Nick Piggin wrote:
> > That is synch write bandwidth. Yes that seems to be a problem.
> 
> It's not just sync write, it's a write in general, blkdev doesn't know
> if the one waiting is pdflush or some other task. Once this will be
> fixed I will have to reconsider my opinion of course, but I guess after

Yeah those sorts of dependencies are tricky. I think the best bet is
to not get _too_ fancy, and try to cover the basics like keeping
fairness good, and minimising write latency as much as possible.

> it gets fixed the benefit of "as" on the desktop will as well decrease
> compared to cfq. The desktop is ok with "as" simply because it's
> normally optimal to stop writes completely, since there are few apps
> doing write journaling or heavy writes, and there's normally no
> contigous read happening in the background. Desktop just needs a
> temporary peak read max bandwidth when you click on openoffice or
> similar app (and "as" provides it). But on a mixed server doing some
> significant read and write (i.e.  somebody downloading the kernel from
> kernel.org and installing it on some application server) I don't think
> "as" is general purpose enough. Another example is the multiuser usage
> with one user reading a big mbox folder in mutt, whole the other user
> s exiting mutt at the same time. The one exiting will pratically have to
> wait the first user to finish its read I/O. All I/O becomes sync when it
> exceeds the max size of the writeback cache.
> 

AS is surprisingly good when doing concurrent reads and buffered writes.
The buffered writes don't get starved too badly. Basically, AS just
ensures a reader will get the chance to play out its entire read batch
before switching to another reader or a writer.

Buffered writes don't suffer the same problem obviously because the
disk can can easily be kept fed from cache. Any read vs buffered write
starvation you see will mainly be due to the /sys tunables that give
more priority to reads (which isn't a bad idea, generally).


> "as" is clearly the best for the common case of the very desktop usage
> (i.e. machine 99.9% idle and without any I/O except when starting an app
> or saving a file, and the user noticing delay only while waiting the
> window to open up after he clicked the button).  But I believe cfq is
> better for a general purpose usage where we cannot assume how the kernel
> will be used.

Maybe. CFQ may be a bit closer to a traditional elevator behaviour,
while AS uses some significantly different concepts which I guess
aren't as well tested and optimised for.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  1:37           ` Andrea Arcangeli
  2004-12-08  1:47             ` Nick Piggin
@ 2004-12-08  2:00             ` Andrew Morton
  2004-12-08  2:08               ` Andrew Morton
                                 ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-08  2:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: nickpiggin, axboe, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> The desktop is ok with "as" simply because it's
>  normally optimal to stop writes completely

AS doesn't "stop writes completely".  With the current settings it
apportions about 1/3 of the disk's bandwidth to writes.

This thing Jens has found is for direct-io writes only.  It's a bug.

The other problem with AS is that it basically doesn't work at all with a
TCQ depth greater than four or so, and lots of people blindly look at
untuned SCSI benchmark results without realising that.  If a distro is
always selecting CFQ then they've probably gone and deoptimised all their
IDE users.  

AS needs another iteration of development to fix these things.  Right now
it's probably the case that we need CFQ or deadline for servers and AS for
desktops.   That's awkward.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:00             ` Andrew Morton
@ 2004-12-08  2:08               ` Andrew Morton
  2004-12-08  6:55                 ` Jens Axboe
  2004-12-08  2:20               ` Andrea Arcangeli
  2004-12-08  6:55               ` Jens Axboe
  2 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-08  2:08 UTC (permalink / raw)
  To: andrea, nickpiggin, axboe, linux-kernel

Andrew Morton <akpm@osdl.org> wrote:
>
> If a distro is
>  always selecting CFQ then they've probably gone and deoptimised all their
>  IDE users.  

That being said, yeah, once we get the time-sliced-CFQ happening, it should
probably be made the default, at least until AS gets fixed up.  We need to
run the numbers and settle on that.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  1:47             ` Nick Piggin
@ 2004-12-08  2:09               ` Andrea Arcangeli
  2004-12-08  2:11                 ` Andrew Morton
  2004-12-08  6:52               ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08  2:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jens Axboe, Andrew Morton, linux-kernel

On Wed, Dec 08, 2004 at 12:47:08PM +1100, Nick Piggin wrote:
> Buffered writes don't suffer the same problem obviously because the
> disk can can easily be kept fed from cache. Any read vs buffered write

This is true for very small buffered writes, which is the case for
desktop usage, but for more server oriented usage if the write isn't so
small, and you flush the writeback cache to disk very slowly, eventually
it will become a _sync_ write. So I agree that as long as the write
doesn't become synchronous "as" provides better behaviour.

One hidden side effect of "as" is that by writing so slowly (and
64KiB/sec really is slow), it increases the time it will take for a
dirty page to be flushed to disk (with tons of ram and lot of continous
readers I wouldn't be surprised if it could take hours for the data to
hit disk in an artificial testcase, you can do the math and find how
long it would take to the last page in the list to hit disk at
64KiB/sec).

> starvation you see will mainly be due to the /sys tunables that give
> more priority to reads (which isn't a bad idea, generally).

sure.

> Maybe. CFQ may be a bit closer to a traditional elevator behaviour,
> while AS uses some significantly different concepts which I guess
> aren't as well tested and optimised for.

It's already the best for desktop usage (even the 64KiB/sec is the best
on desktop), but as you said above it uses significantly different
concepts and that makes it by definition not general purpose (and
definitely a no-way for database, while cfq isn't a no-way on the
desktop).

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:09               ` Andrea Arcangeli
@ 2004-12-08  2:11                 ` Andrew Morton
  2004-12-08  2:22                   ` Andrea Arcangeli
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-08  2:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: nickpiggin, axboe, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
>  One hidden side effect of "as" is that by writing so slowly (and
>  64KiB/sec really is slow), it increases the time it will take for a
>  dirty page to be flushed to disk

The 64k/sec only happens for direct-io, and those pages aren't dirty.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:00             ` Andrew Morton
  2004-12-08  2:08               ` Andrew Morton
@ 2004-12-08  2:20               ` Andrea Arcangeli
  2004-12-08  2:25                 ` Andrew Morton
  2004-12-08  6:55               ` Jens Axboe
  2 siblings, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08  2:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, axboe, linux-kernel

On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> untuned SCSI benchmark results without realising that.  If a distro is
> always selecting CFQ then they've probably gone and deoptimised all their
> IDE users.  

The enterprise distro definitely shouldn't use "as" by default: database
apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
definitely the best for enterprise distros. This is a tangible result,
SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
lot, so such 64kib Jens found would be a showstopper for a enterprise
release, slelecting something different than "as" is a _must_ for
enterprise distro).

In the desktop distro you'll notice the /proc/cmdline has elevator="as"
because for desktop distros more people is going to use them for
desktop as expected.

But for enterprise distros this isn't the case, and cfq (or deadline)
must be the default, sure not "as". So claiming that selecting cfq by
default (I said in the enterprise distro) is deoptimising users, is
a wrong statement and the opposite of reality.

And personally I use cfq even on the desktop (since I'm not a normal
desktop user and I've apps writing too).

> AS needs another iteration of development to fix these things.  Right now
> it's probably the case that we need CFQ or deadline for servers and AS for
> desktops.   That's awkward.

Exactly.

If you believe AS is going to perform better than CFQ on the database
enterprise usage, we just need to prove it in practice after the round
of fixes, then changing the default back to "as" it'll be an additional
one liner on top of the blocker direct-io bug.

Desktop is already forced to "as" by /proc/cmdline, so it's not affected
on how we change the default of the enterprise distro AFIK.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:11                 ` Andrew Morton
@ 2004-12-08  2:22                   ` Andrea Arcangeli
  0 siblings, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08  2:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, axboe, linux-kernel

On Tue, Dec 07, 2004 at 06:11:37PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> >  One hidden side effect of "as" is that by writing so slowly (and
> >  64KiB/sec really is slow), it increases the time it will take for a
> >  dirty page to be flushed to disk
> 
> The 64k/sec only happens for direct-io, and those pages aren't dirty.

I agree my above claim was wrong, thanks for correcting it.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:20               ` Andrea Arcangeli
@ 2004-12-08  2:25                 ` Andrew Morton
  2004-12-08  2:33                   ` Andrea Arcangeli
  2004-12-08  2:33                   ` Nick Piggin
  0 siblings, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2004-12-08  2:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: nickpiggin, axboe, linux-kernel

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> > untuned SCSI benchmark results without realising that.  If a distro is
> > always selecting CFQ then they've probably gone and deoptimised all their
> > IDE users.  
> 
> The enterprise distro definitely shouldn't use "as" by default: database
> apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
> definitely the best for enterprise distros. This is a tangible result,
> SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
> lot, so such 64kib Jens found would be a showstopper for a enterprise
> release, slelecting something different than "as" is a _must_ for
> enterprise distro).

That's a missing hint in the direct-io code.  This fixes it up:

--- 25/fs/direct-io.c~a	2004-12-07 18:12:25.491602512 -0800
+++ 25-akpm/fs/direct-io.c	2004-12-07 18:13:13.661279608 -0800
@@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
 	struct dio *dio;
 	int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
 
+	current->flags |= PF_SYNCWRITE;
+
 	if (bdev)
 		bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
 
@@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
 out:
 	if (reader_with_isem)
 		up(&inode->i_sem);
+	current->flags &= ~PF_SYNCWRITE;
 	return retval;
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);
_

> ...
> 
> If you believe AS is going to perform better than CFQ on the database
> enterprise usage, we just need to prove it in practice after the round
> of fixes, then changing the default back to "as" it'll be an additional
> one liner on top of the blocker direct-io bug.

I don't think AS will ever meet the performance of CFQ or deadline for the
seeky database loads, unfortunately.  We busted a gut over that and were
never able to get better than 90% or so.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:25                 ` Andrew Morton
@ 2004-12-08  2:33                   ` Andrea Arcangeli
  2004-12-08  2:33                   ` Nick Piggin
  1 sibling, 0 replies; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08  2:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, axboe, linux-kernel

On Tue, Dec 07, 2004 at 06:25:57PM -0800, Andrew Morton wrote:
> That's a missing hint in the direct-io code.  This fixes it up:
> 
> --- 25/fs/direct-io.c~a	2004-12-07 18:12:25.491602512 -0800
> +++ 25-akpm/fs/direct-io.c	2004-12-07 18:13:13.661279608 -0800
> @@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
>  	struct dio *dio;
>  	int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
>  
> +	current->flags |= PF_SYNCWRITE;
> +
>  	if (bdev)
>  		bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
>  
> @@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
>  out:
>  	if (reader_with_isem)
>  		up(&inode->i_sem);
> +	current->flags &= ~PF_SYNCWRITE;
>  	return retval;
>  }
>  EXPORT_SYMBOL(__blockdev_direct_IO);

that was fast ;) great, thanks!

> I don't think AS will ever meet the performance of CFQ or deadline for the

This is my expectation too, since for these apps write latency is almost
more important than read latency and writes are often sync.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:25                 ` Andrew Morton
  2004-12-08  2:33                   ` Andrea Arcangeli
@ 2004-12-08  2:33                   ` Nick Piggin
  2004-12-08  2:51                     ` Andrea Arcangeli
  2004-12-08  6:58                     ` Jens Axboe
  1 sibling, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  2:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, axboe, linux-kernel

On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> > > untuned SCSI benchmark results without realising that.  If a distro is
> > > always selecting CFQ then they've probably gone and deoptimised all their
> > > IDE users.  
> > 
> > The enterprise distro definitely shouldn't use "as" by default: database
> > apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
> > definitely the best for enterprise distros. This is a tangible result,
> > SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
> > lot, so such 64kib Jens found would be a showstopper for a enterprise
> > release, slelecting something different than "as" is a _must_ for
> > enterprise distro).
> 
> That's a missing hint in the direct-io code.  This fixes it up:
> 
> --- 25/fs/direct-io.c~a	2004-12-07 18:12:25.491602512 -0800
> +++ 25-akpm/fs/direct-io.c	2004-12-07 18:13:13.661279608 -0800
> @@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
>  	struct dio *dio;
>  	int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
>  
> +	current->flags |= PF_SYNCWRITE;
> +
>  	if (bdev)
>  		bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
>  
> @@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
>  out:
>  	if (reader_with_isem)
>  		up(&inode->i_sem);
> +	current->flags &= ~PF_SYNCWRITE;
>  	return retval;
>  }
>  EXPORT_SYMBOL(__blockdev_direct_IO);
> _
> 
> > ...
> > 
> > If you believe AS is going to perform better than CFQ on the database
> > enterprise usage, we just need to prove it in practice after the round
> > of fixes, then changing the default back to "as" it'll be an additional
> > one liner on top of the blocker direct-io bug.
> 
> I don't think AS will ever meet the performance of CFQ or deadline for the
> seeky database loads, unfortunately.  We busted a gut over that and were
> never able to get better than 90% or so.
> 

I think we could detect when a disk asks for more than, say, 4
concurrent requests, and in that case turn off read anticipation
and all the anti-starvation for TCQ by default (with the option
to force it back on).

I think this would be a decent "it works" solution that would make
AS acceptable as a default.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:33                   ` Nick Piggin
@ 2004-12-08  2:51                     ` Andrea Arcangeli
  2004-12-08  3:02                       ` Nick Piggin
  2004-12-08  6:58                     ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Andrea Arcangeli @ 2004-12-08  2:51 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, axboe, linux-kernel

On Wed, Dec 08, 2004 at 01:33:33PM +1100, Nick Piggin wrote:
> I think we could detect when a disk asks for more than, say, 4
> concurrent requests, and in that case turn off read anticipation
> and all the anti-starvation for TCQ by default (with the option
> to force it back on).

What do you mean with "disk asks for more than 4 concurrent requests?"
You mean checking the TCQ capability of the hardware storage?

> I think this would be a decent "it works" solution that would make
> AS acceptable as a default.

Perhaps the code would be the same but if you disable it completely on
certain hardware that's not AS anymore...

Then I believe it would be better to switch to cfq for storage capable
of more than 4 concurrent tagged queued requests instead of sticking
with a "disabled AS". What's the point of AS if the features of AS are
disabled?

One relevant feature of cfq is the fairness property of pid against pid
or user against user. You don't get that fairness with the other I/O
schedulers. It was designed for fairness since the first place. Fariness
of writes against writes and reads against reads and write against reads
and read against writes.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:51                     ` Andrea Arcangeli
@ 2004-12-08  3:02                       ` Nick Piggin
  0 siblings, 0 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  3:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Andrew Morton, axboe, linux-kernel

On Wed, 2004-12-08 at 03:51 +0100, Andrea Arcangeli wrote:
> On Wed, Dec 08, 2004 at 01:33:33PM +1100, Nick Piggin wrote:
> > I think we could detect when a disk asks for more than, say, 4
> > concurrent requests, and in that case turn off read anticipation
> > and all the anti-starvation for TCQ by default (with the option
> > to force it back on).
> 
> What do you mean with "disk asks for more than 4 concurrent requests?"
> You mean checking the TCQ capability of the hardware storage?
> 

Yeah. Just check if there are more than 4 outstanding requests at once.

> > I think this would be a decent "it works" solution that would make
> > AS acceptable as a default.
> 
> Perhaps the code would be the same but if you disable it completely on
> certain hardware that's not AS anymore...
> 

Which is what we want on those system ;)

> Then I believe it would be better to switch to cfq for storage capable
> of more than 4 concurrent tagged queued requests instead of sticking
> with a "disabled AS". What's the point of AS if the features of AS are
> disabled?
> 

For everyone else, who do want the AS features (ie. not databases).

> One relevant feature of cfq is the fairness property of pid against pid
> or user against user. You don't get that fairness with the other I/O
> schedulers. It was designed for fairness since the first place. Fariness
> of writes against writes and reads against reads and write against reads
> and read against writes.

That is something, I'll grant you that.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  0:54         ` Nick Piggin
  2004-12-08  1:37           ` Andrea Arcangeli
@ 2004-12-08  6:49           ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  6:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 01:37 +0100, Andrea Arcangeli wrote:
> > On Thu, Dec 02, 2004 at 08:52:36PM +0100, Jens Axboe wrote:
> > > with its default io scheduler has basically zero write performance in
> > 
> > IMHO the default io scheduler should be changed to cfq. as is all but
> > general purpose so it's a mistake to leave it the default (plus as Jens
> 
> I think it is actually pretty good at general purpose stuff. For
> example, the old writes starve reads thing. It is especially bad
> when doing small dependent reads like `find | xargs grep`. (Although
> CFQ is probably better at this than deadline too).

Time sliced cfq fixes this.

> It also tends to degrade more gracefully under memory load because
> it doesn't require much readahead.

Ditto.

> > found the write bandwidth is not existent during reads, no surprise it
> > falls apart in any database load). We had to make the cfq the default
> > for the enterprise release already. The first thing I do is to add
> > elevator=cfq on a new install. I really like how well cfq has been
> > designed, implemented and turned, Jens's results with his last patch are
> > quite impressive.
> > 
> 
> That is synch write bandwidth. Yes that seems to be a problem.

A pretty big one :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  1:47             ` Nick Piggin
  2004-12-08  2:09               ` Andrea Arcangeli
@ 2004-12-08  6:52               ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  6:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrea Arcangeli, Andrew Morton, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> > it gets fixed the benefit of "as" on the desktop will as well decrease
> > compared to cfq. The desktop is ok with "as" simply because it's
> > normally optimal to stop writes completely, since there are few apps
> > doing write journaling or heavy writes, and there's normally no
> > contigous read happening in the background. Desktop just needs a
> > temporary peak read max bandwidth when you click on openoffice or
> > similar app (and "as" provides it). But on a mixed server doing some
> > significant read and write (i.e.  somebody downloading the kernel from
> > kernel.org and installing it on some application server) I don't think
> > "as" is general purpose enough. Another example is the multiuser usage
> > with one user reading a big mbox folder in mutt, whole the other user
> > s exiting mutt at the same time. The one exiting will pratically have to
> > wait the first user to finish its read I/O. All I/O becomes sync when it
> > exceeds the max size of the writeback cache.
> > 
> 
> AS is surprisingly good when doing concurrent reads and buffered writes.
> The buffered writes don't get starved too badly. Basically, AS just
> ensures a reader will get the chance to play out its entire read batch
> before switching to another reader or a writer.

AS doesn't give a lot of bandwidth to the writes, about 10-20% only.
Tiem sliced cfq is more fair, you get closer to 50/50 in that case.

> Buffered writes don't suffer the same problem obviously because the
> disk can can easily be kept fed from cache. Any read vs buffered write
> starvation you see will mainly be due to the /sys tunables that give
> more priority to reads (which isn't a bad idea, generally).

Depends entirely on the work load, I don't think you can say something
like that in generel. For a desktop load, sure.

> > "as" is clearly the best for the common case of the very desktop usage
> > (i.e. machine 99.9% idle and without any I/O except when starting an app
> > or saving a file, and the user noticing delay only while waiting the
> > window to open up after he clicked the button).  But I believe cfq is
> > better for a general purpose usage where we cannot assume how the kernel
> > will be used.
> 
> Maybe. CFQ may be a bit closer to a traditional elevator behaviour,
> while AS uses some significantly different concepts which I guess
> aren't as well tested and optimised for.

You should read the new cfq code, there's isn't that much difference
when it comes to the plain act of ordering io or finding the next
request (I stole some code :-).

The data direction batching that AS does I don't see the point of.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:00             ` Andrew Morton
  2004-12-08  2:08               ` Andrew Morton
  2004-12-08  2:20               ` Andrea Arcangeli
@ 2004-12-08  6:55               ` Jens Axboe
  2004-12-08  7:08                 ` Nick Piggin
  2004-12-08 10:52                 ` Helge Hafting
  2 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  6:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, nickpiggin, linux-kernel

On Tue, Dec 07 2004, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > The desktop is ok with "as" simply because it's
> >  normally optimal to stop writes completely
> 
> AS doesn't "stop writes completely".  With the current settings it
> apportions about 1/3 of the disk's bandwidth to writes.
> 
> This thing Jens has found is for direct-io writes only.  It's a bug.

Indeed. It's a special case one, but nasty for that case.

> The other problem with AS is that it basically doesn't work at all with a
> TCQ depth greater than four or so, and lots of people blindly look at
> untuned SCSI benchmark results without realising that.  If a distro is

That's pretty easy to fix. I added something like that to cfq, and it's
not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).

> always selecting CFQ then they've probably gone and deoptimised all their
> IDE users.  

Andrew, AS has other issues, it's not a case of AS always being faster
at everything.

> AS needs another iteration of development to fix these things.  Right now
> it's probably the case that we need CFQ or deadline for servers and AS for
> desktops.   That's awkward.

Currently I think the time sliced cfq is the best all around. There's
still a few kinks to be shaken out, but generally I think the concept is
sounder than AS.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:08               ` Andrew Morton
@ 2004-12-08  6:55                 ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  6:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, nickpiggin, linux-kernel

On Tue, Dec 07 2004, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > If a distro is
> >  always selecting CFQ then they've probably gone and deoptimised all their
> >  IDE users.  
> 
> That being said, yeah, once we get the time-sliced-CFQ happening, it should
> probably be made the default, at least until AS gets fixed up.  We need to
> run the numbers and settle on that.

I'll do a new round of numbers today.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  2:33                   ` Nick Piggin
  2004-12-08  2:51                     ` Andrea Arcangeli
@ 2004-12-08  6:58                     ` Jens Axboe
  2004-12-08  7:14                       ` Nick Piggin
  1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  6:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > On Tue, Dec 07, 2004 at 06:00:33PM -0800, Andrew Morton wrote:
> > > > untuned SCSI benchmark results without realising that.  If a distro is
> > > > always selecting CFQ then they've probably gone and deoptimised all their
> > > > IDE users.  
> > > 
> > > The enterprise distro definitely shouldn't use "as" by default: database
> > > apps _must_ not use AS, they've to use either CFQ or deadline. CFQ is
> > > definitely the best for enterprise distros. This is a tangible result,
> > > SCSI/IDE doesn't matter at all (and keep in mind they use O_DIRECT a
> > > lot, so such 64kib Jens found would be a showstopper for a enterprise
> > > release, slelecting something different than "as" is a _must_ for
> > > enterprise distro).
> > 
> > That's a missing hint in the direct-io code.  This fixes it up:
> > 
> > --- 25/fs/direct-io.c~a	2004-12-07 18:12:25.491602512 -0800
> > +++ 25-akpm/fs/direct-io.c	2004-12-07 18:13:13.661279608 -0800
> > @@ -1161,6 +1161,8 @@ __blockdev_direct_IO(int rw, struct kioc
> >  	struct dio *dio;
> >  	int reader_with_isem = (rw == READ && dio_lock_type == DIO_OWN_LOCKING);
> >  
> > +	current->flags |= PF_SYNCWRITE;
> > +
> >  	if (bdev)
> >  		bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
> >  
> > @@ -1244,6 +1246,7 @@ __blockdev_direct_IO(int rw, struct kioc
> >  out:
> >  	if (reader_with_isem)
> >  		up(&inode->i_sem);
> > +	current->flags &= ~PF_SYNCWRITE;
> >  	return retval;
> >  }
> >  EXPORT_SYMBOL(__blockdev_direct_IO);
> > _
> > 
> > > ...
> > > 
> > > If you believe AS is going to perform better than CFQ on the database
> > > enterprise usage, we just need to prove it in practice after the round
> > > of fixes, then changing the default back to "as" it'll be an additional
> > > one liner on top of the blocker direct-io bug.
> > 
> > I don't think AS will ever meet the performance of CFQ or deadline for the
> > seeky database loads, unfortunately.  We busted a gut over that and were
> > never able to get better than 90% or so.
> > 
> 
> I think we could detect when a disk asks for more than, say, 4
> concurrent requests, and in that case turn off read anticipation
> and all the anti-starvation for TCQ by default (with the option
> to force it back on).

CFQ only allows a certain depth a the hardware level, you can control
that. I don't think you should drop the AS behaviour in that case, you
should look at when the last request comes in and what type it is.

With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
it gets harder to get good read bandwidth as the disk is trying pretty
hard to starve me. Maybe killing write back caching would help, I'll
have to try.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  6:55               ` Jens Axboe
@ 2004-12-08  7:08                 ` Nick Piggin
  2004-12-08  7:11                   ` Jens Axboe
  2004-12-08 10:52                 ` Helge Hafting
  1 sibling, 1 reply; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  7:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:
> On Tue, Dec 07 2004, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > The desktop is ok with "as" simply because it's
> > >  normally optimal to stop writes completely
> > 
> > AS doesn't "stop writes completely".  With the current settings it
> > apportions about 1/3 of the disk's bandwidth to writes.
> > 
> > This thing Jens has found is for direct-io writes only.  It's a bug.
> 
> Indeed. It's a special case one, but nasty for that case.
> 
> > The other problem with AS is that it basically doesn't work at all with a
> > TCQ depth greater than four or so, and lots of people blindly look at
> > untuned SCSI benchmark results without realising that.  If a distro is
> 
> That's pretty easy to fix. I added something like that to cfq, and it's
> not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).
> 
> > always selecting CFQ then they've probably gone and deoptimised all their
> > IDE users.  
> 
> Andrew, AS has other issues, it's not a case of AS always being faster
> at everything.
> 
> > AS needs another iteration of development to fix these things.  Right now
> > it's probably the case that we need CFQ or deadline for servers and AS for
> > desktops.   That's awkward.
> 
> Currently I think the time sliced cfq is the best all around. There's
> still a few kinks to be shaken out, but generally I think the concept is
> sounder than AS.
> 

But aren't you basically unconditionally allowing a 4ms idle time after
reads? The complexity of AS (other than all the work we had to do to get
the block layer to cope with it), is getting it to turn off at (mostly)
the right times. Other than that, it is basically the deadline
scheduler.

I could be wrong, but it looks like you'll just run into the same sorts
of performance problems as AS initially had.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:08                 ` Nick Piggin
@ 2004-12-08  7:11                   ` Jens Axboe
  2004-12-08  7:19                     ` Nick Piggin
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  7:11 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:
> > On Tue, Dec 07 2004, Andrew Morton wrote:
> > > Andrea Arcangeli <andrea@suse.de> wrote:
> > > >
> > > > The desktop is ok with "as" simply because it's
> > > >  normally optimal to stop writes completely
> > > 
> > > AS doesn't "stop writes completely".  With the current settings it
> > > apportions about 1/3 of the disk's bandwidth to writes.
> > > 
> > > This thing Jens has found is for direct-io writes only.  It's a bug.
> > 
> > Indeed. It's a special case one, but nasty for that case.
> > 
> > > The other problem with AS is that it basically doesn't work at all with a
> > > TCQ depth greater than four or so, and lots of people blindly look at
> > > untuned SCSI benchmark results without realising that.  If a distro is
> > 
> > That's pretty easy to fix. I added something like that to cfq, and it's
> > not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).
> > 
> > > always selecting CFQ then they've probably gone and deoptimised all their
> > > IDE users.  
> > 
> > Andrew, AS has other issues, it's not a case of AS always being faster
> > at everything.
> > 
> > > AS needs another iteration of development to fix these things.  Right now
> > > it's probably the case that we need CFQ or deadline for servers and AS for
> > > desktops.   That's awkward.
> > 
> > Currently I think the time sliced cfq is the best all around. There's
> > still a few kinks to be shaken out, but generally I think the concept is
> > sounder than AS.
> > 
> 
> But aren't you basically unconditionally allowing a 4ms idle time after
> reads? The complexity of AS (other than all the work we had to do to get
> the block layer to cope with it), is getting it to turn off at (mostly)
> the right times. Other than that, it is basically the deadline
> scheduler.

Yes, the concept is similar and there will be time wasting currently.
I've got some cases covered that AS doesn't, and there are definitely
some the other way around as well.

If you have any test cases/programs, I'd like to see them.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  6:58                     ` Jens Axboe
@ 2004-12-08  7:14                       ` Nick Piggin
  2004-12-08  7:20                         ` Jens Axboe
  2004-12-08 13:48                         ` Jens Axboe
  0 siblings, 2 replies; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  7:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> On Wed, Dec 08 2004, Nick Piggin wrote:
> > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:

> > I think we could detect when a disk asks for more than, say, 4
> > concurrent requests, and in that case turn off read anticipation
> > and all the anti-starvation for TCQ by default (with the option
> > to force it back on).
> 
> CFQ only allows a certain depth a the hardware level, you can control
> that. I don't think you should drop the AS behaviour in that case, you
> should look at when the last request comes in and what type it is.
> 
> With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> it gets harder to get good read bandwidth as the disk is trying pretty
> hard to starve me. Maybe killing write back caching would help, I'll
> have to try.
> 

I "fixed" this in AS. It gets (or got, last time we checked, many months
ago) pretty good read latency even with a big write and a very large
tag depth.

What were the main things I had to do... hmm, I think the main one was
to not start on a new batch until all requests from a previous batch
are reported to have completed. So eg. you get all reads completing
before you start issuing any more writes. The write->read side of things
isn't so clear cut with your "smart" write caches on the IO systems, but
no doubt that helps a bit.

Of course, after you do all that your database performance has well and
truly gone down the shitter. It is also hampered by the more fundamental
issue that read anticipating can block up the pipe for IO that is cached
on the controller/disks and would get satisfied immediately.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:11                   ` Jens Axboe
@ 2004-12-08  7:19                     ` Nick Piggin
  2004-12-08  7:26                       ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  7:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, 2004-12-08 at 08:11 +0100, Jens Axboe wrote:
> On Wed, Dec 08 2004, Nick Piggin wrote:
> > On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:

> > > Currently I think the time sliced cfq is the best all around. There's
> > > still a few kinks to be shaken out, but generally I think the concept is
> > > sounder than AS.
> > > 
> > 
> > But aren't you basically unconditionally allowing a 4ms idle time after
> > reads? The complexity of AS (other than all the work we had to do to get
> > the block layer to cope with it), is getting it to turn off at (mostly)
> > the right times. Other than that, it is basically the deadline
> > scheduler.
> 
> Yes, the concept is similar and there will be time wasting currently.
> I've got some cases covered that AS doesn't, and there are definitely
> some the other way around as well.
> 

Oh? What have you got covered that AS doesn't? (I'm only reading the
patch itself, which isn't trivial to follow).

> If you have any test cases/programs, I'd like to see them.
> 

Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
had trouble with are OraSim (Oracle might give you a copy), Andrew's
patch scripts when applying a stack of patches, pgbench... can't
really remember any others off the top of my head.

I've got a small set of basic test programs that are similar to the
sort of tests you've been running in this thread as well.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:14                       ` Nick Piggin
@ 2004-12-08  7:20                         ` Jens Axboe
  2004-12-08  7:29                           ` Nick Piggin
  2004-12-08  7:30                           ` Andrew Morton
  2004-12-08 13:48                         ` Jens Axboe
  1 sibling, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  7:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> 
> > > I think we could detect when a disk asks for more than, say, 4
> > > concurrent requests, and in that case turn off read anticipation
> > > and all the anti-starvation for TCQ by default (with the option
> > > to force it back on).
> > 
> > CFQ only allows a certain depth a the hardware level, you can control
> > that. I don't think you should drop the AS behaviour in that case, you
> > should look at when the last request comes in and what type it is.
> > 
> > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > it gets harder to get good read bandwidth as the disk is trying pretty
> > hard to starve me. Maybe killing write back caching would help, I'll
> > have to try.
> > 
> 
> I "fixed" this in AS. It gets (or got, last time we checked, many months
> ago) pretty good read latency even with a big write and a very large
> tag depth.
> 
> What were the main things I had to do... hmm, I think the main one was
> to not start on a new batch until all requests from a previous batch
> are reported to have completed. So eg. you get all reads completing
> before you start issuing any more writes. The write->read side of things
> isn't so clear cut with your "smart" write caches on the IO systems, but
> no doubt that helps a bit.

I can see the read/write batching being helpful there, at least to
prevent writes starving reads if you let the queue drain completely
before starting a new batch.

CFQ does something similar, just not batched together. But it does let
the depth build up a little and drain out. In fact I think I'm missing
a little fix there thinking about it, that could be why the read
latencies hurt on write intensive loads (the dispatch queue is drained,
the hardware queue is not fully).

> Of course, after you do all that your database performance has well and
> truly gone down the shitter. It is also hampered by the more fundamental
> issue that read anticipating can block up the pipe for IO that is cached
> on the controller/disks and would get satisfied immediately.

I think we need to end up with something that sets the machine profile
for the interesting disks. Some things you can check for at runtime
(like the writes being extremely fast is a good indicator of write
caching), but it is just not possible to cover it all. Plus, you end up
with 30-40% of the code being convoluted stuff added to detect it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:19                     ` Nick Piggin
@ 2004-12-08  7:26                       ` Jens Axboe
  2004-12-08  9:35                         ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  7:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 08:11 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Wed, 2004-12-08 at 07:55 +0100, Jens Axboe wrote:
> 
> > > > Currently I think the time sliced cfq is the best all around. There's
> > > > still a few kinks to be shaken out, but generally I think the concept is
> > > > sounder than AS.
> > > > 
> > > 
> > > But aren't you basically unconditionally allowing a 4ms idle time after
> > > reads? The complexity of AS (other than all the work we had to do to get
> > > the block layer to cope with it), is getting it to turn off at (mostly)
> > > the right times. Other than that, it is basically the deadline
> > > scheduler.
> > 
> > Yes, the concept is similar and there will be time wasting currently.
> > I've got some cases covered that AS doesn't, and there are definitely
> > some the other way around as well.
> > 
> 
> Oh? What have you got covered that AS doesn't? (I'm only reading the
> patch itself, which isn't trivial to follow).

You are only thinking in terms of single process characteristics like
will it exit and think times, the inter-process characteristics are very
hap hazard. You might find the applied code easier to read, I think.

> > If you have any test cases/programs, I'd like to see them.
> > 
> 
> Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> had trouble with are OraSim (Oracle might give you a copy), Andrew's
> patch scripts when applying a stack of patches, pgbench... can't
> really remember any others off the top of my head.

The patch scripts case is interesting, last night (when committing other
patches) I was thinking I should try and bench that today. It has a good
mix of reads and writes.

There's still lots of tuning in the pipe line. As I wrote originally,
this was basically just a quick hack that I was surprised did so well
:-) It has grown a little since then and I think the concept is really
sound, so I'll continue to work on it.

> I've got a small set of basic test programs that are similar to the
> sort of tests you've been running in this thread as well.

Ok

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:20                         ` Jens Axboe
@ 2004-12-08  7:29                           ` Nick Piggin
  2004-12-08  7:32                             ` Jens Axboe
  2004-12-08  7:30                           ` Andrew Morton
  1 sibling, 1 reply; 66+ messages in thread
From: Nick Piggin @ 2004-12-08  7:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, 2004-12-08 at 08:20 +0100, Jens Axboe wrote:
> On Wed, Dec 08 2004, Nick Piggin wrote:
> > On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> > 
> > > > I think we could detect when a disk asks for more than, say, 4
> > > > concurrent requests, and in that case turn off read anticipation
> > > > and all the anti-starvation for TCQ by default (with the option
> > > > to force it back on).
> > > 
> > > CFQ only allows a certain depth a the hardware level, you can control
> > > that. I don't think you should drop the AS behaviour in that case, you
> > > should look at when the last request comes in and what type it is.
> > > 
> > > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > > it gets harder to get good read bandwidth as the disk is trying pretty
> > > hard to starve me. Maybe killing write back caching would help, I'll
> > > have to try.
> > > 
> > 
> > I "fixed" this in AS. It gets (or got, last time we checked, many months
> > ago) pretty good read latency even with a big write and a very large
> > tag depth.
> > 
> > What were the main things I had to do... hmm, I think the main one was
> > to not start on a new batch until all requests from a previous batch
> > are reported to have completed. So eg. you get all reads completing
> > before you start issuing any more writes. The write->read side of things
> > isn't so clear cut with your "smart" write caches on the IO systems, but
> > no doubt that helps a bit.
> 
> I can see the read/write batching being helpful there, at least to
> prevent writes starving reads if you let the queue drain completely
> before starting a new batch.
> 
> CFQ does something similar, just not batched together. But it does let
> the depth build up a little and drain out. In fact I think I'm missing
> a little fix there thinking about it, that could be why the read
> latencies hurt on write intensive loads (the dispatch queue is drained,
> the hardware queue is not fully).
> 

OK, you should look into that, because I found it was quite effective.
Maybe you have a little bug or oversight somewhere if you read latencies
are really bad. Note that AS read latencies at 256 tags aren't so good
as at 2 tags... but I think they're an order of magnitude better than
with deadline on the hardware we were testing.

> > Of course, after you do all that your database performance has well and
> > truly gone down the shitter. It is also hampered by the more fundamental
> > issue that read anticipating can block up the pipe for IO that is cached
> > on the controller/disks and would get satisfied immediately.
> 
> I think we need to end up with something that sets the machine profile
> for the interesting disks. Some things you can check for at runtime
> (like the writes being extremely fast is a good indicator of write
> caching), but it is just not possible to cover it all. Plus, you end up
> with 30-40% of the code being convoluted stuff added to detect it.
> 

Ideally maybe we would have a userspace program that is run to detect
various disk parameters and ask the user / config file what sort of
workloads we want to do, and spits out a recommended IO scheduler and
/sys configuration to accompany it.

That at least could be made quite sophisticated than a kernel solution,
and could gather quite a lot of "static" disk properties.

Of course there will be also some things that need to be done in
kernel...



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:20                         ` Jens Axboe
  2004-12-08  7:29                           ` Nick Piggin
@ 2004-12-08  7:30                           ` Andrew Morton
  2004-12-08  7:36                             ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2004-12-08  7:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: nickpiggin, andrea, linux-kernel

Jens Axboe <axboe@suse.de> wrote:
>
>  I think we need to end up with something that sets the machine profile
>  for the interesting disks. Some things you can check for at runtime
>  (like the writes being extremely fast is a good indicator of write
>  caching), but it is just not possible to cover it all. Plus, you end up
>  with 30-40% of the code being convoluted stuff added to detect it.

We can detect these things from userspace.  Parse the hdparm/scsiinfo
output, then poke numbers into /sys tunables.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:29                           ` Nick Piggin
@ 2004-12-08  7:32                             ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  7:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 08:20 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > > > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> > > 
> > > > > I think we could detect when a disk asks for more than, say, 4
> > > > > concurrent requests, and in that case turn off read anticipation
> > > > > and all the anti-starvation for TCQ by default (with the option
> > > > > to force it back on).
> > > > 
> > > > CFQ only allows a certain depth a the hardware level, you can control
> > > > that. I don't think you should drop the AS behaviour in that case, you
> > > > should look at when the last request comes in and what type it is.
> > > > 
> > > > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > > > it gets harder to get good read bandwidth as the disk is trying pretty
> > > > hard to starve me. Maybe killing write back caching would help, I'll
> > > > have to try.
> > > > 
> > > 
> > > I "fixed" this in AS. It gets (or got, last time we checked, many months
> > > ago) pretty good read latency even with a big write and a very large
> > > tag depth.
> > > 
> > > What were the main things I had to do... hmm, I think the main one was
> > > to not start on a new batch until all requests from a previous batch
> > > are reported to have completed. So eg. you get all reads completing
> > > before you start issuing any more writes. The write->read side of things
> > > isn't so clear cut with your "smart" write caches on the IO systems, but
> > > no doubt that helps a bit.
> > 
> > I can see the read/write batching being helpful there, at least to
> > prevent writes starving reads if you let the queue drain completely
> > before starting a new batch.
> > 
> > CFQ does something similar, just not batched together. But it does let
> > the depth build up a little and drain out. In fact I think I'm missing
> > a little fix there thinking about it, that could be why the read
> > latencies hurt on write intensive loads (the dispatch queue is drained,
> > the hardware queue is not fully).
> > 
> 
> OK, you should look into that, because I found it was quite effective.
> Maybe you have a little bug or oversight somewhere if you read latencies
> are really bad. Note that AS read latencies at 256 tags aren't so good
> as at 2 tags... but I think they're an order of magnitude better than
> with deadline on the hardware we were testing.

It wasn't _that_ bad, the main issue really was that it was interferring
with the cfq slices and you didn't get really good aggregate throughput
for several threads. Once that happens, there's the nasty tendency for
both latency to rise and throughput to plummit quickly :-)

I cap the depth at a variable setting right now, so no more than 4 by
default.

> > > Of course, after you do all that your database performance has well and
> > > truly gone down the shitter. It is also hampered by the more fundamental
> > > issue that read anticipating can block up the pipe for IO that is cached
> > > on the controller/disks and would get satisfied immediately.
> > 
> > I think we need to end up with something that sets the machine profile
> > for the interesting disks. Some things you can check for at runtime
> > (like the writes being extremely fast is a good indicator of write
> > caching), but it is just not possible to cover it all. Plus, you end up
> > with 30-40% of the code being convoluted stuff added to detect it.
> > 
> 
> Ideally maybe we would have a userspace program that is run to detect
> various disk parameters and ask the user / config file what sort of
> workloads we want to do, and spits out a recommended IO scheduler and
> /sys configuration to accompany it.

Well, or have the user give a profile of the drive. There's no point in
attempting to guess things the user knows. And then there are things you
probably cannot get right in either case :)

> That at least could be made quite sophisticated than a kernel solution,
> and could gather quite a lot of "static" disk properties.

And move some code to user space.

> Of course there will be also some things that need to be done in
> kernel...

Always, we should of course run as well as we can without magic disk
programs being needed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:30                           ` Andrew Morton
@ 2004-12-08  7:36                             ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  7:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, andrea, linux-kernel

On Tue, Dec 07 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de> wrote:
> >
> >  I think we need to end up with something that sets the machine profile
> >  for the interesting disks. Some things you can check for at runtime
> >  (like the writes being extremely fast is a good indicator of write
> >  caching), but it is just not possible to cover it all. Plus, you end up
> >  with 30-40% of the code being convoluted stuff added to detect it.
> 
> We can detect these things from userspace.  Parse the hdparm/scsiinfo
> output, then poke numbers into /sys tunables.

The simple things, like cache settings and queue depth - definitely. The
harder things like how does this drive behave you cannot. And
unfortunately the former is also pretty easy to control (at least for
the depth) and at least gather at runtime. So I think a user mode helper
only makes sense if it can help you with real drive characteristics that
are hard to detect. Plus, settings have a nack for changing while we
are running as well.

Hmm so perhaps not such a hot idea after all. I don't envision anyone
actually doing it anyways, so...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:26                       ` Jens Axboe
@ 2004-12-08  9:35                         ` Jens Axboe
  2004-12-08 10:08                           ` Jens Axboe
  2004-12-08 12:47                           ` Jens Axboe
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08  9:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Jens Axboe wrote:
> > Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> > had trouble with are OraSim (Oracle might give you a copy), Andrew's
> > patch scripts when applying a stack of patches, pgbench... can't
> > really remember any others off the top of my head.
> 
> The patch scripts case is interesting, last night (when committing other
> patches) I was thinking I should try and bench that today. It has a good
> mix of reads and writes.

AS is currently 10 seconds faster for that workload (untar of a kernel
and then applying 2237 patches). AS completes it in 155 seconds, CFQ
takes 164 seconds.

I still need to fix the streamed write perfomance regression, then I'll
see how the above compares again. CFQ doesn't do very well in eg
tiobench streamed write case (it's about 30% slower than AS).

(btw, any mention of CFQ in this thread refers to time sliced cfq).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  9:35                         ` Jens Axboe
@ 2004-12-08 10:08                           ` Jens Axboe
  2004-12-08 12:47                           ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 10:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Jens Axboe wrote:
> On Wed, Dec 08 2004, Jens Axboe wrote:
> > > Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> > > had trouble with are OraSim (Oracle might give you a copy), Andrew's
> > > patch scripts when applying a stack of patches, pgbench... can't
> > > really remember any others off the top of my head.
> > 
> > The patch scripts case is interesting, last night (when committing other
> > patches) I was thinking I should try and bench that today. It has a good
> > mix of reads and writes.
> 
> AS is currently 10 seconds faster for that workload (untar of a kernel
> and then applying 2237 patches). AS completes it in 155 seconds, CFQ
> takes 164 seconds.

DEADLINE does 160 seconds, btw.

Something like

for i in patches.*/*; do cp "$i" /dev/null; done

while running a

dd if=/dev/zero of=testfile bs=64k

could be better for both schedulers. AS completes the workload in
4min 14sec, CFQ in 3min 5sec. I don't have the time to try DEADLINE :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08 10:52                 ` Helge Hafting
@ 2004-12-08 10:49                   ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 10:49 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Andrew Morton, Andrea Arcangeli, nickpiggin, linux-kernel

On Wed, Dec 08 2004, Helge Hafting wrote:
> >>AS needs another iteration of development to fix these things.  Right now
> >>it's probably the case that we need CFQ or deadline for servers and AS for
> >>desktops.   That's awkward.
> >>   
> >>
> >
> >Currently I think the time sliced cfq is the best all around. There's
> >still a few kinks to be shaken out, but generally I think the concept is
> >sounder than AS.
> > 
> >
> I wonder, would it make sense to add some limited anticipation
> to the cfq scheduler?  It seems to me that there is room to
> get some of the AS benefit without getting too unfair:
> 
> AS does a wait that is short compared to a seek, getting some
> more locality almost for free.  Consider if CFQ did this, with
> the added limitation that it only let a few extra read requests
> in this way before doing the next seek anyway.  For example,
> allowing up to 3 extra anticipated read requests before
> seeking could quadruple read bandwith in some cases.  This is
> clearly not as fair, but the extra reads will be almost free
> because those few reads take little time compared to the seek
> that follows anyway.  Therefore, the latency for other requests
> shouldn't change much and we get the best of both AS and CFQ.
> Or have I made a broken assumption?

This is basically what time sliced cfq does. For sync requests, cfq
allows a definable idle period where we give the process a chance to
submit a new request if it has enough time slice to do so. This
'anticipation' then is just an artifact of the design of time sliced
cfq, where we do assign a finite time period where a given process owns
the disk.

See my initial posting on time sliced cfq. That is why time sliced cfq
does as well (or better) then AS for the many client cases, while still
being fair.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  6:55               ` Jens Axboe
  2004-12-08  7:08                 ` Nick Piggin
@ 2004-12-08 10:52                 ` Helge Hafting
  2004-12-08 10:49                   ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Helge Hafting @ 2004-12-08 10:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, Andrea Arcangeli, nickpiggin, linux-kernel

Jens Axboe wrote:

>On Tue, Dec 07 2004, Andrew Morton wrote:
>  
>
>>Andrea Arcangeli <andrea@suse.de> wrote:
>>    
>>
>>>The desktop is ok with "as" simply because it's
>>> normally optimal to stop writes completely
>>>      
>>>
>>AS doesn't "stop writes completely".  With the current settings it
>>apportions about 1/3 of the disk's bandwidth to writes.
>>
>>This thing Jens has found is for direct-io writes only.  It's a bug.
>>    
>>
>
>Indeed. It's a special case one, but nasty for that case.
>
>  
>
>>The other problem with AS is that it basically doesn't work at all with a
>>TCQ depth greater than four or so, and lots of people blindly look at
>>untuned SCSI benchmark results without realising that.  If a distro is
>>    
>>
>
>That's pretty easy to fix. I added something like that to cfq, and it's
>not a lot of lines of code (grep for rq_in_driver and cfq_max_depth).
>
>  
>
>>always selecting CFQ then they've probably gone and deoptimised all their
>>IDE users.  
>>    
>>
>
>Andrew, AS has other issues, it's not a case of AS always being faster
>at everything.
>
>  
>
>>AS needs another iteration of development to fix these things.  Right now
>>it's probably the case that we need CFQ or deadline for servers and AS for
>>desktops.   That's awkward.
>>    
>>
>
>Currently I think the time sliced cfq is the best all around. There's
>still a few kinks to be shaken out, but generally I think the concept is
>sounder than AS.
>  
>
I wonder, would it make sense to add some limited anticipation
to the cfq scheduler?  It seems to me that there is room to
get some of the AS benefit without getting too unfair:

AS does a wait that is short compared to a seek, getting some
more locality almost for free.  Consider if CFQ did this, with
the added limitation that it only let a few extra read requests
in this way before doing the next seek anyway.  For example,
allowing up to 3 extra anticipated read requests before
seeking could quadruple read bandwith in some cases.  This is
clearly not as fair, but the extra reads will be almost free
because those few reads take little time compared to the seek
that follows anyway.  Therefore, the latency for other requests
shouldn't change much and we get the best of both AS and CFQ.
Or have I made a broken assumption?

The max number of requests to anticipate could even be
configurable, jut set it to 0 to get pure CFQ.

Helge Hafting













^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  9:35                         ` Jens Axboe
  2004-12-08 10:08                           ` Jens Axboe
@ 2004-12-08 12:47                           ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 12:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Jens Axboe wrote:
> On Wed, Dec 08 2004, Jens Axboe wrote:
> > > Hmm, damn. Lots of stuff. I guess some of the notable ones that I've
> > > had trouble with are OraSim (Oracle might give you a copy), Andrew's
> > > patch scripts when applying a stack of patches, pgbench... can't
> > > really remember any others off the top of my head.
> > 
> > The patch scripts case is interesting, last night (when committing other
> > patches) I was thinking I should try and bench that today. It has a good
> > mix of reads and writes.
> 
> AS is currently 10 seconds faster for that workload (untar of a kernel
> and then applying 2237 patches). AS completes it in 155 seconds, CFQ
> takes 164 seconds.

Turned out to be a stupid dispatch sort error in cfq, now it has the
exact same runtime as AS.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
  2004-12-08  7:14                       ` Nick Piggin
  2004-12-08  7:20                         ` Jens Axboe
@ 2004-12-08 13:48                         ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2004-12-08 13:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrea Arcangeli, linux-kernel

On Wed, Dec 08 2004, Nick Piggin wrote:
> On Wed, 2004-12-08 at 07:58 +0100, Jens Axboe wrote:
> > On Wed, Dec 08 2004, Nick Piggin wrote:
> > > On Tue, 2004-12-07 at 18:25 -0800, Andrew Morton wrote:
> 
> > > I think we could detect when a disk asks for more than, say, 4
> > > concurrent requests, and in that case turn off read anticipation
> > > and all the anti-starvation for TCQ by default (with the option
> > > to force it back on).
> > 
> > CFQ only allows a certain depth a the hardware level, you can control
> > that. I don't think you should drop the AS behaviour in that case, you
> > should look at when the last request comes in and what type it is.
> > 
> > With time sliced cfq I'm seeing some silly SCSI disk behaviour as well,
> > it gets harder to get good read bandwidth as the disk is trying pretty
> > hard to starve me. Maybe killing write back caching would help, I'll
> > have to try.
> > 
> 
> I "fixed" this in AS. It gets (or got, last time we checked, many months
> ago) pretty good read latency even with a big write and a very large
> tag depth.

This problem was also caused by the dispatch sort bug. So you were
right, it was 'some little bug' in the code :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Time sliced CFQ io scheduler
@ 2004-12-03 20:52 Chuck Ebbert
  0 siblings, 0 replies; 66+ messages in thread
From: Chuck Ebbert @ 2004-12-03 20:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Prakash K. Cheemplavam, Andrew Morton, linux-kernel, Nick Piggin,
	Neil Brown

On Fri, 3 Dec 2004 at 11:31:30 +0100 Jens Axboe wrote:

>> Yeas, I have linux raid (testing md1). Have appield both settings on 
>> both drives and got a interesting new pattern: Now it alternates. My 
>> email client is still not usale while writing though...
>
> Funky. It looks like another case of the io scheduler being at the wrong
> place - if raid sends dependent reads to different drives, it screws up
> the io scheduling. The right way to fix that would be to io scheduler
> before raid (reverse of what we do now), but that is a lot of work. A
> hack would be to try and tie processes to one md component for periods
> of time, sort of like cfq slicing.

 How about having the raid1 read balance code send each read to every drive
in the mirror, and just take the first one that returns data?  It could then
cancel the rest, or just ignore them...  ;)


--Chuck Ebbert  03-Dec-04  15:43:54

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2004-12-08 13:49 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-02 13:04 Time sliced CFQ io scheduler Jens Axboe
2004-12-02 13:48 ` Jens Axboe
2004-12-02 19:48   ` Andrew Morton
2004-12-02 19:52     ` Jens Axboe
2004-12-02 20:19       ` Andrew Morton
2004-12-02 20:19         ` Jens Axboe
2004-12-02 20:34           ` Andrew Morton
2004-12-02 20:37             ` Jens Axboe
2004-12-07 23:11               ` Nick Piggin
2004-12-02 22:18         ` Prakash K. Cheemplavam
2004-12-03  7:01           ` Jens Axboe
2004-12-03  9:12             ` Prakash K. Cheemplavam
2004-12-03  9:18               ` Jens Axboe
2004-12-03  9:35                 ` Prakash K. Cheemplavam
2004-12-03  9:43                   ` Jens Axboe
2004-12-03  9:26               ` Andrew Morton
2004-12-03  9:34                 ` Prakash K. Cheemplavam
2004-12-03  9:39                 ` Jens Axboe
2004-12-03  9:54                   ` Prakash K. Cheemplavam
     [not found]                   ` <41B03722.5090001@gmx.de>
2004-12-03 10:31                     ` Jens Axboe
2004-12-03 10:38                       ` Jens Axboe
2004-12-03 10:45                         ` Prakash K. Cheemplavam
2004-12-03 10:48                           ` Jens Axboe
2004-12-03 11:27                             ` Prakash K. Cheemplavam
2004-12-03 11:29                               ` Jens Axboe
2004-12-03 11:52                                 ` Prakash K. Cheemplavam
2004-12-08  0:37       ` Andrea Arcangeli
2004-12-08  0:54         ` Nick Piggin
2004-12-08  1:37           ` Andrea Arcangeli
2004-12-08  1:47             ` Nick Piggin
2004-12-08  2:09               ` Andrea Arcangeli
2004-12-08  2:11                 ` Andrew Morton
2004-12-08  2:22                   ` Andrea Arcangeli
2004-12-08  6:52               ` Jens Axboe
2004-12-08  2:00             ` Andrew Morton
2004-12-08  2:08               ` Andrew Morton
2004-12-08  6:55                 ` Jens Axboe
2004-12-08  2:20               ` Andrea Arcangeli
2004-12-08  2:25                 ` Andrew Morton
2004-12-08  2:33                   ` Andrea Arcangeli
2004-12-08  2:33                   ` Nick Piggin
2004-12-08  2:51                     ` Andrea Arcangeli
2004-12-08  3:02                       ` Nick Piggin
2004-12-08  6:58                     ` Jens Axboe
2004-12-08  7:14                       ` Nick Piggin
2004-12-08  7:20                         ` Jens Axboe
2004-12-08  7:29                           ` Nick Piggin
2004-12-08  7:32                             ` Jens Axboe
2004-12-08  7:30                           ` Andrew Morton
2004-12-08  7:36                             ` Jens Axboe
2004-12-08 13:48                         ` Jens Axboe
2004-12-08  6:55               ` Jens Axboe
2004-12-08  7:08                 ` Nick Piggin
2004-12-08  7:11                   ` Jens Axboe
2004-12-08  7:19                     ` Nick Piggin
2004-12-08  7:26                       ` Jens Axboe
2004-12-08  9:35                         ` Jens Axboe
2004-12-08 10:08                           ` Jens Axboe
2004-12-08 12:47                           ` Jens Axboe
2004-12-08 10:52                 ` Helge Hafting
2004-12-08 10:49                   ` Jens Axboe
2004-12-08  6:49           ` Jens Axboe
2004-12-02 14:28 ` Giuliano Pochini
2004-12-02 14:41   ` Jens Axboe
2004-12-04 13:05     ` Giuliano Pochini
2004-12-03 20:52 Chuck Ebbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).