From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752012AbcBKWWO (ORCPT ); Thu, 11 Feb 2016 17:22:14 -0500 Received: from mail-yk0-f193.google.com ([209.85.160.193]:34274 "EHLO mail-yk0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751662AbcBKWWM (ORCPT ); Thu, 11 Feb 2016 17:22:12 -0500 Date: Thu, 11 Feb 2016 17:22:10 -0500 From: Tejun Heo To: Paolo Valente Cc: Jens Axboe , Fabio Checconi , Arianna Avanzini , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, ulf.hansson@linaro.org, linus.walleij@linaro.org, broonie@kernel.org Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler Message-ID: <20160211222210.GC3741@mtj.duckdns.org> References: <1454364778-25179-1-git-send-email-paolo.valente@linaro.org> <1454364778-25179-10-git-send-email-paolo.valente@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454364778-25179-10-git-send-email-paolo.valente@linaro.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Here's the biggest question I have. If I'm reading the paper and code correctly, bfq is using bandwidth as the virtual time used for scheduling. This doesn't make sense to me because for most IO devices and especially for rotating disks bandwidth is a number which fluctuates easily and widely. Trying to schedule a process with sequential IO pattern and another with random IO on bandwidth basis would be disastrous. bfq adjusts for the fact by "punishing" seeky workloads, which, I guess, is because doing naive bandwidth based scheduling is simply unworkable. The punishment is naturally charging it more than it consumed but that looks like a twisted way of doing time based scheduling. It uses some heuristics to determine how much a seeky queue should get over-charged so that the overcharged amount equals the actual IO resource consumed, which is IO time on the device. cfq may have a lot of shortcomings but abstracting IO resource in terms of device IO time isn't one of them and bfq follows all the necessary pieces of scheduling mechanics to be able to do so too - single issuer at a time with opportunistic idling. So, I'm scratching my head wondering why this wasn't time based to begin with. My limited understanding of the scheduling algorithm doesn't show anything why this couldn't have been time based. What am I missing? I haven't scrutinized the code closely and what follows are all trivial stylistic comments. > block/Kconfig.iosched | 8 +- > block/bfq.h | 502 ++++++ Why does this need to be in a separate file? It doesn't seem to get used anywhere other than cfq-iosched.c. > +/** > + * struct bfq_data - per device data structure. > + * @queue: request queue for the managed device. > + * @sched_data: root @bfq_sched_data for the device. > + * @busy_queues: number of bfq_queues containing requests (including the > + * queue in service, even if it is idling). ... I'm personally not a big fan of documenting struct fields this way. It's too easy to get them out of sync. > + * All the fields are protected by the @queue lock. > + */ > +struct bfq_data { > + struct request_queue *queue; > + > + struct bfq_sched_data sched_data; and also I think it's easier on the eyes to align field names struct request_queue *queue; struct bfq_sched_data sched_data; That said, this is fundamentally bike-shedding, so please feel free to ignore. > +enum bfqq_state_flags { > + BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */ > + BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */ > + BFQ_BFQQ_FLAG_must_alloc, /* must be allowed rq alloc */ ... > +#define BFQ_BFQQ_FNS(name) \ ... > + > +BFQ_BFQQ_FNS(busy); > +BFQ_BFQQ_FNS(wait_request); > +BFQ_BFQQ_FNS(must_alloc); Heh... can we not do this? What's wrong with just doing the following? enum { BFQQ_BUSY = 1U << 0, BFQQ_WAIT_REQUEST = 1U << 1, BFQQ_MUST_ALLOC = 1U << 2, ... }; > +/* Logging facilities. */ > +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \ > + blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args) I don't think it's too productive to repeat bfq's. bfqq_log()? > +#define bfq_log(bfqd, fmt, args...) \ > + blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args) > + > +/* Expiration reasons. */ > +enum bfqq_expiration { > + BFQ_BFQQ_TOO_IDLE = 0, /* > + * queue has been idling for > + * too long > + */ > + BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */ > + BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */ > + BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */ > +}; Given that these are less common than the flags, maybe use something like BFQQ_EXP_TOO_IDLE? > +static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity); bfqe_to_bfqq()? And so on. I don't think being verbose with heavily repeated keywords helps much. > +static struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync) > +{ > + return bic->bfqq[is_sync]; > +} > + > +static void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, > + bool is_sync) > +{ > + bic->bfqq[is_sync] = bfqq; > +} Do helpers like the above add anything? Don't these obfuscate more than help? > +static struct bfq_data *bfq_get_bfqd_locked(void **ptr, unsigned long *flags) bfqd_get_and_lock_irqsave()? > +{ > + struct bfq_data *bfqd; > + > + rcu_read_lock(); > + bfqd = rcu_dereference(*(struct bfq_data **)ptr); > + > + if (bfqd != NULL) { > + spin_lock_irqsave(bfqd->queue->queue_lock, *flags); > + if (ptr == NULL) > + pr_crit("get_bfqd_locked pointer NULL\n"); I don't get it. How can @ptr can be NULL at this point? The kernel would have crashed on the above rcu_deref. > + else if (*ptr == bfqd) > + goto out; > + spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags); > + } > + > + bfqd = NULL; > +out: > + rcu_read_unlock(); > + return bfqd; > +} > + > +static void bfq_put_bfqd_unlock(struct bfq_data *bfqd, unsigned long *flags) > +{ > + spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags); > +} Ditto, I don't think these single liners help anything. Just make sure that the counterpart's name clearly indicates that it's a locking function. ... > +/* hw_tag detection: parallel requests threshold and min samples needed. */ > +#define BFQ_HW_QUEUE_THRESHOLD 4 > +#define BFQ_HW_QUEUE_SAMPLES 32 > + > +#define BFQQ_SEEK_THR (sector_t)(8 * 1024) > +#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR) Inconsistent indenting? ... > + * Shift for timestamp calculations. This actually limits the maximum > + * service allowed in one timestamp delta (small shift values increase it), > + * the maximum total weight that can be used for the queues in the system > + * (big shift values increase it), and the period of virtual time > + * wraparounds. > */ > +#define WFQ_SERVICE_SHIFT 22 Collect constants in one place? ... > +static struct bfq_entity *bfq_entity_of(struct rb_node *node) > +{ > + struct bfq_entity *entity = NULL; > > + if (node) > + entity = rb_entry(node, struct bfq_entity, rb_node); > > + return entity; > +} Maybe if (!node) return NULL; return rb_entry(node, ...); > +static unsigned short bfq_weight_to_ioprio(int weight) > { > + return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ? > + 0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight; Heh, how about? return max(IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight, 0); > +static struct rb_node *bfq_find_deepest(struct rb_node *node) > { > + struct rb_node *deepest; > + > + if (!node->rb_right && !node->rb_left) > + deepest = rb_parent(node); > + else if (!node->rb_right) > + deepest = node->rb_left; > + else if (!node->rb_left) > + deepest = node->rb_right; > + else { > + deepest = rb_next(node); > + if (deepest->rb_right) > + deepest = deepest->rb_right; > + else if (rb_parent(deepest) != node) > + deepest = rb_parent(deepest); > + } According to CodyingStyle, if any clause gets braced, all clauses get braced. And there were some lines which were unnecessarily broken when joining them would still fit under 80col limit. Thanks. -- tejun