Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

From: Tejun Heo <tj@kernel.org>
To: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	Fabio Checconi <fchecconi@gmail.com>,
	Arianna Avanzini <avanzini.arianna@gmail.com>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	ulf.hansson@linaro.org, linus.walleij@linaro.org,
	broonie@kernel.org
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
Date: Thu, 11 Feb 2016 17:22:10 -0500	[thread overview]
Message-ID: <20160211222210.GC3741@mtj.duckdns.org> (raw)
In-Reply-To: <1454364778-25179-10-git-send-email-paolo.valente@linaro.org>

Hello,

Here's the biggest question I have.  If I'm reading the paper and code
correctly, bfq is using bandwidth as the virtual time used for
scheduling.  This doesn't make sense to me because for most IO devices
and especially for rotating disks bandwidth is a number which
fluctuates easily and widely.  Trying to schedule a process with
sequential IO pattern and another with random IO on bandwidth basis
would be disastrous.

bfq adjusts for the fact by "punishing" seeky workloads, which, I
guess, is because doing naive bandwidth based scheduling is simply
unworkable.  The punishment is naturally charging it more than it
consumed but that looks like a twisted way of doing time based
scheduling.  It uses some heuristics to determine how much a seeky
queue should get over-charged so that the overcharged amount equals
the actual IO resource consumed, which is IO time on the device.

cfq may have a lot of shortcomings but abstracting IO resource in
terms of device IO time isn't one of them and bfq follows all the
necessary pieces of scheduling mechanics to be able to do so too -
single issuer at a time with opportunistic idling.

So, I'm scratching my head wondering why this wasn't time based to
begin with.  My limited understanding of the scheduling algorithm
doesn't show anything why this couldn't have been time based.  What am
I missing?

I haven't scrutinized the code closely and what follows are all
trivial stylistic comments.

>  block/Kconfig.iosched |    8 +-
>  block/bfq.h           |  502 ++++++

Why does this need to be in a separate file?  It doesn't seem to get
used anywhere other than cfq-iosched.c.

> +/**
> + * struct bfq_data - per device data structure.
> + * @queue: request queue for the managed device.
> + * @sched_data: root @bfq_sched_data for the device.
> + * @busy_queues: number of bfq_queues containing requests (including the
> + *		 queue in service, even if it is idling).
...

I'm personally not a big fan of documenting struct fields this way.
It's too easy to get them out of sync.

> + * All the fields are protected by the @queue lock.
> + */
> +struct bfq_data {
> +	struct request_queue *queue;
> +
> +	struct bfq_sched_data sched_data;

and also I think it's easier on the eyes to align field names

	struct request_queue		*queue;
	struct bfq_sched_data		sched_data;

That said, this is fundamentally bike-shedding, so please feel free
to ignore.

> +enum bfqq_state_flags {
> +	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
> +	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
> +	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
...
> +#define BFQ_BFQQ_FNS(name)						\
...
> +
> +BFQ_BFQQ_FNS(busy);
> +BFQ_BFQQ_FNS(wait_request);
> +BFQ_BFQQ_FNS(must_alloc);

Heh... can we not do this?  What's wrong with just doing the following?

enum {
	BFQQ_BUSY		= 1U << 0,
	BFQQ_WAIT_REQUEST	= 1U << 1,
	BFQQ_MUST_ALLOC		= 1U << 2,
	...
};

> +/* Logging facilities. */
> +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
> +	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)

I don't think it's too productive to repeat bfq's.  bfqq_log()?

> +#define bfq_log(bfqd, fmt, args...) \
> +	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
> +
> +/* Expiration reasons. */
> +enum bfqq_expiration {
> +	BFQ_BFQQ_TOO_IDLE = 0,		/*
> +					 * queue has been idling for
> +					 * too long
> +					 */
> +	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
> +	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
> +	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
> +};

Given that these are less common than the flags, maybe use something
like BFQQ_EXP_TOO_IDLE?

> +static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);

bfqe_to_bfqq()?  And so on.  I don't think being verbose with heavily
repeated keywords helps much.

> +static struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync)
> +{
> +	return bic->bfqq[is_sync];
> +}
> +
> +static void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq,
> +			 bool is_sync)
> +{
> +	bic->bfqq[is_sync] = bfqq;
> +}

Do helpers like the above add anything?  Don't these obfuscate more
than help?

> +static struct bfq_data *bfq_get_bfqd_locked(void **ptr, unsigned long *flags)

bfqd_get_and_lock_irqsave()?

> +{
> +	struct bfq_data *bfqd;
> +
> +	rcu_read_lock();
> +	bfqd = rcu_dereference(*(struct bfq_data **)ptr);
> +
> +	if (bfqd != NULL) {
> +		spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
> +		if (ptr == NULL)
> +			pr_crit("get_bfqd_locked pointer NULL\n");

I don't get it.  How can @ptr can be NULL at this point?  The kernel
would have crashed on the above rcu_deref.

> +		else if (*ptr == bfqd)
> +			goto out;
> +		spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
> +	}
> +
> +	bfqd = NULL;
> +out:
> +	rcu_read_unlock();
> +	return bfqd;
> +}
> +
> +static void bfq_put_bfqd_unlock(struct bfq_data *bfqd, unsigned long *flags)
> +{
> +	spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
> +}

Ditto, I don't think these single liners help anything.  Just make
sure that the counterpart's name clearly indicates that it's a locking
function.

...
> +/* hw_tag detection: parallel requests threshold and min samples needed. */
> +#define BFQ_HW_QUEUE_THRESHOLD	4
> +#define BFQ_HW_QUEUE_SAMPLES	32
> +
> +#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
> +#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)

Inconsistent indenting?

...
> + * Shift for timestamp calculations.  This actually limits the maximum
> + * service allowed in one timestamp delta (small shift values increase it),
> + * the maximum total weight that can be used for the queues in the system
> + * (big shift values increase it), and the period of virtual time
> + * wraparounds.
>   */
> +#define WFQ_SERVICE_SHIFT	22

Collect constants in one place?

...
> +static struct bfq_entity *bfq_entity_of(struct rb_node *node)
> +{
> +	struct bfq_entity *entity = NULL;
>  
> +	if (node)
> +		entity = rb_entry(node, struct bfq_entity, rb_node);
>  
> +	return entity;
> +}

Maybe

	if (!node)
		return NULL;
	return rb_entry(node, ...);

> +static unsigned short bfq_weight_to_ioprio(int weight)
>  {
> +	return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
> +		0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;

Heh, how about?

	return max(IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight, 0);

> +static struct rb_node *bfq_find_deepest(struct rb_node *node)
>  {
> +	struct rb_node *deepest;
> +
> +	if (!node->rb_right && !node->rb_left)
> +		deepest = rb_parent(node);
> +	else if (!node->rb_right)
> +		deepest = node->rb_left;
> +	else if (!node->rb_left)
> +		deepest = node->rb_right;
> +	else {
> +		deepest = rb_next(node);
> +		if (deepest->rb_right)
> +			deepest = deepest->rb_right;
> +		else if (rb_parent(deepest) != node)
> +			deepest = rb_parent(deepest);
> +	}

According to CodyingStyle, if any clause gets braced, all clauses get
braced.

And there were some lines which were unnecessarily broken when joining
them would still fit under 80col limit.

Thanks.

-- 
tejun