LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput
@ 2018-12-06 18:18 Paolo Valente
  2018-12-06 18:18 ` [PATCH BUGFIX 1/2] block, bfq: fix decrement of num_active_groups Paolo Valente
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Paolo Valente @ 2018-12-06 18:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	bfq-iosched, oleksandr, federico, Paolo Valente

Hi Jens,
the first patch in this series fixes an error in the decrementing of
the counter of the number of groups with pending I/O. This wrong
decrement caused loss of throughput or, less likely, of control on
I/O. The second patch is a fix of some wrong comments, which somehow
contributed to making the above bug more difficult to find.

Thanks,
Paolo

Paolo Valente (2):
  block, bfq: fix decrement of num_active_groups
  block, bfq: fix comments on __bfq_deactivate_entity

 block/bfq-iosched.c | 76 +++++++++++++++++++++++++++++++++++++----------------
 block/bfq-iosched.h | 51 +++++++++++++++++++++++++++++++++--
 block/bfq-wf2q.c    | 16 ++++++-----
 3 files changed, 112 insertions(+), 31 deletions(-)

--
2.16.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH BUGFIX 1/2] block, bfq: fix decrement of num_active_groups
  2018-12-06 18:18 [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Paolo Valente
@ 2018-12-06 18:18 ` Paolo Valente
  2018-12-06 18:18 ` [PATCH BUGFIX 2/2] block, bfq: fix comments on __bfq_deactivate_entity Paolo Valente
  2018-12-07  2:23 ` [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Jens Axboe
  2 siblings, 0 replies; 6+ messages in thread
From: Paolo Valente @ 2018-12-06 18:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	bfq-iosched, oleksandr, federico, Paolo Valente

Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
detection")', if there are process groups with I/O requests waiting for
completion, then BFQ tags the scenario as 'asymmetric'. This detection
is needed for preserving service guarantees (for details, see comments
on the computation * of the variable asymmetric_scenario in the
function bfq_better_to_idle).

Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric
scenarios detection")' contains an error exactly in the updating of
the number of groups with I/O requests waiting for completion: if a
group has more than one descendant process, then the above number of
groups, which is renamed from num_active_groups to a more appropriate
num_groups_with_pending_reqs by this commit, may happen to be wrongly
decremented multiple times, namely every time one of the descendant
processes gets all its pending I/O requests completed.

A correct, complete solution should work as follows. Consider a group
that is inactive, i.e., that has no descendant process with pending
I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
is still accounting for this group, because the group still has some
descendant process with some I/O request still in
flight. num_groups_with_pending_reqs should be decremented when the
in-flight request of the last descendant process is finally completed
(assuming that nothing else has changed for the group in the meantime,
in terms of composition of the group and active/inactive state of
child groups and processes). To accomplish this, an additional
pending-request counter must be added to entities, and must be
updated correctly.

To avoid this additional field and operations, this commit resorts to
the following tradeoff between simplicity and accuracy: for an
inactive group that is still counted in num_groups_with_pending_reqs,
this commit decrements num_groups_with_pending_reqs when the first
descendant process of the group remains with no request waiting for
completion.

This simplified scheme provides a fix to the unbalanced decrements
introduced by 2d29c9f89fcd. Since this error was also caused by lack
of comments on this non-trivial issue, this commit also adds related
comments.

Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
Reported-by: Steven Barrett <steven@liquorix.net>
Tested-by: Steven Barrett <steven@liquorix.net>
Tested-by: Lucjan Lucjanov <lucjan.lucjanov@gmail.com>
Reviewed-by: Federico Motta <federico@willer.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-iosched.c | 76 +++++++++++++++++++++++++++++++++++++----------------
 block/bfq-iosched.h | 51 +++++++++++++++++++++++++++++++++--
 block/bfq-wf2q.c    |  5 +++-
 3 files changed, 107 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 3a27d31fcda6..97337214bec4 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -638,7 +638,7 @@ static bool bfq_varied_queue_weights_or_active_groups(struct bfq_data *bfqd)
 		 bfqd->queue_weights_tree.rb_node->rb_right)
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	       ) ||
-		(bfqd->num_active_groups > 0
+		(bfqd->num_groups_with_pending_reqs > 0
 #endif
 	       );
 }
@@ -802,7 +802,21 @@ void bfq_weights_tree_remove(struct bfq_data *bfqd,
 			 */
 			break;
 		}
-		bfqd->num_active_groups--;
+
+		/*
+		 * The decrement of num_groups_with_pending_reqs is
+		 * not performed immediately upon the deactivation of
+		 * entity, but it is delayed to when it also happens
+		 * that the first leaf descendant bfqq of entity gets
+		 * all its pending requests completed. The following
+		 * instructions perform this delayed decrement, if
+		 * needed. See the comments on
+		 * num_groups_with_pending_reqs for details.
+		 */
+		if (entity->in_groups_with_pending_reqs) {
+			entity->in_groups_with_pending_reqs = false;
+			bfqd->num_groups_with_pending_reqs--;
+		}
 	}
 }
 
@@ -3529,27 +3543,44 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
 	 * fact, if there are active groups, then, for condition (i)
 	 * to become false, it is enough that an active group contains
 	 * more active processes or sub-groups than some other active
-	 * group. We address this issue with the following bi-modal
-	 * behavior, implemented in the function
+	 * group. More precisely, for condition (i) to hold because of
+	 * such a group, it is not even necessary that the group is
+	 * (still) active: it is sufficient that, even if the group
+	 * has become inactive, some of its descendant processes still
+	 * have some request already dispatched but still waiting for
+	 * completion. In fact, requests have still to be guaranteed
+	 * their share of the throughput even after being
+	 * dispatched. In this respect, it is easy to show that, if a
+	 * group frequently becomes inactive while still having
+	 * in-flight requests, and if, when this happens, the group is
+	 * not considered in the calculation of whether the scenario
+	 * is asymmetric, then the group may fail to be guaranteed its
+	 * fair share of the throughput (basically because idling may
+	 * not be performed for the descendant processes of the group,
+	 * but it had to be).  We address this issue with the
+	 * following bi-modal behavior, implemented in the function
 	 * bfq_symmetric_scenario().
 	 *
-	 * If there are active groups, then the scenario is tagged as
+	 * If there are groups with requests waiting for completion
+	 * (as commented above, some of these groups may even be
+	 * already inactive), then the scenario is tagged as
 	 * asymmetric, conservatively, without checking any of the
 	 * conditions (i) and (ii). So the device is idled for bfqq.
 	 * This behavior matches also the fact that groups are created
-	 * exactly if controlling I/O (to preserve bandwidth and
-	 * latency guarantees) is a primary concern.
+	 * exactly if controlling I/O is a primary concern (to
+	 * preserve bandwidth and latency guarantees).
 	 *
-	 * On the opposite end, if there are no active groups, then
-	 * only condition (i) is actually controlled, i.e., provided
-	 * that condition (i) holds, idling is not performed,
-	 * regardless of whether condition (ii) holds. In other words,
-	 * only if condition (i) does not hold, then idling is
-	 * allowed, and the device tends to be prevented from queueing
-	 * many requests, possibly of several processes. Since there
-	 * are no active groups, then, to control condition (i) it is
-	 * enough to check whether all active queues have the same
-	 * weight.
+	 * On the opposite end, if there are no groups with requests
+	 * waiting for completion, then only condition (i) is actually
+	 * controlled, i.e., provided that condition (i) holds, idling
+	 * is not performed, regardless of whether condition (ii)
+	 * holds. In other words, only if condition (i) does not hold,
+	 * then idling is allowed, and the device tends to be
+	 * prevented from queueing many requests, possibly of several
+	 * processes. Since there are no groups with requests waiting
+	 * for completion, then, to control condition (i) it is enough
+	 * to check just whether all the queues with requests waiting
+	 * for completion also have the same weight.
 	 *
 	 * Not checking condition (ii) evidently exposes bfqq to the
 	 * risk of getting less throughput than its fair share.
@@ -3607,10 +3638,11 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
 	 * bfqq is weight-raised is checked explicitly here. More
 	 * precisely, the compound condition below takes into account
 	 * also the fact that, even if bfqq is being weight-raised,
-	 * the scenario is still symmetric if all active queues happen
-	 * to be weight-raised. Actually, we should be even more
-	 * precise here, and differentiate between interactive weight
-	 * raising and soft real-time weight raising.
+	 * the scenario is still symmetric if all queues with requests
+	 * waiting for completion happen to be
+	 * weight-raised. Actually, we should be even more precise
+	 * here, and differentiate between interactive weight raising
+	 * and soft real-time weight raising.
 	 *
 	 * As a side note, it is worth considering that the above
 	 * device-idling countermeasures may however fail in the
@@ -5417,7 +5449,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 
 	bfqd->queue_weights_tree = RB_ROOT;
-	bfqd->num_active_groups = 0;
+	bfqd->num_groups_with_pending_reqs = 0;
 
 	INIT_LIST_HEAD(&bfqd->active_list);
 	INIT_LIST_HEAD(&bfqd->idle_list);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 77651d817ecd..0b02bf302de0 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -196,6 +196,9 @@ struct bfq_entity {
 
 	/* flag, set to request a weight, ioprio or ioprio_class change  */
 	int prio_changed;
+
+	/* flag, set if the entity is counted in groups_with_pending_reqs */
+	bool in_groups_with_pending_reqs;
 };
 
 struct bfq_group;
@@ -448,10 +451,54 @@ struct bfq_data {
 	 * bfq_weights_tree_[add|remove] for further details).
 	 */
 	struct rb_root queue_weights_tree;
+
 	/*
-	 * number of groups with requests still waiting for completion
+	 * Number of groups with at least one descendant process that
+	 * has at least one request waiting for completion. Note that
+	 * this accounts for also requests already dispatched, but not
+	 * yet completed. Therefore this number of groups may differ
+	 * (be larger) than the number of active groups, as a group is
+	 * considered active only if its corresponding entity has
+	 * descendant queues with at least one request queued. This
+	 * number is used to decide whether a scenario is symmetric.
+	 * For a detailed explanation see comments on the computation
+	 * of the variable asymmetric_scenario in the function
+	 * bfq_better_to_idle().
+	 *
+	 * However, it is hard to compute this number exactly, for
+	 * groups with multiple descendant processes. Consider a group
+	 * that is inactive, i.e., that has no descendant process with
+	 * pending I/O inside BFQ queues. Then suppose that
+	 * num_groups_with_pending_reqs is still accounting for this
+	 * group, because the group has descendant processes with some
+	 * I/O request still in flight. num_groups_with_pending_reqs
+	 * should be decremented when the in-flight request of the
+	 * last descendant process is finally completed (assuming that
+	 * nothing else has changed for the group in the meantime, in
+	 * terms of composition of the group and active/inactive state of child
+	 * groups and processes). To accomplish this, an additional
+	 * pending-request counter must be added to entities, and must
+	 * be updated correctly. To avoid this additional field and operations,
+	 * we resort to the following tradeoff between simplicity and
+	 * accuracy: for an inactive group that is still counted in
+	 * num_groups_with_pending_reqs, we decrement
+	 * num_groups_with_pending_reqs when the first descendant
+	 * process of the group remains with no request waiting for
+	 * completion.
+	 *
+	 * Even this simpler decrement strategy requires a little
+	 * carefulness: to avoid multiple decrements, we flag a group,
+	 * more precisely an entity representing a group, as still
+	 * counted in num_groups_with_pending_reqs when it becomes
+	 * inactive. Then, when the first descendant queue of the
+	 * entity remains with no request waiting for completion,
+	 * num_groups_with_pending_reqs is decremented, and this flag
+	 * is reset. After this flag is reset for the entity,
+	 * num_groups_with_pending_reqs won't be decremented any
+	 * longer in case a new descendant queue of the entity remains
+	 * with no request waiting for completion.
 	 */
-	unsigned int num_active_groups;
+	unsigned int num_groups_with_pending_reqs;
 
 	/*
 	 * Number of bfq_queues containing requests (including the
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 4b0d5fb69160..63e0f12be7c9 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1012,7 +1012,10 @@ static void __bfq_activate_entity(struct bfq_entity *entity,
 			container_of(entity, struct bfq_group, entity);
 		struct bfq_data *bfqd = bfqg->bfqd;
 
-		bfqd->num_active_groups++;
+		if (!entity->in_groups_with_pending_reqs) {
+			entity->in_groups_with_pending_reqs = true;
+			bfqd->num_groups_with_pending_reqs++;
+		}
 	}
 #endif
 
-- 
2.16.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH BUGFIX 2/2] block, bfq: fix comments on __bfq_deactivate_entity
  2018-12-06 18:18 [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Paolo Valente
  2018-12-06 18:18 ` [PATCH BUGFIX 1/2] block, bfq: fix decrement of num_active_groups Paolo Valente
@ 2018-12-06 18:18 ` Paolo Valente
  2018-12-07  2:23 ` [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Jens Axboe
  2 siblings, 0 replies; 6+ messages in thread
From: Paolo Valente @ 2018-12-06 18:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	bfq-iosched, oleksandr, federico, Paolo Valente

Comments on function __bfq_deactivate_entity contains two imprecise or
wrong statements:
1) The function performs the deactivation of the entity.
2) The function must be invoked only if the entity is on a service tree.

This commits replaces both statements with the correct ones:
1) The functions updates sched_data and service trees for the entity,
so as to represent entity as inactive (which is only part of the steps
needed for the deactivation of the entity).
2) The function must be invoked on every entity being deactivated.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
---
 block/bfq-wf2q.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 63e0f12be7c9..72adbbe975d5 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1154,15 +1154,14 @@ static void bfq_activate_requeue_entity(struct bfq_entity *entity,
 }
 
 /**
- * __bfq_deactivate_entity - deactivate an entity from its service tree.
- * @entity: the entity to deactivate.
+ * __bfq_deactivate_entity - update sched_data and service trees for
+ * entity, so as to represent entity as inactive
+ * @entity: the entity being deactivated.
  * @ins_into_idle_tree: if false, the entity will not be put into the
  *			idle tree.
  *
- * Deactivates an entity, independently of its previous state.  Must
- * be invoked only if entity is on a service tree. Extracts the entity
- * from that tree, and if necessary and allowed, puts it into the idle
- * tree.
+ * If necessary and allowed, puts entity into the idle tree. NOTE:
+ * entity may be on no tree if in service.
  */
 bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
 {
-- 
2.16.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput
  2018-12-06 18:18 [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Paolo Valente
  2018-12-06 18:18 ` [PATCH BUGFIX 1/2] block, bfq: fix decrement of num_active_groups Paolo Valente
  2018-12-06 18:18 ` [PATCH BUGFIX 2/2] block, bfq: fix comments on __bfq_deactivate_entity Paolo Valente
@ 2018-12-07  2:23 ` Jens Axboe
  2018-12-07 10:01   ` Paolo Valente
  2 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2018-12-07  2:23 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	bfq-iosched, oleksandr, federico

On 12/6/18 11:18 AM, Paolo Valente wrote:
> Hi Jens,
> the first patch in this series fixes an error in the decrementing of
> the counter of the number of groups with pending I/O. This wrong
> decrement caused loss of throughput or, less likely, of control on
> I/O. The second patch is a fix of some wrong comments, which somehow
> contributed to making the above bug more difficult to find.

Are you fine with this going into 4.21? I can't quite tell what your
intent is. The first patch has a Fixes for something that went into
this series, but then patch 2 is a comment update that would not
normally be something to be applied at this stage.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput
  2018-12-07  2:23 ` [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Jens Axboe
@ 2018-12-07 10:01   ` Paolo Valente
  2018-12-07 14:40     ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Valente @ 2018-12-07 10:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, Ulf Hansson, Linus Walleij,
	Mark Brown, bfq-iosched, oleksandr, federico



> Il giorno 7 dic 2018, alle ore 03:23, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 12/6/18 11:18 AM, Paolo Valente wrote:
>> Hi Jens,
>> the first patch in this series fixes an error in the decrementing of
>> the counter of the number of groups with pending I/O. This wrong
>> decrement caused loss of throughput or, less likely, of control on
>> I/O. The second patch is a fix of some wrong comments, which somehow
>> contributed to making the above bug more difficult to find.
> 
> Are you fine with this going into 4.21? I can't quite tell what your
> intent is. The first patch has a Fixes for something

yep, that fixes a serious error.

> that went into
> this series, but then patch 2 is a comment update that would not
> normally be something to be applied at this stage.
> 

and yes, only comments changed by the second one

May it make sense to apply them in two steps, one in the 4.20 and the other one in the 4.21?

Thanks,
Paolo

> -- 
> Jens Axboe
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput
  2018-12-07 10:01   ` Paolo Valente
@ 2018-12-07 14:40     ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2018-12-07 14:40 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-block, linux-kernel, Ulf Hansson, Linus Walleij,
	Mark Brown, bfq-iosched, oleksandr, federico

On 12/7/18 3:01 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 7 dic 2018, alle ore 03:23, Jens Axboe <axboe@kernel.dk> ha scritto:
>>
>> On 12/6/18 11:18 AM, Paolo Valente wrote:
>>> Hi Jens,
>>> the first patch in this series fixes an error in the decrementing of
>>> the counter of the number of groups with pending I/O. This wrong
>>> decrement caused loss of throughput or, less likely, of control on
>>> I/O. The second patch is a fix of some wrong comments, which somehow
>>> contributed to making the above bug more difficult to find.
>>
>> Are you fine with this going into 4.21? I can't quite tell what your
>> intent is. The first patch has a Fixes for something
> 
> yep, that fixes a serious error.
> 
>> that went into
>> this series, but then patch 2 is a comment update that would not
>> normally be something to be applied at this stage.
>>
> 
> and yes, only comments changed by the second one
> 
> May it make sense to apply them in two steps, one in the 4.20 and the other one in the 4.21?

I think so, I'll do that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-06 18:18 [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Paolo Valente
2018-12-06 18:18 ` [PATCH BUGFIX 1/2] block, bfq: fix decrement of num_active_groups Paolo Valente
2018-12-06 18:18 ` [PATCH BUGFIX 2/2] block, bfq: fix comments on __bfq_deactivate_entity Paolo Valente
2018-12-07  2:23 ` [PATCH BUGFIX 0/2] bfq: fix unbalanced decrements causing loss of throughput Jens Axboe
2018-12-07 10:01   ` Paolo Valente
2018-12-07 14:40     ` Jens Axboe

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox