All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Jens Axboe <axboe@kernel.dk>
Cc: <linux-block@vger.kernel.org>,
	Paolo Valente <paolo.valente@linaro.org>, Jan Kara <jack@suse.cz>
Subject: [PATCH 0/8 v5] bfq: Limit number of allocated scheduler tags per cgroup
Date: Thu, 25 Nov 2021 14:36:33 +0100	[thread overview]
Message-ID: <20211125133131.14018-1-jack@suse.cz> (raw)

Hello!

Here is the fifth revision of my patches to fix how bfq weights apply on
cgroup throughput and on throughput of processes with different IO priorities.
The only change since the previous version is that I've rebased the series
on top of Jens' linux-block.git for-5.17/block branch which required a bit of
reshuffling of IOC passing.

Jens, can you please merge the series?

Changes since v4:
* Rebased on top of linux-block.git for-5.17/block

Changes since v3:
* Rebased on top of 5.16-rc2
* Added Reviewed-by and Acked-by tags

Changes since v2:
* Rebased on top of current Linus' tree
* Updated computation of scheduler tag proportions to work correctly even
  for processes within the same cgroup but with different IO priorities
* Added comment roughly explaining why we limit tag depth
* Added patch limiting waker / wakee detection in time so avoid at least the
  most obvious false positives
* Added patch to log waker / wakee detections in blktrace for better debugging
* Added patch properly account injected IO

Changes since v1:
* Fixed computation of appropriate proportion of scheduler tags for a cgroup
  to work with deeper cgroup hierarchies.

Original cover letter:

I was looking into why cgroup weights do not have any measurable impact on
writeback throughput from different cgroups. This actually a regression from
CFQ where things work more or less OK and weights have roughly the impact they
should. The problem can be reproduced e.g. by running the following easy fio
job in two cgroups with different weight:

[writer]
directory=/mnt/repro/
numjobs=1
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

I can observe there's no significat difference in the amount of data written
from different cgroups despite their weights are in say 1:3 ratio.

After some debugging I've understood the dynamics of the system. There are two
issues:

1) The amount of scheduler tags needs to be significantly larger than the
amount of device tags. Otherwise there are not enough requests waiting in BFQ
to be dispatched to the device and thus there's nothing to schedule on.

2) Even with enough scheduler tags, writers from two cgroups eventually start
contending on scheduler tag allocation. These are served on first come first
served basis so writers from both cgroups feed requests into bfq with
approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
submitted and completed faster and eventually we end up in a situation when
there's no IO from the heavier cgroup in bfq and all scheduler tags are
consumed by requests from the lighter cgroup. At that point bfq just dispatches
lots of the IO from the lighter cgroup since there's no contender for disk
throughput. As a result observed throughput for both cgroups are the same.

This series fixes this problem by accounting how many scheduler tags are
allocated for each cgroup and if a cgroup has more tags allocated than its
fair share (based on weights) in its service tree, we heavily limit scheduler
tag bitmap depth for it so that it is not be able to starve other cgroups from
scheduler tags.

								Honza

Previous versions:
Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1
Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@suse.cz # v2
Link: http://lore.kernel.org/r/20211006164110.10817-1-jack@suse.cz # v3
Link: http://lore.kernel.org/r/20211123101109.20879-1-jack@suse.cz # v4

             reply	other threads:[~2021-11-25 13:38 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-25 13:36 Jan Kara [this message]
2021-11-25 13:36 ` [PATCH 1/8] block: Provide blk_mq_sched_get_icq() Jan Kara
2021-11-25 16:04   ` Jens Axboe
2021-11-25 13:36 ` [PATCH 2/8] bfq: Track number of allocated requests in bfq_entity Jan Kara
2021-11-25 13:36 ` [PATCH 3/8] bfq: Store full bitmap depth in bfq_data Jan Kara
2021-11-25 13:36 ` [PATCH 4/8] bfq: Limit number of requests consumed by each cgroup Jan Kara
2021-11-25 13:36 ` [PATCH 5/8] bfq: Limit waker detection in time Jan Kara
2021-11-25 13:36 ` [PATCH 6/8] bfq: Provide helper to generate bfqq name Jan Kara
2021-11-25 13:36 ` [PATCH 7/8] bfq: Log waker detections Jan Kara
2021-11-25 13:36 ` [PATCH 8/8] bfq: Do not let waker requests skip proper accounting Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211125133131.14018-1-jack@suse.cz \
    --to=jack@suse.cz \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=paolo.valente@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.