linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Weiping Zhang <zwp10758@gmail.com>
To: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	Bart Van Assche <bvanassche@acm.org>,
	linux-nvme@lists.infradead.org, Ming Lei <ming.lei@redhat.com>,
	linux-block@vger.kernel.org, Minwoo Im <minwoo.im.dev@gmail.com>,
	cgroups@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	"Nadolski, Edmund" <edmund.nadolski@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
Date: Sun, 16 Feb 2020 16:09:34 +0800	[thread overview]
Message-ID: <CAA70yB5qAj8YnNiPVD5zmPrrTr0A0F3v2cC6t2S1Fb0kiECLfw@mail.gmail.com> (raw)
In-Reply-To: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com>

Keith Busch <kbusch@kernel.org> 于2020年2月4日周二 下午11:42写道:
>
> On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote:
> > This series try to add Weighted Round Robin for block cgroup and nvme
> > driver. When multiple containers share a single nvme device, we want
> > to protect IO critical container from not be interfernced by other
> > containers. We add blkio.wrr interface to user to control their IO
> > priority. The blkio.wrr accept five level priorities, which contains
> > "urgent", "high", "medium", "low" and "none", the "none" is used for
> > disable WRR for this cgroup.
>
Hi Bush,

> The NVMe protocol really doesn't define WRR to be a mechanism to mitigate
> interference, though. It defines credits among the weighted queues
> only for command fetching, and an urgent strict priority class that
> starves the rest. It has nothing to do with how the controller should
> prioritize completion of those commands, even if it may be reasonable to
> assume influencing when the command is fetched should affect its
> completion.
>
Thanks your feedback, the fio test result on WRR shows that, the high-wrr-fio
get more bandwidth/iops and low latency. I think it's a good feature
for the case
that run multiple workload with different priority, especially for
container colocation.

> On the "weighted" strict priority, there's nothing separating "high"
> from "low" other than the name: the "set features" credit assignment
> can invert which queues have higher command fetch rates such that the
> "low" is favoured over the "high".
>
If there is no limitation in the hardware controller, we can add more
checking in
"set feature command". I think mostly people won't give "low" more
credits than "high",
it really does not make sense.

> There's no protection against the "urgent" class starving others: normal
> IO will timeout and trigger repeated controller resets, while polled IO
> will consume 100% of CPU cycles without making any progress if we make
> this type of queue available without any additional code to ensure the
> host behaves..
>
I think we can just disable it in the software layer , actually, I have no real
application need this.

> On the driver implementation, the number of module parameters being
> added here is problematic. We already have 2 special classes of queues,
> and defining this at the module level is considered too coarse when
> the system has different devices on opposite ends of the capability
> spectrum. For example, users want polled queues for the fast devices,
> and none for the slower tier. We just don't have a good mechanism to
> define per-controller resources, and more queue classes will make this
> problem worse.
>
We can add a new "string" module parameter, which contains a model number,
in most cases, the save product with a common prefix model number, so
in this way
nvme can distinguish the different performance devices(hign or low end).
Before create io queue, nvme driver can get the device's Model number(40 Bytes),
then nvme driver can compare device's model number with module parameter, to
decide how many io queues for each disk;

/* if model_number is MODEL_ANY, these parameters will be applied to
all nvme devices. */
char dev_io_queues[1024] = "model_number=MODEL_ANY,
poll=0,read=0,wrr_low=0,wrr_medium=0,wrr_high=0,wrr_urgent=0";
/* these paramters only affect nvme disk whose model number is "XXX" */
char dev_io_queues[1024] = "model_number=XXX,
poll=1,read=2,wrr_low=3,wrr_medium=4,wrr_high=5,wrr_urgent=0;";

struct dev_io_queues {
        char model_number[40];
        unsigned int poll;
        unsgined int read;
        unsigned int wrr_low;
        unsigned int wrr_medium;
        unsigned int wrr_high;
        unsigned int wrr_urgent;
};

We can use these two variable to store io queue configurations:

/* default values for the all disk, except whose model number is not
in io_queues_cfg */
struct dev_io_queues io_queues_def = {};

/* user defined values for a specific model number */
struct dev_io_queues io_queues_cfg = {};

If we need multiple configurations( > 2), we can also extend
dev_io_queues to support it.

> On the blk-mq side, this implementation doesn't work with the IO
> schedulers. If one is in use, requests may be reordered such that a
> request on your high-priority hctx may be dispatched later than more
> recent ones associated with lower priority. I don't think that's what
> you'd want to happen, so priority should be considered with schedulers
> too.
>
Currently, nvme does not use io scheduler by defalut, if user want to make
wrr compatible with io scheduler, we can add other patches to handle this.

> But really, though, NVMe's WRR is too heavy weight and difficult to use.
> The techincal work group can come up with something better, but it looks
> like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics).

For the test result, I think it's a useful feature.
It really gives high priority applications high iops/bandwith and low latency,
and it makes software very thin and simple.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2020-02-16  8:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-04  3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 1/4] block: add weighted round robin for blkcgroup Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 2/4] nvme: add get_ams for nvme_ctrl_ops Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 3/4] nvme-pci: rename module parameter write_queues to read_queues Weiping Zhang
2020-02-04  3:31 ` [PATCH v5 4/4] nvme: add support weighted round robin queue Weiping Zhang
2020-02-04 15:42 ` [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Keith Busch
2020-02-16  8:09   ` Weiping Zhang [this message]
2020-03-31  6:17     ` Weiping Zhang
2020-03-31 10:29       ` Paolo Valente
2020-03-31 14:36       ` Tejun Heo
2020-03-31 15:47         ` Weiping Zhang
2020-03-31 15:51           ` Tejun Heo
2020-03-31 15:52             ` Christoph Hellwig
2020-03-31 15:54               ` Tejun Heo
2020-03-31 16:31               ` Weiping Zhang
2020-03-31 16:33                 ` Christoph Hellwig
2020-03-31 16:52                   ` Weiping Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAA70yB5qAj8YnNiPVD5zmPrrTr0A0F3v2cC6t2S1Fb0kiECLfw@mail.gmail.com \
    --to=zwp10758@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=cgroups@vger.kernel.org \
    --cc=edmund.nadolski@intel.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=minwoo.im.dev@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).