From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=IDql=4E=vger.kernel.org=linux-block-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8F39FC76199
	for <linux-block@archiver.kernel.org>; Sun, 16 Feb 2020 08:09:50 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 651DA206E2
	for <linux-block@archiver.kernel.org>; Sun, 16 Feb 2020 08:09:50 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VmKAFLhp"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726020AbgBPIJr (ORCPT <rfc822;linux-block@archiver.kernel.org>);
        Sun, 16 Feb 2020 03:09:47 -0500
Received: from mail-qt1-f196.google.com ([209.85.160.196]:39702 "EHLO
        mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725930AbgBPIJr (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Sun, 16 Feb 2020 03:09:47 -0500
Received: by mail-qt1-f196.google.com with SMTP id c5so10018227qtj.6;
        Sun, 16 Feb 2020 00:09:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=;
        b=VmKAFLhpDtAwHFCSjp+NzMIgHyhm7krjkidskFg0N+Fgdz3R8JDO3b8ne9kX2NFKY+
         N/GybcS2uUJYFLbORG6aDUumIctsLO+Be5ocFLSryqNoTi+9rDFflc5gpxK+RKw1+Io/
         n88bkYbb9GtMZ/vk5gE20Oue9ZjRBfLFFDCNBLu+HP1A2TycZn6CtUMlARKRH4vGlG4M
         ig34rD5d25IBzbXpRYhrTD01cJRhTk9nw4VwW4ldj26nRlASBkqDb3QHKjwIJg6/eOCD
         MMD6AR29nHG9e3OxFlZGf9E7wRVQhM5ujrZu534Vpv+52f9/1EdXj4e0dwx5Bp2ei4oZ
         JmRg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=;
        b=VfHtnrsmKl/URU+kHaUAv3G5HFNSPhYdIG9Ce1wsDV/GhflDerjacvV9AFyGy5zu/S
         wby6buuJNTIJ39hH6NwjAy9G6mUrqJicVf9jEU/L8rqLpVhEmJ91h9mHSmhbK4ZEPcm0
         k/ye1dQ++OGUhWjqV2L7s46sXwJrg3y/mp1+v3dZGYPgMzeknw3INhy3wgs9LTfPJQqT
         lgb3n24Zv89rI/pWyJNIXSLa538MJG4XRdy9YQ5+bHRnhGU+uEjAjkhJxqwQkVQg2TKP
         ZHZFXKzrpfZYKjpXsJq+IZYK31zEubUrBgoIE9iX4f/mAVNzTRen3V5/4kTLU3c7H8gB
         CnLA==
X-Gm-Message-State: APjAAAUOpTjwDPCTaeUUWt7qnGDRjIerK6I1mqWxBc6HrtQNr9dMWbhA
        soHsXE0PwmBfsEZHtCO6y3b2XK/Se5ETb6d+Eww=
X-Google-Smtp-Source: APXvYqyWsEgZR3nj4ltlF/cwwuK/9RC88Bxi124aJCUSlJQqix+GJMRtc5vW2gcs32HrxaV4C+AM8gxZqKNu36C+0WQ=
X-Received: by 2002:ac8:47cc:: with SMTP id d12mr9206563qtr.246.1581840585855;
 Sun, 16 Feb 2020 00:09:45 -0800 (PST)
MIME-Version: 1.0
References: <cover.1580786525.git.zhangweiping@didiglobal.com> <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com>
In-Reply-To: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com>
From:   Weiping Zhang <zwp10758@gmail.com>
Date:   Sun, 16 Feb 2020 16:09:34 +0800
Message-ID: <CAA70yB5qAj8YnNiPVD5zmPrrTr0A0F3v2cC6t2S1Fb0kiECLfw@mail.gmail.com>
Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
To:     Keith Busch <kbusch@kernel.org>
Cc:     Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>,
        Christoph Hellwig <hch@lst.de>,
        Bart Van Assche <bvanassche@acm.org>,
        Minwoo Im <minwoo.im.dev@gmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ming Lei <ming.lei@redhat.com>,
        "Nadolski, Edmund" <edmund.nadolski@intel.com>,
        linux-block@vger.kernel.org, cgroups@vger.kernel.org,
        linux-nvme@lists.infradead.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-block.vger.kernel.org>
X-Mailing-List: linux-block@vger.kernel.org

Keith Busch <kbusch@kernel.org> =E4=BA=8E2020=E5=B9=B42=E6=9C=884=E6=97=A5=
=E5=91=A8=E4=BA=8C =E4=B8=8B=E5=8D=8811:42=E5=86=99=E9=81=93=EF=BC=9A
>
> On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote:
> > This series try to add Weighted Round Robin for block cgroup and nvme
> > driver. When multiple containers share a single nvme device, we want
> > to protect IO critical container from not be interfernced by other
> > containers. We add blkio.wrr interface to user to control their IO
> > priority. The blkio.wrr accept five level priorities, which contains
> > "urgent", "high", "medium", "low" and "none", the "none" is used for
> > disable WRR for this cgroup.
>
Hi Bush,

> The NVMe protocol really doesn't define WRR to be a mechanism to mitigate
> interference, though. It defines credits among the weighted queues
> only for command fetching, and an urgent strict priority class that
> starves the rest. It has nothing to do with how the controller should
> prioritize completion of those commands, even if it may be reasonable to
> assume influencing when the command is fetched should affect its
> completion.
>
Thanks your feedback, the fio test result on WRR shows that, the high-wrr-f=
io
get more bandwidth/iops and low latency. I think it's a good feature
for the case
that run multiple workload with different priority, especially for
container colocation.

> On the "weighted" strict priority, there's nothing separating "high"
> from "low" other than the name: the "set features" credit assignment
> can invert which queues have higher command fetch rates such that the
> "low" is favoured over the "high".
>
If there is no limitation in the hardware controller, we can add more
checking in
"set feature command". I think mostly people won't give "low" more
credits than "high",
it really does not make sense.

> There's no protection against the "urgent" class starving others: normal
> IO will timeout and trigger repeated controller resets, while polled IO
> will consume 100% of CPU cycles without making any progress if we make
> this type of queue available without any additional code to ensure the
> host behaves..
>
I think we can just disable it in the software layer , actually, I have no =
real
application need this.

> On the driver implementation, the number of module parameters being
> added here is problematic. We already have 2 special classes of queues,
> and defining this at the module level is considered too coarse when
> the system has different devices on opposite ends of the capability
> spectrum. For example, users want polled queues for the fast devices,
> and none for the slower tier. We just don't have a good mechanism to
> define per-controller resources, and more queue classes will make this
> problem worse.
>
We can add a new "string" module parameter, which contains a model number,
in most cases, the save product with a common prefix model number, so
in this way
nvme can distinguish the different performance devices(hign or low end).
Before create io queue, nvme driver can get the device's Model number(40 By=
tes),
then nvme driver can compare device's model number with module parameter, t=
o
decide how many io queues for each disk;

/* if model_number is MODEL_ANY, these parameters will be applied to
all nvme devices. */
char dev_io_queues[1024] =3D "model_number=3DMODEL_ANY,
poll=3D0,read=3D0,wrr_low=3D0,wrr_medium=3D0,wrr_high=3D0,wrr_urgent=3D0";
/* these paramters only affect nvme disk whose model number is "XXX" */
char dev_io_queues[1024] =3D "model_number=3DXXX,
poll=3D1,read=3D2,wrr_low=3D3,wrr_medium=3D4,wrr_high=3D5,wrr_urgent=3D0;";

struct dev_io_queues {
        char model_number[40];
        unsigned int poll;
        unsgined int read;
        unsigned int wrr_low;
        unsigned int wrr_medium;
        unsigned int wrr_high;
        unsigned int wrr_urgent;
};

We can use these two variable to store io queue configurations:

/* default values for the all disk, except whose model number is not
in io_queues_cfg */
struct dev_io_queues io_queues_def =3D {};

/* user defined values for a specific model number */
struct dev_io_queues io_queues_cfg =3D {};

If we need multiple configurations( > 2), we can also extend
dev_io_queues to support it.

> On the blk-mq side, this implementation doesn't work with the IO
> schedulers. If one is in use, requests may be reordered such that a
> request on your high-priority hctx may be dispatched later than more
> recent ones associated with lower priority. I don't think that's what
> you'd want to happen, so priority should be considered with schedulers
> too.
>
Currently, nvme does not use io scheduler by defalut, if user want to make
wrr compatible with io scheduler, we can add other patches to handle this.

> But really, though, NVMe's WRR is too heavy weight and difficult to use.
> The techincal work group can come up with something better, but it looks
> like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics=
).

For the test result, I think it's a useful feature.
It really gives high priority applications high iops/bandwith and low laten=
cy,
and it makes software very thin and simple.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=8N39=4E=lists.infradead.org=linux-nvme-bounces+linux-nvme=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.5 required=3.0 tests=DKIM_ADSP_CUSTOM_MED,
	DKIM_SIGNED,DKIM_VALID,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7F36CC76199
	for <linux-nvme@archiver.kernel.org>; Sun, 16 Feb 2020 08:10:04 +0000 (UTC)
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 46FEE206E2
	for <linux-nvme@archiver.kernel.org>; Sun, 16 Feb 2020 08:10:04 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="I2Y5Kv8v";
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VmKAFLhp"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46FEE206E2
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20170209; h=Sender:
	Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From:
	In-Reply-To:References:MIME-Version:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=2b/7U2fXhh8ZdeI2XWxFKCx+V3dY3+ja7e5gT4mUqv0=; b=I2Y5Kv8v4HxpOC
	xeeiJGevkTvZj+mlZYCFd1QF71tYSajB1LIwj38xB0AcpVxwxw2SIv7VgdpbSHo2fHQhr/ltHIA6Q
	5Y6jmfhYnzeyy2X5uajQzjzoPSIWBES5LG1q0pnLtCCE2NTnTjbGxHvojTRIzQL/JCmS0JK5p0b8w
	fsfL0axUNe1l4jTL9WuLhhz40Qa2F8rD8YRoAs4KDI2FNjZ2OaDD4NmdJ37TP3FcJW7hZsZkOJ3gH
	2rm8bp0qoxmPA+oacmQkG2FJkiDAa6FT3qsHeRWE2sIdArEqoByh2jdIx7RDYeGlPMmsTnx5pnvzK
	C8ZnI6mw2vZpMGefQreQ==;
Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux))
	id 1j3Ezx-0008CP-Ak; Sun, 16 Feb 2020 08:09:53 +0000
Received: from mail-qt1-x841.google.com ([2607:f8b0:4864:20::841])
 by bombadil.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux))
 id 1j3Ezt-0008BU-Uu
 for linux-nvme@lists.infradead.org; Sun, 16 Feb 2020 08:09:51 +0000
Received: by mail-qt1-x841.google.com with SMTP id d9so9996731qte.12
 for <linux-nvme@lists.infradead.org>; Sun, 16 Feb 2020 00:09:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=;
 b=VmKAFLhpDtAwHFCSjp+NzMIgHyhm7krjkidskFg0N+Fgdz3R8JDO3b8ne9kX2NFKY+
 N/GybcS2uUJYFLbORG6aDUumIctsLO+Be5ocFLSryqNoTi+9rDFflc5gpxK+RKw1+Io/
 n88bkYbb9GtMZ/vk5gE20Oue9ZjRBfLFFDCNBLu+HP1A2TycZn6CtUMlARKRH4vGlG4M
 ig34rD5d25IBzbXpRYhrTD01cJRhTk9nw4VwW4ldj26nRlASBkqDb3QHKjwIJg6/eOCD
 MMD6AR29nHG9e3OxFlZGf9E7wRVQhM5ujrZu534Vpv+52f9/1EdXj4e0dwx5Bp2ei4oZ
 JmRg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=;
 b=VZ5kiZcFHL5XBqe7lh1s4TXAxHWRTZcBilwS6C9e69WachFxwrc8ssMh05hYlQu21q
 RP+7DnhY13Z2K79yqqTCKSlDu72oDf9Ikrnid/cEopgkWXwHgR4MR0ar2ZZYhfwSVJYc
 sB9exBHznYx65Cwa1nK4tBZNwjoNuX6Wt0Gexe9LN7URyqvRGvCox/pJYpYee/KM+zyV
 39o+c/YXEPc1DL1O7mHl4M8j6hvsHJpIUySl9yEZAsEtvfOnQBlh43OQOUmDEqm8Kt1H
 gfhNluqybPVGU+wogFw0DBlpSYaaYcifKtJyXI/xGegfmvtOS7eOtvWfwUNjxMi9ZlHg
 JbTg==
X-Gm-Message-State: APjAAAVb3hHuyGxR1Lr3LlztOg3gh4Md5vLHOA9ohPf2e7/7dNFgZP2c
 hC8x7nEmP94P/aSCrifSPJ6S7EA39TWXfmDViZs=
X-Google-Smtp-Source: APXvYqyWsEgZR3nj4ltlF/cwwuK/9RC88Bxi124aJCUSlJQqix+GJMRtc5vW2gcs32HrxaV4C+AM8gxZqKNu36C+0WQ=
X-Received: by 2002:ac8:47cc:: with SMTP id d12mr9206563qtr.246.1581840585855; 
 Sun, 16 Feb 2020 00:09:45 -0800 (PST)
MIME-Version: 1.0
References: <cover.1580786525.git.zhangweiping@didiglobal.com>
 <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com>
In-Reply-To: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com>
From: Weiping Zhang <zwp10758@gmail.com>
Date: Sun, 16 Feb 2020 16:09:34 +0800
Message-ID: <CAA70yB5qAj8YnNiPVD5zmPrrTr0A0F3v2cC6t2S1Fb0kiECLfw@mail.gmail.com>
Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
To: Keith Busch <kbusch@kernel.org>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20200216_000950_024321_82582078 
X-CRM114-Status: GOOD (  30.35  )
X-BeenThere: linux-nvme@lists.infradead.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <linux-nvme.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-nvme>,
 <mailto:linux-nvme-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-nvme/>
List-Post: <mailto:linux-nvme@lists.infradead.org>
List-Help: <mailto:linux-nvme-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-nvme>,
 <mailto:linux-nvme-request@lists.infradead.org?subject=subscribe>
Cc: Jens Axboe <axboe@kernel.dk>, Bart Van Assche <bvanassche@acm.org>,
 linux-nvme@lists.infradead.org, Ming Lei <ming.lei@redhat.com>,
 linux-block@vger.kernel.org, Minwoo Im <minwoo.im.dev@gmail.com>,
 cgroups@vger.kernel.org, Tejun Heo <tj@kernel.org>, "Nadolski,
 Edmund" <edmund.nadolski@intel.com>, Thomas Gleixner <tglx@linutronix.de>,
 Christoph Hellwig <hch@lst.de>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Sender: "linux-nvme" <linux-nvme-bounces@lists.infradead.org>
Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org

S2VpdGggQnVzY2ggPGtidXNjaEBrZXJuZWwub3JnPiDkuo4yMDIw5bm0MuaciDTml6Xlkajkuowg
5LiL5Y2IMTE6NDLlhpnpgZPvvJoKPgo+IE9uIFR1ZSwgRmViIDA0LCAyMDIwIGF0IDExOjMwOjQ1
QU0gKzA4MDAsIFdlaXBpbmcgWmhhbmcgd3JvdGU6Cj4gPiBUaGlzIHNlcmllcyB0cnkgdG8gYWRk
IFdlaWdodGVkIFJvdW5kIFJvYmluIGZvciBibG9jayBjZ3JvdXAgYW5kIG52bWUKPiA+IGRyaXZl
ci4gV2hlbiBtdWx0aXBsZSBjb250YWluZXJzIHNoYXJlIGEgc2luZ2xlIG52bWUgZGV2aWNlLCB3
ZSB3YW50Cj4gPiB0byBwcm90ZWN0IElPIGNyaXRpY2FsIGNvbnRhaW5lciBmcm9tIG5vdCBiZSBp
bnRlcmZlcm5jZWQgYnkgb3RoZXIKPiA+IGNvbnRhaW5lcnMuIFdlIGFkZCBibGtpby53cnIgaW50
ZXJmYWNlIHRvIHVzZXIgdG8gY29udHJvbCB0aGVpciBJTwo+ID4gcHJpb3JpdHkuIFRoZSBibGtp
by53cnIgYWNjZXB0IGZpdmUgbGV2ZWwgcHJpb3JpdGllcywgd2hpY2ggY29udGFpbnMKPiA+ICJ1
cmdlbnQiLCAiaGlnaCIsICJtZWRpdW0iLCAibG93IiBhbmQgIm5vbmUiLCB0aGUgIm5vbmUiIGlz
IHVzZWQgZm9yCj4gPiBkaXNhYmxlIFdSUiBmb3IgdGhpcyBjZ3JvdXAuCj4KSGkgQnVzaCwKCj4g
VGhlIE5WTWUgcHJvdG9jb2wgcmVhbGx5IGRvZXNuJ3QgZGVmaW5lIFdSUiB0byBiZSBhIG1lY2hh
bmlzbSB0byBtaXRpZ2F0ZQo+IGludGVyZmVyZW5jZSwgdGhvdWdoLiBJdCBkZWZpbmVzIGNyZWRp
dHMgYW1vbmcgdGhlIHdlaWdodGVkIHF1ZXVlcwo+IG9ubHkgZm9yIGNvbW1hbmQgZmV0Y2hpbmcs
IGFuZCBhbiB1cmdlbnQgc3RyaWN0IHByaW9yaXR5IGNsYXNzIHRoYXQKPiBzdGFydmVzIHRoZSBy
ZXN0LiBJdCBoYXMgbm90aGluZyB0byBkbyB3aXRoIGhvdyB0aGUgY29udHJvbGxlciBzaG91bGQK
PiBwcmlvcml0aXplIGNvbXBsZXRpb24gb2YgdGhvc2UgY29tbWFuZHMsIGV2ZW4gaWYgaXQgbWF5
IGJlIHJlYXNvbmFibGUgdG8KPiBhc3N1bWUgaW5mbHVlbmNpbmcgd2hlbiB0aGUgY29tbWFuZCBp
cyBmZXRjaGVkIHNob3VsZCBhZmZlY3QgaXRzCj4gY29tcGxldGlvbi4KPgpUaGFua3MgeW91ciBm
ZWVkYmFjaywgdGhlIGZpbyB0ZXN0IHJlc3VsdCBvbiBXUlIgc2hvd3MgdGhhdCwgdGhlIGhpZ2gt
d3JyLWZpbwpnZXQgbW9yZSBiYW5kd2lkdGgvaW9wcyBhbmQgbG93IGxhdGVuY3kuIEkgdGhpbmsg
aXQncyBhIGdvb2QgZmVhdHVyZQpmb3IgdGhlIGNhc2UKdGhhdCBydW4gbXVsdGlwbGUgd29ya2xv
YWQgd2l0aCBkaWZmZXJlbnQgcHJpb3JpdHksIGVzcGVjaWFsbHkgZm9yCmNvbnRhaW5lciBjb2xv
Y2F0aW9uLgoKPiBPbiB0aGUgIndlaWdodGVkIiBzdHJpY3QgcHJpb3JpdHksIHRoZXJlJ3Mgbm90
aGluZyBzZXBhcmF0aW5nICJoaWdoIgo+IGZyb20gImxvdyIgb3RoZXIgdGhhbiB0aGUgbmFtZTog
dGhlICJzZXQgZmVhdHVyZXMiIGNyZWRpdCBhc3NpZ25tZW50Cj4gY2FuIGludmVydCB3aGljaCBx
dWV1ZXMgaGF2ZSBoaWdoZXIgY29tbWFuZCBmZXRjaCByYXRlcyBzdWNoIHRoYXQgdGhlCj4gImxv
dyIgaXMgZmF2b3VyZWQgb3ZlciB0aGUgImhpZ2giLgo+CklmIHRoZXJlIGlzIG5vIGxpbWl0YXRp
b24gaW4gdGhlIGhhcmR3YXJlIGNvbnRyb2xsZXIsIHdlIGNhbiBhZGQgbW9yZQpjaGVja2luZyBp
bgoic2V0IGZlYXR1cmUgY29tbWFuZCIuIEkgdGhpbmsgbW9zdGx5IHBlb3BsZSB3b24ndCBnaXZl
ICJsb3ciIG1vcmUKY3JlZGl0cyB0aGFuICJoaWdoIiwKaXQgcmVhbGx5IGRvZXMgbm90IG1ha2Ug
c2Vuc2UuCgo+IFRoZXJlJ3Mgbm8gcHJvdGVjdGlvbiBhZ2FpbnN0IHRoZSAidXJnZW50IiBjbGFz
cyBzdGFydmluZyBvdGhlcnM6IG5vcm1hbAo+IElPIHdpbGwgdGltZW91dCBhbmQgdHJpZ2dlciBy
ZXBlYXRlZCBjb250cm9sbGVyIHJlc2V0cywgd2hpbGUgcG9sbGVkIElPCj4gd2lsbCBjb25zdW1l
IDEwMCUgb2YgQ1BVIGN5Y2xlcyB3aXRob3V0IG1ha2luZyBhbnkgcHJvZ3Jlc3MgaWYgd2UgbWFr
ZQo+IHRoaXMgdHlwZSBvZiBxdWV1ZSBhdmFpbGFibGUgd2l0aG91dCBhbnkgYWRkaXRpb25hbCBj
b2RlIHRvIGVuc3VyZSB0aGUKPiBob3N0IGJlaGF2ZXMuLgo+CkkgdGhpbmsgd2UgY2FuIGp1c3Qg
ZGlzYWJsZSBpdCBpbiB0aGUgc29mdHdhcmUgbGF5ZXIgLCBhY3R1YWxseSwgSSBoYXZlIG5vIHJl
YWwKYXBwbGljYXRpb24gbmVlZCB0aGlzLgoKPiBPbiB0aGUgZHJpdmVyIGltcGxlbWVudGF0aW9u
LCB0aGUgbnVtYmVyIG9mIG1vZHVsZSBwYXJhbWV0ZXJzIGJlaW5nCj4gYWRkZWQgaGVyZSBpcyBw
cm9ibGVtYXRpYy4gV2UgYWxyZWFkeSBoYXZlIDIgc3BlY2lhbCBjbGFzc2VzIG9mIHF1ZXVlcywK
PiBhbmQgZGVmaW5pbmcgdGhpcyBhdCB0aGUgbW9kdWxlIGxldmVsIGlzIGNvbnNpZGVyZWQgdG9v
IGNvYXJzZSB3aGVuCj4gdGhlIHN5c3RlbSBoYXMgZGlmZmVyZW50IGRldmljZXMgb24gb3Bwb3Np
dGUgZW5kcyBvZiB0aGUgY2FwYWJpbGl0eQo+IHNwZWN0cnVtLiBGb3IgZXhhbXBsZSwgdXNlcnMg
d2FudCBwb2xsZWQgcXVldWVzIGZvciB0aGUgZmFzdCBkZXZpY2VzLAo+IGFuZCBub25lIGZvciB0
aGUgc2xvd2VyIHRpZXIuIFdlIGp1c3QgZG9uJ3QgaGF2ZSBhIGdvb2QgbWVjaGFuaXNtIHRvCj4g
ZGVmaW5lIHBlci1jb250cm9sbGVyIHJlc291cmNlcywgYW5kIG1vcmUgcXVldWUgY2xhc3NlcyB3
aWxsIG1ha2UgdGhpcwo+IHByb2JsZW0gd29yc2UuCj4KV2UgY2FuIGFkZCBhIG5ldyAic3RyaW5n
IiBtb2R1bGUgcGFyYW1ldGVyLCB3aGljaCBjb250YWlucyBhIG1vZGVsIG51bWJlciwKaW4gbW9z
dCBjYXNlcywgdGhlIHNhdmUgcHJvZHVjdCB3aXRoIGEgY29tbW9uIHByZWZpeCBtb2RlbCBudW1i
ZXIsIHNvCmluIHRoaXMgd2F5Cm52bWUgY2FuIGRpc3Rpbmd1aXNoIHRoZSBkaWZmZXJlbnQgcGVy
Zm9ybWFuY2UgZGV2aWNlcyhoaWduIG9yIGxvdyBlbmQpLgpCZWZvcmUgY3JlYXRlIGlvIHF1ZXVl
LCBudm1lIGRyaXZlciBjYW4gZ2V0IHRoZSBkZXZpY2UncyBNb2RlbCBudW1iZXIoNDAgQnl0ZXMp
LAp0aGVuIG52bWUgZHJpdmVyIGNhbiBjb21wYXJlIGRldmljZSdzIG1vZGVsIG51bWJlciB3aXRo
IG1vZHVsZSBwYXJhbWV0ZXIsIHRvCmRlY2lkZSBob3cgbWFueSBpbyBxdWV1ZXMgZm9yIGVhY2gg
ZGlzazsKCi8qIGlmIG1vZGVsX251bWJlciBpcyBNT0RFTF9BTlksIHRoZXNlIHBhcmFtZXRlcnMg
d2lsbCBiZSBhcHBsaWVkIHRvCmFsbCBudm1lIGRldmljZXMuICovCmNoYXIgZGV2X2lvX3F1ZXVl
c1sxMDI0XSA9ICJtb2RlbF9udW1iZXI9TU9ERUxfQU5ZLApwb2xsPTAscmVhZD0wLHdycl9sb3c9
MCx3cnJfbWVkaXVtPTAsd3JyX2hpZ2g9MCx3cnJfdXJnZW50PTAiOwovKiB0aGVzZSBwYXJhbXRl
cnMgb25seSBhZmZlY3QgbnZtZSBkaXNrIHdob3NlIG1vZGVsIG51bWJlciBpcyAiWFhYIiAqLwpj
aGFyIGRldl9pb19xdWV1ZXNbMTAyNF0gPSAibW9kZWxfbnVtYmVyPVhYWCwKcG9sbD0xLHJlYWQ9
Mix3cnJfbG93PTMsd3JyX21lZGl1bT00LHdycl9oaWdoPTUsd3JyX3VyZ2VudD0wOyI7CgpzdHJ1
Y3QgZGV2X2lvX3F1ZXVlcyB7CiAgICAgICAgY2hhciBtb2RlbF9udW1iZXJbNDBdOwogICAgICAg
IHVuc2lnbmVkIGludCBwb2xsOwogICAgICAgIHVuc2dpbmVkIGludCByZWFkOwogICAgICAgIHVu
c2lnbmVkIGludCB3cnJfbG93OwogICAgICAgIHVuc2lnbmVkIGludCB3cnJfbWVkaXVtOwogICAg
ICAgIHVuc2lnbmVkIGludCB3cnJfaGlnaDsKICAgICAgICB1bnNpZ25lZCBpbnQgd3JyX3VyZ2Vu
dDsKfTsKCldlIGNhbiB1c2UgdGhlc2UgdHdvIHZhcmlhYmxlIHRvIHN0b3JlIGlvIHF1ZXVlIGNv
bmZpZ3VyYXRpb25zOgoKLyogZGVmYXVsdCB2YWx1ZXMgZm9yIHRoZSBhbGwgZGlzaywgZXhjZXB0
IHdob3NlIG1vZGVsIG51bWJlciBpcyBub3QKaW4gaW9fcXVldWVzX2NmZyAqLwpzdHJ1Y3QgZGV2
X2lvX3F1ZXVlcyBpb19xdWV1ZXNfZGVmID0ge307CgovKiB1c2VyIGRlZmluZWQgdmFsdWVzIGZv
ciBhIHNwZWNpZmljIG1vZGVsIG51bWJlciAqLwpzdHJ1Y3QgZGV2X2lvX3F1ZXVlcyBpb19xdWV1
ZXNfY2ZnID0ge307CgpJZiB3ZSBuZWVkIG11bHRpcGxlIGNvbmZpZ3VyYXRpb25zKCA+IDIpLCB3
ZSBjYW4gYWxzbyBleHRlbmQKZGV2X2lvX3F1ZXVlcyB0byBzdXBwb3J0IGl0LgoKPiBPbiB0aGUg
YmxrLW1xIHNpZGUsIHRoaXMgaW1wbGVtZW50YXRpb24gZG9lc24ndCB3b3JrIHdpdGggdGhlIElP
Cj4gc2NoZWR1bGVycy4gSWYgb25lIGlzIGluIHVzZSwgcmVxdWVzdHMgbWF5IGJlIHJlb3JkZXJl
ZCBzdWNoIHRoYXQgYQo+IHJlcXVlc3Qgb24geW91ciBoaWdoLXByaW9yaXR5IGhjdHggbWF5IGJl
IGRpc3BhdGNoZWQgbGF0ZXIgdGhhbiBtb3JlCj4gcmVjZW50IG9uZXMgYXNzb2NpYXRlZCB3aXRo
IGxvd2VyIHByaW9yaXR5LiBJIGRvbid0IHRoaW5rIHRoYXQncyB3aGF0Cj4geW91J2Qgd2FudCB0
byBoYXBwZW4sIHNvIHByaW9yaXR5IHNob3VsZCBiZSBjb25zaWRlcmVkIHdpdGggc2NoZWR1bGVy
cwo+IHRvby4KPgpDdXJyZW50bHksIG52bWUgZG9lcyBub3QgdXNlIGlvIHNjaGVkdWxlciBieSBk
ZWZhbHV0LCBpZiB1c2VyIHdhbnQgdG8gbWFrZQp3cnIgY29tcGF0aWJsZSB3aXRoIGlvIHNjaGVk
dWxlciwgd2UgY2FuIGFkZCBvdGhlciBwYXRjaGVzIHRvIGhhbmRsZSB0aGlzLgoKPiBCdXQgcmVh
bGx5LCB0aG91Z2gsIE5WTWUncyBXUlIgaXMgdG9vIGhlYXZ5IHdlaWdodCBhbmQgZGlmZmljdWx0
IHRvIHVzZS4KPiBUaGUgdGVjaGluY2FsIHdvcmsgZ3JvdXAgY2FuIGNvbWUgdXAgd2l0aCBzb21l
dGhpbmcgYmV0dGVyLCBidXQgaXQgbG9va3MKPiBsaWtlIHRoZXkndmUgbG9zdCBpbnRlcmVzdCBp
biBUUEFSIDQwMTEgKG5vIGRpc2N1c3Npb24gaW4gMiB5ZWFycywgYWZhaWNzKS4KCkZvciB0aGUg
dGVzdCByZXN1bHQsIEkgdGhpbmsgaXQncyBhIHVzZWZ1bCBmZWF0dXJlLgpJdCByZWFsbHkgZ2l2
ZXMgaGlnaCBwcmlvcml0eSBhcHBsaWNhdGlvbnMgaGlnaCBpb3BzL2JhbmR3aXRoIGFuZCBsb3cg
bGF0ZW5jeSwKYW5kIGl0IG1ha2VzIHNvZnR3YXJlIHZlcnkgdGhpbiBhbmQgc2ltcGxlLgoKX19f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KbGludXgtbnZtZSBt
YWlsaW5nIGxpc3QKbGludXgtbnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6Ly9saXN0cy5p
bmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtbnZtZQo=

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Weiping Zhang <zwp10758-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme
Date: Sun, 16 Feb 2020 16:09:34 +0800
Message-ID: <CAA70yB5qAj8YnNiPVD5zmPrrTr0A0F3v2cC6t2S1Fb0kiECLfw@mail.gmail.com>
References: <cover.1580786525.git.zhangweiping@didiglobal.com> <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com>
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Return-path: <cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc:content-transfer-encoding;
        bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=;
        b=VmKAFLhpDtAwHFCSjp+NzMIgHyhm7krjkidskFg0N+Fgdz3R8JDO3b8ne9kX2NFKY+
         N/GybcS2uUJYFLbORG6aDUumIctsLO+Be5ocFLSryqNoTi+9rDFflc5gpxK+RKw1+Io/
         n88bkYbb9GtMZ/vk5gE20Oue9ZjRBfLFFDCNBLu+HP1A2TycZn6CtUMlARKRH4vGlG4M
         ig34rD5d25IBzbXpRYhrTD01cJRhTk9nw4VwW4ldj26nRlASBkqDb3QHKjwIJg6/eOCD
         MMD6AR29nHG9e3OxFlZGf9E7wRVQhM5ujrZu534Vpv+52f9/1EdXj4e0dwx5Bp2ei4oZ
         JmRg==
In-Reply-To: <20200204154200.GA5831-NcBpxDuQXws0jmVbqSliM8EHLCht3btpqxv4g6HH51o@public.gmane.org>
Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="utf-8"
To: Keith Busch <kbusch-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org>, Minwoo Im <minwoo.im.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>, Ming Lei <ming.lei-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "Nadolski, Edmund" <edmund.nadolski-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>, linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org

Keith Busch <kbusch-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> =E4=BA=8E2020=E5=B9=B42=E6=9C=884=E6=97=A5=
=E5=91=A8=E4=BA=8C =E4=B8=8B=E5=8D=8811:42=E5=86=99=E9=81=93=EF=BC=9A
>
> On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote:
> > This series try to add Weighted Round Robin for block cgroup and nvme
> > driver. When multiple containers share a single nvme device, we want
> > to protect IO critical container from not be interfernced by other
> > containers. We add blkio.wrr interface to user to control their IO
> > priority. The blkio.wrr accept five level priorities, which contains
> > "urgent", "high", "medium", "low" and "none", the "none" is used for
> > disable WRR for this cgroup.
>
Hi Bush,

> The NVMe protocol really doesn't define WRR to be a mechanism to mitigate
> interference, though. It defines credits among the weighted queues
> only for command fetching, and an urgent strict priority class that
> starves the rest. It has nothing to do with how the controller should
> prioritize completion of those commands, even if it may be reasonable to
> assume influencing when the command is fetched should affect its
> completion.
>
Thanks your feedback, the fio test result on WRR shows that, the high-wrr-f=
io
get more bandwidth/iops and low latency. I think it's a good feature
for the case
that run multiple workload with different priority, especially for
container colocation.

> On the "weighted" strict priority, there's nothing separating "high"
> from "low" other than the name: the "set features" credit assignment
> can invert which queues have higher command fetch rates such that the
> "low" is favoured over the "high".
>
If there is no limitation in the hardware controller, we can add more
checking in
"set feature command". I think mostly people won't give "low" more
credits than "high",
it really does not make sense.

> There's no protection against the "urgent" class starving others: normal
> IO will timeout and trigger repeated controller resets, while polled IO
> will consume 100% of CPU cycles without making any progress if we make
> this type of queue available without any additional code to ensure the
> host behaves..
>
I think we can just disable it in the software layer , actually, I have no =
real
application need this.

> On the driver implementation, the number of module parameters being
> added here is problematic. We already have 2 special classes of queues,
> and defining this at the module level is considered too coarse when
> the system has different devices on opposite ends of the capability
> spectrum. For example, users want polled queues for the fast devices,
> and none for the slower tier. We just don't have a good mechanism to
> define per-controller resources, and more queue classes will make this
> problem worse.
>
We can add a new "string" module parameter, which contains a model number,
in most cases, the save product with a common prefix model number, so
in this way
nvme can distinguish the different performance devices(hign or low end).
Before create io queue, nvme driver can get the device's Model number(40 By=
tes),
then nvme driver can compare device's model number with module parameter, t=
o
decide how many io queues for each disk;

/* if model_number is MODEL_ANY, these parameters will be applied to
all nvme devices. */
char dev_io_queues[1024] =3D "model_number=3DMODEL_ANY,
poll=3D0,read=3D0,wrr_low=3D0,wrr_medium=3D0,wrr_high=3D0,wrr_urgent=3D0";
/* these paramters only affect nvme disk whose model number is "XXX" */
char dev_io_queues[1024] =3D "model_number=3DXXX,
poll=3D1,read=3D2,wrr_low=3D3,wrr_medium=3D4,wrr_high=3D5,wrr_urgent=3D0;";

struct dev_io_queues {
        char model_number[40];
        unsigned int poll;
        unsgined int read;
        unsigned int wrr_low;
        unsigned int wrr_medium;
        unsigned int wrr_high;
        unsigned int wrr_urgent;
};

We can use these two variable to store io queue configurations:

/* default values for the all disk, except whose model number is not
in io_queues_cfg */
struct dev_io_queues io_queues_def =3D {};

/* user defined values for a specific model number */
struct dev_io_queues io_queues_cfg =3D {};

If we need multiple configurations( > 2), we can also extend
dev_io_queues to support it.

> On the blk-mq side, this implementation doesn't work with the IO
> schedulers. If one is in use, requests may be reordered such that a
> request on your high-priority hctx may be dispatched later than more
> recent ones associated with lower priority. I don't think that's what
> you'd want to happen, so priority should be considered with schedulers
> too.
>
Currently, nvme does not use io scheduler by defalut, if user want to make
wrr compatible with io scheduler, we can add other patches to handle this.

> But really, though, NVMe's WRR is too heavy weight and difficult to use.
> The techincal work group can come up with something better, but it looks
> like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics=
).

For the test result, I think it's a useful feature.
It really gives high priority applications high iops/bandwith and low laten=
cy,
and it makes software very thin and simple.