From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F39FC76199 for ; Sun, 16 Feb 2020 08:09:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 651DA206E2 for ; Sun, 16 Feb 2020 08:09:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VmKAFLhp" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726020AbgBPIJr (ORCPT ); Sun, 16 Feb 2020 03:09:47 -0500 Received: from mail-qt1-f196.google.com ([209.85.160.196]:39702 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725930AbgBPIJr (ORCPT ); Sun, 16 Feb 2020 03:09:47 -0500 Received: by mail-qt1-f196.google.com with SMTP id c5so10018227qtj.6; Sun, 16 Feb 2020 00:09:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=; b=VmKAFLhpDtAwHFCSjp+NzMIgHyhm7krjkidskFg0N+Fgdz3R8JDO3b8ne9kX2NFKY+ N/GybcS2uUJYFLbORG6aDUumIctsLO+Be5ocFLSryqNoTi+9rDFflc5gpxK+RKw1+Io/ n88bkYbb9GtMZ/vk5gE20Oue9ZjRBfLFFDCNBLu+HP1A2TycZn6CtUMlARKRH4vGlG4M ig34rD5d25IBzbXpRYhrTD01cJRhTk9nw4VwW4ldj26nRlASBkqDb3QHKjwIJg6/eOCD MMD6AR29nHG9e3OxFlZGf9E7wRVQhM5ujrZu534Vpv+52f9/1EdXj4e0dwx5Bp2ei4oZ JmRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=; b=VfHtnrsmKl/URU+kHaUAv3G5HFNSPhYdIG9Ce1wsDV/GhflDerjacvV9AFyGy5zu/S wby6buuJNTIJ39hH6NwjAy9G6mUrqJicVf9jEU/L8rqLpVhEmJ91h9mHSmhbK4ZEPcm0 k/ye1dQ++OGUhWjqV2L7s46sXwJrg3y/mp1+v3dZGYPgMzeknw3INhy3wgs9LTfPJQqT lgb3n24Zv89rI/pWyJNIXSLa538MJG4XRdy9YQ5+bHRnhGU+uEjAjkhJxqwQkVQg2TKP ZHZFXKzrpfZYKjpXsJq+IZYK31zEubUrBgoIE9iX4f/mAVNzTRen3V5/4kTLU3c7H8gB CnLA== X-Gm-Message-State: APjAAAUOpTjwDPCTaeUUWt7qnGDRjIerK6I1mqWxBc6HrtQNr9dMWbhA soHsXE0PwmBfsEZHtCO6y3b2XK/Se5ETb6d+Eww= X-Google-Smtp-Source: APXvYqyWsEgZR3nj4ltlF/cwwuK/9RC88Bxi124aJCUSlJQqix+GJMRtc5vW2gcs32HrxaV4C+AM8gxZqKNu36C+0WQ= X-Received: by 2002:ac8:47cc:: with SMTP id d12mr9206563qtr.246.1581840585855; Sun, 16 Feb 2020 00:09:45 -0800 (PST) MIME-Version: 1.0 References: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com> In-Reply-To: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com> From: Weiping Zhang Date: Sun, 16 Feb 2020 16:09:34 +0800 Message-ID: Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme To: Keith Busch Cc: Jens Axboe , Tejun Heo , Christoph Hellwig , Bart Van Assche , Minwoo Im , Thomas Gleixner , Ming Lei , "Nadolski, Edmund" , linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Keith Busch =E4=BA=8E2020=E5=B9=B42=E6=9C=884=E6=97=A5= =E5=91=A8=E4=BA=8C =E4=B8=8B=E5=8D=8811:42=E5=86=99=E9=81=93=EF=BC=9A > > On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote: > > This series try to add Weighted Round Robin for block cgroup and nvme > > driver. When multiple containers share a single nvme device, we want > > to protect IO critical container from not be interfernced by other > > containers. We add blkio.wrr interface to user to control their IO > > priority. The blkio.wrr accept five level priorities, which contains > > "urgent", "high", "medium", "low" and "none", the "none" is used for > > disable WRR for this cgroup. > Hi Bush, > The NVMe protocol really doesn't define WRR to be a mechanism to mitigate > interference, though. It defines credits among the weighted queues > only for command fetching, and an urgent strict priority class that > starves the rest. It has nothing to do with how the controller should > prioritize completion of those commands, even if it may be reasonable to > assume influencing when the command is fetched should affect its > completion. > Thanks your feedback, the fio test result on WRR shows that, the high-wrr-f= io get more bandwidth/iops and low latency. I think it's a good feature for the case that run multiple workload with different priority, especially for container colocation. > On the "weighted" strict priority, there's nothing separating "high" > from "low" other than the name: the "set features" credit assignment > can invert which queues have higher command fetch rates such that the > "low" is favoured over the "high". > If there is no limitation in the hardware controller, we can add more checking in "set feature command". I think mostly people won't give "low" more credits than "high", it really does not make sense. > There's no protection against the "urgent" class starving others: normal > IO will timeout and trigger repeated controller resets, while polled IO > will consume 100% of CPU cycles without making any progress if we make > this type of queue available without any additional code to ensure the > host behaves.. > I think we can just disable it in the software layer , actually, I have no = real application need this. > On the driver implementation, the number of module parameters being > added here is problematic. We already have 2 special classes of queues, > and defining this at the module level is considered too coarse when > the system has different devices on opposite ends of the capability > spectrum. For example, users want polled queues for the fast devices, > and none for the slower tier. We just don't have a good mechanism to > define per-controller resources, and more queue classes will make this > problem worse. > We can add a new "string" module parameter, which contains a model number, in most cases, the save product with a common prefix model number, so in this way nvme can distinguish the different performance devices(hign or low end). Before create io queue, nvme driver can get the device's Model number(40 By= tes), then nvme driver can compare device's model number with module parameter, t= o decide how many io queues for each disk; /* if model_number is MODEL_ANY, these parameters will be applied to all nvme devices. */ char dev_io_queues[1024] =3D "model_number=3DMODEL_ANY, poll=3D0,read=3D0,wrr_low=3D0,wrr_medium=3D0,wrr_high=3D0,wrr_urgent=3D0"; /* these paramters only affect nvme disk whose model number is "XXX" */ char dev_io_queues[1024] =3D "model_number=3DXXX, poll=3D1,read=3D2,wrr_low=3D3,wrr_medium=3D4,wrr_high=3D5,wrr_urgent=3D0;"; struct dev_io_queues { char model_number[40]; unsigned int poll; unsgined int read; unsigned int wrr_low; unsigned int wrr_medium; unsigned int wrr_high; unsigned int wrr_urgent; }; We can use these two variable to store io queue configurations: /* default values for the all disk, except whose model number is not in io_queues_cfg */ struct dev_io_queues io_queues_def =3D {}; /* user defined values for a specific model number */ struct dev_io_queues io_queues_cfg =3D {}; If we need multiple configurations( > 2), we can also extend dev_io_queues to support it. > On the blk-mq side, this implementation doesn't work with the IO > schedulers. If one is in use, requests may be reordered such that a > request on your high-priority hctx may be dispatched later than more > recent ones associated with lower priority. I don't think that's what > you'd want to happen, so priority should be considered with schedulers > too. > Currently, nvme does not use io scheduler by defalut, if user want to make wrr compatible with io scheduler, we can add other patches to handle this. > But really, though, NVMe's WRR is too heavy weight and difficult to use. > The techincal work group can come up with something better, but it looks > like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics= ). For the test result, I think it's a useful feature. It really gives high priority applications high iops/bandwith and low laten= cy, and it makes software very thin and simple. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=3.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED,DKIM_VALID,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F36CC76199 for ; Sun, 16 Feb 2020 08:10:04 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 46FEE206E2 for ; Sun, 16 Feb 2020 08:10:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="I2Y5Kv8v"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VmKAFLhp" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46FEE206E2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:To:Subject:Message-ID:Date:From: In-Reply-To:References:MIME-Version:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=2b/7U2fXhh8ZdeI2XWxFKCx+V3dY3+ja7e5gT4mUqv0=; b=I2Y5Kv8v4HxpOC xeeiJGevkTvZj+mlZYCFd1QF71tYSajB1LIwj38xB0AcpVxwxw2SIv7VgdpbSHo2fHQhr/ltHIA6Q 5Y6jmfhYnzeyy2X5uajQzjzoPSIWBES5LG1q0pnLtCCE2NTnTjbGxHvojTRIzQL/JCmS0JK5p0b8w fsfL0axUNe1l4jTL9WuLhhz40Qa2F8rD8YRoAs4KDI2FNjZ2OaDD4NmdJ37TP3FcJW7hZsZkOJ3gH 2rm8bp0qoxmPA+oacmQkG2FJkiDAa6FT3qsHeRWE2sIdArEqoByh2jdIx7RDYeGlPMmsTnx5pnvzK C8ZnI6mw2vZpMGefQreQ==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1j3Ezx-0008CP-Ak; Sun, 16 Feb 2020 08:09:53 +0000 Received: from mail-qt1-x841.google.com ([2607:f8b0:4864:20::841]) by bombadil.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1j3Ezt-0008BU-Uu for linux-nvme@lists.infradead.org; Sun, 16 Feb 2020 08:09:51 +0000 Received: by mail-qt1-x841.google.com with SMTP id d9so9996731qte.12 for ; Sun, 16 Feb 2020 00:09:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=; b=VmKAFLhpDtAwHFCSjp+NzMIgHyhm7krjkidskFg0N+Fgdz3R8JDO3b8ne9kX2NFKY+ N/GybcS2uUJYFLbORG6aDUumIctsLO+Be5ocFLSryqNoTi+9rDFflc5gpxK+RKw1+Io/ n88bkYbb9GtMZ/vk5gE20Oue9ZjRBfLFFDCNBLu+HP1A2TycZn6CtUMlARKRH4vGlG4M ig34rD5d25IBzbXpRYhrTD01cJRhTk9nw4VwW4ldj26nRlASBkqDb3QHKjwIJg6/eOCD MMD6AR29nHG9e3OxFlZGf9E7wRVQhM5ujrZu534Vpv+52f9/1EdXj4e0dwx5Bp2ei4oZ JmRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=; b=VZ5kiZcFHL5XBqe7lh1s4TXAxHWRTZcBilwS6C9e69WachFxwrc8ssMh05hYlQu21q RP+7DnhY13Z2K79yqqTCKSlDu72oDf9Ikrnid/cEopgkWXwHgR4MR0ar2ZZYhfwSVJYc sB9exBHznYx65Cwa1nK4tBZNwjoNuX6Wt0Gexe9LN7URyqvRGvCox/pJYpYee/KM+zyV 39o+c/YXEPc1DL1O7mHl4M8j6hvsHJpIUySl9yEZAsEtvfOnQBlh43OQOUmDEqm8Kt1H gfhNluqybPVGU+wogFw0DBlpSYaaYcifKtJyXI/xGegfmvtOS7eOtvWfwUNjxMi9ZlHg JbTg== X-Gm-Message-State: APjAAAVb3hHuyGxR1Lr3LlztOg3gh4Md5vLHOA9ohPf2e7/7dNFgZP2c hC8x7nEmP94P/aSCrifSPJ6S7EA39TWXfmDViZs= X-Google-Smtp-Source: APXvYqyWsEgZR3nj4ltlF/cwwuK/9RC88Bxi124aJCUSlJQqix+GJMRtc5vW2gcs32HrxaV4C+AM8gxZqKNu36C+0WQ= X-Received: by 2002:ac8:47cc:: with SMTP id d12mr9206563qtr.246.1581840585855; Sun, 16 Feb 2020 00:09:45 -0800 (PST) MIME-Version: 1.0 References: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com> In-Reply-To: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com> From: Weiping Zhang Date: Sun, 16 Feb 2020 16:09:34 +0800 Message-ID: Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme To: Keith Busch X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200216_000950_024321_82582078 X-CRM114-Status: GOOD ( 30.35 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jens Axboe , Bart Van Assche , linux-nvme@lists.infradead.org, Ming Lei , linux-block@vger.kernel.org, Minwoo Im , cgroups@vger.kernel.org, Tejun Heo , "Nadolski, Edmund" , Thomas Gleixner , Christoph Hellwig Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: "linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org S2VpdGggQnVzY2ggPGtidXNjaEBrZXJuZWwub3JnPiDkuo4yMDIw5bm0MuaciDTml6Xlkajkuowg 5LiL5Y2IMTE6NDLlhpnpgZPvvJoKPgo+IE9uIFR1ZSwgRmViIDA0LCAyMDIwIGF0IDExOjMwOjQ1 QU0gKzA4MDAsIFdlaXBpbmcgWmhhbmcgd3JvdGU6Cj4gPiBUaGlzIHNlcmllcyB0cnkgdG8gYWRk IFdlaWdodGVkIFJvdW5kIFJvYmluIGZvciBibG9jayBjZ3JvdXAgYW5kIG52bWUKPiA+IGRyaXZl ci4gV2hlbiBtdWx0aXBsZSBjb250YWluZXJzIHNoYXJlIGEgc2luZ2xlIG52bWUgZGV2aWNlLCB3 ZSB3YW50Cj4gPiB0byBwcm90ZWN0IElPIGNyaXRpY2FsIGNvbnRhaW5lciBmcm9tIG5vdCBiZSBp bnRlcmZlcm5jZWQgYnkgb3RoZXIKPiA+IGNvbnRhaW5lcnMuIFdlIGFkZCBibGtpby53cnIgaW50 ZXJmYWNlIHRvIHVzZXIgdG8gY29udHJvbCB0aGVpciBJTwo+ID4gcHJpb3JpdHkuIFRoZSBibGtp by53cnIgYWNjZXB0IGZpdmUgbGV2ZWwgcHJpb3JpdGllcywgd2hpY2ggY29udGFpbnMKPiA+ICJ1 cmdlbnQiLCAiaGlnaCIsICJtZWRpdW0iLCAibG93IiBhbmQgIm5vbmUiLCB0aGUgIm5vbmUiIGlz IHVzZWQgZm9yCj4gPiBkaXNhYmxlIFdSUiBmb3IgdGhpcyBjZ3JvdXAuCj4KSGkgQnVzaCwKCj4g VGhlIE5WTWUgcHJvdG9jb2wgcmVhbGx5IGRvZXNuJ3QgZGVmaW5lIFdSUiB0byBiZSBhIG1lY2hh bmlzbSB0byBtaXRpZ2F0ZQo+IGludGVyZmVyZW5jZSwgdGhvdWdoLiBJdCBkZWZpbmVzIGNyZWRp dHMgYW1vbmcgdGhlIHdlaWdodGVkIHF1ZXVlcwo+IG9ubHkgZm9yIGNvbW1hbmQgZmV0Y2hpbmcs IGFuZCBhbiB1cmdlbnQgc3RyaWN0IHByaW9yaXR5IGNsYXNzIHRoYXQKPiBzdGFydmVzIHRoZSBy ZXN0LiBJdCBoYXMgbm90aGluZyB0byBkbyB3aXRoIGhvdyB0aGUgY29udHJvbGxlciBzaG91bGQK PiBwcmlvcml0aXplIGNvbXBsZXRpb24gb2YgdGhvc2UgY29tbWFuZHMsIGV2ZW4gaWYgaXQgbWF5 IGJlIHJlYXNvbmFibGUgdG8KPiBhc3N1bWUgaW5mbHVlbmNpbmcgd2hlbiB0aGUgY29tbWFuZCBp cyBmZXRjaGVkIHNob3VsZCBhZmZlY3QgaXRzCj4gY29tcGxldGlvbi4KPgpUaGFua3MgeW91ciBm ZWVkYmFjaywgdGhlIGZpbyB0ZXN0IHJlc3VsdCBvbiBXUlIgc2hvd3MgdGhhdCwgdGhlIGhpZ2gt d3JyLWZpbwpnZXQgbW9yZSBiYW5kd2lkdGgvaW9wcyBhbmQgbG93IGxhdGVuY3kuIEkgdGhpbmsg aXQncyBhIGdvb2QgZmVhdHVyZQpmb3IgdGhlIGNhc2UKdGhhdCBydW4gbXVsdGlwbGUgd29ya2xv YWQgd2l0aCBkaWZmZXJlbnQgcHJpb3JpdHksIGVzcGVjaWFsbHkgZm9yCmNvbnRhaW5lciBjb2xv Y2F0aW9uLgoKPiBPbiB0aGUgIndlaWdodGVkIiBzdHJpY3QgcHJpb3JpdHksIHRoZXJlJ3Mgbm90 aGluZyBzZXBhcmF0aW5nICJoaWdoIgo+IGZyb20gImxvdyIgb3RoZXIgdGhhbiB0aGUgbmFtZTog dGhlICJzZXQgZmVhdHVyZXMiIGNyZWRpdCBhc3NpZ25tZW50Cj4gY2FuIGludmVydCB3aGljaCBx dWV1ZXMgaGF2ZSBoaWdoZXIgY29tbWFuZCBmZXRjaCByYXRlcyBzdWNoIHRoYXQgdGhlCj4gImxv dyIgaXMgZmF2b3VyZWQgb3ZlciB0aGUgImhpZ2giLgo+CklmIHRoZXJlIGlzIG5vIGxpbWl0YXRp b24gaW4gdGhlIGhhcmR3YXJlIGNvbnRyb2xsZXIsIHdlIGNhbiBhZGQgbW9yZQpjaGVja2luZyBp bgoic2V0IGZlYXR1cmUgY29tbWFuZCIuIEkgdGhpbmsgbW9zdGx5IHBlb3BsZSB3b24ndCBnaXZl ICJsb3ciIG1vcmUKY3JlZGl0cyB0aGFuICJoaWdoIiwKaXQgcmVhbGx5IGRvZXMgbm90IG1ha2Ug c2Vuc2UuCgo+IFRoZXJlJ3Mgbm8gcHJvdGVjdGlvbiBhZ2FpbnN0IHRoZSAidXJnZW50IiBjbGFz cyBzdGFydmluZyBvdGhlcnM6IG5vcm1hbAo+IElPIHdpbGwgdGltZW91dCBhbmQgdHJpZ2dlciBy ZXBlYXRlZCBjb250cm9sbGVyIHJlc2V0cywgd2hpbGUgcG9sbGVkIElPCj4gd2lsbCBjb25zdW1l IDEwMCUgb2YgQ1BVIGN5Y2xlcyB3aXRob3V0IG1ha2luZyBhbnkgcHJvZ3Jlc3MgaWYgd2UgbWFr ZQo+IHRoaXMgdHlwZSBvZiBxdWV1ZSBhdmFpbGFibGUgd2l0aG91dCBhbnkgYWRkaXRpb25hbCBj b2RlIHRvIGVuc3VyZSB0aGUKPiBob3N0IGJlaGF2ZXMuLgo+CkkgdGhpbmsgd2UgY2FuIGp1c3Qg ZGlzYWJsZSBpdCBpbiB0aGUgc29mdHdhcmUgbGF5ZXIgLCBhY3R1YWxseSwgSSBoYXZlIG5vIHJl YWwKYXBwbGljYXRpb24gbmVlZCB0aGlzLgoKPiBPbiB0aGUgZHJpdmVyIGltcGxlbWVudGF0aW9u LCB0aGUgbnVtYmVyIG9mIG1vZHVsZSBwYXJhbWV0ZXJzIGJlaW5nCj4gYWRkZWQgaGVyZSBpcyBw cm9ibGVtYXRpYy4gV2UgYWxyZWFkeSBoYXZlIDIgc3BlY2lhbCBjbGFzc2VzIG9mIHF1ZXVlcywK PiBhbmQgZGVmaW5pbmcgdGhpcyBhdCB0aGUgbW9kdWxlIGxldmVsIGlzIGNvbnNpZGVyZWQgdG9v IGNvYXJzZSB3aGVuCj4gdGhlIHN5c3RlbSBoYXMgZGlmZmVyZW50IGRldmljZXMgb24gb3Bwb3Np dGUgZW5kcyBvZiB0aGUgY2FwYWJpbGl0eQo+IHNwZWN0cnVtLiBGb3IgZXhhbXBsZSwgdXNlcnMg d2FudCBwb2xsZWQgcXVldWVzIGZvciB0aGUgZmFzdCBkZXZpY2VzLAo+IGFuZCBub25lIGZvciB0 aGUgc2xvd2VyIHRpZXIuIFdlIGp1c3QgZG9uJ3QgaGF2ZSBhIGdvb2QgbWVjaGFuaXNtIHRvCj4g ZGVmaW5lIHBlci1jb250cm9sbGVyIHJlc291cmNlcywgYW5kIG1vcmUgcXVldWUgY2xhc3NlcyB3 aWxsIG1ha2UgdGhpcwo+IHByb2JsZW0gd29yc2UuCj4KV2UgY2FuIGFkZCBhIG5ldyAic3RyaW5n IiBtb2R1bGUgcGFyYW1ldGVyLCB3aGljaCBjb250YWlucyBhIG1vZGVsIG51bWJlciwKaW4gbW9z dCBjYXNlcywgdGhlIHNhdmUgcHJvZHVjdCB3aXRoIGEgY29tbW9uIHByZWZpeCBtb2RlbCBudW1i ZXIsIHNvCmluIHRoaXMgd2F5Cm52bWUgY2FuIGRpc3Rpbmd1aXNoIHRoZSBkaWZmZXJlbnQgcGVy Zm9ybWFuY2UgZGV2aWNlcyhoaWduIG9yIGxvdyBlbmQpLgpCZWZvcmUgY3JlYXRlIGlvIHF1ZXVl LCBudm1lIGRyaXZlciBjYW4gZ2V0IHRoZSBkZXZpY2UncyBNb2RlbCBudW1iZXIoNDAgQnl0ZXMp LAp0aGVuIG52bWUgZHJpdmVyIGNhbiBjb21wYXJlIGRldmljZSdzIG1vZGVsIG51bWJlciB3aXRo IG1vZHVsZSBwYXJhbWV0ZXIsIHRvCmRlY2lkZSBob3cgbWFueSBpbyBxdWV1ZXMgZm9yIGVhY2gg ZGlzazsKCi8qIGlmIG1vZGVsX251bWJlciBpcyBNT0RFTF9BTlksIHRoZXNlIHBhcmFtZXRlcnMg d2lsbCBiZSBhcHBsaWVkIHRvCmFsbCBudm1lIGRldmljZXMuICovCmNoYXIgZGV2X2lvX3F1ZXVl c1sxMDI0XSA9ICJtb2RlbF9udW1iZXI9TU9ERUxfQU5ZLApwb2xsPTAscmVhZD0wLHdycl9sb3c9 MCx3cnJfbWVkaXVtPTAsd3JyX2hpZ2g9MCx3cnJfdXJnZW50PTAiOwovKiB0aGVzZSBwYXJhbXRl cnMgb25seSBhZmZlY3QgbnZtZSBkaXNrIHdob3NlIG1vZGVsIG51bWJlciBpcyAiWFhYIiAqLwpj aGFyIGRldl9pb19xdWV1ZXNbMTAyNF0gPSAibW9kZWxfbnVtYmVyPVhYWCwKcG9sbD0xLHJlYWQ9 Mix3cnJfbG93PTMsd3JyX21lZGl1bT00LHdycl9oaWdoPTUsd3JyX3VyZ2VudD0wOyI7CgpzdHJ1 Y3QgZGV2X2lvX3F1ZXVlcyB7CiAgICAgICAgY2hhciBtb2RlbF9udW1iZXJbNDBdOwogICAgICAg IHVuc2lnbmVkIGludCBwb2xsOwogICAgICAgIHVuc2dpbmVkIGludCByZWFkOwogICAgICAgIHVu c2lnbmVkIGludCB3cnJfbG93OwogICAgICAgIHVuc2lnbmVkIGludCB3cnJfbWVkaXVtOwogICAg ICAgIHVuc2lnbmVkIGludCB3cnJfaGlnaDsKICAgICAgICB1bnNpZ25lZCBpbnQgd3JyX3VyZ2Vu dDsKfTsKCldlIGNhbiB1c2UgdGhlc2UgdHdvIHZhcmlhYmxlIHRvIHN0b3JlIGlvIHF1ZXVlIGNv bmZpZ3VyYXRpb25zOgoKLyogZGVmYXVsdCB2YWx1ZXMgZm9yIHRoZSBhbGwgZGlzaywgZXhjZXB0 IHdob3NlIG1vZGVsIG51bWJlciBpcyBub3QKaW4gaW9fcXVldWVzX2NmZyAqLwpzdHJ1Y3QgZGV2 X2lvX3F1ZXVlcyBpb19xdWV1ZXNfZGVmID0ge307CgovKiB1c2VyIGRlZmluZWQgdmFsdWVzIGZv ciBhIHNwZWNpZmljIG1vZGVsIG51bWJlciAqLwpzdHJ1Y3QgZGV2X2lvX3F1ZXVlcyBpb19xdWV1 ZXNfY2ZnID0ge307CgpJZiB3ZSBuZWVkIG11bHRpcGxlIGNvbmZpZ3VyYXRpb25zKCA+IDIpLCB3 ZSBjYW4gYWxzbyBleHRlbmQKZGV2X2lvX3F1ZXVlcyB0byBzdXBwb3J0IGl0LgoKPiBPbiB0aGUg YmxrLW1xIHNpZGUsIHRoaXMgaW1wbGVtZW50YXRpb24gZG9lc24ndCB3b3JrIHdpdGggdGhlIElP Cj4gc2NoZWR1bGVycy4gSWYgb25lIGlzIGluIHVzZSwgcmVxdWVzdHMgbWF5IGJlIHJlb3JkZXJl ZCBzdWNoIHRoYXQgYQo+IHJlcXVlc3Qgb24geW91ciBoaWdoLXByaW9yaXR5IGhjdHggbWF5IGJl IGRpc3BhdGNoZWQgbGF0ZXIgdGhhbiBtb3JlCj4gcmVjZW50IG9uZXMgYXNzb2NpYXRlZCB3aXRo IGxvd2VyIHByaW9yaXR5LiBJIGRvbid0IHRoaW5rIHRoYXQncyB3aGF0Cj4geW91J2Qgd2FudCB0 byBoYXBwZW4sIHNvIHByaW9yaXR5IHNob3VsZCBiZSBjb25zaWRlcmVkIHdpdGggc2NoZWR1bGVy cwo+IHRvby4KPgpDdXJyZW50bHksIG52bWUgZG9lcyBub3QgdXNlIGlvIHNjaGVkdWxlciBieSBk ZWZhbHV0LCBpZiB1c2VyIHdhbnQgdG8gbWFrZQp3cnIgY29tcGF0aWJsZSB3aXRoIGlvIHNjaGVk dWxlciwgd2UgY2FuIGFkZCBvdGhlciBwYXRjaGVzIHRvIGhhbmRsZSB0aGlzLgoKPiBCdXQgcmVh bGx5LCB0aG91Z2gsIE5WTWUncyBXUlIgaXMgdG9vIGhlYXZ5IHdlaWdodCBhbmQgZGlmZmljdWx0 IHRvIHVzZS4KPiBUaGUgdGVjaGluY2FsIHdvcmsgZ3JvdXAgY2FuIGNvbWUgdXAgd2l0aCBzb21l dGhpbmcgYmV0dGVyLCBidXQgaXQgbG9va3MKPiBsaWtlIHRoZXkndmUgbG9zdCBpbnRlcmVzdCBp biBUUEFSIDQwMTEgKG5vIGRpc2N1c3Npb24gaW4gMiB5ZWFycywgYWZhaWNzKS4KCkZvciB0aGUg dGVzdCByZXN1bHQsIEkgdGhpbmsgaXQncyBhIHVzZWZ1bCBmZWF0dXJlLgpJdCByZWFsbHkgZ2l2 ZXMgaGlnaCBwcmlvcml0eSBhcHBsaWNhdGlvbnMgaGlnaCBpb3BzL2JhbmR3aXRoIGFuZCBsb3cg bGF0ZW5jeSwKYW5kIGl0IG1ha2VzIHNvZnR3YXJlIHZlcnkgdGhpbiBhbmQgc2ltcGxlLgoKX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KbGludXgtbnZtZSBt YWlsaW5nIGxpc3QKbGludXgtbnZtZUBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6Ly9saXN0cy5p bmZyYWRlYWQub3JnL21haWxtYW4vbGlzdGluZm8vbGludXgtbnZtZQo= From mboxrd@z Thu Jan 1 00:00:00 1970 From: Weiping Zhang Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Date: Sun, 16 Feb 2020 16:09:34 +0800 Message-ID: References: <20200204154200.GA5831@redsun51.ssa.fujisawa.hgst.com> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=dNEzfTwUBmfbDmP8jHrIogG2oGgC1+4D3raZbBbSc2I=; b=VmKAFLhpDtAwHFCSjp+NzMIgHyhm7krjkidskFg0N+Fgdz3R8JDO3b8ne9kX2NFKY+ N/GybcS2uUJYFLbORG6aDUumIctsLO+Be5ocFLSryqNoTi+9rDFflc5gpxK+RKw1+Io/ n88bkYbb9GtMZ/vk5gE20Oue9ZjRBfLFFDCNBLu+HP1A2TycZn6CtUMlARKRH4vGlG4M ig34rD5d25IBzbXpRYhrTD01cJRhTk9nw4VwW4ldj26nRlASBkqDb3QHKjwIJg6/eOCD MMD6AR29nHG9e3OxFlZGf9E7wRVQhM5ujrZu534Vpv+52f9/1EdXj4e0dwx5Bp2ei4oZ JmRg== In-Reply-To: <20200204154200.GA5831-NcBpxDuQXws0jmVbqSliM8EHLCht3btpqxv4g6HH51o@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="utf-8" To: Keith Busch Cc: Jens Axboe , Tejun Heo , Christoph Hellwig , Bart Van Assche , Minwoo Im , Thomas Gleixner , Ming Lei , "Nadolski, Edmund" , linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org Keith Busch =E4=BA=8E2020=E5=B9=B42=E6=9C=884=E6=97=A5= =E5=91=A8=E4=BA=8C =E4=B8=8B=E5=8D=8811:42=E5=86=99=E9=81=93=EF=BC=9A > > On Tue, Feb 04, 2020 at 11:30:45AM +0800, Weiping Zhang wrote: > > This series try to add Weighted Round Robin for block cgroup and nvme > > driver. When multiple containers share a single nvme device, we want > > to protect IO critical container from not be interfernced by other > > containers. We add blkio.wrr interface to user to control their IO > > priority. The blkio.wrr accept five level priorities, which contains > > "urgent", "high", "medium", "low" and "none", the "none" is used for > > disable WRR for this cgroup. > Hi Bush, > The NVMe protocol really doesn't define WRR to be a mechanism to mitigate > interference, though. It defines credits among the weighted queues > only for command fetching, and an urgent strict priority class that > starves the rest. It has nothing to do with how the controller should > prioritize completion of those commands, even if it may be reasonable to > assume influencing when the command is fetched should affect its > completion. > Thanks your feedback, the fio test result on WRR shows that, the high-wrr-f= io get more bandwidth/iops and low latency. I think it's a good feature for the case that run multiple workload with different priority, especially for container colocation. > On the "weighted" strict priority, there's nothing separating "high" > from "low" other than the name: the "set features" credit assignment > can invert which queues have higher command fetch rates such that the > "low" is favoured over the "high". > If there is no limitation in the hardware controller, we can add more checking in "set feature command". I think mostly people won't give "low" more credits than "high", it really does not make sense. > There's no protection against the "urgent" class starving others: normal > IO will timeout and trigger repeated controller resets, while polled IO > will consume 100% of CPU cycles without making any progress if we make > this type of queue available without any additional code to ensure the > host behaves.. > I think we can just disable it in the software layer , actually, I have no = real application need this. > On the driver implementation, the number of module parameters being > added here is problematic. We already have 2 special classes of queues, > and defining this at the module level is considered too coarse when > the system has different devices on opposite ends of the capability > spectrum. For example, users want polled queues for the fast devices, > and none for the slower tier. We just don't have a good mechanism to > define per-controller resources, and more queue classes will make this > problem worse. > We can add a new "string" module parameter, which contains a model number, in most cases, the save product with a common prefix model number, so in this way nvme can distinguish the different performance devices(hign or low end). Before create io queue, nvme driver can get the device's Model number(40 By= tes), then nvme driver can compare device's model number with module parameter, t= o decide how many io queues for each disk; /* if model_number is MODEL_ANY, these parameters will be applied to all nvme devices. */ char dev_io_queues[1024] =3D "model_number=3DMODEL_ANY, poll=3D0,read=3D0,wrr_low=3D0,wrr_medium=3D0,wrr_high=3D0,wrr_urgent=3D0"; /* these paramters only affect nvme disk whose model number is "XXX" */ char dev_io_queues[1024] =3D "model_number=3DXXX, poll=3D1,read=3D2,wrr_low=3D3,wrr_medium=3D4,wrr_high=3D5,wrr_urgent=3D0;"; struct dev_io_queues { char model_number[40]; unsigned int poll; unsgined int read; unsigned int wrr_low; unsigned int wrr_medium; unsigned int wrr_high; unsigned int wrr_urgent; }; We can use these two variable to store io queue configurations: /* default values for the all disk, except whose model number is not in io_queues_cfg */ struct dev_io_queues io_queues_def =3D {}; /* user defined values for a specific model number */ struct dev_io_queues io_queues_cfg =3D {}; If we need multiple configurations( > 2), we can also extend dev_io_queues to support it. > On the blk-mq side, this implementation doesn't work with the IO > schedulers. If one is in use, requests may be reordered such that a > request on your high-priority hctx may be dispatched later than more > recent ones associated with lower priority. I don't think that's what > you'd want to happen, so priority should be considered with schedulers > too. > Currently, nvme does not use io scheduler by defalut, if user want to make wrr compatible with io scheduler, we can add other patches to handle this. > But really, though, NVMe's WRR is too heavy weight and difficult to use. > The techincal work group can come up with something better, but it looks > like they've lost interest in TPAR 4011 (no discussion in 2 years, afaics= ). For the test result, I think it's a useful feature. It really gives high priority applications high iops/bandwith and low laten= cy, and it makes software very thin and simple.