From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DEA9C433EF for ; Mon, 25 Oct 2021 07:58:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 188426109D for ; Mon, 25 Oct 2021 07:58:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232146AbhJYIAi (ORCPT ); Mon, 25 Oct 2021 04:00:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34450 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232119AbhJYIAh (ORCPT ); Mon, 25 Oct 2021 04:00:37 -0400 Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [IPv6:2a00:1450:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 99853C061745 for ; Mon, 25 Oct 2021 00:58:15 -0700 (PDT) Received: by mail-wr1-x435.google.com with SMTP id w15so7531302wra.3 for ; Mon, 25 Oct 2021 00:58:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=a6fR8q6eNfsmjxHH24h5OIsBTG+JgcE11ntIofTu6IQ=; b=utU5vIGSYsxzcD4eKbs4vpBA4Z5LRhftVp6KwpgL60LTcvoH2hC6VNg695fbLY55/c UpLWV/WKi/7g9caVWIgU6ZfEfUVjDA82qcqR54c/ftAbfwb43+dELA0WtPaVb17YgKuz e6PJ59L/cS6XJ6S81PwDvt19CLSBfPsKJztGGT73OzC6aBYgymuWBoD3t9fc+xf3rM3a 9FONwNxy6Na8vzp/6dnPkgqJWxCeqMokbYKqvAVtPlPjmZLA8/oYNIFyH6KrfHpj+oYu 8mTZtGiZYY9nT/rFsVMWaqy9z/cYacQrlD3fd36gQnEDoPcZNMtqli/PbxoyKxBM9zSn cxvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=a6fR8q6eNfsmjxHH24h5OIsBTG+JgcE11ntIofTu6IQ=; b=foo+T3kzKKArOVeKSM3/9ECjbux0vI9uSkOV/dBbAQS5AuxLJN8sVgHAJKgBsdVTcZ zDpsowM/2xVeXnRAQwhSN0cKy0rrrwgGq6qP8+At460a7oOGhi0+vs/RNZYvBeeQf79G 63JlPlNkClV/Kl8DSD8FkDvZ3/n9oSbB22vjixyvCn2pUG3COnavPEqDGdCjHOcP67rL BeqBwUUT21e5F4bmqRPyG2lz8eFTPe/ia5kvgW6MwJDNWurK7Auqopb3EmLYOKPuDk9K DNLiTK7o6vOOSyKgSu7Pt//3pcf/AiP4rptpjrv+RySvOIQGNLsuW8uqs9f63BuyShwQ foeQ== X-Gm-Message-State: AOAM532g5MjFML2sdQCH+XEccuTXvE3MiVMS2+8UubtVNBcpb8axFusQ uEIYJ7ifUG8WJHq7bRO+fLc4BQ== X-Google-Smtp-Source: ABdhPJwiYLYqnn0HU8rlXS4l+sODBtyCNA29X206eiSraXmGN2KIHh46nsoSdORsJhvLwh+b+Yafgw== X-Received: by 2002:a5d:438b:: with SMTP id i11mr6502916wrq.188.1635148693938; Mon, 25 Oct 2021 00:58:13 -0700 (PDT) Received: from [192.168.17.250] (elgamal.mat.unimo.it. [155.185.5.192]) by smtp.gmail.com with ESMTPSA id z135sm20589672wmc.45.2021.10.25.00.58.12 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 25 Oct 2021 00:58:13 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: [PATCH 0/8 v3] bfq: Limit number of allocated scheduler tags per cgroup From: Paolo Valente In-Reply-To: Date: Mon, 25 Oct 2021 09:58:11 +0200 Cc: linux-block , Jens Axboe , =?utf-8?Q?Michal_Koutn=C3=BD?= , 237826@studenti.unimore.it, 224833@studenti.unimore.it, Giacomo Guiduzzi <224804@studenti.unimore.it>, 238290@studenti.unimore.it, PAOLO CROTTI <204572@studenti.unimore.it> Content-Transfer-Encoding: quoted-printable Message-Id: References: <20211006164110.10817-1-jack@suse.cz> To: Jan Kara X-Mailer: Apple Mail (2.3445.104.11) Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org > Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente = ha scritto: >=20 >=20 >=20 >> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara ha = scritto: >>=20 >> Hello! >>=20 >> Here is the third revision of my patches to fix how bfq weights apply = on cgroup >> throughput and on throughput of processes with different IO = priorities. Since >> v2 I've added some more patches so that now IO priorities also result = in >> service differentiation (previously they had no effect on service >> differentiation on some workloads similarly to cgroup weights). The = last patch >> in the series still needs some work as in the current state it causes = a >> notable regression (~20-30%) with dbench benchmark for large numbers = of >> clients. I've verified that the last patch is indeed necessary for = the service >> differentiation with the workload described in its changelog. As we = discussed >> with Paolo, I have also found out that if I remove the "waker has = enough >> budget" condition from bfq_select_queue(), dbench performance is = restored >> and the service differentiation is still good. But we probably need = some >> justification or cleaner solution than just removing the condition so = that >> is still up to discussion. But first seven patches already noticeably = improve >> the situation for lots of workloads so IMO they stand on their own = and >> can be merged regardless of how we go about the last patch. >>=20 >=20 > Hi Jan, > I have just one more (easy-to-resolve) doubt: you seem to have tested > these patches mostly on the throughput side. Did you run a > startup-latency test as well? I can run some for you, if you prefer > so. Just give me a few days. >=20 We are finally testing your patches a little bit right now, for regressions with our typical benchmarks ... Paolo > When that is cleared, your first seven patches are ok for me. > Actually I think they are a very good and relevant contribution. > Patch number eight probably deserve some ore analysis, as you pointed > out. As I already told you in our call, we can look at that budget > condition together. And figure out the best tests to use, to check > whether I/O control does not get lost too much. >=20 > Thanks, > Paolo >=20 >> Changes since v2: >> * Rebased on top of current Linus' tree >> * Updated computation of scheduler tag proportions to work correctly = even >> for processes within the same cgroup but with different IO priorities >> * Added comment roughly explaining why we limit tag depth >> * Added patch limiting waker / wakee detection in time so avoid at = least the >> most obvious false positives >> * Added patch to log waker / wakee detections in blktrace for better = debugging >> * Added patch properly account injected IO >>=20 >> Changes since v1: >> * Fixed computation of appropriate proportion of scheduler tags for a = cgroup >> to work with deeper cgroup hierarchies. >>=20 >> Original cover letter: >>=20 >> I was looking into why cgroup weights do not have any measurable = impact on >> writeback throughput from different cgroups. This actually a = regression from >> CFQ where things work more or less OK and weights have roughly the = impact they >> should. The problem can be reproduced e.g. by running the following = easy fio >> job in two cgroups with different weight: >>=20 >> [writer] >> directory=3D/mnt/repro/ >> numjobs=3D1 >> rw=3Dwrite >> size=3D8g >> time_based >> runtime=3D30 >> ramp_time=3D10 >> blocksize=3D1m >> direct=3D0 >> ioengine=3Dsync >>=20 >> I can observe there's no significat difference in the amount of data = written >> from different cgroups despite their weights are in say 1:3 ratio. >>=20 >> After some debugging I've understood the dynamics of the system. = There are two >> issues: >>=20 >> 1) The amount of scheduler tags needs to be significantly larger than = the >> amount of device tags. Otherwise there are not enough requests = waiting in BFQ >> to be dispatched to the device and thus there's nothing to schedule = on. >>=20 >> 2) Even with enough scheduler tags, writers from two cgroups = eventually start >> contending on scheduler tag allocation. These are served on first = come first >> served basis so writers from both cgroups feed requests into bfq with >> approximately the same speed. Since bfq prefers IO from heavier = cgroup, that is >> submitted and completed faster and eventually we end up in a = situation when >> there's no IO from the heavier cgroup in bfq and all scheduler tags = are >> consumed by requests from the lighter cgroup. At that point bfq just = dispatches >> lots of the IO from the lighter cgroup since there's no contender for = disk >> throughput. As a result observed throughput for both cgroups are the = same. >>=20 >> This series fixes this problem by accounting how many scheduler tags = are >> allocated for each cgroup and if a cgroup has more tags allocated = than its >> fair share (based on weights) in its service tree, we heavily limit = scheduler >> tag bitmap depth for it so that it is not be able to starve other = cgroups from >> scheduler tags. >>=20 >> What do people think about this? >>=20 >> Honza >>=20 >> Previous versions: >> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # = v1 >> Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@suse.cz # = v2