From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:58840 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728414AbfHBOM1 (ORCPT ); Fri, 2 Aug 2019 10:12:27 -0400 From: Chris Mason Subject: Re: [PATCH 09/24] xfs: don't allow log IO to be throttled Date: Fri, 2 Aug 2019 14:11:53 +0000 Message-ID: <7093F5C3-53D2-4C49-9C0D-64B20C565D18@fb.com> References: <20190801021752.4986-1-david@fromorbit.com> <20190801021752.4986-10-david@fromorbit.com> <20190801235849.GO7777@dread.disaster.area> In-Reply-To: <20190801235849.GO7777@dread.disaster.area> Content-Language: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Dave Chinner Cc: "linux-xfs@vger.kernel.org" , "linux-mm@kvack.org" , "linux-fsdevel@vger.kernel.org" , Jens Axboe On 1 Aug 2019, at 19:58, Dave Chinner wrote: > On Thu, Aug 01, 2019 at 01:39:34PM +0000, Chris Mason wrote: >> On 31 Jul 2019, at 22:17, Dave Chinner wrote: >> >>> From: Dave Chinner >>> >>> Running metadata intensive workloads, I've been seeing the AIL >>> pushing getting stuck on pinned buffers and triggering log forces. >>> The log force is taking a long time to run because the log IO is >>> getting throttled by wbt_wait() - the block layer writeback >>> throttle. It's being throttled because there is a huge amount of >>> metadata writeback going on which is filling the request queue. >>> >>> IOWs, we have a priority inversion problem here. >>> >>> Mark the log IO bios with REQ_IDLE so they don't get throttled >>> by the block layer writeback throttle. When we are forcing the CIL, >>> we are likely to need to to tens of log IOs, and they are issued as >>> fast as they can be build and IO completed. Hence REQ_IDLE is >>> appropriate - it's an indication that more IO will follow shortly. >>> >>> And because we also set REQ_SYNC, the writeback throttle will no >>> treat log IO the same way it treats direct IO writes - it will not >>> throttle them at all. Hence we solve the priority inversion problem >>> caused by the writeback throttle being unable to distinguish between >>> high priority log IO and background metadata writeback. >>> >> [ cc Jens ] >> >> We spent a lot of time getting rid of these inversions in io.latency >> (and the new io.cost), where REQ_META just blows through the=20 >> throttling >> and goes into back charging instead. > > Which simply reinforces the fact that that request type based > throttling is a fundamentally broken architecture. > >> It feels awkward to have one set of prio inversion workarounds for=20 >> io.* >> and another for wbt. Jens, should we make an explicit one that=20 >> doesn't >> rely on magic side effects, or just decide that metadata is meta=20 >> enough >> to break all the rules? > > The problem isn't REQ_META blows throw the throttling, the problem > is that different REQ_META IOs have different priority. Yes and no. At some point important FS threads have the potential to=20 wait on every single REQ_META IO on the box, so every single REQ_META IO=20 has the potential to create priority inversions. > > IOWs, the problem here is that we are trying to infer priority from > the request type rather than an actual priority assigned by the > submitter. There is no way direct IO has higher priority in a > filesystem than log IO tagged with REQ_META as direct IO can require > log IO to make progress. Priority is a policy determined by the > submitter, not the mechanism doing the throttling. > > Can we please move this all over to priorites based on > bio->b_ioprio? And then document how the range of priorities are > managed, such as: > > (99 =3D highest prio to 0 =3D lowest) > > swap out > swap in >90 > User hard RT max 89 > User hard RT min 80 > filesystem max 79 > ionice max 60 > background data writeback 40 > ionice min 20 > filesystem min 10 > idle 0 > > So that we can appropriately prioritise different types of kernel > internal IO w.r.t user controlled IO priorities? This way we can > still tag the bios with the type of data they contain, but we > no longer use that to determine whether to throttle that IO or not - > throttling/scheduling should be done entirely on a priority basis. I think you and I are describing solutions to different problems. The=20 reason the back charging works so well in io.latency and io.cost is=20 because the IO controllers are able to remember that a given cgroup=20 created X amount of IO, and then just make that cgroup wait at a safe=20 time, instead of trying to assign priority to things that have infinite=20 priority. I can't really see bio->b_ioprio working without the rest of the IO=20 controller logic creating a sensible system, and giving userland the=20 framework to define weights etc. My question is if it's worth trying=20 inside of the wbt code, or if we should just let the metadata go=20 through. Tejun reminded me that in a lot of ways, swap is user IO and it's=20 actually fine to have it prioritized at the same level as user IO. We=20 don't want to let a low prio app thrash the drive swapping things in and=20 out all the time, and it's actually fine to make them wait as long as=20 other higher priority processes aren't waiting for the memory. This=20 depends on the cgroup config, so wrt your current patches it probably=20 sounds crazy, but we have a lot of data around this from the fleet. -chris