From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9B12C18E5B for ; Tue, 10 Mar 2020 11:07:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A714A24694 for ; Tue, 10 Mar 2020 11:07:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="TMBHM4I0"; dkim=pass (1024-bit key) header.d=sharedspace.onmicrosoft.com header.i=@sharedspace.onmicrosoft.com header.b="TC3qxmDd" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726271AbgCJLHL (ORCPT ); Tue, 10 Mar 2020 07:07:11 -0400 Received: from esa2.hgst.iphmx.com ([68.232.143.124]:20283 "EHLO esa2.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726512AbgCJLHI (ORCPT ); Tue, 10 Mar 2020 07:07:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1583838483; x=1615374483; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-id:content-transfer-encoding: mime-version; bh=90Et7OOaqpEAgjMT1MBd+1/Qc2iMzUJ4ArzSKN7sJU0=; b=TMBHM4I0FiWYaNRwFAJ58cjXtYm4lwrP0JhC6PZQqkPoxJYiNMW/1ztY vaHceKlvXFA6tteLXJbXpTXlP6eekKwFSOMlqblCO473/aT3+4LyEmXcm VXWPFeW1Bs8TQ6+G7DGpxJKrp8vrGpSnRiisMDHTdt/4Rn2PYk1iu4JGo 5DG1R12S0jeZTdFsSFF55JsCDCIoeIV97P6fN2e69GYmSAM4n7TzZDLCr Y0Ch3fFt4pmal83RiiZHNCQsGNwAOy38EgzHLtRSBf0Wz1dGVms0RMtn/ u1bd7K8WYScTlzw+VjLKhGHaVM+VyCscaQ4etAXuEBmiekGanm29KU2qu Q==; IronPort-SDR: qHFUbmyBwj4Yg0jCEh1W5jUf/fHlZIaq4jaU7PBitHocCFHdMybrrQFpYWLhpa9rrqAQzGBdaM 7R8L8OMbpIOjdep4gdLRGIx/+gzabN0hduC6v9UjOVbJSBokpAYINQVEvQmCzMU7jk0lQVD3zM EznB8OOy+s1FUF7zd5wNEgrRD6pH9GdA5YYbCAbc7T6eYaaTqOS09yFgD6NZOlZpG2n9nFgwbn W7FADQtU6I5J85kPr/JZfynErp+HeuYzF8N2VFOFYEYRGDi6CGo8gUwDKArwVimC64iGk0CU+a iHE= X-IronPort-AV: E=Sophos;i="5.70,536,1574092800"; d="scan'208";a="234087705" Received: from mail-bn8nam12lp2176.outbound.protection.outlook.com (HELO NAM12-BN8-obe.outbound.protection.outlook.com) ([104.47.55.176]) by ob1.hgst.iphmx.com with ESMTP; 10 Mar 2020 19:08:01 +0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=khosQFjXtw6Ftzp5cXiIi1O0E+YL5tQiYHPY6dBIzhWsCMY8hNAeqKHJvKUiDkbIX/TiOSoqfjPYAtKNn+oSfCYCFKGSIDleqzA4wksgq4QHrSssIOyRkJBRh0dPKMDW0SzHakifTw6qUp6v3p5np41hlIk2po/fFJuYzVBRsLFbK4puVyQyQS47T2rOwq45DccVB3Shj1DP+a0wFE4SiQN7WyiufqVNkml6G9GAZK5RD1Z9yqk3wIBBsQlRQnnzpFNTA7e9QO1VSpg2R3eNF22VhJOnJNNAL5ANy+N4m0yd2rApeXgw6fAvJhbZ7kzefh1BBSgNcIIT9my//F37Iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LDi4+f00F2OdH27ge1poSc1KBkC/FD2I/7xK2+ySL9g=; b=TSWQo5fOv8ugUlf6MYxTJlNqcJ+yF53VrCB10p29j+FkjvFPfsOcodCU9ycMf4vR0N/SLEoPnN4hFCGaPmDcldckrocRM7plJINhIGWg7+kpPXtZRnFz+yK0VZ7YPNKO0erzEnx8JV2lp3QegQ2x1nDUtmgse8ICNYMakVfJdSgNIUwFK7NdZqmiUvrz93lTpUqwqn749y0JT6pbZ+ODnmgY3xTFCRyc2oo+nhvlFPxoVs3RqaNJdfVUFJBSotQNudonvdtRIUAEl/Tj/fk3M1BqeooFZD3FpblVy1K/IKqzCYw91h0K2H7suhj3GC2x05KyqwpXkCHjvvHRxbZjNA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=wdc.com; dmarc=pass action=none header.from=wdc.com; dkim=pass header.d=wdc.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sharedspace.onmicrosoft.com; s=selector2-sharedspace-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=LDi4+f00F2OdH27ge1poSc1KBkC/FD2I/7xK2+ySL9g=; b=TC3qxmDd2HGLuDvk9O79PMgsv3zHR0VzH9zB9VUSDiCDNNrvPnaLEngRofQm1kKvmbpv5ODEBN8eY7fz3w5MigScb8uGjKSePa2EW/dsk/Fani4ZwbX65qdkGvedwKJuuIr0vykrD4q9OEovMrQOFtvaa5IqnxsdN/W/3a3be08= Received: from CY1PR04MB2268.namprd04.prod.outlook.com (2a01:111:e400:c61b::7) by CY1PR04MB2332.namprd04.prod.outlook.com (2a01:111:e400:c61a::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2793.16; Tue, 10 Mar 2020 11:07:05 +0000 Received: from CY1PR04MB2268.namprd04.prod.outlook.com ([fe80::2986:c2cb:aa36:a25d]) by CY1PR04MB2268.namprd04.prod.outlook.com ([fe80::2986:c2cb:aa36:a25d%8]) with mapi id 15.20.2793.013; Tue, 10 Mar 2020 11:07:05 +0000 From: Shinichiro Kawasaki To: Ming Lei CC: Damien Le Moal , "linux-block@vger.kernel.org" , Jens Axboe Subject: Re: commit 01e99aeca397 causes longer runtime of block/004 Thread-Topic: commit 01e99aeca397 causes longer runtime of block/004 Thread-Index: AQHV8c4FlOkNXDceTU6Nwl3i3vnr/KhBXQ+AgAAlSQCAADIagA== Date: Tue, 10 Mar 2020 11:07:04 +0000 Message-ID: <20200310110703.hvaek3opc67lcr74@shindev.dhcp.fujisawa.hgst.com> References: <20200305024808.GA26733@ming.t460p> <20200306060622.t2jl7qkzvkwvvcbx@shindev.dhcp.fujisawa.hgst.com> <20200306081309.GA29683@ming.t460p> <20200307010222.gtrwivafqe2254i6@shindev.dhcp.fujisawa.hgst.com> <20200307041343.GB20579@ming.t460p> <20200309000715.sqgsssrauzmmpli2@shindev.dhcp.fujisawa.hgst.com> <20200309161430.GA4871@ming.t460p> <20200310055417.o3jghx4sl5xtztci@shindev.dhcp.fujisawa.hgst.com> <20200310080744.GA7618@ming.t460p> In-Reply-To: <20200310080744.GA7618@ming.t460p> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=shinichiro.kawasaki@wdc.com; x-originating-ip: [199.255.47.12] x-ms-publictraffictype: Email x-ms-office365-filtering-ht: Tenant x-ms-office365-filtering-correlation-id: ecebaef2-a8ca-4111-31f0-08d7c4e32ac9 x-ms-traffictypediagnostic: CY1PR04MB2332: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: wdcipoutbound: EOP-TRUE x-ms-oob-tlc-oobclassifiers: OLM:10000; x-forefront-prvs: 033857D0BD x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(4636009)(396003)(376002)(136003)(39860400002)(366004)(346002)(189003)(199004)(66476007)(66446008)(64756008)(66556008)(76116006)(91956017)(71200400001)(966005)(4326008)(478600001)(6512007)(6486002)(81156014)(81166006)(8676002)(6916009)(26005)(186003)(8936002)(9686003)(66946007)(6506007)(53546011)(5660300002)(44832011)(86362001)(1076003)(30864003)(2906002)(316002)(66574012)(54906003)(148743002)(60764002)(309714004);DIR:OUT;SFP:1102;SCL:1;SRVR:CY1PR04MB2332;H:CY1PR04MB2268.namprd04.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;MX:1;A:1; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: rnmIRRrVn686Nybx5hvhbcVYkvFvLtSchGFaNV/VOo18QIfdF03XoOkN3dY5P6JP4ERudxt6ZNlfVlJ5ZzfauDRCq+qSQbi324YtHzlLmTY5AT/Fp7XyOgYeqMUPfVQzX9edNmw200P9EzWveUQwobEGbu1cTrJFE/srOPPKAmmSW9Obiqt63vJVzhpSEHoYNVtia9u+UgiIXMD/9KLuKBhmdYCguGXxfcCFSq/f0tRIeYHubgdUBkKbJkwS/6W/nCX0AES7jvKQizaupIiG1lQiiIn27O3EKvK6hGx7/zbT96i6JURb8o8w3SQzjs7fGzNB8SUAmpr9iDeZb3plmHKe9bUkviJseN39ZOo9rBtIFtuVCU6EJ4Sgk+rhHr0nLapYL5ZfNt7XZbSarFV0JyuanV4c0khnb5XTl0C+2uaTDQGSGFTK0JnhrMv1FbQaSMMumQgrC1a+hJCjvVMzp/FbqRPZ9u3PNqeqUFm9s84aCjcx5Bo89vQjh79+oPkx1K3atjNfbEryhVqezQquGHzjehTz+KWxSzTBS1jREGO1z9aDsYvvhlNr0vSVQQpyz4INto/jTm1evSPke3lOX4/nrMZvW+aDD7KKy/vMFukAdkGTUZdWuMDAqBNvxHD8 x-ms-exchange-antispam-messagedata: JuM3BMlgxz4w7woAioU1opfuEFX1QdSJpcu2Ay48/Y2MkhRwvR6E/KCYzP2dUumORF3fLsp5WAjRdeynkirA8zN9RRE8eydv/VDuACjV8F18SW73FYv7XDi0XBA1lvLQJ5AATrlAbQ6Wlyq+Rd2L+Q== Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: wdc.com X-MS-Exchange-CrossTenant-Network-Message-Id: ecebaef2-a8ca-4111-31f0-08d7c4e32ac9 X-MS-Exchange-CrossTenant-originalarrivaltime: 10 Mar 2020 11:07:04.9743 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: b61c8803-16f3-4c35-9b17-6f65f441df86 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: R8uVU1YmpX8wUmz1VTSZK6vwajCpJ7bL4hj4Ok3Ed33LOEzO4jxtiHkFE40zyVnxb3zSAb8CLdHQtbV5n/S6QdUtLw9rqcuIeMqs35F8wOA= X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY1PR04MB2332 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Mar 10, 2020 / 16:07, Ming Lei wrote: > On Tue, Mar 10, 2020 at 05:54:18AM +0000, Shinichiro Kawasaki wrote: > > Ming, thank you for sharing the log files and analysis. > >=20 > > On Mar 10, 2020 / 03:07, Damien Le Moal wrote: > > > On 2020/03/10 1:14, Ming Lei wrote: > > > > On Mon, Mar 09, 2020 at 12:07:16AM +0000, Shinichiro Kawasaki wrote= : > > > >> On Mar 07, 2020 / 12:13, Ming Lei wrote: > > > >>> On Sat, Mar 07, 2020 at 01:02:23AM +0000, Shinichiro Kawasaki wro= te: > > > >>>> On Mar 06, 2020 / 16:13, Ming Lei wrote: > > > >>>>> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki w= rote: > > > >>>>>> On Mar 05, 2020 / 10:48, Ming Lei wrote: > > > >>>>>>> Hi Shinichiro, > > > >>>>>>> > > > >>>>>>> On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki= wrote: > > > >>>>>>>> On Mar 04, 2020 / 17:53, Ming Lei wrote: > > > >>>>>>>>> On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasa= ki wrote: > > > >>>>>>>>>> On Mar 04, 2020 / 11:46, Ming Lei wrote: > > > >>>>>>>>>>> On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawa= saki wrote: > > > >>>>>>>>>>>> I noticed that blktests block/004 takes longer runtime w= ith 5.6-rc4 than > > > >>>>>>>>>>>> 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq= : insert passthrough > > > >>>>>>>>>>>> request into hctx->dispatch directly") triggers it. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The longer runtime was observed with dm-linear device wh= ich maps SATA SMR HDD > > > >>>>>>>>>>>> connected via AHCI. It was not observed with dm-linear o= n SAS/SATA SMR HDDs > > > >>>>>>>>>>>> connected via SAS-HBA. Not observed with dm-linear on no= n-SMR HDDs either. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Before the commit, block/004 took around 130 seconds. Af= ter the commit, it takes > > > >>>>>>>>>>>> around 300 seconds. I need to dig in further details to = understand why the > > > >>>>>>>>>>>> commit makes the test case longer. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The test case block/004 does "flush intensive workload".= Is this longer runtime > > > >>>>>>>>>>>> expected? > > > >>>>>>>>>>> > > > >>>>>>>>>>> The following patch might address this issue: > > > >>>>>>>>>>> > > > >>>>>>>>>>> https://lore.kernel.org/linux-block/20200207190416.99928-= 1-sqazi@google.com/#t > > > >>>>>>>>>>> > > > >>>>>>>>>>> Please test and provide us the result. > > > >>>>>>>>>>> > > > >>>>>>>>>>> thanks, > > > >>>>>>>>>>> Ming > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Hi Ming, > > > >>>>>>>>>> > > > >>>>>>>>>> I applied the patch to 5.6-rc4 but I observed the longer r= untime of block/004. > > > >>>>>>>>>> Still it takes around 300 seconds. > > > >>>>>>>>> > > > >>>>>>>>> Hello Shinichiro, > > > >>>>>>>>> > > > >>>>>>>>> block/004 only sends 1564 sync randwrite, and seems 130s ha= s been slow > > > >>>>>>>>> enough. > > > >>>>>>>>> > > > >>>>>>>>> There are two related effect in that commit for your issue: > > > >>>>>>>>> > > > >>>>>>>>> 1) 'at_head' is applied in blk_mq_sched_insert_request() fo= r flush > > > >>>>>>>>> request > > > >>>>>>>>> > > > >>>>>>>>> 2) all IO is added back to tail of hctx->dispatch after .qu= eue_rq() > > > >>>>>>>>> returns STS_RESOURCE > > > >>>>>>>>> > > > >>>>>>>>> Seems it is more related with 2) given you can't reproduce = the issue on=20 > > > >>>>>>>>> SAS. > > > >>>>>>>>> > > > >>>>>>>>> So please test the following two patches, and see which one= makes a > > > >>>>>>>>> difference for you. > > > >>>>>>>>> > > > >>>>>>>>> BTW, both two looks not reasonable, just for narrowing down= the issue. > > > >>>>>>>>> > > > >>>>>>>>> 1) patch 1 > > > >>>>>>>>> > > > >>>>>>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > > > >>>>>>>>> index 856356b1619e..86137c75283c 100644 > > > >>>>>>>>> --- a/block/blk-mq-sched.c > > > >>>>>>>>> +++ b/block/blk-mq-sched.c > > > >>>>>>>>> @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct= request *rq, bool at_head, > > > >>>>>>>>> WARN_ON(e && (rq->tag !=3D -1)); > > > >>>>>>>>> =20 > > > >>>>>>>>> if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) { > > > >>>>>>>>> - blk_mq_request_bypass_insert(rq, at_head, false); > > > >>>>>>>>> + blk_mq_request_bypass_insert(rq, true, false); > > > >>>>>>>>> goto run; > > > >>>>>>>>> } > > > >>>>>>>> > > > >>>>>>>> Ming, thank you for the trial patches. > > > >>>>>>>> This "patch 1" reduced the runtime, as short as rc3. > > > >>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> 2) patch 2 > > > >>>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c > > > >>>>>>>>> index d92088dec6c3..447d5cb39832 100644 > > > >>>>>>>>> --- a/block/blk-mq.c > > > >>>>>>>>> +++ b/block/blk-mq.c > > > >>>>>>>>> @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct r= equest_queue *q, struct list_head *list, > > > >>>>>>>>> q->mq_ops->commit_rqs(hctx); > > > >>>>>>>>> =20 > > > >>>>>>>>> spin_lock(&hctx->lock); > > > >>>>>>>>> - list_splice_tail_init(list, &hctx->dispatch); > > > >>>>>>>>> + list_splice_init(list, &hctx->dispatch); > > > >>>>>>>>> spin_unlock(&hctx->lock); > > > >>>>>>>>> =20 > > > >>>>>>>>> /* > > > >>>>>>>> > > > >>>>>>>> This patch 2 didn't reduce the runtime. > > > >>>>>>>> > > > >>>>>>>> Wish this report helps. > > > >>>>>>> > > > >>>>>>> Your feedback does help, then please test the following patch= : > > > >>>>>> > > > >>>>>> Hi Ming, thank you for the patch. I applied it on top of rc4 a= nd confirmed > > > >>>>>> it reduces the runtime as short as rc3. Good. > > > >>>>> > > > >>>>> Hi Shinichiro, > > > >>>>> > > > >>>>> Thanks for your test! > > > >>>>> > > > >>>>> Then I think the following change should make the difference ac= tually, > > > >>>>> you may double check that and confirm if it is that. > > > >>>>> > > > >>>>>> @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_= queue *q, struct blk_flush_queue *fq, > > > >>>>>> flush_rq->rq_disk =3D first_rq->rq_disk; > > > >>>>>> flush_rq->end_io =3D flush_end_io; > > > >>>>>> =20 > > > >>>>>> - blk_flush_queue_rq(flush_rq, false); > > > >>>>>> + blk_flush_queue_rq(flush_rq, true); > > > >>>> > > > >>>> Yes, with this the one line change above only, the runtime was r= educed. > > > >>>> > > > >>>>> > > > >>>>> However, the flush request is added to tail of dispatch queue[1= ] for long time. > > > >>>>> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes e= ntirely") > > > >>>>> and its predecessor(all mq scheduler start patches) changed to = add flush request > > > >>>>> to front of dispatch queue for blk-mq by ignoring 'add_queue' p= arameter of > > > >>>>> blk_mq_sched_insert_flush(). That change may be by accident, an= d not sure it is > > > >>>>> correct. > > > >>>>> > > > >>>>> I guess once flush rq is added to tail of dispatch queue in blo= ck/004, > > > >>>>> in which lots of FS request may stay in hctx->dispatch because = of low > > > >>>>> AHCI queue depth, then we may take a bit long for flush rq to b= e > > > >>>>> submitted to LLD. > > > >>>>> > > > >>>>> I'd suggest to root cause/understand the issue given it isn't o= bvious > > > >>>>> correct to queue flush rq at front of dispatch queue, so could = you collect > > > >>>>> the following trace via the following script with/without the s= ingle line > > > >>>>> patch? > > > >>>> > > > >>>> Thank you for the thoughts for the correct design. I have taken = the two traces, > > > >>>> with and without the one liner patch above. The gzip archived tr= ace files have > > > >>>> 1.6MB size. It looks too large to post to the list. Please let m= e know how you > > > >>>> want the trace files shared. > > > >>> > > > >>> I didn't thought the trace can be so big given the ios should be = just > > > >>> 256 * 64(1564). > > > >>> > > > >>> You may put the log somewhere in Internet, cloud storage, web, or > > > >>> whatever. Then just provides us the link. > > > >>> > > > >>> Or if you can't find a place to hold it, just send to me, and I w= ill put > > > >>> it in my RH people web link. > > > >> > > > >> I have sent another e-mail only to you attaching the log files gzi= ped. > > > >> Your confirmation will be appreciated. > > > >=20 > > > > Yeah, I got the log, and it has been put in the following link: > > > >=20 > > > > http://people.redhat.com/minlei/tests/logs/blktests_block_004_perf_= degrade.tar.gz > > > >=20 > > > > Also I have run some analysis, and block/004 basically call pwrite(= ) & > > > > fsync() in each job. > > > >=20 > > > > 1) v5.6-rc kernel > > > > - write rq average latency: 0.091s=20 > > > > - flush rq average latency: 0.018s > > > > - average issue times of write request: 1.978 //how many trace_blo= ck_rq_issue() is called for one rq > > > > - average issue times of flush request: 1.052 > > > >=20 > > > > 2) v5.6-rc patched kernel > > > > - write rq average latency: 0.031 > > > > - flush rq average latency: 0.035 > > > > - average issue times of write request: 1.466 > > > > - average issue times of flush request: 1.610 > > > >=20 > > > >=20 > > > > block/004 starts 64 jobs and AHCI's queue can become saturated easi= ly, > > > > then BLK_MQ_S_SCHED_RESTART should be set, so write request in disp= atch > > > > queue can only move on after one request is completed. > > > >=20 > > > > Looks the flush request takes shorter time than normal write IO. > > > > If flush request is added to front of dispatch queue, the next norm= al > > > > write IO could be queued to lld quicker than adding to tail of disp= atch > > > > queue. > > > > trace_block_rq_issue() is called more than one time is because of > > > > AHCI or the drive's implementation. It usually means that > > > > host->hostt->queuecommand() fails for several times when queuing on= e > > > > single request. For AHCI, I understand it shouldn't fail normally g= iven > > > > we guarantee that the actual queue depth is <=3D either LUN or host= 's > > > > queue depth. Maybe it depends on your SMR's implementation about ha= ndling > > > > flush/write IO. Could you check why .queuecommand() fails in block/= 004? > >=20 > > I put some debug prints and confirmed that the .queuecommand function i= s > > ata_scsi_queuecmd() and it returns SCSI_MLQUEUE_DEVICE_BUSY because > > ata_std_qc_defer() returns ATA_DEFER_LINK. The comment of ata_std_qc_de= fer() > > notes that "Non-NCQ commands cannot run with any other command, NCQ or = not. As > > upper layer only knows the queue depth, we are responsible for maintain= ing > > exclusion. This function checks whether a new command @qc can be issue= d." Then > > I guess .queuecommand() fails because is that Non-NCQ flush command and= NCQ > > write command are waiting the completion each other. >=20 > OK, got it. >=20 > >=20 > > >=20 > > > Indeed, that is weird that queuecommand fails. There is nothing SMR s= pecific in > > > the AHCI code beside disk probe checks. So write & flush handling doe= s not > > > differ between SMR and regular disks. The same applies to the drive s= ide. write > > > and flush commands are the normal commands, no change at all. The onl= y > > > difference being the sequential write constraint which the drive hono= rs by not > > > executing the queued write command out of order. But there are no con= straint for > > > flush on SMR, so we get whatever the drive does, that is, totally dri= ve dependent. > > >=20 > > > I wonder if the issue may be with the particular AHCI chipset used fo= r this test. > > >=20 > > > >=20 > > > > Also can you provide queue flags via the following command? > > > >=20 > > > > cat /sys/kernel/debug/block/sdN/state > >=20 > > The state sysfs attribute was as follows: > >=20 > > SAME_COMP|IO_STAT|ADD_RANDOM|INIT_DONE|WC|STATS|REGISTERED|SCSI_PASSTHR= OUGH|26 > >=20 > > It didn't change before and after the block/004 run. > >=20 > >=20 > > > >=20 > > > > I understand flush request should be slower than normal write, howe= ver > > > > looks not true in this hardware. > > >=20 > > > Probably due to the fact that since the writes are sequential, there = is a lot of > > > drive internal optimization that can be done to minimize the cost of = flush > > > (internal data streaming from cache to media, media-cache use, etc) T= hat is true > > > for regular disks too. And all of this is highly dependent on the har= dware > > > implementation. > >=20 > > This discussion tempted me to take closer look in the traces. And I not= iced that > > number of flush commands issued is different with and without the patch= . > >=20 > > | without patch | with patch > > ------------------------+---------------+------------ > > block_getrq: rwbs=3DFWS | 32640 | 32640 > > block_rq_issue: rwbs=3DFF | 32101 | 7593 >=20 > Looks issued flush request is too many given the flush machinery > should avoid to queue duplicated flush requests. >=20 > I will investigate the flush code a bit. >=20 > >=20 > > Without the patch, flush command is issued between two write commands. = With the > > patch, some write commands are executed without flush between them. > >=20 > > I wonder how the requeue list position of flush command (head vs tail) = changes > > the number of flush commands to issue. > >=20 > > Another weird thing is number of block_getrq traces of flush (rwds=3DFW= S). It > > doubles number of writes (256 * 64 =3D 16320). I will chase this furthe= r. >=20 > Indeed, not see such issue when I run the test on kvm ahci. I found that the cause of the doubled flush commands is dm-linear device I = set up. I mapped two areas of single SMR drive to the dm-linear device, using d= m table with two lines. My understanding is that when a flush is requested to= the dm-linear device, the flush request is duplicated and forwarded to destina= tion devices listed in dm table lines. In this case, a flush request is duplicat= ed for two lines in dm table, and both flush requests go to the single SMR dri= ve. This doubles the number of flush commands. I changed the dm table to have s= ingle line to map single area, and observed the number of flush commands (block_g= etrq with rwbs=3DFWS) got reduced to 16320, the same number as the write command= s. This means that the workload I observed the longer runtime of block/004 wit= h kernel 5.6-rc4 is really flush intensive. It has two flush commands per one write command. It might be too sensitive to flush command behavior. I'm not= so sure if this is the workload worth performance care. --=20 Best Regards, Shin'ichiro Kawasaki=