From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH 0/4] blk-mq: support to use hw tag for scheduling To: Ming Lei References: <20170428151539.25514-1-ming.lei@redhat.com> <839682da-f375-8eab-d6f5-fcf1457150f1@fb.com> <20170503040303.GA20187@ming.t460p> <370fbeb6-d832-968a-2759-47f16b866551@kernel.dk> <20170503150351.GA7927@ming.t460p> Cc: linux-block@vger.kernel.org, Christoph Hellwig , Omar Sandoval From: Jens Axboe Message-ID: <31bb973e-d9cf-9454-58fd-4893701088c5@kernel.dk> Date: Wed, 3 May 2017 09:08:34 -0600 MIME-Version: 1.0 In-Reply-To: <20170503150351.GA7927@ming.t460p> Content-Type: text/plain; charset=windows-1252 List-ID: On 05/03/2017 09:03 AM, Ming Lei wrote: > On Wed, May 03, 2017 at 08:10:58AM -0600, Jens Axboe wrote: >> On 05/03/2017 08:08 AM, Jens Axboe wrote: >>> On 05/02/2017 10:03 PM, Ming Lei wrote: >>>> On Fri, Apr 28, 2017 at 02:29:18PM -0600, Jens Axboe wrote: >>>>> On 04/28/2017 09:15 AM, Ming Lei wrote: >>>>>> Hi, >>>>>> >>>>>> This patchset introduces flag of BLK_MQ_F_SCHED_USE_HW_TAG and >>>>>> allows to use hardware tag directly for IO scheduling if the queue's >>>>>> depth is big enough. In this way, we can avoid to allocate extra tags >>>>>> and request pool for IO schedule, and the schedule tag allocation/release >>>>>> can be saved in I/O submit path. >>>>> >>>>> Ming, I like this approach, it's pretty clean. It'd be nice to have a >>>>> bit of performance data to back up that it's useful to add this code, >>>>> though. Have you run anything on eg kyber on nvme that shows a >>>>> reduction in overhead when getting rid of separate scheduler tags? >>>> >>>> I can observe small improvement in the following tests: >>>> >>>> 1) fio script >>>> # io scheduler: kyber >>>> >>>> RWS="randread read randwrite write" >>>> for RW in $RWS; do >>>> echo "Running test $RW" >>>> sudo echo 3 > /proc/sys/vm/drop_caches >>>> sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=20 --numjobs=1 --ioengine=libaio --iodepth=10240 --group_reporting=1 --filename=$DISK --name=$DISK-test-$RW --rw=$RW --output-format=json >>>> done >>>> >>>> 2) results >>>> >>>> --------------------------------------------------------- >>>> |sched tag(iops/lat) | use hw tag to sched(iops/lat) >>>> ---------------------------------------------------------- >>>> randread |188940/54107 | 193865/52734 >>>> ---------------------------------------------------------- >>>> read |192646/53069 | 199738/51188 >>>> ---------------------------------------------------------- >>>> randwrite |171048/59777 | 179038/57112 >>>> ---------------------------------------------------------- >>>> write |171886/59492 | 181029/56491 >>>> ---------------------------------------------------------- >>>> >>>> I guess it may be a bit more obvious when running the test on one slow >>>> NVMe device, and will try to find one and run the test again. >>> >>> Thanks for running that. As I said in my original reply, I think this >>> is a good optimization, and the implementation is clean. I'm fine with >>> the current limitations of when to enable it, and it's not like we >>> can't extend this later, if we want. >>> >>> I do agree with Bart that patch 1+4 should be combined. I'll do that. >> >> Actually, can you do that when reposting? Looks like you needed to >> do that anyway. > > Yeah, I will do that in V1. V2? :-) Sounds good. I just wanted to check the numbers here, with the series applied on top of for-linus crashes when switching to kyber. A few hunks threw fuzz, but it looked fine to me. But I bet I fat fingered something. So it'd be great if you could respin against my for-linus branch. -- Jens Axboe