From: Paolo Valente <paolo.valente@linaro.org> To: Weiping Zhang <zwp10758@gmail.com> Cc: Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>, Christoph Hellwig <hch@lst.de>, Bart Van Assche <bvanassche@acm.org>, Minwoo Im <minwoo.im.dev@gmail.com>, Thomas Gleixner <tglx@linutronix.de>, Ming Lei <ming.lei@redhat.com>, "Nadolski, Edmund" <edmund.nadolski@intel.com>, linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Date: Tue, 31 Mar 2020 12:29:05 +0200 [thread overview] Message-ID: <7CD57B83-F067-4918-878C-BAC413C6A2B3@linaro.org> (raw) In-Reply-To: <CAA70yB62_6JD_8dJTGPjnjJfyJSa1xqiCVwwNYtsTCUXQR5uCA@mail.gmail.com> > Il giorno 31 mar 2020, alle ore 08:17, Weiping Zhang <zwp10758@gmail.com> ha scritto: > >>> On the driver implementation, the number of module parameters being >>> added here is problematic. We already have 2 special classes of queues, >>> and defining this at the module level is considered too coarse when >>> the system has different devices on opposite ends of the capability >>> spectrum. For example, users want polled queues for the fast devices, >>> and none for the slower tier. We just don't have a good mechanism to >>> define per-controller resources, and more queue classes will make this >>> problem worse. >>> >> We can add a new "string" module parameter, which contains a model number, >> in most cases, the save product with a common prefix model number, so >> in this way >> nvme can distinguish the different performance devices(hign or low end). >> Before create io queue, nvme driver can get the device's Model number(40 Bytes), >> then nvme driver can compare device's model number with module parameter, to >> decide how many io queues for each disk; >> >> /* if model_number is MODEL_ANY, these parameters will be applied to >> all nvme devices. */ >> char dev_io_queues[1024] = "model_number=MODEL_ANY, >> poll=0,read=0,wrr_low=0,wrr_medium=0,wrr_high=0,wrr_urgent=0"; >> /* these paramters only affect nvme disk whose model number is "XXX" */ >> char dev_io_queues[1024] = "model_number=XXX, >> poll=1,read=2,wrr_low=3,wrr_medium=4,wrr_high=5,wrr_urgent=0;"; >> >> struct dev_io_queues { >> char model_number[40]; >> unsigned int poll; >> unsgined int read; >> unsigned int wrr_low; >> unsigned int wrr_medium; >> unsigned int wrr_high; >> unsigned int wrr_urgent; >> }; >> >> We can use these two variable to store io queue configurations: >> >> /* default values for the all disk, except whose model number is not >> in io_queues_cfg */ >> struct dev_io_queues io_queues_def = {}; >> >> /* user defined values for a specific model number */ >> struct dev_io_queues io_queues_cfg = {}; >> >> If we need multiple configurations( > 2), we can also extend >> dev_io_queues to support it. >> > > Hi Maintainers, > > If we add patch to support these queue count at controller level, > instead moudle level, > shall we add WRR ? > > Recently I do some cgroup io weight testing, > https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test > I think a proper io weight policy > should consider high weight cgroup's iops, latency and also take whole > disk's throughput > into account, that is to say, the policy should do more carfully trade > off between cgroup's > IO performance and whole disk's throughput. I know one policy cannot > do all things perfectly, > but from the test result nvme-wrr can work well. > > From the following test result, nvme-wrr work well for both cgroup's > latency, iops, and whole > disk's throughput. > > Notes: > blk-iocost: only set qos.model, not set percentage latency. > nvme-wrr: set weight by: > h=64;m=32;l=8;ab=0; nvme set-feature /dev/nvme1n1 -f 1 -v $(printf > "0x%x\n" $(($ab<<0|$l<<8|$m<<16|$h<<24))) > echo "$major:$minor high" > /sys/fs/cgroup/test1/io.wrr > echo "$major:$minor low" > /sys/fs/cgroup/test2/io.wrr > > > Randread vs Randread: > cgroup.test1.weight : cgroup.test2.weight = 8 : 1 > high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K > low weight cgroup test2: randread, fio: numjobs=8, iodepth=32, bs=4K > > test case bw iops rd_avg_lat wr_avg_lat > rd_p99_lat wr_p99_lat > ======================================================================================= > bfq_test1 767226 191806 1333.30 0.00 > 536.00 0.00 > bfq_test2 94607 23651 10816.06 0.00 > 610.00 0.00 > iocost_test1 1457718 364429 701.76 0.00 > 1630.00 0.00 > iocost_test2 1466337 366584 697.62 0.00 > 1613.00 0.00 > none_test1 1456585 364146 702.22 0.00 > 1646.00 0.00 > none_test2 1463090 365772 699.12 0.00 > 1613.00 0.00 > wrr_test1 2635391 658847 387.94 0.00 > 1236.00 0.00 > wrr_test2 365428 91357 2801.00 0.00 > 5537.00 0.00 > > https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#215-summary-fio-output > > Glad to see that BFQ meets weights. Sad to see how it is suffering in terms of IOPS on your system. Good job with your scheduler! However, as for I/O control, the hard-to-control cases are not the ones with constantly-full deep queues. BFQ complexity stems from the need to control also the tough cases. An example is sync I/O with I/O depth one against async I/O. On the other hand, those use cases may not be of interest for your scheduler. Thanks, Paolo > Randread vs Seq Write: > cgroup.test1.weight : cgroup.test2.weight = 8 : 1 > high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K > low weight cgroup test2: seq write, fio: numjobs=1, iodepth=32, bs=256K > > test case bw iops rd_avg_lat wr_avg_lat > rd_p99_lat wr_p99_lat > ======================================================================================= > bfq_test1 814327 203581 1256.19 0.00 593.00 0.00 > bfq_test2 104758 409 0.00 78196.32 0.00 > 1052770.00 > iocost_test1 270467 67616 3784.02 0.00 9371.00 0.00 > iocost_test2 1541575 6021 0.00 5313.02 0.00 > 6848.00 > none_test1 271708 67927 3767.01 0.00 9502.00 0.00 > none_test2 1541951 6023 0.00 5311.50 0.00 > 6848.00 > wrr_test1 775005 193751 1320.17 0.00 4112.00 0.00 > wrr_test2 1198319 4680 0.00 6835.30 0.00 > 8847.00 > > > https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#225-summary-fio-output > > Thanks > Weiping
WARNING: multiple messages have this Message-ID (diff)
From: Paolo Valente <paolo.valente@linaro.org> To: Weiping Zhang <zwp10758@gmail.com> Cc: Jens Axboe <axboe@kernel.dk>, Bart Van Assche <bvanassche@acm.org>, linux-nvme@lists.infradead.org, Ming Lei <ming.lei@redhat.com>, linux-block@vger.kernel.org, Tejun Heo <tj@kernel.org>, Minwoo Im <minwoo.im.dev@gmail.com>, cgroups@vger.kernel.org, Keith Busch <kbusch@kernel.org>, "Nadolski, Edmund" <edmund.nadolski@intel.com>, Thomas Gleixner <tglx@linutronix.de>, Christoph Hellwig <hch@lst.de> Subject: Re: [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Date: Tue, 31 Mar 2020 12:29:05 +0200 [thread overview] Message-ID: <7CD57B83-F067-4918-878C-BAC413C6A2B3@linaro.org> (raw) In-Reply-To: <CAA70yB62_6JD_8dJTGPjnjJfyJSa1xqiCVwwNYtsTCUXQR5uCA@mail.gmail.com> > Il giorno 31 mar 2020, alle ore 08:17, Weiping Zhang <zwp10758@gmail.com> ha scritto: > >>> On the driver implementation, the number of module parameters being >>> added here is problematic. We already have 2 special classes of queues, >>> and defining this at the module level is considered too coarse when >>> the system has different devices on opposite ends of the capability >>> spectrum. For example, users want polled queues for the fast devices, >>> and none for the slower tier. We just don't have a good mechanism to >>> define per-controller resources, and more queue classes will make this >>> problem worse. >>> >> We can add a new "string" module parameter, which contains a model number, >> in most cases, the save product with a common prefix model number, so >> in this way >> nvme can distinguish the different performance devices(hign or low end). >> Before create io queue, nvme driver can get the device's Model number(40 Bytes), >> then nvme driver can compare device's model number with module parameter, to >> decide how many io queues for each disk; >> >> /* if model_number is MODEL_ANY, these parameters will be applied to >> all nvme devices. */ >> char dev_io_queues[1024] = "model_number=MODEL_ANY, >> poll=0,read=0,wrr_low=0,wrr_medium=0,wrr_high=0,wrr_urgent=0"; >> /* these paramters only affect nvme disk whose model number is "XXX" */ >> char dev_io_queues[1024] = "model_number=XXX, >> poll=1,read=2,wrr_low=3,wrr_medium=4,wrr_high=5,wrr_urgent=0;"; >> >> struct dev_io_queues { >> char model_number[40]; >> unsigned int poll; >> unsgined int read; >> unsigned int wrr_low; >> unsigned int wrr_medium; >> unsigned int wrr_high; >> unsigned int wrr_urgent; >> }; >> >> We can use these two variable to store io queue configurations: >> >> /* default values for the all disk, except whose model number is not >> in io_queues_cfg */ >> struct dev_io_queues io_queues_def = {}; >> >> /* user defined values for a specific model number */ >> struct dev_io_queues io_queues_cfg = {}; >> >> If we need multiple configurations( > 2), we can also extend >> dev_io_queues to support it. >> > > Hi Maintainers, > > If we add patch to support these queue count at controller level, > instead moudle level, > shall we add WRR ? > > Recently I do some cgroup io weight testing, > https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test > I think a proper io weight policy > should consider high weight cgroup's iops, latency and also take whole > disk's throughput > into account, that is to say, the policy should do more carfully trade > off between cgroup's > IO performance and whole disk's throughput. I know one policy cannot > do all things perfectly, > but from the test result nvme-wrr can work well. > > From the following test result, nvme-wrr work well for both cgroup's > latency, iops, and whole > disk's throughput. > > Notes: > blk-iocost: only set qos.model, not set percentage latency. > nvme-wrr: set weight by: > h=64;m=32;l=8;ab=0; nvme set-feature /dev/nvme1n1 -f 1 -v $(printf > "0x%x\n" $(($ab<<0|$l<<8|$m<<16|$h<<24))) > echo "$major:$minor high" > /sys/fs/cgroup/test1/io.wrr > echo "$major:$minor low" > /sys/fs/cgroup/test2/io.wrr > > > Randread vs Randread: > cgroup.test1.weight : cgroup.test2.weight = 8 : 1 > high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K > low weight cgroup test2: randread, fio: numjobs=8, iodepth=32, bs=4K > > test case bw iops rd_avg_lat wr_avg_lat > rd_p99_lat wr_p99_lat > ======================================================================================= > bfq_test1 767226 191806 1333.30 0.00 > 536.00 0.00 > bfq_test2 94607 23651 10816.06 0.00 > 610.00 0.00 > iocost_test1 1457718 364429 701.76 0.00 > 1630.00 0.00 > iocost_test2 1466337 366584 697.62 0.00 > 1613.00 0.00 > none_test1 1456585 364146 702.22 0.00 > 1646.00 0.00 > none_test2 1463090 365772 699.12 0.00 > 1613.00 0.00 > wrr_test1 2635391 658847 387.94 0.00 > 1236.00 0.00 > wrr_test2 365428 91357 2801.00 0.00 > 5537.00 0.00 > > https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#215-summary-fio-output > > Glad to see that BFQ meets weights. Sad to see how it is suffering in terms of IOPS on your system. Good job with your scheduler! However, as for I/O control, the hard-to-control cases are not the ones with constantly-full deep queues. BFQ complexity stems from the need to control also the tough cases. An example is sync I/O with I/O depth one against async I/O. On the other hand, those use cases may not be of interest for your scheduler. Thanks, Paolo > Randread vs Seq Write: > cgroup.test1.weight : cgroup.test2.weight = 8 : 1 > high weight cgroup test1: randread, fio: numjobs=8, iodepth=32, bs=4K > low weight cgroup test2: seq write, fio: numjobs=1, iodepth=32, bs=256K > > test case bw iops rd_avg_lat wr_avg_lat > rd_p99_lat wr_p99_lat > ======================================================================================= > bfq_test1 814327 203581 1256.19 0.00 593.00 0.00 > bfq_test2 104758 409 0.00 78196.32 0.00 > 1052770.00 > iocost_test1 270467 67616 3784.02 0.00 9371.00 0.00 > iocost_test2 1541575 6021 0.00 5313.02 0.00 > 6848.00 > none_test1 271708 67927 3767.01 0.00 9502.00 0.00 > none_test2 1541951 6023 0.00 5311.50 0.00 > 6848.00 > wrr_test1 775005 193751 1320.17 0.00 4112.00 0.00 > wrr_test2 1198319 4680 0.00 6835.30 0.00 > 8847.00 > > > https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test#225-summary-fio-output > > Thanks > Weiping _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme
next prev parent reply other threads:[~2020-03-31 10:27 UTC|newest] Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-02-04 3:30 [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Weiping Zhang 2020-02-04 3:30 ` Weiping Zhang 2020-02-04 3:30 ` Weiping Zhang 2020-02-04 3:31 ` [PATCH v5 1/4] block: add weighted round robin for blkcgroup Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 3:31 ` [PATCH v5 2/4] nvme: add get_ams for nvme_ctrl_ops Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 3:31 ` [PATCH v5 3/4] nvme-pci: rename module parameter write_queues to read_queues Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 3:31 ` [PATCH v5 4/4] nvme: add support weighted round robin queue Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 3:31 ` Weiping Zhang 2020-02-04 15:42 ` [PATCH v5 0/4] Add support Weighted Round Robin for blkcg and nvme Keith Busch 2020-02-04 15:42 ` Keith Busch 2020-02-16 8:09 ` Weiping Zhang 2020-02-16 8:09 ` Weiping Zhang 2020-02-16 8:09 ` Weiping Zhang 2020-03-31 6:17 ` Weiping Zhang 2020-03-31 6:17 ` Weiping Zhang 2020-03-31 10:29 ` Paolo Valente [this message] 2020-03-31 10:29 ` Paolo Valente 2020-03-31 14:36 ` Tejun Heo 2020-03-31 14:36 ` Tejun Heo 2020-03-31 14:36 ` Tejun Heo 2020-03-31 15:47 ` Weiping Zhang 2020-03-31 15:47 ` Weiping Zhang 2020-03-31 15:47 ` Weiping Zhang 2020-03-31 15:51 ` Tejun Heo 2020-03-31 15:51 ` Tejun Heo 2020-03-31 15:52 ` Christoph Hellwig 2020-03-31 15:52 ` Christoph Hellwig 2020-03-31 15:52 ` Christoph Hellwig 2020-03-31 15:54 ` Tejun Heo 2020-03-31 15:54 ` Tejun Heo 2020-03-31 15:54 ` Tejun Heo 2020-03-31 16:31 ` Weiping Zhang 2020-03-31 16:31 ` Weiping Zhang 2020-03-31 16:31 ` Weiping Zhang 2020-03-31 16:33 ` Christoph Hellwig 2020-03-31 16:33 ` Christoph Hellwig 2020-03-31 16:33 ` Christoph Hellwig 2020-03-31 16:52 ` Weiping Zhang 2020-03-31 16:52 ` Weiping Zhang
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=7CD57B83-F067-4918-878C-BAC413C6A2B3@linaro.org \ --to=paolo.valente@linaro.org \ --cc=axboe@kernel.dk \ --cc=bvanassche@acm.org \ --cc=cgroups@vger.kernel.org \ --cc=edmund.nadolski@intel.com \ --cc=hch@lst.de \ --cc=kbusch@kernel.org \ --cc=linux-block@vger.kernel.org \ --cc=linux-nvme@lists.infradead.org \ --cc=ming.lei@redhat.com \ --cc=minwoo.im.dev@gmail.com \ --cc=tglx@linutronix.de \ --cc=tj@kernel.org \ --cc=zwp10758@gmail.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.