Re: Performance drop due to "blk-mq-sched: improve sequential I/O performance"

From: Ming Lei <ming.lei@redhat.com>
To: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: Performance drop due to "blk-mq-sched: improve sequential I/O performance"
Date: Wed, 2 May 2018 18:34:05 +0800	[thread overview]
Message-ID: <20180502103403.GB31961@ming.t460p> (raw)
In-Reply-To: <0efdcd2c4aa241f5d9d6acad915ee4fe@mail.gmail.com>

On Wed, May 02, 2018 at 03:32:53PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Wednesday, May 2, 2018 3:17 PM
> > To: Kashyap Desai
> > Cc: linux-scsi@vger.kernel.org; linux-block@vger.kernel.org
> > Subject: Re: Performance drop due to "blk-mq-sched: improve sequential
> I/O
> > performance"
> >
> > On Wed, May 02, 2018 at 01:13:34PM +0530, Kashyap Desai wrote:
> > > Hi Ming,
> > >
> > > I was running some performance test on latest 4.17-rc and figure out
> > > performance drop (approximate 15% drop) due to below patch set.
> > > https://marc.info/?l=linux-block&m=150802309522847&w=2
> > >
> > > I observed drop on latest 4.16.6 stable and 4.17-rc kernel as well.
> > > Taking bisect approach, figure out that Issue is not observed using
> > > last stable kernel 4.14.38.
> > > I pick 4.14.38 stable kernel  as base line and applied above patch to
> > > confirm the behavior.
> > >
> > > lscpu output -
> > >
> > > Architecture:          x86_64
> > > CPU op-mode(s):        32-bit, 64-bit
> > > Byte Order:            Little Endian
> > > CPU(s):                72
> > > On-line CPU(s) list:   0-71
> > > Thread(s) per core:    2
> > > Core(s) per socket:    18
> > > Socket(s):             2
> > > NUMA node(s):          2
> > > Vendor ID:             GenuineIntel
> > > CPU family:            6
> > > Model:                 85
> > > Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> > > Stepping:              4
> > > CPU MHz:               1457.182
> > > CPU max MHz:           2701.0000
> > > CPU min MHz:           1200.0000
> > > BogoMIPS:              5400.00
> > > Virtualization:        VT-x
> > > L1d cache:             32K
> > > L1i cache:             32K
> > > L2 cache:              1024K
> > > L3 cache:              25344K
> > > NUMA node0 CPU(s):     0-17,36-53
> > > NUMA node1 CPU(s):     18-35,54-71
> > >
> > > I am having 16 SSDs - "SDLL1DLR400GCCA1". Created two R0 VD (each VD
> > > consist of 8 SSDs) using MegaRaid Ventura series adapter.
> > >
> > > fio script -
> > > numactl -N 1 fio 2vd.fio --bs=4k --iodepth=128 -rw=randread
> > > --group_report --ioscheduler=none --numjobs=4
> > >
> > >
> > >                    		| v4.14.38-stable   	| patched
> > > v4.14.38-stable
> > >                    		| mq-none	       	| mq-none
> > > ---------------------------------------------------------------------
> > > randread        "iops"	 | 1597k 		| 1377k
> > >
> > >
> > > Below is perf tool report without patch set. ( Looks like lock
> > > contention is causing this drop, so provided relevant snippet)
> > >
> > > -    3.19%     2.89%  fio              [kernel.vmlinux]            [k]
> > > _raw_spin_lock
> > >    - 2.43% io_submit
> > >       - 2.30% entry_SYSCALL_64
> > >          - do_syscall_64
> > >             - 2.18% do_io_submit
> > >                - 1.59% blk_finish_plug
> > >                   - 1.59% blk_flush_plug_list
> > >                      - 1.59% blk_mq_flush_plug_list
> > >                         - 1.00% __blk_mq_delay_run_hw_queue
> > >                            - 0.99% blk_mq_sched_dispatch_requests
> > >                               - 0.63% blk_mq_dispatch_rq_list
> > >                                    0.60% scsi_queue_rq
> > >                         - 0.57% blk_mq_sched_insert_requests
> > >                            - 0.56% blk_mq_insert_requests
> > >                                 0.51% _raw_spin_lock
> > >
> > > Below is perf tool report after applying patch set.
> > >
> > > -    4.10%     3.51%  fio              [kernel.vmlinux]            [k]
> > > _raw_spin_lock
> > >    - 3.09% io_submit
> > >       - 2.97% entry_SYSCALL_64
> > >          - do_syscall_64
> > >             - 2.85% do_io_submit
> > >                - 2.35% blk_finish_plug
> > >                   - 2.35% blk_flush_plug_list
> > >                      - 2.35% blk_mq_flush_plug_list
> > >                         - 1.83% __blk_mq_delay_run_hw_queue
> > >                            - 1.83% __blk_mq_run_hw_queue
> > >                               - 1.83% blk_mq_sched_dispatch_requests
> > >                                  - 1.82% blk_mq_do_dispatch_ctx
> > >                                     - 1.14% blk_mq_dequeue_from_ctx
> > >                                        - 1.11% dispatch_rq_from_ctx
> > >                                             1.03% _raw_spin_lock
> > >                           0.50% blk_mq_sched_insert_requests
> > >
> > > Let me know if you want more data or is this something a known
> > > implication of patch-set ?
> >
> > The percpu lock of 'ctx->lock' shouldn't have taken so much CPU in
> > dispatch_rq_from_ctx, and the reason may be that the single sbitmap is
> > shared among all CPUs(nodes).
> >
> > So this issue may be same with your previous report, I will provide the
> per-
> > host tagset patches against v4.17-rc3 for you to test this week.
> >
> > Could you run your benchmark and test patches against v4.17-rc kernel
> next
> > time?
> 
> 4.17-rc is also same. I just used 4.14 kernel to narrow down the patch
> set.  I can test your patch against 4.17-rc.
> 
> >
> > BTW, could you update with us if the previous cpu lockup issue is fixed
> or not
> > after commit adbe552349f2(scsi: megaraid_sas: fix selection of reply
> queue)?
> 
> This commit is good and fix issue around CPU online/offline test case.
> I can still see CPU lockup even with above commit (just run plane IO with
> more submitters and less reply queue), but that is really going to be
> fixed if we use irq-poll.

OK, I suppose there isn't such issue if number of submitters is same or
close to number of reply queues.

For more submitters and less reply queue, that is another story, since
the completion CPU can be used up easily, especially the completion
path of megaraid driver takes much CPU, which can be observed in your
previous perf trace.

> 
> I have created internal code changes based on below RFC and using irq poll
> CPU lockup issue is resolved.
> https://www.spinics.net/lists/linux-scsi/msg116668.html

Could we use the 1:1 mapping and not apply out-of-tree irq poll in the
following test? So that we can keep at same page easily.

Thanks,
Ming