All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org
Subject: Re: Performance drop due to "blk-mq-sched: improve sequential I/O performance"
Date: Wed, 2 May 2018 18:34:05 +0800	[thread overview]
Message-ID: <20180502103403.GB31961@ming.t460p> (raw)
In-Reply-To: <0efdcd2c4aa241f5d9d6acad915ee4fe@mail.gmail.com>

On Wed, May 02, 2018 at 03:32:53PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Wednesday, May 2, 2018 3:17 PM
> > To: Kashyap Desai
> > Cc: linux-scsi@vger.kernel.org; linux-block@vger.kernel.org
> > Subject: Re: Performance drop due to "blk-mq-sched: improve sequential
> I/O
> > performance"
> >
> > On Wed, May 02, 2018 at 01:13:34PM +0530, Kashyap Desai wrote:
> > > Hi Ming,
> > >
> > > I was running some performance test on latest 4.17-rc and figure out
> > > performance drop (approximate 15% drop) due to below patch set.
> > > https://marc.info/?l=linux-block&m=150802309522847&w=2
> > >
> > > I observed drop on latest 4.16.6 stable and 4.17-rc kernel as well.
> > > Taking bisect approach, figure out that Issue is not observed using
> > > last stable kernel 4.14.38.
> > > I pick 4.14.38 stable kernel  as base line and applied above patch to
> > > confirm the behavior.
> > >
> > > lscpu output -
> > >
> > > Architecture:          x86_64
> > > CPU op-mode(s):        32-bit, 64-bit
> > > Byte Order:            Little Endian
> > > CPU(s):                72
> > > On-line CPU(s) list:   0-71
> > > Thread(s) per core:    2
> > > Core(s) per socket:    18
> > > Socket(s):             2
> > > NUMA node(s):          2
> > > Vendor ID:             GenuineIntel
> > > CPU family:            6
> > > Model:                 85
> > > Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> > > Stepping:              4
> > > CPU MHz:               1457.182
> > > CPU max MHz:           2701.0000
> > > CPU min MHz:           1200.0000
> > > BogoMIPS:              5400.00
> > > Virtualization:        VT-x
> > > L1d cache:             32K
> > > L1i cache:             32K
> > > L2 cache:              1024K
> > > L3 cache:              25344K
> > > NUMA node0 CPU(s):     0-17,36-53
> > > NUMA node1 CPU(s):     18-35,54-71
> > >
> > > I am having 16 SSDs - "SDLL1DLR400GCCA1". Created two R0 VD (each VD
> > > consist of 8 SSDs) using MegaRaid Ventura series adapter.
> > >
> > > fio script -
> > > numactl -N 1 fio 2vd.fio --bs=4k --iodepth=128 -rw=randread
> > > --group_report --ioscheduler=none --numjobs=4
> > >
> > >
> > >                    		| v4.14.38-stable   	| patched
> > > v4.14.38-stable
> > >                    		| mq-none	       	| mq-none
> > > ---------------------------------------------------------------------
> > > randread        "iops"	 | 1597k 		| 1377k
> > >
> > >
> > > Below is perf tool report without patch set. ( Looks like lock
> > > contention is causing this drop, so provided relevant snippet)
> > >
> > > -    3.19%     2.89%  fio              [kernel.vmlinux]            [k]
> > > _raw_spin_lock
> > >    - 2.43% io_submit
> > >       - 2.30% entry_SYSCALL_64
> > >          - do_syscall_64
> > >             - 2.18% do_io_submit
> > >                - 1.59% blk_finish_plug
> > >                   - 1.59% blk_flush_plug_list
> > >                      - 1.59% blk_mq_flush_plug_list
> > >                         - 1.00% __blk_mq_delay_run_hw_queue
> > >                            - 0.99% blk_mq_sched_dispatch_requests
> > >                               - 0.63% blk_mq_dispatch_rq_list
> > >                                    0.60% scsi_queue_rq
> > >                         - 0.57% blk_mq_sched_insert_requests
> > >                            - 0.56% blk_mq_insert_requests
> > >                                 0.51% _raw_spin_lock
> > >
> > > Below is perf tool report after applying patch set.
> > >
> > > -    4.10%     3.51%  fio              [kernel.vmlinux]            [k]
> > > _raw_spin_lock
> > >    - 3.09% io_submit
> > >       - 2.97% entry_SYSCALL_64
> > >          - do_syscall_64
> > >             - 2.85% do_io_submit
> > >                - 2.35% blk_finish_plug
> > >                   - 2.35% blk_flush_plug_list
> > >                      - 2.35% blk_mq_flush_plug_list
> > >                         - 1.83% __blk_mq_delay_run_hw_queue
> > >                            - 1.83% __blk_mq_run_hw_queue
> > >                               - 1.83% blk_mq_sched_dispatch_requests
> > >                                  - 1.82% blk_mq_do_dispatch_ctx
> > >                                     - 1.14% blk_mq_dequeue_from_ctx
> > >                                        - 1.11% dispatch_rq_from_ctx
> > >                                             1.03% _raw_spin_lock
> > >                           0.50% blk_mq_sched_insert_requests
> > >
> > > Let me know if you want more data or is this something a known
> > > implication of patch-set ?
> >
> > The percpu lock of 'ctx->lock' shouldn't have taken so much CPU in
> > dispatch_rq_from_ctx, and the reason may be that the single sbitmap is
> > shared among all CPUs(nodes).
> >
> > So this issue may be same with your previous report, I will provide the
> per-
> > host tagset patches against v4.17-rc3 for you to test this week.
> >
> > Could you run your benchmark and test patches against v4.17-rc kernel
> next
> > time?
> 
> 4.17-rc is also same. I just used 4.14 kernel to narrow down the patch
> set.  I can test your patch against 4.17-rc.
> 
> >
> > BTW, could you update with us if the previous cpu lockup issue is fixed
> or not
> > after commit adbe552349f2(scsi: megaraid_sas: fix selection of reply
> queue)?
> 
> This commit is good and fix issue around CPU online/offline test case.
> I can still see CPU lockup even with above commit (just run plane IO with
> more submitters and less reply queue), but that is really going to be
> fixed if we use irq-poll.

OK, I suppose there isn't such issue if number of submitters is same or
close to number of reply queues.

For more submitters and less reply queue, that is another story, since
the completion CPU can be used up easily, especially the completion
path of megaraid driver takes much CPU, which can be observed in your
previous perf trace.

> 
> I have created internal code changes based on below RFC and using irq poll
> CPU lockup issue is resolved.
> https://www.spinics.net/lists/linux-scsi/msg116668.html

Could we use the 1:1 mapping and not apply out-of-tree irq poll in the
following test? So that we can keep at same page easily.

Thanks,
Ming

  reply	other threads:[~2018-05-02 10:34 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-02  7:43 Performance drop due to "blk-mq-sched: improve sequential I/O performance" Kashyap Desai
2018-05-02  9:47 ` Ming Lei
2018-05-02 10:02   ` Kashyap Desai
2018-05-02 10:34     ` Ming Lei [this message]
2018-05-02 10:39       ` Kashyap Desai
2018-05-12 23:38         ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180502103403.GB31961@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=kashyap.desai@broadcom.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.