All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@redhat.com>
To: Ming Lin <mlin@kernel.org>
Cc: Ming Lei <ming.lei@canonical.com>,
	dm-devel@redhat.com, Christoph Hellwig <hch@lst.de>,
	Alasdair G Kergon <agk@redhat.com>,
	Lars Ellenberg <drbd-dev@lists.linbit.com>,
	Philip Kelleher <pjk1939@linux.vnet.ibm.com>,
	Joshua Morris <josh.h.morris@us.ibm.com>,
	Christoph Hellwig <hch@infradead.org>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Nitin Gupta <ngupta@vflare.org>,
	Oleg Drokin <oleg.drokin@intel.com>,
	Al Viro <viro@zeniv.linux.org.uk>, Jens Axboe <axboe@kernel.dk>,
	Andreas Dilger <andreas.dilger@intel.com>,
	Geoff Levand <geoff@infradead.org>, Jiri Kosina <jkosina@suse.cz>,
	lkml <linux-kernel@vger.kernel.org>, Jim Paris <jim@jtan.com>,
	Minchan Kim <minchan@kernel.org>, Dongsu Park <dpark@posteo.net>,
	drbd-user@lists.linbit.com
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios
Date: Thu, 4 Jun 2015 17:06:17 -0400	[thread overview]
Message-ID: <20150604210617.GA23710@redhat.com> (raw)
In-Reply-To: <CAF1ivSY_7M1OhrMuXg0OKu7BPy=SbTHSHCk+2q6vfEVgvJL8YA@mail.gmail.com>

On Tue, Jun 02 2015 at  4:59pm -0400,
Ming Lin <mlin@kernel.org> wrote:

> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote:
> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> >> > Does it make sense?
> >>
> >> To stripe across devices with different characteristics?
> >>
> >> Some suggestions.
> >>
> >> Prepare 3 kernels.
> >>   O - Old kernel.
> >>   M - Old kernel with merge_bvec_fn disabled.
> >>   N - New kernel.
> >>
> >> You're trying to search for counter-examples to the hypothesis that
> >> "Kernel N always outperforms Kernel O".  Then if you find any, trying
> >> to show either that the performance impediment is small enough that
> >> it doesn't matter or that the cases are sufficiently rare or obscure
> >> that they may be ignored because of the greater benefits of N in much more
> >> common cases.
> >>
> >> (1) You're looking to set up configurations where kernel O performs noticeably
> >> better than M.  Then you're comparing the performance of O and N in those
> >> situations.
> >>
> >> (2) You're looking at other sensible configurations where O and M have
> >> similar performance, and comparing that with the performance of N.
> >
> > I didn't find case (1).
> >
> > But the important thing for this series is to simplify block layer
> > based on immutable biovecs. I don't expect performance improvement.

No simplifying isn't the important thing.  Any change to remove the
merge_bvec callbacks needs to not introduce performance regressions on
enterprise systems with large RAID arrays, etc.

It is fine if there isn't a performance improvement but I really don't
think the limited testing you've done on a relatively small storage
configuration has come even close to showing these changes don't
introduce performance regressions.

> > Here is the changes statistics.
> >
> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
> >
> > I run below 3 test cases to make sure it didn't bring any regressions.
> > Test environment: 2 NVMe drives on 2 sockets server.
> > Each case run for 30 minutes.
> >
> > 2) btrfs radi0
> >
> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> > mount /dev/nvme0n1 /mnt
> >
> > Then run 8K read.
> >
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=4
> > rw=read
> >
> > [job1]
> > bs=8K
> > directory=/mnt
> > size=1G
> >
> > 2) ext4 on MD raid5
> >
> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> > mkfs.ext4 /dev/md0
> > mount /dev/md0 /mnt
> >
> > fio script same as btrfs test
> >
> > 3) xfs on DM stripped target
> >
> > pvcreate /dev/nvme0n1 /dev/nvme1n1
> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> > mount /dev/striped_vol_group/striped_logical_volume /mnt
> >
> > fio script same as btrfs test
> >
> > ------
> >
> > Results:
> >
> >         4.1-rc4         4.1-rc4-patched
> > btrfs   1818.6MB/s      1874.1MB/s
> > ext4    717307KB/s      714030KB/s
> > xfs     1396.6MB/s      1398.6MB/s
> 
> Hi Alasdair & Mike,
> 
> Would you like these numbers?
> I'd like to address your concerns to move forward.

I really don't see that these NVMe results prove much.

We need to test on large HW raid setups like a Netapp filer (or even
local SAS drives connected via some SAS controller).  Like a 8+2 drive
RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
devices is also useful.  It is larger RAID setups that will be more
sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
size boundaries.

There are tradeoffs between creating a really large bio and creating a
properly sized bio from the start.  And yes, to one of neilb's original
points, limits do change and we suck at restacking limits.. so what was
once properly sized may no longer be but: that is a relatively rare
occurrence.  Late splitting does do away with the limits stacking
disconnect.  And in general I like the idea of removing all the
merge_bvec code.  I just don't think I can confidently Ack such a
wholesale switch at this point with such limited performance analysis.
If we (the DM/lvm team at Red Hat) are being painted into a corner of
having to provide our own testing that meets our definition of
"thorough" then we'll need time to carry out those tests.  But I'd hate
to hold up everyone because DM is not in agreement on this change...

So taking a step back, why can't we introduce late bio splitting in a
phased approach?

1: introduce late bio splitting to block core BUT still keep established
   merge_bvec infrastructure
2: establish a way for upper layers to skip merge_bvec if they'd like to
   do so (e.g. block-core exposes a 'use_late_bio_splitting' or
   something for userspace or upper layers to set, can also have a
   Kconfig that enables this feature by default)
3: we gain confidence in late bio-splitting and then carry on with the
   removal of merge_bvec et al (could be incrementally done on a
   per-driver basis, e.g. DM, MD, btrfs, etc, etc).

Mike

WARNING: multiple messages have this Message-ID (diff)
From: Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Ming Lin <mlin-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>,
	Oleg Drokin <oleg.drokin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	Andreas Dilger
	<andreas.dilger-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	Geoff Levand <geoff-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Jiri Kosina <jkosina-AlSwsSmVLrQ@public.gmane.org>,
	Ming Lei <ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>,
	lkml <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Philip Kelleher
	<pjk1939-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>,
	Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	Joshua Morris
	<josh.h.morris-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
	Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	Dongsu Park <dpark-VwIFZPTo/vqsTnJN9+BGXg@public.gmane.org>,
	Kent Overstreet
	<kent.overstreet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	drbd-user-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org,
	Jim Paris <jim-XrPbb/hENzg@public.gmane.org>,
	Nitin Gupta <ngupta-KNmc09w0p+Ednm+yROfE0A@public.gmane.org>,
	Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
	Alasdair G Kergon <agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Lars Ellenberg <drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org>
Subject: Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios
Date: Thu, 4 Jun 2015 17:06:17 -0400	[thread overview]
Message-ID: <20150604210617.GA23710@redhat.com> (raw)
In-Reply-To: <CAF1ivSY_7M1OhrMuXg0OKu7BPy=SbTHSHCk+2q6vfEVgvJL8YA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jun 02 2015 at  4:59pm -0400,
Ming Lin <mlin-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:

> On Sun, May 31, 2015 at 11:02 PM, Ming Lin <mlin-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Thu, 2015-05-28 at 01:36 +0100, Alasdair G Kergon wrote:
> >> On Wed, May 27, 2015 at 04:42:44PM -0700, Ming Lin wrote:
> >> > Here are fio results of XFS on a DM stripped target with 2 SSDs + 1 HDD.
> >> > Does it make sense?
> >>
> >> To stripe across devices with different characteristics?
> >>
> >> Some suggestions.
> >>
> >> Prepare 3 kernels.
> >>   O - Old kernel.
> >>   M - Old kernel with merge_bvec_fn disabled.
> >>   N - New kernel.
> >>
> >> You're trying to search for counter-examples to the hypothesis that
> >> "Kernel N always outperforms Kernel O".  Then if you find any, trying
> >> to show either that the performance impediment is small enough that
> >> it doesn't matter or that the cases are sufficiently rare or obscure
> >> that they may be ignored because of the greater benefits of N in much more
> >> common cases.
> >>
> >> (1) You're looking to set up configurations where kernel O performs noticeably
> >> better than M.  Then you're comparing the performance of O and N in those
> >> situations.
> >>
> >> (2) You're looking at other sensible configurations where O and M have
> >> similar performance, and comparing that with the performance of N.
> >
> > I didn't find case (1).
> >
> > But the important thing for this series is to simplify block layer
> > based on immutable biovecs. I don't expect performance improvement.

No simplifying isn't the important thing.  Any change to remove the
merge_bvec callbacks needs to not introduce performance regressions on
enterprise systems with large RAID arrays, etc.

It is fine if there isn't a performance improvement but I really don't
think the limited testing you've done on a relatively small storage
configuration has come even close to showing these changes don't
introduce performance regressions.

> > Here is the changes statistics.
> >
> > "68 files changed, 336 insertions(+), 1331 deletions(-)"
> >
> > I run below 3 test cases to make sure it didn't bring any regressions.
> > Test environment: 2 NVMe drives on 2 sockets server.
> > Each case run for 30 minutes.
> >
> > 2) btrfs radi0
> >
> > mkfs.btrfs -f -d raid0 /dev/nvme0n1 /dev/nvme1n1
> > mount /dev/nvme0n1 /mnt
> >
> > Then run 8K read.
> >
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1800
> > time_based
> > group_reporting
> > numjobs=4
> > rw=read
> >
> > [job1]
> > bs=8K
> > directory=/mnt
> > size=1G
> >
> > 2) ext4 on MD raid5
> >
> > mdadm --create /dev/md0 --level=5 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
> > mkfs.ext4 /dev/md0
> > mount /dev/md0 /mnt
> >
> > fio script same as btrfs test
> >
> > 3) xfs on DM stripped target
> >
> > pvcreate /dev/nvme0n1 /dev/nvme1n1
> > vgcreate striped_vol_group /dev/nvme0n1 /dev/nvme1n1
> > lvcreate -i2 -I4 -L250G -nstriped_logical_volume striped_vol_group
> > mkfs.xfs -f /dev/striped_vol_group/striped_logical_volume
> > mount /dev/striped_vol_group/striped_logical_volume /mnt
> >
> > fio script same as btrfs test
> >
> > ------
> >
> > Results:
> >
> >         4.1-rc4         4.1-rc4-patched
> > btrfs   1818.6MB/s      1874.1MB/s
> > ext4    717307KB/s      714030KB/s
> > xfs     1396.6MB/s      1398.6MB/s
> 
> Hi Alasdair & Mike,
> 
> Would you like these numbers?
> I'd like to address your concerns to move forward.

I really don't see that these NVMe results prove much.

We need to test on large HW raid setups like a Netapp filer (or even
local SAS drives connected via some SAS controller).  Like a 8+2 drive
RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
devices is also useful.  It is larger RAID setups that will be more
sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
size boundaries.

There are tradeoffs between creating a really large bio and creating a
properly sized bio from the start.  And yes, to one of neilb's original
points, limits do change and we suck at restacking limits.. so what was
once properly sized may no longer be but: that is a relatively rare
occurrence.  Late splitting does do away with the limits stacking
disconnect.  And in general I like the idea of removing all the
merge_bvec code.  I just don't think I can confidently Ack such a
wholesale switch at this point with such limited performance analysis.
If we (the DM/lvm team at Red Hat) are being painted into a corner of
having to provide our own testing that meets our definition of
"thorough" then we'll need time to carry out those tests.  But I'd hate
to hold up everyone because DM is not in agreement on this change...

So taking a step back, why can't we introduce late bio splitting in a
phased approach?

1: introduce late bio splitting to block core BUT still keep established
   merge_bvec infrastructure
2: establish a way for upper layers to skip merge_bvec if they'd like to
   do so (e.g. block-core exposes a 'use_late_bio_splitting' or
   something for userspace or upper layers to set, can also have a
   Kconfig that enables this feature by default)
3: we gain confidence in late bio-splitting and then carry on with the
   removal of merge_bvec et al (could be incrementally done on a
   per-driver basis, e.g. DM, MD, btrfs, etc, etc).

Mike

  reply	other threads:[~2015-06-04 21:06 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-22 18:18 [PATCH v4 00/11] simplify block layer based on immutable biovecs Ming Lin
2015-05-22 18:18 ` [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios Ming Lin
2015-05-25  5:46   ` NeilBrown
2015-05-26 14:36   ` Mike Snitzer
2015-05-26 15:02     ` Ming Lin
2015-05-26 15:34       ` Alasdair G Kergon
2015-05-26 15:34         ` Alasdair G Kergon
2015-05-26 23:06         ` NeilBrown
2015-05-27  0:40           ` [dm-devel] " Alasdair G Kergon
2015-05-27  8:20             ` Christoph Hellwig
2015-05-27  8:20               ` Christoph Hellwig
2015-05-26 16:04       ` Mike Snitzer
2015-05-26 17:17         ` Ming Lin
2015-05-27 23:42         ` Ming Lin
2015-05-27 23:42           ` Ming Lin
2015-05-28  0:36           ` Alasdair G Kergon
2015-05-28  5:54             ` Ming Lin
2015-05-28  5:54               ` Ming Lin
2015-05-29  7:05             ` Ming Lin
2015-05-29  7:05               ` Ming Lin
2015-05-29 15:15               ` Mike Snitzer
2015-06-01  6:02             ` Ming Lin
2015-06-01  6:02               ` Ming Lin
2015-06-02 20:59               ` Ming Lin
2015-06-02 20:59                 ` Ming Lin
2015-06-04 21:06                 ` Mike Snitzer [this message]
2015-06-04 21:06                   ` Mike Snitzer
2015-06-04 22:21                   ` Ming Lin
2015-06-05  0:06                     ` Mike Snitzer
2015-06-05  5:21                       ` Ming Lin
2015-06-05  5:21                         ` Ming Lin
2015-06-09  6:09                   ` Ming Lin
2015-06-09  6:09                     ` Ming Lin
2015-06-10 21:20                     ` Ming Lin
2015-06-10 21:20                       ` Ming Lin
2015-06-10 21:46                       ` Mike Snitzer
2015-06-10 21:46                         ` Mike Snitzer
2015-06-10 22:06                         ` Ming Lin
2015-06-10 22:06                           ` Ming Lin
2015-06-12  5:49                           ` Ming Lin
2015-06-12  5:49                             ` Ming Lin
2015-06-18  5:27                         ` Ming Lin
2015-06-18  5:27                           ` Ming Lin
2015-05-22 18:18 ` [PATCH v4 02/11] block: simplify bio_add_page() Ming Lin
2015-05-22 18:18 ` [PATCH v4 03/11] bcache: remove driver private bio splitting code Ming Lin
2015-05-22 18:18 ` [PATCH v4 04/11] btrfs: remove bio splitting and merge_bvec_fn() calls Ming Lin
2015-05-22 18:18 ` [PATCH v4 05/11] block: remove split code in blkdev_issue_discard Ming Lin
2015-05-22 18:18 ` [PATCH v4 06/11] md/raid5: get rid of bio_fits_rdev() Ming Lin
2015-05-25  5:48   ` NeilBrown
2015-05-25  7:03     ` Ming Lin
2015-05-25  7:54       ` NeilBrown
2015-05-25 14:17         ` Christoph Hellwig
2015-05-26 14:33           ` Ming Lin
2015-05-26 22:32             ` Ming Lin
2015-05-26 23:03               ` NeilBrown
2015-05-26 23:42                 ` Ming Lin
2015-05-27  0:38                   ` NeilBrown
2015-05-27  8:15                 ` Christoph Hellwig
2015-05-22 18:18 ` [PATCH v4 07/11] md/raid5: split bio for chunk_aligned_read Ming Lin
2015-05-22 18:18 ` [PATCH v4 08/11] block: kill merge_bvec_fn() completely Ming Lin
2015-05-25  5:49   ` NeilBrown
2015-05-25 14:04   ` Christoph Hellwig
2015-05-25 15:02     ` Ilya Dryomov
2015-05-25 15:08       ` Christoph Hellwig
2015-05-25 15:19         ` Ilya Dryomov
2015-05-25 15:35       ` Alex Elder
2015-05-22 18:18 ` [PATCH v4 09/11] fs: use helper bio_add_page() instead of open coding on bi_io_vec Ming Lin
2015-05-22 18:18 ` [PATCH v4 10/11] block: remove bio_get_nr_vecs() Ming Lin
2015-05-22 18:18 ` [PATCH v4 11/11] Documentation: update notes in biovecs about arbitrarily sized bios Ming Lin
2015-05-23 14:15 ` [PATCH v4 00/11] simplify block layer based on immutable biovecs Christoph Hellwig
2015-05-24  7:37   ` Ming Lin
2015-05-25 13:51     ` Christoph Hellwig
2015-05-29  6:39       ` Ming Lin
2015-06-01  6:15   ` Ming Lin
2015-06-03  6:57     ` Christoph Hellwig
2015-06-03 13:28       ` Jeff Moyer
2015-06-03 17:06         ` Ming Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150604210617.GA23710@redhat.com \
    --to=snitzer@redhat.com \
    --cc=agk@redhat.com \
    --cc=andreas.dilger@intel.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=dpark@posteo.net \
    --cc=drbd-dev@lists.linbit.com \
    --cc=drbd-user@lists.linbit.com \
    --cc=geoff@infradead.org \
    --cc=hch@infradead.org \
    --cc=hch@lst.de \
    --cc=jim@jtan.com \
    --cc=jkosina@suse.cz \
    --cc=josh.h.morris@us.ibm.com \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=minchan@kernel.org \
    --cc=ming.lei@canonical.com \
    --cc=mlin@kernel.org \
    --cc=ngupta@vflare.org \
    --cc=oleg.drokin@intel.com \
    --cc=pjk1939@linux.vnet.ibm.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.