All of lore.kernel.org
 help / color / mirror / Atom feed
* poor read performance on rbd+LVM, LVM overload
@ 2013-10-16 14:46 ` Ugis
  0 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-16 14:46 UTC (permalink / raw)
  To: linux-lvm-H+wXaHxf7aLQT0dZR+AlfA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

Hello ceph&LVM communities!

I noticed very slow reads from xfs mount that is on ceph
client(rbd+gpt partition+LVM PV + xfs on LE)
To find a cause I created another rbd in the same pool, formatted it
straight away with xfs, mounted.

Write performance for both xfs mounts is similar ~12MB/s

reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows:
with LVM ~4MB/s
pure xfs ~30MB/s

Watched performance while doing reads with atop. In LVM case atop
shows LVM overloaded:
LVM | s-LV_backups  | busy     95% |  read   21515 | write      0  |
KiB/r      4 |               | KiB/w      0 |  MBr/s   4.20 | MBw/s
0.00  | avq     1.00 |  avio 0.85 ms |

client kernel 3.10.10
ceph version 0.67.4

My considerations:
I have expanded rbd under LVM couple of times(accordingly expanding
gpt partition, PV,VG,LV, xfs afterwards), but that should have no
impact on performance(tested clean rbd+LVM, same read performance as
for expanded one).

As with device-mapper, after LVM is initialized it is just a small
table with LE->PE  mapping that should reside in close CPU cache.
I am guessing this could be related to old CPU used, probably caching
near CPU does not work well(I tested also local HDDs with/without LVM
and got read speed ~13MB/s vs 46MB/s with atop showing same overload
in  LVM case).

What could make so great difference when LVM is used and what/how to
tune? As write performance does not differ, DM extent lookup should
not be lagging, where is the trick?

CPU used:
# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
microcode       : 0x2
cpu MHz         : 3200.077
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr ss
                                                       e sse2 ss ht tm
pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
cid cx16 xtpr lahf_lm
bogomips        : 6400.15
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Br,
Ugis

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-16 14:46 ` Ugis
  0 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-16 14:46 UTC (permalink / raw)
  To: linux-lvm, ceph-devel, ceph-users

Hello ceph&LVM communities!

I noticed very slow reads from xfs mount that is on ceph
client(rbd+gpt partition+LVM PV + xfs on LE)
To find a cause I created another rbd in the same pool, formatted it
straight away with xfs, mounted.

Write performance for both xfs mounts is similar ~12MB/s

reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows:
with LVM ~4MB/s
pure xfs ~30MB/s

Watched performance while doing reads with atop. In LVM case atop
shows LVM overloaded:
LVM | s-LV_backups  | busy     95% |  read   21515 | write      0  |
KiB/r      4 |               | KiB/w      0 |  MBr/s   4.20 | MBw/s
0.00  | avq     1.00 |  avio 0.85 ms |

client kernel 3.10.10
ceph version 0.67.4

My considerations:
I have expanded rbd under LVM couple of times(accordingly expanding
gpt partition, PV,VG,LV, xfs afterwards), but that should have no
impact on performance(tested clean rbd+LVM, same read performance as
for expanded one).

As with device-mapper, after LVM is initialized it is just a small
table with LE->PE  mapping that should reside in close CPU cache.
I am guessing this could be related to old CPU used, probably caching
near CPU does not work well(I tested also local HDDs with/without LVM
and got read speed ~13MB/s vs 46MB/s with atop showing same overload
in  LVM case).

What could make so great difference when LVM is used and what/how to
tune? As write performance does not differ, DM extent lookup should
not be lagging, where is the trick?

CPU used:
# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 10
microcode       : 0x2
cpu MHz         : 3200.077
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr ss
                                                       e sse2 ss ht tm
pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
cid cx16 xtpr lahf_lm
bogomips        : 6400.15
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Br,
Ugis

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-16 14:46 ` [linux-lvm] " Ugis
@ 2013-10-16 16:16   ` Sage Weil
  -1 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-16 16:16 UTC (permalink / raw)
  To: Ugis; +Cc: linux-lvm, ceph-devel, ceph-users

Hi,

On Wed, 16 Oct 2013, Ugis wrote:
> Hello ceph&LVM communities!
> 
> I noticed very slow reads from xfs mount that is on ceph
> client(rbd+gpt partition+LVM PV + xfs on LE)
> To find a cause I created another rbd in the same pool, formatted it
> straight away with xfs, mounted.
> 
> Write performance for both xfs mounts is similar ~12MB/s
> 
> reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows:
> with LVM ~4MB/s
> pure xfs ~30MB/s
> 
> Watched performance while doing reads with atop. In LVM case atop
> shows LVM overloaded:
> LVM | s-LV_backups  | busy     95% |  read   21515 | write      0  |
> KiB/r      4 |               | KiB/w      0 |  MBr/s   4.20 | MBw/s
> 0.00  | avq     1.00 |  avio 0.85 ms |
> 
> client kernel 3.10.10
> ceph version 0.67.4
> 
> My considerations:
> I have expanded rbd under LVM couple of times(accordingly expanding
> gpt partition, PV,VG,LV, xfs afterwards), but that should have no
> impact on performance(tested clean rbd+LVM, same read performance as
> for expanded one).
> 
> As with device-mapper, after LVM is initialized it is just a small
> table with LE->PE  mapping that should reside in close CPU cache.
> I am guessing this could be related to old CPU used, probably caching
> near CPU does not work well(I tested also local HDDs with/without LVM
> and got read speed ~13MB/s vs 46MB/s with atop showing same overload
> in  LVM case).
> 
> What could make so great difference when LVM is used and what/how to
> tune? As write performance does not differ, DM extent lookup should
> not be lagging, where is the trick?

My first guess is that LVM is shifting the content of hte device such that 
it no longer aligns well with the RBD striping (by default, 4MB).  The 
non-aligned reads/writes would need to touch two objects instead of 
one, and dd is generally doing these synchronously (i.e., lots of 
waiting).

I'm not sure what options LVM provides for aligning things to the 
underlying storage...

sage


> 
> CPU used:
> # cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 15
> model           : 4
> model name      : Intel(R) Xeon(TM) CPU 3.20GHz
> stepping        : 10
> microcode       : 0x2
> cpu MHz         : 3200.077
> cache size      : 2048 KB
> physical id     : 0
> siblings        : 2
> core id         : 0
> cpu cores       : 1
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 5
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr ss
>                                                        e sse2 ss ht tm
> pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
> cid cx16 xtpr lahf_lm
> bogomips        : 6400.15
> clflush size    : 64
> cache_alignment : 128
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
> 
> Br,
> Ugis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-16 16:16   ` Sage Weil
  0 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-16 16:16 UTC (permalink / raw)
  To: Ugis; +Cc: ceph-devel, ceph-users, linux-lvm

Hi,

On Wed, 16 Oct 2013, Ugis wrote:
> Hello ceph&LVM communities!
> 
> I noticed very slow reads from xfs mount that is on ceph
> client(rbd+gpt partition+LVM PV + xfs on LE)
> To find a cause I created another rbd in the same pool, formatted it
> straight away with xfs, mounted.
> 
> Write performance for both xfs mounts is similar ~12MB/s
> 
> reads with "dd if=/mnt/somefile bs=1M | pv | dd of=/dev/null" as follows:
> with LVM ~4MB/s
> pure xfs ~30MB/s
> 
> Watched performance while doing reads with atop. In LVM case atop
> shows LVM overloaded:
> LVM | s-LV_backups  | busy     95% |  read   21515 | write      0  |
> KiB/r      4 |               | KiB/w      0 |  MBr/s   4.20 | MBw/s
> 0.00  | avq     1.00 |  avio 0.85 ms |
> 
> client kernel 3.10.10
> ceph version 0.67.4
> 
> My considerations:
> I have expanded rbd under LVM couple of times(accordingly expanding
> gpt partition, PV,VG,LV, xfs afterwards), but that should have no
> impact on performance(tested clean rbd+LVM, same read performance as
> for expanded one).
> 
> As with device-mapper, after LVM is initialized it is just a small
> table with LE->PE  mapping that should reside in close CPU cache.
> I am guessing this could be related to old CPU used, probably caching
> near CPU does not work well(I tested also local HDDs with/without LVM
> and got read speed ~13MB/s vs 46MB/s with atop showing same overload
> in  LVM case).
> 
> What could make so great difference when LVM is used and what/how to
> tune? As write performance does not differ, DM extent lookup should
> not be lagging, where is the trick?

My first guess is that LVM is shifting the content of hte device such that 
it no longer aligns well with the RBD striping (by default, 4MB).  The 
non-aligned reads/writes would need to touch two objects instead of 
one, and dd is generally doing these synchronously (i.e., lots of 
waiting).

I'm not sure what options LVM provides for aligning things to the 
underlying storage...

sage


> 
> CPU used:
> # cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 15
> model           : 4
> model name      : Intel(R) Xeon(TM) CPU 3.20GHz
> stepping        : 10
> microcode       : 0x2
> cpu MHz         : 3200.077
> cache size      : 2048 KB
> physical id     : 0
> siblings        : 2
> core id         : 0
> cpu cores       : 1
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 5
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr ss
>                                                        e sse2 ss ht tm
> pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl
> cid cx16 xtpr lahf_lm
> bogomips        : 6400.15
> clflush size    : 64
> cache_alignment : 128
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
> 
> Br,
> Ugis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-16 16:16   ` [linux-lvm] " Sage Weil
@ 2013-10-17  9:06       ` David McBride
  -1 siblings, 0 replies; 36+ messages in thread
From: David McBride @ 2013-10-17  9:06 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ,
	linux-lvm-H+wXaHxf7aLQT0dZR+AlfA

On 16/10/2013 17:16, Sage Weil wrote:

> I'm not sure what options LVM provides for aligning things to the
> underlying storage...

There is a generic kernel ABI for exposing performance properties of 
block devices to higher layers, so that they can automatically tune 
themselves according to those performance properties, and report their 
performance properties to users higher up the stack.

LVM supports both reading this data from underlying physical devices, 
configuring itself as appropriate --- as well as reporting this data to 
users of LVs, so that they can, too.

(For example, mkfs.xfs uses libblkid to automatically select the optimal 
stripe-size, stride width, etc. of an LVM volume sitting on top of an MD 
disk array.)

A good starting point appears to be:

 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c72758f33784e5e2a1a4bb9421ef3e6de8f9fcf3

If Ceph RBD block devices don't currently expose this information, that 
should be a relatively simple addition that will result in all higher 
layers, whether LVM or a native filesystem, automatically tuning 
themselves at creation-time for the RBD's performance characteristics.

(As an aside, it's possible that OSD journalling performance could also 
be improved by teaching it to heed this topology information.  I can 
imagine that when writing directly to block devices it may be possible 
to improve performance, such as when using LVM-on-an-SSD, or a DOS 
partition on a 4k-sector SATA disk.)

  ~ ~ ~

In the mean time, the documentation I found for LVM2 suggests that the 
`pvcreate` command supports the "--dataalignment" and 
"--dataalignmentoffset" flags.

The former should be the RBD object size, e.g. 4MB by default.  In this 
case, you'll also need to set the latter compensate for the offset 
introduced by the GPT place-holder partition table at the start of the 
device so that LVM data extents begin on an object boundry.

Cheers,
David
-- 
David McBride <dwm37-KWPb1pKIrIJaa/9Udqfwiw@public.gmane.org>
Unix Specialist, University Computing Service

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-17  9:06       ` David McBride
  0 siblings, 0 replies; 36+ messages in thread
From: David McBride @ 2013-10-17  9:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: ugis22, ceph-devel, ceph-users, linux-lvm

On 16/10/2013 17:16, Sage Weil wrote:

> I'm not sure what options LVM provides for aligning things to the
> underlying storage...

There is a generic kernel ABI for exposing performance properties of 
block devices to higher layers, so that they can automatically tune 
themselves according to those performance properties, and report their 
performance properties to users higher up the stack.

LVM supports both reading this data from underlying physical devices, 
configuring itself as appropriate --- as well as reporting this data to 
users of LVs, so that they can, too.

(For example, mkfs.xfs uses libblkid to automatically select the optimal 
stripe-size, stride width, etc. of an LVM volume sitting on top of an MD 
disk array.)

A good starting point appears to be:

 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c72758f33784e5e2a1a4bb9421ef3e6de8f9fcf3

If Ceph RBD block devices don't currently expose this information, that 
should be a relatively simple addition that will result in all higher 
layers, whether LVM or a native filesystem, automatically tuning 
themselves at creation-time for the RBD's performance characteristics.

(As an aside, it's possible that OSD journalling performance could also 
be improved by teaching it to heed this topology information.  I can 
imagine that when writing directly to block devices it may be possible 
to improve performance, such as when using LVM-on-an-SSD, or a DOS 
partition on a 4k-sector SATA disk.)

  ~ ~ ~

In the mean time, the documentation I found for LVM2 suggests that the 
`pvcreate` command supports the "--dataalignment" and 
"--dataalignmentoffset" flags.

The former should be the RBD object size, e.g. 4MB by default.  In this 
case, you'll also need to set the latter compensate for the offset 
introduced by the GPT place-holder partition table at the start of the 
device so that LVM data extents begin on an object boundry.

Cheers,
David
-- 
David McBride <dwm37@cam.ac.uk>
Unix Specialist, University Computing Service

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-16 16:16   ` [linux-lvm] " Sage Weil
@ 2013-10-17 15:18     ` Mike Snitzer
  -1 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-17 15:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ugis, ceph-devel, ceph-users, linux-lvm

On Wed, Oct 16 2013 at 12:16pm -0400,
Sage Weil <sage@inktank.com> wrote:

> Hi,
> 
> On Wed, 16 Oct 2013, Ugis wrote:
> > 
> > What could make so great difference when LVM is used and what/how to
> > tune? As write performance does not differ, DM extent lookup should
> > not be lagging, where is the trick?
> 
> My first guess is that LVM is shifting the content of hte device such that 
> it no longer aligns well with the RBD striping (by default, 4MB).  The 
> non-aligned reads/writes would need to touch two objects instead of 
> one, and dd is generally doing these synchronously (i.e., lots of 
> waiting).
> 
> I'm not sure what options LVM provides for aligning things to the 
> underlying storage...

LVM will consume the underlying storage's device limits.  So if rbd
establishes appropriate minimum_io_size and optimal_io_size that reflect
the striping config LVM will pick it up -- provided
'data_alignment_detection' is enabled in lvm.conf (which it is by
default).

Ugis, please provide the output of:

RBD_DEVICE=<rbd device name>
pvs -o pe_start $RBD_DEVICE
cat /sys/block/$RBD_DEVICE/queue/minimum_io_size
cat /sys/block/$RBD_DEVICE/queue/optimal_io_size

The 'pvs' command will tell you where LVM aligned the start of the data
area (which follows the LVM metadata area).  Hopefully it reflects what
was published in sysfs for rbd's striping.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-17 15:18     ` Mike Snitzer
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-17 15:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, Ugis, linux-lvm, ceph-users

On Wed, Oct 16 2013 at 12:16pm -0400,
Sage Weil <sage@inktank.com> wrote:

> Hi,
> 
> On Wed, 16 Oct 2013, Ugis wrote:
> > 
> > What could make so great difference when LVM is used and what/how to
> > tune? As write performance does not differ, DM extent lookup should
> > not be lagging, where is the trick?
> 
> My first guess is that LVM is shifting the content of hte device such that 
> it no longer aligns well with the RBD striping (by default, 4MB).  The 
> non-aligned reads/writes would need to touch two objects instead of 
> one, and dd is generally doing these synchronously (i.e., lots of 
> waiting).
> 
> I'm not sure what options LVM provides for aligning things to the 
> underlying storage...

LVM will consume the underlying storage's device limits.  So if rbd
establishes appropriate minimum_io_size and optimal_io_size that reflect
the striping config LVM will pick it up -- provided
'data_alignment_detection' is enabled in lvm.conf (which it is by
default).

Ugis, please provide the output of:

RBD_DEVICE=<rbd device name>
pvs -o pe_start $RBD_DEVICE
cat /sys/block/$RBD_DEVICE/queue/minimum_io_size
cat /sys/block/$RBD_DEVICE/queue/optimal_io_size

The 'pvs' command will tell you where LVM aligned the start of the data
area (which follows the LVM metadata area).  Hopefully it reflects what
was published in sysfs for rbd's striping.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-17 15:18     ` [linux-lvm] " Mike Snitzer
@ 2013-10-18  7:56         ` Ugis
  -1 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-18  7:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ,
	linux-lvm-H+wXaHxf7aLQT0dZR+AlfA

> Ugis, please provide the output of:
>
> RBD_DEVICE=<rbd device name>
> pvs -o pe_start $RBD_DEVICE
> cat /sys/block/$RBD_DEVICE/queue/minimum_io_size
> cat /sys/block/$RBD_DEVICE/queue/optimal_io_size
>
> The 'pvs' command will tell you where LVM aligned the start of the data
> area (which follows the LVM metadata area).  Hopefully it reflects what
> was published in sysfs for rbd's striping.

output follows:
#pvs -o pe_start /dev/rbd1p1
  1st PE
    4.00m
# cat /sys/block/rbd1/queue/minimum_io_size
4194304
# cat /sys/block/rbd1/queue/optimal_io_size
4194304

Seems correct in terms of ceph-LVM io parameter negotiation? I wonded
about gpt header+PV metadata - it makes some shift starting from ceph
1st block beginning. Does this mean that all following LVM 4m data
blocks are shifted by this part and span 2 ceph objects?
If so, performance will be affected.

Ugis

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-18  7:56         ` Ugis
  0 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-18  7:56 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: ceph-devel, ceph-users, Sage Weil, linux-lvm

> Ugis, please provide the output of:
>
> RBD_DEVICE=<rbd device name>
> pvs -o pe_start $RBD_DEVICE
> cat /sys/block/$RBD_DEVICE/queue/minimum_io_size
> cat /sys/block/$RBD_DEVICE/queue/optimal_io_size
>
> The 'pvs' command will tell you where LVM aligned the start of the data
> area (which follows the LVM metadata area).  Hopefully it reflects what
> was published in sysfs for rbd's striping.

output follows:
#pvs -o pe_start /dev/rbd1p1
  1st PE
    4.00m
# cat /sys/block/rbd1/queue/minimum_io_size
4194304
# cat /sys/block/rbd1/queue/optimal_io_size
4194304

Seems correct in terms of ceph-LVM io parameter negotiation? I wonded
about gpt header+PV metadata - it makes some shift starting from ceph
1st block beginning. Does this mean that all following LVM 4m data
blocks are shifted by this part and span 2 ceph objects?
If so, performance will be affected.

Ugis

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-18  7:56         ` [linux-lvm] " Ugis
@ 2013-10-19  0:01           ` Sage Weil
  -1 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-19  0:01 UTC (permalink / raw)
  To: Ugis; +Cc: Mike Snitzer, ceph-devel, ceph-users, linux-lvm

On Fri, 18 Oct 2013, Ugis wrote:
> > Ugis, please provide the output of:
> >
> > RBD_DEVICE=<rbd device name>
> > pvs -o pe_start $RBD_DEVICE
> > cat /sys/block/$RBD_DEVICE/queue/minimum_io_size
> > cat /sys/block/$RBD_DEVICE/queue/optimal_io_size
> >
> > The 'pvs' command will tell you where LVM aligned the start of the data
> > area (which follows the LVM metadata area).  Hopefully it reflects what
> > was published in sysfs for rbd's striping.
> 
> output follows:
> #pvs -o pe_start /dev/rbd1p1
>   1st PE
>     4.00m
> # cat /sys/block/rbd1/queue/minimum_io_size
> 4194304
> # cat /sys/block/rbd1/queue/optimal_io_size
> 4194304

Well, the parameters are being set at least.  Mike, is it possible that 
having minimum_io_size set to 4m is causing some read amplification 
in LVM, translating a small read into a complete fetch of the PE (or 
somethinga long those lines)?

Ugis, if your cluster is on the small side, it might be interesting to see 
what requests the client is generated in the LVM and non-LVM case by 
setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs 
'--debug-ms 1') and then looking at the osd_op messages that appear in 
/var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is 
different.

> Seems correct in terms of ceph-LVM io parameter negotiation? I wonded
> about gpt header+PV metadata - it makes some shift starting from ceph
> 1st block beginning. Does this mean that all following LVM 4m data
> blocks are shifted by this part and span 2 ceph objects?
> If so, performance will be affected.

I'm no LVM expert, but I would guess that LVM is aligning things properly 
based on the above device properties...

sage

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-19  0:01           ` Sage Weil
  0 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-19  0:01 UTC (permalink / raw)
  To: Ugis; +Cc: ceph-devel, ceph-users, Mike Snitzer, linux-lvm

On Fri, 18 Oct 2013, Ugis wrote:
> > Ugis, please provide the output of:
> >
> > RBD_DEVICE=<rbd device name>
> > pvs -o pe_start $RBD_DEVICE
> > cat /sys/block/$RBD_DEVICE/queue/minimum_io_size
> > cat /sys/block/$RBD_DEVICE/queue/optimal_io_size
> >
> > The 'pvs' command will tell you where LVM aligned the start of the data
> > area (which follows the LVM metadata area).  Hopefully it reflects what
> > was published in sysfs for rbd's striping.
> 
> output follows:
> #pvs -o pe_start /dev/rbd1p1
>   1st PE
>     4.00m
> # cat /sys/block/rbd1/queue/minimum_io_size
> 4194304
> # cat /sys/block/rbd1/queue/optimal_io_size
> 4194304

Well, the parameters are being set at least.  Mike, is it possible that 
having minimum_io_size set to 4m is causing some read amplification 
in LVM, translating a small read into a complete fetch of the PE (or 
somethinga long those lines)?

Ugis, if your cluster is on the small side, it might be interesting to see 
what requests the client is generated in the LVM and non-LVM case by 
setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs 
'--debug-ms 1') and then looking at the osd_op messages that appear in 
/var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is 
different.

> Seems correct in terms of ceph-LVM io parameter negotiation? I wonded
> about gpt header+PV metadata - it makes some shift starting from ceph
> 1st block beginning. Does this mean that all following LVM 4m data
> blocks are shifted by this part and span 2 ceph objects?
> If so, performance will be affected.

I'm no LVM expert, but I would guess that LVM is aligning things properly 
based on the above device properties...

sage

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-19  0:01           ` [linux-lvm] " Sage Weil
@ 2013-10-20 15:18               ` Ugis
  -1 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-20 15:18 UTC (permalink / raw)
  To: Sage Weil
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ,
	Mike Snitzer, linux-lvm-H+wXaHxf7aLQT0dZR+AlfA

>> output follows:
>> #pvs -o pe_start /dev/rbd1p1
>>   1st PE
>>     4.00m
>> # cat /sys/block/rbd1/queue/minimum_io_size
>> 4194304
>> # cat /sys/block/rbd1/queue/optimal_io_size
>> 4194304
>
> Well, the parameters are being set at least.  Mike, is it possible that
> having minimum_io_size set to 4m is causing some read amplification
> in LVM, translating a small read into a complete fetch of the PE (or
> somethinga long those lines)?
>
> Ugis, if your cluster is on the small side, it might be interesting to see
> what requests the client is generated in the LVM and non-LVM case by
> setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
> '--debug-ms 1') and then looking at the osd_op messages that appear in
> /var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
> different.
>
Sage, here follows debug output. I am no pro in reading this, but
seems read block size differ(or what is that number following ~ sign)?

OSD.2 read with LVM:
2013-10-20 16:59:05.307159 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176566434
rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0)
v4 -- ?+0 0xdc35c00 con 0xd9e4840
2013-10-20 16:59:05.307655 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 5548 ====
osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef
[read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0)
0x12593d80 con 0xd9e4840
2013-10-20 16:59:05.307824 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176566435
rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0)
v4 -- ?+0 0xe24fc00 con 0xd9e4840
2013-10-20 16:59:05.308316 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 5549 ====
osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef
[read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0)
0xe28f6c0 con 0xd9e4840
2013-10-20 16:59:05.308499 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176566436
rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0)
v4 -- ?+0 0xdc35a00 con 0xd9e4840
2013-10-20 16:59:05.308985 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 5550 ====
osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef
[read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0)
0xe0b46c0 con 0xd9e4840

OSD.2 read without LVM
2013-10-20 17:03:13.730881 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176708854
rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4
-- ?+0 0x1019d200 con 0xd9e4840
2013-10-20 17:03:13.731318 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 18232 ====
osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read
2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0)
0x171a7480 con 0xd9e4840
2013-10-20 17:03:13.731664 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176708855
rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4
-- ?+0 0x12b81200 con 0xd9e4840
2013-10-20 17:03:13.733112 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 18233 ====
osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read
2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0)
0x12593d80 con 0xd9e4840
2013-10-20 17:03:13.733393 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176708856
rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4
-- ?+0 0xeba9000 con 0xd9e4840
2013-10-20 17:03:13.733741 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 18234 ====
osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read
2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0)
0xe0b4d80 con 0xd9e4840

How to proceed with tuning read performance on LVM? Is there some
chanage needed in code of ceph/LVM or my config needs to be tuned?
If what is shown in logs means 4k read block in LVM case - then it
seems I need to tell LVM(or xfs on top of LVM dictates read block
side?) that io block should be rather 4m?

Ugis

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-20 15:18               ` Ugis
  0 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-20 15:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, ceph-users, Mike Snitzer, linux-lvm

>> output follows:
>> #pvs -o pe_start /dev/rbd1p1
>>   1st PE
>>     4.00m
>> # cat /sys/block/rbd1/queue/minimum_io_size
>> 4194304
>> # cat /sys/block/rbd1/queue/optimal_io_size
>> 4194304
>
> Well, the parameters are being set at least.  Mike, is it possible that
> having minimum_io_size set to 4m is causing some read amplification
> in LVM, translating a small read into a complete fetch of the PE (or
> somethinga long those lines)?
>
> Ugis, if your cluster is on the small side, it might be interesting to see
> what requests the client is generated in the LVM and non-LVM case by
> setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
> '--debug-ms 1') and then looking at the osd_op messages that appear in
> /var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
> different.
>
Sage, here follows debug output. I am no pro in reading this, but
seems read block size differ(or what is that number following ~ sign)?

OSD.2 read with LVM:
2013-10-20 16:59:05.307159 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176566434
rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0)
v4 -- ?+0 0xdc35c00 con 0xd9e4840
2013-10-20 16:59:05.307655 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 5548 ====
osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef
[read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0)
0x12593d80 con 0xd9e4840
2013-10-20 16:59:05.307824 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176566435
rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0)
v4 -- ?+0 0xe24fc00 con 0xd9e4840
2013-10-20 16:59:05.308316 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 5549 ====
osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef
[read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0)
0xe28f6c0 con 0xd9e4840
2013-10-20 16:59:05.308499 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176566436
rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0)
v4 -- ?+0 0xdc35a00 con 0xd9e4840
2013-10-20 16:59:05.308985 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 5550 ====
osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef
[read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0)
0xe0b46c0 con 0xd9e4840

OSD.2 read without LVM
2013-10-20 17:03:13.730881 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176708854
rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4
-- ?+0 0x1019d200 con 0xd9e4840
2013-10-20 17:03:13.731318 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 18232 ====
osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read
2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0)
0x171a7480 con 0xd9e4840
2013-10-20 17:03:13.731664 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176708855
rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4
-- ?+0 0x12b81200 con 0xd9e4840
2013-10-20 17:03:13.733112 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 18233 ====
osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read
2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0)
0x12593d80 con 0xd9e4840
2013-10-20 17:03:13.733393 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
x.x.x.y:0/269199468 -- osd_op_reply(176708856
rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4
-- ?+0 0xeba9000 con 0xd9e4840
2013-10-20 17:03:13.733741 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
client.38069 x.x.x.y:0/269199468 18234 ====
osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read
2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0)
0xe0b4d80 con 0xd9e4840

How to proceed with tuning read performance on LVM? Is there some
chanage needed in code of ceph/LVM or my config needs to be tuned?
If what is shown in logs means 4k read block in LVM case - then it
seems I need to tell LVM(or xfs on top of LVM dictates read block
side?) that io block should be rather 4m?

Ugis

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-20 15:18               ` [linux-lvm] " Ugis
@ 2013-10-20 18:21                   ` Josh Durgin
  -1 siblings, 0 replies; 36+ messages in thread
From: Josh Durgin @ 2013-10-20 18:21 UTC (permalink / raw)
  To: Ugis, Sage Weil
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ,
	Mike Snitzer, linux-lvm-H+wXaHxf7aLQT0dZR+AlfA

On 10/20/2013 08:18 AM, Ugis wrote:
>>> output follows:
>>> #pvs -o pe_start /dev/rbd1p1
>>>    1st PE
>>>      4.00m
>>> # cat /sys/block/rbd1/queue/minimum_io_size
>>> 4194304
>>> # cat /sys/block/rbd1/queue/optimal_io_size
>>> 4194304
>>
>> Well, the parameters are being set at least.  Mike, is it possible that
>> having minimum_io_size set to 4m is causing some read amplification
>> in LVM, translating a small read into a complete fetch of the PE (or
>> somethinga long those lines)?
>>
>> Ugis, if your cluster is on the small side, it might be interesting to see
>> what requests the client is generated in the LVM and non-LVM case by
>> setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
>> '--debug-ms 1') and then looking at the osd_op messages that appear in
>> /var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
>> different.
>>
> Sage, here follows debug output. I am no pro in reading this, but
> seems read block size differ(or what is that number following ~ sign)?

Yes, that's the I/O length. LVM is sending requests for 4k at a time,
while plain kernel rbd is sending 128k.

<request logs showing this>

> How to proceed with tuning read performance on LVM? Is there some
> chanage needed in code of ceph/LVM or my config needs to be tuned?
> If what is shown in logs means 4k read block in LVM case - then it
> seems I need to tell LVM(or xfs on top of LVM dictates read block
> side?) that io block should be rather 4m?

It's a client side issue of sending much smaller requests than it needs
to. Check the queue minimum and optimal sizes for the lvm device - it
sounds like they might be getting set to 4k for some reason.

Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] [ceph-users] poor read performance on rbd+LVM, LVM overload
@ 2013-10-20 18:21                   ` Josh Durgin
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Durgin @ 2013-10-20 18:21 UTC (permalink / raw)
  To: Ugis, Sage Weil; +Cc: ceph-devel, ceph-users, Mike Snitzer, linux-lvm

On 10/20/2013 08:18 AM, Ugis wrote:
>>> output follows:
>>> #pvs -o pe_start /dev/rbd1p1
>>>    1st PE
>>>      4.00m
>>> # cat /sys/block/rbd1/queue/minimum_io_size
>>> 4194304
>>> # cat /sys/block/rbd1/queue/optimal_io_size
>>> 4194304
>>
>> Well, the parameters are being set at least.  Mike, is it possible that
>> having minimum_io_size set to 4m is causing some read amplification
>> in LVM, translating a small read into a complete fetch of the PE (or
>> somethinga long those lines)?
>>
>> Ugis, if your cluster is on the small side, it might be interesting to see
>> what requests the client is generated in the LVM and non-LVM case by
>> setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
>> '--debug-ms 1') and then looking at the osd_op messages that appear in
>> /var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
>> different.
>>
> Sage, here follows debug output. I am no pro in reading this, but
> seems read block size differ(or what is that number following ~ sign)?

Yes, that's the I/O length. LVM is sending requests for 4k at a time,
while plain kernel rbd is sending 128k.

<request logs showing this>

> How to proceed with tuning read performance on LVM? Is there some
> chanage needed in code of ceph/LVM or my config needs to be tuned?
> If what is shown in logs means 4k read block in LVM case - then it
> seems I need to tell LVM(or xfs on top of LVM dictates read block
> side?) that io block should be rather 4m?

It's a client side issue of sending much smaller requests than it needs
to. Check the queue minimum and optimal sizes for the lvm device - it
sounds like they might be getting set to 4k for some reason.

Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-20 15:18               ` [linux-lvm] " Ugis
@ 2013-10-21  3:58                   ` Sage Weil
  -1 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-21  3:58 UTC (permalink / raw)
  To: Ugis
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ,
	Mike Snitzer, linux-lvm-H+wXaHxf7aLQT0dZR+AlfA

On Sun, 20 Oct 2013, Ugis wrote:
> >> output follows:
> >> #pvs -o pe_start /dev/rbd1p1
> >>   1st PE
> >>     4.00m
> >> # cat /sys/block/rbd1/queue/minimum_io_size
> >> 4194304
> >> # cat /sys/block/rbd1/queue/optimal_io_size
> >> 4194304
> >
> > Well, the parameters are being set at least.  Mike, is it possible that
> > having minimum_io_size set to 4m is causing some read amplification
> > in LVM, translating a small read into a complete fetch of the PE (or
> > somethinga long those lines)?
> >
> > Ugis, if your cluster is on the small side, it might be interesting to see
> > what requests the client is generated in the LVM and non-LVM case by
> > setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
> > '--debug-ms 1') and then looking at the osd_op messages that appear in
> > /var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
> > different.
> >
> Sage, here follows debug output. I am no pro in reading this, but
> seems read block size differ(or what is that number following ~ sign)?

Yep, it's offset~length.  

It looks like without LVM we're getting 128KB requests (which IIRC is 
typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
fuzzy here, but I seem to recall a property on the request_queue or device 
that affected this.  RBD is currently doing

	segment_size = rbd_obj_bytes(&rbd_dev->header);
	blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
	blk_queue_max_segment_size(q, segment_size);
	blk_queue_io_min(q, segment_size);
	blk_queue_io_opt(q, segment_size);

where segment_size is 4MB (so, much more than 128KB); maybe it has 
something to do with how many smaller ios get coalesced a larger requests?

In any case, something appears to be lost due to the pass through LVM, but 
I'm not very familiar with the block layer code at all...  :/

sage


> 
> OSD.2 read with LVM:
> 2013-10-20 16:59:05.307159 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566434
> rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0)
> v4 -- ?+0 0xdc35c00 con 0xd9e4840
> 2013-10-20 16:59:05.307655 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5548 ====
> osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0)
> 0x12593d80 con 0xd9e4840
> 2013-10-20 16:59:05.307824 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566435
> rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0)
> v4 -- ?+0 0xe24fc00 con 0xd9e4840
> 2013-10-20 16:59:05.308316 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5549 ====
> osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0)
> 0xe28f6c0 con 0xd9e4840
> 2013-10-20 16:59:05.308499 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566436
> rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0)
> v4 -- ?+0 0xdc35a00 con 0xd9e4840
> 2013-10-20 16:59:05.308985 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5550 ====
> osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0)
> 0xe0b46c0 con 0xd9e4840
> 
> OSD.2 read without LVM
> 2013-10-20 17:03:13.730881 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708854
> rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4
> -- ?+0 0x1019d200 con 0xd9e4840
> 2013-10-20 17:03:13.731318 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18232 ====
> osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read
> 2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0)
> 0x171a7480 con 0xd9e4840
> 2013-10-20 17:03:13.731664 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708855
> rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4
> -- ?+0 0x12b81200 con 0xd9e4840
> 2013-10-20 17:03:13.733112 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18233 ====
> osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read
> 2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0)
> 0x12593d80 con 0xd9e4840
> 2013-10-20 17:03:13.733393 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708856
> rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4
> -- ?+0 0xeba9000 con 0xd9e4840
> 2013-10-20 17:03:13.733741 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18234 ====
> osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read
> 2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0)
> 0xe0b4d80 con 0xd9e4840
> 
> How to proceed with tuning read performance on LVM? Is there some
> chanage needed in code of ceph/LVM or my config needs to be tuned?
> If what is shown in logs means 4k read block in LVM case - then it
> seems I need to tell LVM(or xfs on top of LVM dictates read block
> side?) that io block should be rather 4m?
> 
> Ugis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21  3:58                   ` Sage Weil
  0 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-21  3:58 UTC (permalink / raw)
  To: Ugis; +Cc: elder, ceph-devel, ceph-users, Mike Snitzer, linux-lvm

On Sun, 20 Oct 2013, Ugis wrote:
> >> output follows:
> >> #pvs -o pe_start /dev/rbd1p1
> >>   1st PE
> >>     4.00m
> >> # cat /sys/block/rbd1/queue/minimum_io_size
> >> 4194304
> >> # cat /sys/block/rbd1/queue/optimal_io_size
> >> 4194304
> >
> > Well, the parameters are being set at least.  Mike, is it possible that
> > having minimum_io_size set to 4m is causing some read amplification
> > in LVM, translating a small read into a complete fetch of the PE (or
> > somethinga long those lines)?
> >
> > Ugis, if your cluster is on the small side, it might be interesting to see
> > what requests the client is generated in the LVM and non-LVM case by
> > setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
> > '--debug-ms 1') and then looking at the osd_op messages that appear in
> > /var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
> > different.
> >
> Sage, here follows debug output. I am no pro in reading this, but
> seems read block size differ(or what is that number following ~ sign)?

Yep, it's offset~length.  

It looks like without LVM we're getting 128KB requests (which IIRC is 
typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
fuzzy here, but I seem to recall a property on the request_queue or device 
that affected this.  RBD is currently doing

	segment_size = rbd_obj_bytes(&rbd_dev->header);
	blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
	blk_queue_max_segment_size(q, segment_size);
	blk_queue_io_min(q, segment_size);
	blk_queue_io_opt(q, segment_size);

where segment_size is 4MB (so, much more than 128KB); maybe it has 
something to do with how many smaller ios get coalesced a larger requests?

In any case, something appears to be lost due to the pass through LVM, but 
I'm not very familiar with the block layer code at all...  :/

sage


> 
> OSD.2 read with LVM:
> 2013-10-20 16:59:05.307159 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566434
> rbd_data.3ad974b0dc51.0000000000007cef [read 4083712~4096] ondisk = 0)
> v4 -- ?+0 0xdc35c00 con 0xd9e4840
> 2013-10-20 16:59:05.307655 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5548 ====
> osd_op(client.38069.1:176566435 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4087808~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (1554835253 0 0)
> 0x12593d80 con 0xd9e4840
> 2013-10-20 16:59:05.307824 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566435
> rbd_data.3ad974b0dc51.0000000000007cef [read 4087808~4096] ondisk = 0)
> v4 -- ?+0 0xe24fc00 con 0xd9e4840
> 2013-10-20 16:59:05.308316 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5549 ====
> osd_op(client.38069.1:176566436 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4091904~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3467296840 0 0)
> 0xe28f6c0 con 0xd9e4840
> 2013-10-20 16:59:05.308499 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176566436
> rbd_data.3ad974b0dc51.0000000000007cef [read 4091904~4096] ondisk = 0)
> v4 -- ?+0 0xdc35a00 con 0xd9e4840
> 2013-10-20 16:59:05.308985 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 5550 ====
> osd_op(client.38069.1:176566437 rbd_data.3ad974b0dc51.0000000000007cef
> [read 4096000~4096] 4.5672f053 e6870) v4 ==== 177+0+0 (3104591620 0 0)
> 0xe0b46c0 con 0xd9e4840
> 
> OSD.2 read without LVM
> 2013-10-20 17:03:13.730881 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708854
> rb.0.967b.238e1f29.000000000071 [read 2359296~131072] ondisk = 0) v4
> -- ?+0 0x1019d200 con 0xd9e4840
> 2013-10-20 17:03:13.731318 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18232 ====
> osd_op(client.38069.1:176708855 rb.0.967b.238e1f29.000000000071 [read
> 2490368~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (1987168552 0 0)
> 0x171a7480 con 0xd9e4840
> 2013-10-20 17:03:13.731664 7f95acfa5700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708855
> rb.0.967b.238e1f29.000000000071 [read 2490368~131072] ondisk = 0) v4
> -- ?+0 0x12b81200 con 0xd9e4840
> 2013-10-20 17:03:13.733112 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18233 ====
> osd_op(client.38069.1:176708856 rb.0.967b.238e1f29.000000000071 [read
> 2621440~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (527551382 0 0)
> 0x12593d80 con 0xd9e4840
> 2013-10-20 17:03:13.733393 7f95ac7a4700  1 -- x.x.x.x:6804/1944 -->
> x.x.x.y:0/269199468 -- osd_op_reply(176708856
> rb.0.967b.238e1f29.000000000071 [read 2621440~131072] ondisk = 0) v4
> -- ?+0 0xeba9000 con 0xd9e4840
> 2013-10-20 17:03:13.733741 7f95b27b0700  1 -- x.x.x.x:6804/1944 <==
> client.38069 x.x.x.y:0/269199468 18234 ====
> osd_op(client.38069.1:176708857 rb.0.967b.238e1f29.000000000071 [read
> 2752512~131072] 4.c0d1e4cb e6870) v4 ==== 170+0+0 (178955972 0 0)
> 0xe0b4d80 con 0xd9e4840
> 
> How to proceed with tuning read performance on LVM? Is there some
> chanage needed in code of ceph/LVM or my config needs to be tuned?
> If what is shown in logs means 4k read block in LVM case - then it
> seems I need to tell LVM(or xfs on top of LVM dictates read block
> side?) that io block should be rather 4m?
> 
> Ugis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21  3:58                   ` [linux-lvm] " Sage Weil
@ 2013-10-21 14:11                     ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2013-10-21 14:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ugis, Mike Snitzer, ceph-devel, ceph-users, linux-lvm, elder

On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> It looks like without LVM we're getting 128KB requests (which IIRC is 
> typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> fuzzy here, but I seem to recall a property on the request_queue or device 
> that affected this.  RBD is currently doing

Unfortunately most device mapper modules still split all I/O into 4k
chunks before handling them.  They rely on the elevator to merge them
back together down the line, which isn't overly efficient but should at
least provide larger segments for the common cases.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 14:11                     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2013-10-21 14:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: elder, Mike Snitzer, Ugis, linux-lvm, ceph-devel, ceph-users

On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> It looks like without LVM we're getting 128KB requests (which IIRC is 
> typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> fuzzy here, but I seem to recall a property on the request_queue or device 
> that affected this.  RBD is currently doing

Unfortunately most device mapper modules still split all I/O into 4k
chunks before handling them.  They rely on the elevator to merge them
back together down the line, which isn't overly efficient but should at
least provide larger segments for the common cases.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 14:11                     ` [linux-lvm] " Christoph Hellwig
@ 2013-10-21 15:01                         ` Mike Snitzer
  -1 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 15:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-lvm-H+wXaHxf7aLQT0dZR+AlfA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

On Mon, Oct 21 2013 at 10:11am -0400,
Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > fuzzy here, but I seem to recall a property on the request_queue or device 
> > that affected this.  RBD is currently doing
> 
> Unfortunately most device mapper modules still split all I/O into 4k
> chunks before handling them.  They rely on the elevator to merge them
> back together down the line, which isn't overly efficient but should at
> least provide larger segments for the common cases.

It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
no?  Unless care is taken to assemble larger bios (higher up the IO
stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
in $PAGE_SIZE granularity.

I would expect direct IO to before better here because it will make use
of bio_add_page to build up larger IOs.

Taking a step back, the rbd driver is exposing both the minimum_io_size
and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
to respect the limits when it assembles its bios (via bio_add_page).

Sage, any reason why you don't use traditional raid geomtry based IO
limits?, e.g.:

minimum_io_size = raid chunk size
optimal_io_size = raid chunk size * N stripes (aka full stripe)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 15:01                         ` Mike Snitzer
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 15:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: elder, Sage Weil, Ugis, linux-lvm, ceph-devel, ceph-users

On Mon, Oct 21 2013 at 10:11am -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > fuzzy here, but I seem to recall a property on the request_queue or device 
> > that affected this.  RBD is currently doing
> 
> Unfortunately most device mapper modules still split all I/O into 4k
> chunks before handling them.  They rely on the elevator to merge them
> back together down the line, which isn't overly efficient but should at
> least provide larger segments for the common cases.

It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
no?  Unless care is taken to assemble larger bios (higher up the IO
stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
in $PAGE_SIZE granularity.

I would expect direct IO to before better here because it will make use
of bio_add_page to build up larger IOs.

Taking a step back, the rbd driver is exposing both the minimum_io_size
and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
to respect the limits when it assembles its bios (via bio_add_page).

Sage, any reason why you don't use traditional raid geomtry based IO
limits?, e.g.:

minimum_io_size = raid chunk size
optimal_io_size = raid chunk size * N stripes (aka full stripe)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 15:01                         ` [linux-lvm] " Mike Snitzer
@ 2013-10-21 15:06                             ` Mike Snitzer
  -1 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 15:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-lvm-H+wXaHxf7aLQT0dZR+AlfA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

On Mon, Oct 21 2013 at 11:01am -0400,
Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Mon, Oct 21 2013 at 10:11am -0400,
> Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> 
> > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > that affected this.  RBD is currently doing
> > 
> > Unfortunately most device mapper modules still split all I/O into 4k
> > chunks before handling them.  They rely on the elevator to merge them
> > back together down the line, which isn't overly efficient but should at
> > least provide larger segments for the common cases.
> 
> It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> no?  Unless care is taken to assemble larger bios (higher up the IO
> stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> in $PAGE_SIZE granularity.
> 
> I would expect direct IO to before better here because it will make use
> of bio_add_page to build up larger IOs.

s/before/perform/ ;)
 
> Taking a step back, the rbd driver is exposing both the minimum_io_size
> and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> to respect the limits when it assembles its bios (via bio_add_page).
> 
> Sage, any reason why you don't use traditional raid geomtry based IO
> limits?, e.g.:
> 
> minimum_io_size = raid chunk size
> optimal_io_size = raid chunk size * N stripes (aka full stripe)
> 
> _______________________________________________
> linux-lvm mailing list
> linux-lvm-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 15:06                             ` Mike Snitzer
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 15:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: elder, Sage Weil, Ugis, linux-lvm, ceph-devel, ceph-users

On Mon, Oct 21 2013 at 11:01am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Oct 21 2013 at 10:11am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > that affected this.  RBD is currently doing
> > 
> > Unfortunately most device mapper modules still split all I/O into 4k
> > chunks before handling them.  They rely on the elevator to merge them
> > back together down the line, which isn't overly efficient but should at
> > least provide larger segments for the common cases.
> 
> It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> no?  Unless care is taken to assemble larger bios (higher up the IO
> stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> in $PAGE_SIZE granularity.
> 
> I would expect direct IO to before better here because it will make use
> of bio_add_page to build up larger IOs.

s/before/perform/ ;)
 
> Taking a step back, the rbd driver is exposing both the minimum_io_size
> and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> to respect the limits when it assembles its bios (via bio_add_page).
> 
> Sage, any reason why you don't use traditional raid geomtry based IO
> limits?, e.g.:
> 
> minimum_io_size = raid chunk size
> optimal_io_size = raid chunk size * N stripes (aka full stripe)
> 
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 15:01                         ` [linux-lvm] " Mike Snitzer
@ 2013-10-21 16:02                           ` Sage Weil
  -1 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-21 16:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, elder, Ugis, linux-lvm, ceph-devel, ceph-users

On Mon, 21 Oct 2013, Mike Snitzer wrote:
> On Mon, Oct 21 2013 at 10:11am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > that affected this.  RBD is currently doing
> > 
> > Unfortunately most device mapper modules still split all I/O into 4k
> > chunks before handling them.  They rely on the elevator to merge them
> > back together down the line, which isn't overly efficient but should at
> > least provide larger segments for the common cases.
> 
> It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> no?  Unless care is taken to assemble larger bios (higher up the IO
> stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> in $PAGE_SIZE granularity.
> 
> I would expect direct IO to before better here because it will make use
> of bio_add_page to build up larger IOs.

I do know that we regularly see 128 KB requests when we put XFS (or 
whatever else) directly on top of /dev/rbd*.

> Taking a step back, the rbd driver is exposing both the minimum_io_size
> and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> to respect the limits when it assembles its bios (via bio_add_page).
> 
> Sage, any reason why you don't use traditional raid geomtry based IO
> limits?, e.g.:
> 
> minimum_io_size = raid chunk size
> optimal_io_size = raid chunk size * N stripes (aka full stripe)

We are... by default we stripe 4M chunks across 4M objects.  You're 
suggesting it would actually help to advertise a smaller minimim_io_size 
(say, 1MB)?  This could easily be made tunable.

sage

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 16:02                           ` Sage Weil
  0 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-21 16:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: elder, Christoph Hellwig, Ugis, linux-lvm, ceph-devel, ceph-users

On Mon, 21 Oct 2013, Mike Snitzer wrote:
> On Mon, Oct 21 2013 at 10:11am -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > that affected this.  RBD is currently doing
> > 
> > Unfortunately most device mapper modules still split all I/O into 4k
> > chunks before handling them.  They rely on the elevator to merge them
> > back together down the line, which isn't overly efficient but should at
> > least provide larger segments for the common cases.
> 
> It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> no?  Unless care is taken to assemble larger bios (higher up the IO
> stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> in $PAGE_SIZE granularity.
> 
> I would expect direct IO to before better here because it will make use
> of bio_add_page to build up larger IOs.

I do know that we regularly see 128 KB requests when we put XFS (or 
whatever else) directly on top of /dev/rbd*.

> Taking a step back, the rbd driver is exposing both the minimum_io_size
> and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> to respect the limits when it assembles its bios (via bio_add_page).
> 
> Sage, any reason why you don't use traditional raid geomtry based IO
> limits?, e.g.:
> 
> minimum_io_size = raid chunk size
> optimal_io_size = raid chunk size * N stripes (aka full stripe)

We are... by default we stripe 4M chunks across 4M objects.  You're 
suggesting it would actually help to advertise a smaller minimim_io_size 
(say, 1MB)?  This could easily be made tunable.

sage

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 16:02                           ` [linux-lvm] " Sage Weil
@ 2013-10-21 17:48                               ` Mike Snitzer
  -1 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 17:48 UTC (permalink / raw)
  To: Sage Weil
  Cc: Christoph Hellwig, linux-lvm-H+wXaHxf7aLQT0dZR+AlfA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

On Mon, Oct 21 2013 at 12:02pm -0400,
Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> On Mon, 21 Oct 2013, Mike Snitzer wrote:
> > On Mon, Oct 21 2013 at 10:11am -0400,
> > Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> > 
> > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > > that affected this.  RBD is currently doing
> > > 
> > > Unfortunately most device mapper modules still split all I/O into 4k
> > > chunks before handling them.  They rely on the elevator to merge them
> > > back together down the line, which isn't overly efficient but should at
> > > least provide larger segments for the common cases.
> > 
> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > no?  Unless care is taken to assemble larger bios (higher up the IO
> > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> > in $PAGE_SIZE granularity.
> > 
> > I would expect direct IO to before better here because it will make use
> > of bio_add_page to build up larger IOs.
> 
> I do know that we regularly see 128 KB requests when we put XFS (or 
> whatever else) directly on top of /dev/rbd*.

Should be pretty straight-forward to identify any limits that are
different by walking sysfs/queue, e.g.:

grep -r . /sys/block/rdbXXX/queue
vs
grep -r . /sys/block/dm-X/queue

Could be there is an unexpected difference.  For instance, there was
this fix recently: http://patchwork.usersys.redhat.com/patch/69661/

> > Taking a step back, the rbd driver is exposing both the minimum_io_size
> > and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> > the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> > to respect the limits when it assembles its bios (via bio_add_page).
> > 
> > Sage, any reason why you don't use traditional raid geomtry based IO
> > limits?, e.g.:
> > 
> > minimum_io_size = raid chunk size
> > optimal_io_size = raid chunk size * N stripes (aka full stripe)
> 
> We are... by default we stripe 4M chunks across 4M objects.  You're 
> suggesting it would actually help to advertise a smaller minimim_io_size 
> (say, 1MB)?  This could easily be made tunable.

You're striping 4MB chunks across 4 million stripes?

So the full stripe size in bytes is 17592186044416 (or 16TB)?  Yeah
cannot see how XFS could make use of that ;)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 17:48                               ` Mike Snitzer
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 17:48 UTC (permalink / raw)
  To: Sage Weil
  Cc: elder, Christoph Hellwig, Ugis, linux-lvm, ceph-devel, ceph-users

On Mon, Oct 21 2013 at 12:02pm -0400,
Sage Weil <sage@inktank.com> wrote:

> On Mon, 21 Oct 2013, Mike Snitzer wrote:
> > On Mon, Oct 21 2013 at 10:11am -0400,
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > > that affected this.  RBD is currently doing
> > > 
> > > Unfortunately most device mapper modules still split all I/O into 4k
> > > chunks before handling them.  They rely on the elevator to merge them
> > > back together down the line, which isn't overly efficient but should at
> > > least provide larger segments for the common cases.
> > 
> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > no?  Unless care is taken to assemble larger bios (higher up the IO
> > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> > in $PAGE_SIZE granularity.
> > 
> > I would expect direct IO to before better here because it will make use
> > of bio_add_page to build up larger IOs.
> 
> I do know that we regularly see 128 KB requests when we put XFS (or 
> whatever else) directly on top of /dev/rbd*.

Should be pretty straight-forward to identify any limits that are
different by walking sysfs/queue, e.g.:

grep -r . /sys/block/rdbXXX/queue
vs
grep -r . /sys/block/dm-X/queue

Could be there is an unexpected difference.  For instance, there was
this fix recently: http://patchwork.usersys.redhat.com/patch/69661/

> > Taking a step back, the rbd driver is exposing both the minimum_io_size
> > and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> > the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> > to respect the limits when it assembles its bios (via bio_add_page).
> > 
> > Sage, any reason why you don't use traditional raid geomtry based IO
> > limits?, e.g.:
> > 
> > minimum_io_size = raid chunk size
> > optimal_io_size = raid chunk size * N stripes (aka full stripe)
> 
> We are... by default we stripe 4M chunks across 4M objects.  You're 
> suggesting it would actually help to advertise a smaller minimim_io_size 
> (say, 1MB)?  This could easily be made tunable.

You're striping 4MB chunks across 4 million stripes?

So the full stripe size in bytes is 17592186044416 (or 16TB)?  Yeah
cannot see how XFS could make use of that ;)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 17:48                               ` [linux-lvm] " Mike Snitzer
@ 2013-10-21 18:05                                   ` Sage Weil
  -1 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-21 18:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, linux-lvm-H+wXaHxf7aLQT0dZR+AlfA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

On Mon, 21 Oct 2013, Mike Snitzer wrote:
> On Mon, Oct 21 2013 at 12:02pm -0400,
> Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> > On Mon, 21 Oct 2013, Mike Snitzer wrote:
> > > On Mon, Oct 21 2013 at 10:11am -0400,
> > > Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> > > 
> > > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > > > that affected this.  RBD is currently doing
> > > > 
> > > > Unfortunately most device mapper modules still split all I/O into 4k
> > > > chunks before handling them.  They rely on the elevator to merge them
> > > > back together down the line, which isn't overly efficient but should at
> > > > least provide larger segments for the common cases.
> > > 
> > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > > no?  Unless care is taken to assemble larger bios (higher up the IO
> > > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> > > in $PAGE_SIZE granularity.
> > > 
> > > I would expect direct IO to before better here because it will make use
> > > of bio_add_page to build up larger IOs.
> > 
> > I do know that we regularly see 128 KB requests when we put XFS (or 
> > whatever else) directly on top of /dev/rbd*.
> 
> Should be pretty straight-forward to identify any limits that are
> different by walking sysfs/queue, e.g.:
> 
> grep -r . /sys/block/rdbXXX/queue
> vs
> grep -r . /sys/block/dm-X/queue
> 
> Could be there is an unexpected difference.  For instance, there was
> this fix recently: http://patchwork.usersys.redhat.com/patch/69661/
> 
> > > Taking a step back, the rbd driver is exposing both the minimum_io_size
> > > and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> > > the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> > > to respect the limits when it assembles its bios (via bio_add_page).
> > > 
> > > Sage, any reason why you don't use traditional raid geomtry based IO
> > > limits?, e.g.:
> > > 
> > > minimum_io_size = raid chunk size
> > > optimal_io_size = raid chunk size * N stripes (aka full stripe)
> > 
> > We are... by default we stripe 4M chunks across 4M objects.  You're 
> > suggesting it would actually help to advertise a smaller minimim_io_size 
> > (say, 1MB)?  This could easily be made tunable.
> 
> You're striping 4MB chunks across 4 million stripes?
> 
> So the full stripe size in bytes is 17592186044416 (or 16TB)?  Yeah
> cannot see how XFS could make use of that ;)

Sorry, I mean the stripe count is effectively 1.  Each 4MB gets mapped to 
a new 4MB object (for a total of image_size / 4MB objects).  So I think 
minimum_io_size and optimal_io_size are technically correct in this case.

sage

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 18:05                                   ` Sage Weil
  0 siblings, 0 replies; 36+ messages in thread
From: Sage Weil @ 2013-10-21 18:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: elder, Christoph Hellwig, Ugis, linux-lvm, ceph-devel, ceph-users

On Mon, 21 Oct 2013, Mike Snitzer wrote:
> On Mon, Oct 21 2013 at 12:02pm -0400,
> Sage Weil <sage@inktank.com> wrote:
> 
> > On Mon, 21 Oct 2013, Mike Snitzer wrote:
> > > On Mon, Oct 21 2013 at 10:11am -0400,
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > > 
> > > > On Sun, Oct 20, 2013 at 08:58:58PM -0700, Sage Weil wrote:
> > > > > It looks like without LVM we're getting 128KB requests (which IIRC is 
> > > > > typical), but with LVM it's only 4KB.  Unfortunately my memory is a bit 
> > > > > fuzzy here, but I seem to recall a property on the request_queue or device 
> > > > > that affected this.  RBD is currently doing
> > > > 
> > > > Unfortunately most device mapper modules still split all I/O into 4k
> > > > chunks before handling them.  They rely on the elevator to merge them
> > > > back together down the line, which isn't overly efficient but should at
> > > > least provide larger segments for the common cases.
> > > 
> > > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > > no?  Unless care is taken to assemble larger bios (higher up the IO
> > > stack, e.g. in XFS), all buffered IO will come to bio-based DM targets
> > > in $PAGE_SIZE granularity.
> > > 
> > > I would expect direct IO to before better here because it will make use
> > > of bio_add_page to build up larger IOs.
> > 
> > I do know that we regularly see 128 KB requests when we put XFS (or 
> > whatever else) directly on top of /dev/rbd*.
> 
> Should be pretty straight-forward to identify any limits that are
> different by walking sysfs/queue, e.g.:
> 
> grep -r . /sys/block/rdbXXX/queue
> vs
> grep -r . /sys/block/dm-X/queue
> 
> Could be there is an unexpected difference.  For instance, there was
> this fix recently: http://patchwork.usersys.redhat.com/patch/69661/
> 
> > > Taking a step back, the rbd driver is exposing both the minimum_io_size
> > > and optimal_io_size as 4M.  This symmetry will cause XFS to _not_ detect
> > > the exposed limits as striping.  Therefore, AFAIK, XFS won't take steps
> > > to respect the limits when it assembles its bios (via bio_add_page).
> > > 
> > > Sage, any reason why you don't use traditional raid geomtry based IO
> > > limits?, e.g.:
> > > 
> > > minimum_io_size = raid chunk size
> > > optimal_io_size = raid chunk size * N stripes (aka full stripe)
> > 
> > We are... by default we stripe 4M chunks across 4M objects.  You're 
> > suggesting it would actually help to advertise a smaller minimim_io_size 
> > (say, 1MB)?  This could easily be made tunable.
> 
> You're striping 4MB chunks across 4 million stripes?
> 
> So the full stripe size in bytes is 17592186044416 (or 16TB)?  Yeah
> cannot see how XFS could make use of that ;)

Sorry, I mean the stripe count is effectively 1.  Each 4MB gets mapped to 
a new 4MB object (for a total of image_size / 4MB objects).  So I think 
minimum_io_size and optimal_io_size are technically correct in this case.

sage

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 15:01                         ` [linux-lvm] " Mike Snitzer
@ 2013-10-21 18:06                           ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2013-10-21 18:06 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Sage Weil, elder, Ugis, linux-lvm, ceph-devel,
	ceph-users

On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
> It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> no?

Well, it's the block layer based on what DM tells it.  Take a look at
dm_merge_bvec

From dm_merge_bvec:

	/*
         * If the target doesn't support merge method and some of the devices
         * provided their merge_bvec method (we know this by looking at
         * queue_max_hw_sectors), then we can't allow bios with multiple vector
         * entries.  So always set max_size to 0, and the code below allows
         * just one page.
         */
	
Although it's not the general case, just if the driver has a
merge_bvec method.  But this happens if you using DM ontop of MD where I
saw it aswell as on rbd, which is why it's correct in this context, too.

Sorry for over generalizing a bit.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 18:06                           ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2013-10-21 18:06 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: elder, Sage Weil, Christoph Hellwig, Ugis, linux-lvm, ceph-devel,
	ceph-users

On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
> It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> no?

Well, it's the block layer based on what DM tells it.  Take a look at
dm_merge_bvec

From dm_merge_bvec:

	/*
         * If the target doesn't support merge method and some of the devices
         * provided their merge_bvec method (we know this by looking at
         * queue_max_hw_sectors), then we can't allow bios with multiple vector
         * entries.  So always set max_size to 0, and the code below allows
         * just one page.
         */
	
Although it's not the general case, just if the driver has a
merge_bvec method.  But this happens if you using DM ontop of MD where I
saw it aswell as on rbd, which is why it's correct in this context, too.

Sorry for over generalizing a bit.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 18:06                           ` [linux-lvm] " Christoph Hellwig
@ 2013-10-21 18:27                               ` Mike Snitzer
  -1 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 18:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-lvm-H+wXaHxf7aLQT0dZR+AlfA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

On Mon, Oct 21 2013 at  2:06pm -0400,
Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > no?
> 
> Well, it's the block layer based on what DM tells it.  Take a look at
> dm_merge_bvec
> 
> >From dm_merge_bvec:
> 
> 	/*
>          * If the target doesn't support merge method and some of the devices
>          * provided their merge_bvec method (we know this by looking at
>          * queue_max_hw_sectors), then we can't allow bios with multiple vector
>          * entries.  So always set max_size to 0, and the code below allows
>          * just one page.
>          */
> 	
> Although it's not the general case, just if the driver has a
> merge_bvec method.  But this happens if you using DM ontop of MD where I
> saw it aswell as on rbd, which is why it's correct in this context, too.

Right, but only if the DM target that is being used doesn't have a
.merge method.  I don't think it was ever shared which DM target is in
use here.. but both the linear and stripe DM targets provide a .merge
method.
 
> Sorry for over generalizing a bit.

No problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-21 18:27                               ` Mike Snitzer
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Snitzer @ 2013-10-21 18:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: elder, Sage Weil, Ugis, linux-lvm, ceph-devel, ceph-users

On Mon, Oct 21 2013 at  2:06pm -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
> > no?
> 
> Well, it's the block layer based on what DM tells it.  Take a look at
> dm_merge_bvec
> 
> >From dm_merge_bvec:
> 
> 	/*
>          * If the target doesn't support merge method and some of the devices
>          * provided their merge_bvec method (we know this by looking at
>          * queue_max_hw_sectors), then we can't allow bios with multiple vector
>          * entries.  So always set max_size to 0, and the code below allows
>          * just one page.
>          */
> 	
> Although it's not the general case, just if the driver has a
> merge_bvec method.  But this happens if you using DM ontop of MD where I
> saw it aswell as on rbd, which is why it's correct in this context, too.

Right, but only if the DM target that is being used doesn't have a
.merge method.  I don't think it was ever shared which DM target is in
use here.. but both the linear and stripe DM targets provide a .merge
method.
 
> Sorry for over generalizing a bit.

No problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: poor read performance on rbd+LVM, LVM overload
  2013-10-21 18:27                               ` [linux-lvm] " Mike Snitzer
@ 2013-10-30 14:53                                 ` Ugis
  -1 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-30 14:53 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, Alex Elder, Sage Weil, linux-lvm, ceph-devel,
	ceph-users

Hi, I'm back from trip, sorry for thread pause, wanted to wrap this up.
I reread thead, but actually do not see what could be done from admin
side to tune LVM for better read performance on ceph(parts of my LVM
config included below). At least for already deployed LVM.
It seems there is no clear agreement why io is lost, so, it seems that
LVM is not recommended on ceph rbd currently.

In case there is still hope for tuning here follows info.
Mike wrote:
"Should be pretty straight-forward to identify any limits that are
different by walking sysfs/queue, e.g.:
grep -r . /sys/block/rdbXXX/queue
vs
grep -r . /sys/block/dm-X/queue
"

Here it is
# grep -r . /sys/block/rbd2/queue/
/sys/block/rbd2/queue/nomerges:0
/sys/block/rbd2/queue/logical_block_size:512
/sys/block/rbd2/queue/rq_affinity:1
/sys/block/rbd2/queue/discard_zeroes_data:0
/sys/block/rbd2/queue/max_segments:128
/sys/block/rbd2/queue/max_segment_size:4194304
/sys/block/rbd2/queue/rotational:1
/sys/block/rbd2/queue/scheduler:noop [deadline] cfq
/sys/block/rbd2/queue/read_ahead_kb:128
/sys/block/rbd2/queue/max_hw_sectors_kb:4096
/sys/block/rbd2/queue/discard_granularity:0
/sys/block/rbd2/queue/discard_max_bytes:0
/sys/block/rbd2/queue/write_same_max_bytes:0
/sys/block/rbd2/queue/max_integrity_segments:0
/sys/block/rbd2/queue/max_sectors_kb:512
/sys/block/rbd2/queue/physical_block_size:512
/sys/block/rbd2/queue/add_random:1
/sys/block/rbd2/queue/nr_requests:128
/sys/block/rbd2/queue/minimum_io_size:4194304
/sys/block/rbd2/queue/hw_sector_size:512
/sys/block/rbd2/queue/optimal_io_size:4194304
/sys/block/rbd2/queue/iosched/read_expire:500
/sys/block/rbd2/queue/iosched/write_expire:5000
/sys/block/rbd2/queue/iosched/fifo_batch:16
/sys/block/rbd2/queue/iosched/front_merges:1
/sys/block/rbd2/queue/iosched/writes_starved:2
/sys/block/rbd2/queue/iostats:1

# grep -r . /sys/block/dm-2/queue/
/sys/block/dm-2/queue/nomerges:0
/sys/block/dm-2/queue/logical_block_size:512
/sys/block/dm-2/queue/rq_affinity:0
/sys/block/dm-2/queue/discard_zeroes_data:0
/sys/block/dm-2/queue/max_segments:128
/sys/block/dm-2/queue/max_segment_size:65536
/sys/block/dm-2/queue/rotational:1
/sys/block/dm-2/queue/scheduler:none
/sys/block/dm-2/queue/read_ahead_kb:0
/sys/block/dm-2/queue/max_hw_sectors_kb:4096
/sys/block/dm-2/queue/discard_granularity:0
/sys/block/dm-2/queue/discard_max_bytes:0
/sys/block/dm-2/queue/write_same_max_bytes:0
/sys/block/dm-2/queue/max_integrity_segments:0
/sys/block/dm-2/queue/max_sectors_kb:512
/sys/block/dm-2/queue/physical_block_size:512
/sys/block/dm-2/queue/add_random:0
/sys/block/dm-2/queue/nr_requests:128
/sys/block/dm-2/queue/minimum_io_size:4194304
/sys/block/dm-2/queue/hw_sector_size:512
/sys/block/dm-2/queue/optimal_io_size:4194304
/sys/block/dm-2/queue/iostats:0

Chunks of /etc/lvm/lvm.conf if this helps
devices {
    dir = "/dev"
    scan = [ "/dev/rbd" ,"/dev" ]
    preferred_names = [ ]
    filter = [ "a/.*/" ]
    cache_dir = "/etc/lvm/cache"
    cache_file_prefix = ""
    write_cache_state = 0
    types = [ "rbd", 250 ]
    sysfs_scan = 1
    md_component_detection = 1
    md_chunk_alignment = 1
    data_alignment_detection = 1
    data_alignment = 0
    data_alignment_offset_detection = 1
    ignore_suspended_devices = 0
}
...
activation {
    udev_sync = 1
    udev_rules = 1
    missing_stripe_filler = "error"
    reserved_stack = 256
    reserved_memory = 8192
    process_priority = -18
    mirror_region_size = 512
    readahead = "none"
    mirror_log_fault_policy = "allocate"
    mirror_image_fault_policy = "remove"
    use_mlockall = 0
    monitoring = 1
    polling_interval = 15
}

Hope something can be done still, or I will have to move several TB
off the LVM :)
Anyway, it does not feel like the problem cause is clear. May be I
need to file a bug if that is relevant, but where to?

Ugis

2013/10/21 Mike Snitzer <snitzer@redhat.com>:
> On Mon, Oct 21 2013 at  2:06pm -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
>> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
>> > no?
>>
>> Well, it's the block layer based on what DM tells it.  Take a look at
>> dm_merge_bvec
>>
>> >From dm_merge_bvec:
>>
>>       /*
>>          * If the target doesn't support merge method and some of the devices
>>          * provided their merge_bvec method (we know this by looking at
>>          * queue_max_hw_sectors), then we can't allow bios with multiple vector
>>          * entries.  So always set max_size to 0, and the code below allows
>>          * just one page.
>>          */
>>
>> Although it's not the general case, just if the driver has a
>> merge_bvec method.  But this happens if you using DM ontop of MD where I
>> saw it aswell as on rbd, which is why it's correct in this context, too.
>
> Right, but only if the DM target that is being used doesn't have a
> .merge method.  I don't think it was ever shared which DM target is in
> use here.. but both the linear and stripe DM targets provide a .merge
> method.
>
>> Sorry for over generalizing a bit.
>
> No problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [linux-lvm] poor read performance on rbd+LVM, LVM overload
@ 2013-10-30 14:53                                 ` Ugis
  0 siblings, 0 replies; 36+ messages in thread
From: Ugis @ 2013-10-30 14:53 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Alex Elder, Sage Weil, Christoph Hellwig, linux-lvm, ceph-devel,
	ceph-users

Hi, I'm back from trip, sorry for thread pause, wanted to wrap this up.
I reread thead, but actually do not see what could be done from admin
side to tune LVM for better read performance on ceph(parts of my LVM
config included below). At least for already deployed LVM.
It seems there is no clear agreement why io is lost, so, it seems that
LVM is not recommended on ceph rbd currently.

In case there is still hope for tuning here follows info.
Mike wrote:
"Should be pretty straight-forward to identify any limits that are
different by walking sysfs/queue, e.g.:
grep -r . /sys/block/rdbXXX/queue
vs
grep -r . /sys/block/dm-X/queue
"

Here it is
# grep -r . /sys/block/rbd2/queue/
/sys/block/rbd2/queue/nomerges:0
/sys/block/rbd2/queue/logical_block_size:512
/sys/block/rbd2/queue/rq_affinity:1
/sys/block/rbd2/queue/discard_zeroes_data:0
/sys/block/rbd2/queue/max_segments:128
/sys/block/rbd2/queue/max_segment_size:4194304
/sys/block/rbd2/queue/rotational:1
/sys/block/rbd2/queue/scheduler:noop [deadline] cfq
/sys/block/rbd2/queue/read_ahead_kb:128
/sys/block/rbd2/queue/max_hw_sectors_kb:4096
/sys/block/rbd2/queue/discard_granularity:0
/sys/block/rbd2/queue/discard_max_bytes:0
/sys/block/rbd2/queue/write_same_max_bytes:0
/sys/block/rbd2/queue/max_integrity_segments:0
/sys/block/rbd2/queue/max_sectors_kb:512
/sys/block/rbd2/queue/physical_block_size:512
/sys/block/rbd2/queue/add_random:1
/sys/block/rbd2/queue/nr_requests:128
/sys/block/rbd2/queue/minimum_io_size:4194304
/sys/block/rbd2/queue/hw_sector_size:512
/sys/block/rbd2/queue/optimal_io_size:4194304
/sys/block/rbd2/queue/iosched/read_expire:500
/sys/block/rbd2/queue/iosched/write_expire:5000
/sys/block/rbd2/queue/iosched/fifo_batch:16
/sys/block/rbd2/queue/iosched/front_merges:1
/sys/block/rbd2/queue/iosched/writes_starved:2
/sys/block/rbd2/queue/iostats:1

# grep -r . /sys/block/dm-2/queue/
/sys/block/dm-2/queue/nomerges:0
/sys/block/dm-2/queue/logical_block_size:512
/sys/block/dm-2/queue/rq_affinity:0
/sys/block/dm-2/queue/discard_zeroes_data:0
/sys/block/dm-2/queue/max_segments:128
/sys/block/dm-2/queue/max_segment_size:65536
/sys/block/dm-2/queue/rotational:1
/sys/block/dm-2/queue/scheduler:none
/sys/block/dm-2/queue/read_ahead_kb:0
/sys/block/dm-2/queue/max_hw_sectors_kb:4096
/sys/block/dm-2/queue/discard_granularity:0
/sys/block/dm-2/queue/discard_max_bytes:0
/sys/block/dm-2/queue/write_same_max_bytes:0
/sys/block/dm-2/queue/max_integrity_segments:0
/sys/block/dm-2/queue/max_sectors_kb:512
/sys/block/dm-2/queue/physical_block_size:512
/sys/block/dm-2/queue/add_random:0
/sys/block/dm-2/queue/nr_requests:128
/sys/block/dm-2/queue/minimum_io_size:4194304
/sys/block/dm-2/queue/hw_sector_size:512
/sys/block/dm-2/queue/optimal_io_size:4194304
/sys/block/dm-2/queue/iostats:0

Chunks of /etc/lvm/lvm.conf if this helps
devices {
    dir = "/dev"
    scan = [ "/dev/rbd" ,"/dev" ]
    preferred_names = [ ]
    filter = [ "a/.*/" ]
    cache_dir = "/etc/lvm/cache"
    cache_file_prefix = ""
    write_cache_state = 0
    types = [ "rbd", 250 ]
    sysfs_scan = 1
    md_component_detection = 1
    md_chunk_alignment = 1
    data_alignment_detection = 1
    data_alignment = 0
    data_alignment_offset_detection = 1
    ignore_suspended_devices = 0
}
...
activation {
    udev_sync = 1
    udev_rules = 1
    missing_stripe_filler = "error"
    reserved_stack = 256
    reserved_memory = 8192
    process_priority = -18
    mirror_region_size = 512
    readahead = "none"
    mirror_log_fault_policy = "allocate"
    mirror_image_fault_policy = "remove"
    use_mlockall = 0
    monitoring = 1
    polling_interval = 15
}

Hope something can be done still, or I will have to move several TB
off the LVM :)
Anyway, it does not feel like the problem cause is clear. May be I
need to file a bug if that is relevant, but where to?

Ugis

2013/10/21 Mike Snitzer <snitzer@redhat.com>:
> On Mon, Oct 21 2013 at  2:06pm -0400,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Mon, Oct 21, 2013 at 11:01:29AM -0400, Mike Snitzer wrote:
>> > It isn't DM that splits the IO into 4K chunks; it is the VM subsystem
>> > no?
>>
>> Well, it's the block layer based on what DM tells it.  Take a look at
>> dm_merge_bvec
>>
>> >From dm_merge_bvec:
>>
>>       /*
>>          * If the target doesn't support merge method and some of the devices
>>          * provided their merge_bvec method (we know this by looking at
>>          * queue_max_hw_sectors), then we can't allow bios with multiple vector
>>          * entries.  So always set max_size to 0, and the code below allows
>>          * just one page.
>>          */
>>
>> Although it's not the general case, just if the driver has a
>> merge_bvec method.  But this happens if you using DM ontop of MD where I
>> saw it aswell as on rbd, which is why it's correct in this context, too.
>
> Right, but only if the DM target that is being used doesn't have a
> .merge method.  I don't think it was ever shared which DM target is in
> use here.. but both the linear and stripe DM targets provide a .merge
> method.
>
>> Sorry for over generalizing a bit.
>
> No problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-10-30 14:53 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-16 14:46 poor read performance on rbd+LVM, LVM overload Ugis
2013-10-16 14:46 ` [linux-lvm] " Ugis
2013-10-16 16:16 ` Sage Weil
2013-10-16 16:16   ` [linux-lvm] " Sage Weil
     [not found]   ` <alpine.DEB.2.00.1310160914360.22271-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-10-17  9:06     ` David McBride
2013-10-17  9:06       ` [linux-lvm] " David McBride
2013-10-17 15:18   ` Mike Snitzer
2013-10-17 15:18     ` [linux-lvm] " Mike Snitzer
     [not found]     ` <20131017151828.GB28859-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-10-18  7:56       ` Ugis
2013-10-18  7:56         ` [linux-lvm] " Ugis
2013-10-19  0:01         ` Sage Weil
2013-10-19  0:01           ` [linux-lvm] " Sage Weil
     [not found]           ` <alpine.DEB.2.00.1310181657151.19763-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-10-20 15:18             ` Ugis
2013-10-20 15:18               ` [linux-lvm] " Ugis
     [not found]               ` <CAE63xUO4ZrzObMFeQ=FXGFnqpwWsjCiiDr2_VhOt91h=djofYw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-20 18:21                 ` Josh Durgin
2013-10-20 18:21                   ` [linux-lvm] [ceph-users] " Josh Durgin
2013-10-21  3:58                 ` Sage Weil
2013-10-21  3:58                   ` [linux-lvm] " Sage Weil
2013-10-21 14:11                   ` Christoph Hellwig
2013-10-21 14:11                     ` [linux-lvm] " Christoph Hellwig
     [not found]                     ` <20131021141147.GA30189-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2013-10-21 15:01                       ` Mike Snitzer
2013-10-21 15:01                         ` [linux-lvm] " Mike Snitzer
     [not found]                         ` <20131021150129.GA28099-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-10-21 15:06                           ` Mike Snitzer
2013-10-21 15:06                             ` [linux-lvm] " Mike Snitzer
2013-10-21 16:02                         ` Sage Weil
2013-10-21 16:02                           ` [linux-lvm] " Sage Weil
     [not found]                           ` <alpine.DEB.2.00.1310210853140.29488-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-10-21 17:48                             ` Mike Snitzer
2013-10-21 17:48                               ` [linux-lvm] " Mike Snitzer
     [not found]                               ` <20131021174850.GA29416-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-10-21 18:05                                 ` Sage Weil
2013-10-21 18:05                                   ` [linux-lvm] " Sage Weil
2013-10-21 18:06                         ` Christoph Hellwig
2013-10-21 18:06                           ` [linux-lvm] " Christoph Hellwig
     [not found]                           ` <20131021180616.GA7196-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2013-10-21 18:27                             ` Mike Snitzer
2013-10-21 18:27                               ` [linux-lvm] " Mike Snitzer
2013-10-30 14:53                               ` Ugis
2013-10-30 14:53                                 ` [linux-lvm] " Ugis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.