linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] Best way to run LVM over multiple SW RAIDs?
@ 2019-10-29  8:47 Daniel Janzon
  2019-12-07 16:16 ` Anatoly Pugachev
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Janzon @ 2019-10-29  8:47 UTC (permalink / raw)
  To: linux-lvm

[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]

Hello,

I have a server with very high load using four NVMe SSDs and therefore no HW RAID. Instead I used SW RAID with the mdadm tool. Using one RAID5 volume does not work well since the driver can only utilize one CPU core which spikes at 100% and harms performance. Therefore I created 8 partitions on each disk, and 8 RAID5s across the four disks.

Now I want to bring them together with LVM. If I do not use a striped volume I get high performance (in expected magnitude according to disk specs). But when I use a striped volume, performance drops to a magnitude below. The reason I am looking for a striped setup is to ensure that data is spread well over the drive to guarantee a good worst-case performance. With linear allocation rather than striped, if load is directed to files on the first PV (a SW RAID) the system is again exposed to the 1-core limitation.

I tried "--stripes 8 --stripesize 512", and would appreciate any ideas of other things to try. I guess the performance hit can be in the file system as well. I tried XFS and EXT4 with default settings.

Kind Regards,
Daniel


[-- Attachment #2: Type: text/html, Size: 1582 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs?
@ 2019-12-09 10:26 Daniel Janzon
  2019-12-09 14:26 ` Marian Csontos
  2019-12-10 11:23 ` Gionatan Danti
  0 siblings, 2 replies; 14+ messages in thread
From: Daniel Janzon @ 2019-12-09 10:26 UTC (permalink / raw)
  To: linux-lvm


> From: "John Stoffel" <john@stoffel.org>

> Stuart> The mdadm layer already does the striping.  So doing it again
> Stuart> in the LVM layer completely screws it up.  You want plain JBOD
> Stuart> (Just a Bunch Of Disks).

> Umm... not really.  The problem here is more the MD layer not being
> able to run RAID5 across multiple cores at the same time, which is why
> he split things the way he did.

Exactly. The md driver executes on a single core, but with a bunch of RAID5s
I can distribute the load over many cores. That's also why I cannot join the
bunch of RAID5's with a RAID0 (as someone suggested) because then again
all data is pulled through a single core.

> But we don't know the Kernel version, the LVM version, or the OS
> release so as to give better ideas of what to do.

It is Redhat 7, kernel 3.10, scheduler seems to be "[none] mq-deadline kyber"
according to /sys/block/nvme0n1/queue/scheduler. LVM version 2.02.185(2)-RHEL7.

But I wonder if fine-tuning e.g. io scheduler is going to cut it, since I am
looking for something like a 10x improvement.

> The biggest harm to performance here is really the RAID5, and if you
> can instead move to RAID 10 (mirror then stripe across mirrors) then
> you should be a performance boost.

The origin of my problem is indeed the poor performance of RAID5,
which maxes out the single core the driver runs on. But if I accept that
as a given, the next problem is LVM striping. Since I do get 10x better
performance with linear allocation.

> As Daniel says, he's got lots of disk load, but plenty of CPU, so the
> single thread for RAID5 is a big bottleneck.

Yes. That should be fixed since NVMe SSDs now outperform a single
CPU core. But that's a topic for another mailing list I suppose.

> I assume he wants to use LVM so he can create volume(s) larger than
> individual RAID5 volumes, so in that case, I'd probably just build a
> regular non-striped LVM VG holding all your RAID5 disks.  Hopefully
> the Parity disk is spread across all the partitions, though NVMe
> drives should have enough IOPs capacity to mask the RMW cost of RAID5
> to a degree.

The problem is the linear allocation of LVM. It will tend to fill the first
RAID5 first, then the next, etc. The access pattern is such that files
written close in time will be read close in time. We have live video
streams being written and read 24/7. What I want to avoid is that at
some point a majority of all reads end up on a single RAID5 which
will then fail to perform. Bound to happen in an always-on system.

> In any case, I'd just build it like:
>
>  pvcreate /dev/md#     (do for each of 8 RAID5 MD devices)
>  vgcreate datavg /dev/md[#-#]   (give all 8 RAID5 MD devices here.
>  lvcreate -n "name" -L <size> datavg

I think this is basically what I did, what I refer to as a "linearly allocated"
as compared to a striped group. It does indeed perform well most of
the time, but has, I believe, a poor guarantee for the worst case.


> If you can, I'd get more SSDs and move to RAID1+0 (RAID10) instead,
> though you do have the problem where a double disk failure could kill
> your data if it happens to both halves of a mirror.

Yes throwing money on the problem is a good way to solve it. I was 
hoping to avoid that for this application since I thought I just did something
wrong with the stripes.

Kind Regards,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread
* [linux-lvm] Best way to run LVM over multiple SW RAIDs?
@ 2019-12-16  8:22 Daniel Janzon
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Janzon @ 2019-12-16  8:22 UTC (permalink / raw)
  To: linux-lvm

> From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
>>On 12/7/19 11:44 PM, John Stoffel wrote:
>> As Daniel says, he's got lots of disk load, but plenty of CPU, so the
>> single thread for RAID5 is a big bottleneck.

>Perhaps set "/sys/block/mdx/md/group_thread_cnt" could help here,

Now I finally had a chance to test this. It turns out to work great! It's not as fast as a non-raided linearly allocated LVM volume (about half of performance without getting a fat tail of high read/write response time). So there is a price for redundancy but that is worth it in my application. It's now up there in the same magnitude.

Thanks a lot Guoqing! You really helped me a lot here. I'd also like to thank John Stoffel for valuable input.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-12-16  8:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-29  8:47 [linux-lvm] Best way to run LVM over multiple SW RAIDs? Daniel Janzon
2019-12-07 16:16 ` Anatoly Pugachev
2019-12-07 17:37   ` Roberto Fastec
2019-12-07 20:34     ` Stuart D. Gathman
2019-12-07 22:44       ` John Stoffel
2019-12-07 23:14         ` Stuart D. Gathman
2019-12-08 11:57           ` Gionatan Danti
2019-12-08 22:51           ` John Stoffel
2019-12-09 10:40         ` Guoqing Jiang
2019-12-09 10:26 Daniel Janzon
2019-12-09 14:26 ` Marian Csontos
2019-12-10 11:23 ` Gionatan Danti
2019-12-10 21:29   ` John Stoffel
2019-12-16  8:22 Daniel Janzon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).