From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx1.redhat.com (ext-mx17.extmail.prod.ext.phx2.redhat.com [10.5.110.46]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 06E845C207 for ; Tue, 29 Oct 2019 08:47:38 +0000 (UTC) Received: from ex01.edgeware.tv (94.127.35.106.c.fiberdirekt.net [94.127.35.106]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id CF8E33082E1E for ; Tue, 29 Oct 2019 08:47:31 +0000 (UTC) From: Daniel Janzon Date: Tue, 29 Oct 2019 08:47:29 +0000 Message-ID: Content-Language: en-US Content-Type: multipart/alternative; boundary="_000_A8A53B0E4B7ACF4BAD5C77D3671AC8976AB85B3BEX01edgewaretv_" MIME-Version: 1.0 Subject: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: To: "linux-lvm@redhat.com" --_000_A8A53B0E4B7ACF4BAD5C77D3671AC8976AB85B3BEX01edgewaretv_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello, I have a server with very high load using four NVMe SSDs and therefore no H= W RAID. Instead I used SW RAID with the mdadm tool. Using one RAID5 volume = does not work well since the driver can only utilize one CPU core which spi= kes at 100% and harms performance. Therefore I created 8 partitions on each= disk, and 8 RAID5s across the four disks. Now I want to bring them together with LVM. If I do not use a striped volum= e I get high performance (in expected magnitude according to disk specs). B= ut when I use a striped volume, performance drops to a magnitude below. The= reason I am looking for a striped setup is to ensure that data is spread w= ell over the drive to guarantee a good worst-case performance. With linear = allocation rather than striped, if load is directed to files on the first P= V (a SW RAID) the system is again exposed to the 1-core limitation. I tried "--stripes 8 --stripesize 512", and would appreciate any ideas of o= ther things to try. I guess the performance hit can be in the file system a= s well. I tried XFS and EXT4 with default settings. Kind Regards, Daniel --_000_A8A53B0E4B7ACF4BAD5C77D3671AC8976AB85B3BEX01edgewaretv_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hello,

I have a server with very high load using four NVMe SSDs and therefore= no HW RAID. Instead I used SW RAID with the mdadm tool. Using one RAID5 vo= lume does not work well since the driver can only utilize one CPU core whic= h spikes at 100% and harms performance. Therefore I created 8 partitions on each disk, and 8 RAID5s across the fou= r disks.

Now I want to bring them together with LVM. If I do not use a striped = volume I get high performance (in expected magnitude according to disk spec= s). But when I use a striped volume, performance drops to a magnitude below= . The reason I am looking for a striped setup is to ensure that data is spread well over the drive to guar= antee a good worst-case performance. With linear allocation rather than str= iped, if load is directed to files on the first PV (a SW RAID) the system i= s again exposed to the 1-core limitation.

I tried "--stripes 8 --stripesize 512", and would appreciate= any ideas of other things to try. I guess the performance hit can be in th= e file system as well. I tried XFS and EXT4 with default settings.

Kind Regards,
Daniel

--_000_A8A53B0E4B7ACF4BAD5C77D3671AC8976AB85B3BEX01edgewaretv_-- From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast02.extmail.prod.ext.rdu2.redhat.com [10.11.55.18]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7C4912026D69 for ; Sat, 7 Dec 2019 16:16:34 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-2.mimecast.com [205.139.110.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2B8388EDA23 for ; Sat, 7 Dec 2019 16:16:34 +0000 (UTC) Received: by mail-ed1-f65.google.com with SMTP id f8so8647664edv.2 for ; Sat, 07 Dec 2019 08:16:32 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Anatoly Pugachev Date: Sat, 7 Dec 2019 19:16:20 +0300 Message-ID: Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii" To: LVM general discussion and development On Tue, Oct 29, 2019 at 12:14 PM Daniel Janzon wrote: > > Hello, > > I have a server with very high load using four NVMe SSDs and therefore no HW RAID. Instead I used SW RAID with the mdadm tool. Using one RAID5 volume does not work well since the driver can only utilize one CPU core which spikes at 100% and harms performance. Therefore I created 8 partitions on each disk, and 8 RAID5s across the four disks. > > Now I want to bring them together with LVM. If I do not use a striped volume I get high performance (in expected magnitude according to disk specs). But when I use a striped volume, performance drops to a magnitude below. The reason I am looking for a striped setup is to ensure that data is spread well over the drive to guarantee a good worst-case performance. With linear allocation rather than striped, if load is directed to files on the first PV (a SW RAID) the system is again exposed to the 1-core limitation. > > I tried "--stripes 8 --stripesize 512", and would appreciate any ideas of other things to try. I guess the performance hit can be in the file system as well. I tried XFS and EXT4 with default settings. Daniel, a bit more about your system? Like kernel version, io scheduler, etc.. Have you tried with recent kernels MQ (multi-queue) schedulers (noop, deadline) ? From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast04.extmail.prod.ext.rdu2.redhat.com [10.11.55.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4B6B4A319D for ; Sat, 7 Dec 2019 17:37:53 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-2.mimecast.com [207.211.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 4CB4F1024D1C for ; Sat, 7 Dec 2019 17:37:53 +0000 (UTC) Received: by mail-wr1-f68.google.com with SMTP id t2so11340949wrr.1 for ; Sat, 07 Dec 2019 09:37:51 -0800 (PST) Received: from [10.96.255.164] (mob-5-90-86-116.net.vodafone.it. [5.90.86.116]) by smtp.gmail.com with ESMTPSA id x11sm3212665wmg.46.2019.12.07.09.37.48 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 07 Dec 2019 09:37:49 -0800 (PST) Date: Sat, 07 Dec 2019 18:37:45 +0100 Message-ID: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> From: Roberto Fastec In-Reply-To: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="utf-8" To: LVM general discussion and development Did you thought about RAID 50 ?   Messaggio originale   Da: matorola@gmail.com Inviato: 7 dicembre 2019 17:17 A: linux-lvm@redhat.com Rispondi a: linux-lvm@redhat.com Oggetto: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? On Tue, Oct 29, 2019 at 12:14 PM Daniel Janzon wrote: > > Hello, > > I have a server with very high load using four NVMe SSDs and therefore no HW RAID. Instead I used SW RAID with the mdadm tool. Using one RAID5 volume does not work well since the driver can only utilize one CPU core which spikes at 100% and harms performance. Therefore I created 8 partitions on each disk, and 8 RAID5s across the four disks. > > Now I want to bring them together with LVM. If I do not use a striped volume I get high performance (in expected magnitude according to disk specs). But when I use a striped volume, performance drops to a magnitude below. The reason I am looking for a striped setup is to ensure that data is spread well over the drive to guarantee a good worst-case performance. With linear allocation rather than striped, if load is directed to files on the first PV (a SW RAID) the system is again exposed to the 1-core limitation. > > I tried "--stripes 8 --stripesize 512", and would appreciate any ideas of other things to try. I guess the performance hit can be in the file system as well. I tried XFS and EXT4 with default settings. Daniel, a bit more about your system? Like kernel version, io scheduler, etc.. Have you tried with recent kernels MQ (multi-queue) schedulers (noop, deadline) ? _______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast06.extmail.prod.ext.rdu2.redhat.com [10.11.55.22]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1AC9B2026D69 for ; Sat, 7 Dec 2019 20:53:10 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B58E4180063B for ; Sat, 7 Dec 2019 20:53:10 +0000 (UTC) Received: from wiki.gathman.org ([IPv6:2001:470:8:809::2]) (authenticated bits=0) by mail.gathman.org (8.14.7/8.14.7) with ESMTP id xB7KYraU031655 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Sat, 7 Dec 2019 15:34:56 -0500 Date: Sat, 7 Dec 2019 15:34:53 -0500 (EST) From: "Stuart D. Gathman" In-Reply-To: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> Message-ID: References: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: LVM general discussion and development On Tue, Oct 29, 2019 at 12:14 PM Daniel Janzon wrote: > I have a server with very high load using four NVMe SSDs and > therefore no HW RAID. Instead I used SW RAID with the mdadm tool. > Using one RAID5 volume does not work well since the driver can only > utilize one CPU core which spikes at 100% and harms performance. > Therefore I created 8 partitions on each disk, and 8 RAID5s across > the four disks. > Now I want to bring them together with LVM. If I do not use a striped > volume I get high performance (in expected magnitude according to disk > specs). But when I use a striped volume, performance drops to a > magnitude below. The reason I am looking for a striped setup is to The mdadm layer already does the striping. So doing it again in the LVM layer completely screws it up. You want plain JBOD (Just a Bunch Of Disks). -- Stuart D. Gathman "Confutatis maledictis, flammis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast06.extmail.prod.ext.rdu2.redhat.com [10.11.55.22]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4D1C2A2895 for ; Sat, 7 Dec 2019 22:44:07 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E6E0818E6C56 for ; Sat, 7 Dec 2019 22:44:06 +0000 (UTC) Received: from quad.stoffel.org (66-189-75-104.dhcp.oxfr.ma.charter.com [66.189.75.104]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.stoffel.org (Postfix) with ESMTPSA id 2D5111E12B for ; Sat, 7 Dec 2019 17:44:03 -0500 (EST) MIME-Version: 1.0 Message-ID: <24044.11058.338208.602498@quad.stoffel.home> Date: Sat, 7 Dec 2019 17:44:02 -0500 From: "John Stoffel" In-Reply-To: References: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii" To: LVM general discussion and development >>>>> "Stuart" == Stuart D Gathman writes: Stuart> On Tue, Oct 29, 2019 at 12:14 PM Daniel Janzon wrote: >> I have a server with very high load using four NVMe SSDs and >> therefore no HW RAID. Instead I used SW RAID with the mdadm tool. >> Using one RAID5 volume does not work well since the driver can only >> utilize one CPU core which spikes at 100% and harms performance. >> Therefore I created 8 partitions on each disk, and 8 RAID5s across >> the four disks. >> Now I want to bring them together with LVM. If I do not use a striped >> volume I get high performance (in expected magnitude according to disk >> specs). But when I use a striped volume, performance drops to a >> magnitude below. The reason I am looking for a striped setup is to Stuart> The mdadm layer already does the striping. So doing it again Stuart> in the LVM layer completely screws it up. You want plain JBOD Stuart> (Just a Bunch Of Disks). Umm... not really. The problem here is more the MD layer not being able to run RAID5 across multiple cores at the same time, which is why he split things the way he did. But we don't know the Kernel version, the LVM version, or the OS release so as to give better ideas of what to do. The biggest harm to performance here is really the RAID5, and if you can instead move to RAID 10 (mirror then stripe across mirrors) then you should be a performance boost. As Daniel says, he's got lots of disk load, but plenty of CPU, so the single thread for RAID5 is a big bottleneck. I assume he wants to use LVM so he can create volume(s) larger than individual RAID5 volumes, so in that case, I'd probably just build a regular non-striped LVM VG holding all your RAID5 disks. Hopefully the Parity disk is spread across all the partitions, though NVMe drives should have enough IOPs capacity to mask the RMW cost of RAID5 to a degree. In any case, I'd just build it like: pvcreate /dev/md# (do for each of 8 RAID5 MD devices) vgcreate datavg /dev/md[#-#] (give all 8 RAID5 MD devices here. lvcreate -n "name" -L datavg And then test your performance. Since you only have four disks, the 8 RAID5 volumes in your VG are all going to suck for small writes, but NVMe SSDs will mask that to an extent. If you can, I'd get more SSDs and move to RAID1+0 (RAID10) instead, though you do have the problem where a double disk failure could kill your data if it happens to both halves of a mirror. But, numbers talk, BS walks. So if the original poster can provide some details and numbers... then maybe we can help more. John From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast01.extmail.prod.ext.rdu2.redhat.com [10.11.55.17]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 846DDA2898 for ; Sat, 7 Dec 2019 23:14:12 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-2.mimecast.com [205.139.110.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 4BFF9867471 for ; Sat, 7 Dec 2019 23:14:12 +0000 (UTC) Received: from wiki.gathman.org ([IPv6:2001:470:8:809::2]) (authenticated bits=0) by mail.gathman.org (8.14.7/8.14.7) with ESMTP id xB7NE5k9000559 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Sat, 7 Dec 2019 18:14:08 -0500 Date: Sat, 7 Dec 2019 18:14:05 -0500 (EST) From: "Stuart D. Gathman" In-Reply-To: <24044.11058.338208.602498@quad.stoffel.home> Message-ID: References: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> <24044.11058.338208.602498@quad.stoffel.home> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: LVM general discussion and development On Sat, 7 Dec 2019, John Stoffel wrote: > The biggest harm to performance here is really the RAID5, and if you > can instead move to RAID 10 (mirror then stripe across mirrors) then > you should be a performance boost. Yeah, That's what I do. RAID10, and use LVM to join together as JBOD. I forgot about the raid 5 bottleneck part, sorry. > As Daniel says, he's got lots of disk load, but plenty of CPU, so the > single thread for RAID5 is a big bottleneck. > I assume he wants to use LVM so he can create volume(s) larger than > individual RAID5 volumes, so in that case, I'd probably just build a > regular non-striped LVM VG holding all your RAID5 disks. Hopefully Wait, that's what I suggested! > If you can, I'd get more SSDs and move to RAID1+0 (RAID10) instead, > though you do have the problem where a double disk failure could kill > your data if it happens to both halves of a mirror. No worse than raid5. In fact, better because the 2nd fault always kills the raid5, but only has a 33% or less chance of killing the raid10. (And in either case, it is usually just specific sectors, not the entire drive, and other manual recovery techniques can come into play.) -- Stuart D. Gathman "Confutatis maledictis, flammis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast04.extmail.prod.ext.rdu2.redhat.com [10.11.55.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 13E0910AF9CF for ; Sun, 8 Dec 2019 11:57:45 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 390F81006A90 for ; Sun, 8 Dec 2019 11:57:45 +0000 (UTC) MIME-Version: 1.0 Date: Sun, 08 Dec 2019 12:57:36 +0100 From: Gionatan Danti In-Reply-To: References: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> <24044.11058.338208.602498@quad.stoffel.home> Message-ID: <4851bf1ac6d1dca625598bbfe3a28b8c@assyoma.it> Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] =?utf-8?q?Best_way_to_run_LVM_over_multiple_SW_RAIDs?= =?utf-8?q?=3F?= Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: LVM general discussion and development Il 08-12-2019 00:14 Stuart D. Gathman ha scritto: > On Sat, 7 Dec 2019, John Stoffel wrote: > >> The biggest harm to performance here is really the RAID5, and if you >> can instead move to RAID 10 (mirror then stripe across mirrors) then >> you should be a performance boost. > > Yeah, That's what I do. RAID10, and use LVM to join together as JBOD. > I forgot about the raid 5 bottleneck part, sorry. > >> As Daniel says, he's got lots of disk load, but plenty of CPU, so the >> single thread for RAID5 is a big bottleneck. > >> I assume he wants to use LVM so he can create volume(s) larger than >> individual RAID5 volumes, so in that case, I'd probably just build a >> regular non-striped LVM VG holding all your RAID5 disks. Hopefully > > Wait, that's what I suggested! > >> If you can, I'd get more SSDs and move to RAID1+0 (RAID10) instead, >> though you do have the problem where a double disk failure could kill >> your data if it happens to both halves of a mirror. > > No worse than raid5. In fact, better because the 2nd fault always > kills the raid5, but only has a 33% or less chance of killing the > raid10. (And in either case, it is usually just specific sectors, > not the entire drive, and other manual recovery techniques can come > into > play.) While I agree with both (especially regarding RAID10), I propose another setup: and MD RAID0 of the eight MD RAID5 arrays. If I remember correctly, LVM striping code is based on device mapper rather than MD RAID code. Maybe the latter is more efficient at striping on fast NVMe drives? Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast05.extmail.prod.ext.rdu2.redhat.com [10.11.55.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2EB4E2026D69 for ; Sun, 8 Dec 2019 22:51:43 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7B93C908766 for ; Sun, 8 Dec 2019 22:51:43 +0000 (UTC) Received: from quad.stoffel.org (66-189-75-104.dhcp.oxfr.ma.charter.com [66.189.75.104]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.stoffel.org (Postfix) with ESMTPSA id E26CA217AF for ; Sun, 8 Dec 2019 17:51:39 -0500 (EST) MIME-Version: 1.0 Message-ID: <24045.32379.341578.79820@quad.stoffel.home> Date: Sun, 8 Dec 2019 17:51:39 -0500 From: "John Stoffel" In-Reply-To: References: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> <24044.11058.338208.602498@quad.stoffel.home> Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii" To: LVM general discussion and development >>>>> "Stuart" == Stuart D Gathman writes: Stuart> On Sat, 7 Dec 2019, John Stoffel wrote: >> The biggest harm to performance here is really the RAID5, and if you >> can instead move to RAID 10 (mirror then stripe across mirrors) then >> you should be a performance boost. Stuart> Yeah, That's what I do. RAID10, and use LVM to join together as JBOD. Stuart> I forgot about the raid 5 bottleneck part, sorry. Yeah, it's not ideal, and I don't know enough about the code to figure out if it's even possible to fix that issue without major restructuring. >> As Daniel says, he's got lots of disk load, but plenty of CPU, so the >> single thread for RAID5 is a big bottleneck. >> I assume he wants to use LVM so he can create volume(s) larger than >> individual RAID5 volumes, so in that case, I'd probably just build a >> regular non-striped LVM VG holding all your RAID5 disks. Hopefully Stuart> Wait, that's what I suggested! Must have missed that, sorry! Again, let's see if the original poster can provide more details of the setup. >> If you can, I'd get more SSDs and move to RAID1+0 (RAID10) instead, >> though you do have the problem where a double disk failure could kill >> your data if it happens to both halves of a mirror. Stuart> No worse than raid5. In fact, better because the 2nd fault Stuart> always kills the raid5, but only has a 33% or less chance of Stuart> killing the raid10. (And in either case, it is usually just Stuart> specific sectors, not the entire drive, and other manual Stuart> recovery techniques can come into play.) I don't know the failure mode of NVMe drives, a bunch of SSDs didn't so much fail single sectors as they just up and died instantly, without any chance of recovery. So I worry about the NVMe drive failure modes, and I'd want some hot spares in the system if at all possible, because you know they're going to fail just as you get home and stop checking email... so having it rebuild automatically is a big help. If your business can afford it. Can it afford not too? :-) John From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast02.extmail.prod.ext.rdu2.redhat.com [10.11.55.18]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A67E5A4854 for ; Mon, 9 Dec 2019 10:27:01 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-1.mimecast.com [207.211.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0F4578001E9 for ; Mon, 9 Dec 2019 10:27:01 +0000 (UTC) From: Daniel Janzon Date: Mon, 9 Dec 2019 10:26:55 +0000 Message-ID: Content-Language: en-US MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii" To: "linux-lvm@redhat.com" > From: "John Stoffel" > Stuart> The mdadm layer already does the striping. So doing it again > Stuart> in the LVM layer completely screws it up. You want plain JBOD > Stuart> (Just a Bunch Of Disks). > Umm... not really. The problem here is more the MD layer not being > able to run RAID5 across multiple cores at the same time, which is why > he split things the way he did. Exactly. The md driver executes on a single core, but with a bunch of RAID5s I can distribute the load over many cores. That's also why I cannot join the bunch of RAID5's with a RAID0 (as someone suggested) because then again all data is pulled through a single core. > But we don't know the Kernel version, the LVM version, or the OS > release so as to give better ideas of what to do. It is Redhat 7, kernel 3.10, scheduler seems to be "[none] mq-deadline kyber" according to /sys/block/nvme0n1/queue/scheduler. LVM version 2.02.185(2)-RHEL7. But I wonder if fine-tuning e.g. io scheduler is going to cut it, since I am looking for something like a 10x improvement. > The biggest harm to performance here is really the RAID5, and if you > can instead move to RAID 10 (mirror then stripe across mirrors) then > you should be a performance boost. The origin of my problem is indeed the poor performance of RAID5, which maxes out the single core the driver runs on. But if I accept that as a given, the next problem is LVM striping. Since I do get 10x better performance with linear allocation. > As Daniel says, he's got lots of disk load, but plenty of CPU, so the > single thread for RAID5 is a big bottleneck. Yes. That should be fixed since NVMe SSDs now outperform a single CPU core. But that's a topic for another mailing list I suppose. > I assume he wants to use LVM so he can create volume(s) larger than > individual RAID5 volumes, so in that case, I'd probably just build a > regular non-striped LVM VG holding all your RAID5 disks. Hopefully > the Parity disk is spread across all the partitions, though NVMe > drives should have enough IOPs capacity to mask the RMW cost of RAID5 > to a degree. The problem is the linear allocation of LVM. It will tend to fill the first RAID5 first, then the next, etc. The access pattern is such that files written close in time will be read close in time. We have live video streams being written and read 24/7. What I want to avoid is that at some point a majority of all reads end up on a single RAID5 which will then fail to perform. Bound to happen in an always-on system. > In any case, I'd just build it like: > > pvcreate /dev/md# (do for each of 8 RAID5 MD devices) > vgcreate datavg /dev/md[#-#] (give all 8 RAID5 MD devices here. > lvcreate -n "name" -L datavg I think this is basically what I did, what I refer to as a "linearly allocated" as compared to a striped group. It does indeed perform well most of the time, but has, I believe, a poor guarantee for the worst case. > If you can, I'd get more SSDs and move to RAID1+0 (RAID10) instead, > though you do have the problem where a double disk failure could kill > your data if it happens to both halves of a mirror. Yes throwing money on the problem is a good way to solve it. I was hoping to avoid that for this application since I thought I just did something wrong with the stripes. Kind Regards, Daniel From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast04.extmail.prod.ext.rdu2.redhat.com [10.11.55.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 6F8C910BBCC0 for ; Mon, 9 Dec 2019 10:40:27 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-2.mimecast.com [207.211.31.81]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 1DF4D1011A08 for ; Mon, 9 Dec 2019 10:40:27 +0000 (UTC) Received: by mail-ed1-f68.google.com with SMTP id f8so12267945edv.2 for ; Mon, 09 Dec 2019 02:40:24 -0800 (PST) References: <1p3erjcoc4qsk3gplvduhoep.1575740265800@gmail.com> <24044.11058.338208.602498@quad.stoffel.home> From: Guoqing Jiang Message-ID: <6399a003-cec7-dcc0-656d-0a4c577ce4bf@cloud.ionos.com> Date: Mon, 9 Dec 2019 11:40:22 +0100 MIME-Version: 1.0 In-Reply-To: <24044.11058.338208.602498@quad.stoffel.home> Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="utf-8"; format="flowed" To: LVM general discussion and development , John Stoffel On 12/7/19 11:44 PM, John Stoffel wrote: >>>>>> "Stuart" == Stuart D Gathman writes: > Stuart> On Tue, Oct 29, 2019 at 12:14 PM Daniel Janzon wrote: >>> I have a server with very high load using four NVMe SSDs and >>> therefore no HW RAID. Instead I used SW RAID with the mdadm tool. >>> Using one RAID5 volume does not work well since the driver can only >>> utilize one CPU core which spikes at 100% and harms performance. >>> Therefore I created 8 partitions on each disk, and 8 RAID5s across >>> the four disks. >>> Now I want to bring them together with LVM. If I do not use a striped >>> volume I get high performance (in expected magnitude according to disk >>> specs). But when I use a striped volume, performance drops to a >>> magnitude below. The reason I am looking for a striped setup is to > Stuart> The mdadm layer already does the striping. So doing it again > Stuart> in the LVM layer completely screws it up. You want plain JBOD > Stuart> (Just a Bunch Of Disks). > > Umm... not really. The problem here is more the MD layer not being > able to run RAID5 across multiple cores at the same time, which is why > he split things the way he did. > > But we don't know the Kernel version, the LVM version, or the OS > release so as to give better ideas of what to do. > > The biggest harm to performance here is really the RAID5, and if you > can instead move to RAID 10 (mirror then stripe across mirrors) then > you should be a performance boost. > > As Daniel says, he's got lots of disk load, but plenty of CPU, so the > single thread for RAID5 is a big bottleneck. Perhaps set "/sys/block/mdx/md/group_thread_cnt" could help here, see below commits: commit b721420e8719131896b009b11edbbd27d9b85e98 Author: Shaohua Li Date:   Tue Aug 27 17:50:42 2013 +0800     raid5: sysfs entry to control worker thread number commit 851c30c9badfc6b294c98e887624bff53644ad21 Author: Shaohua Li Date:   Wed Aug 28 14:30:16 2013 +0800     raid5: offload stripe handle to workqueue Thanks, Guoqing From mboxrd@z Thu Jan 1 00:00:00 1970 References: From: Marian Csontos Message-ID: <85c8a1e5-946b-cb38-87b6-90e46be7de6a@redhat.com> Date: Mon, 9 Dec 2019 15:26:04 +0100 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: LVM general discussion and development , Daniel Janzon On 12/9/19 11:26 AM, Daniel Janzon wrote: > > The origin of my problem is indeed the poor performance of RAID5, > which maxes out the single core the driver runs on. But if I accept that > as a given, the next problem is LVM striping. Since I do get 10x better What stripesize was used for striped LV? IIRC the default is 64k. IIUC you are serving mostly large files. I have no numbers, and no HW to test the hypothesis, but using larger stripesize could help here as this would split the load on multiple RAID5 volumes, while not splitting the IOs too early into too many small requests. -- Marian From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast01.extmail.prod.ext.rdu2.redhat.com [10.11.55.17]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 88DF42166B27 for ; Tue, 10 Dec 2019 11:23:19 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 906C785A333 for ; Tue, 10 Dec 2019 11:23:19 +0000 (UTC) References: From: Gionatan Danti Message-ID: <0154b0c2-be41-cf37-0973-2ade719321cd@assyoma.it> Date: Tue, 10 Dec 2019 12:23:14 +0100 MIME-Version: 1.0 In-Reply-To: Content-Language: it Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: LVM general discussion and development , Daniel Janzon On 09/12/19 11:26, Daniel Janzon wrote: > Exactly. The md driver executes on a single core, but with a bunch of RAID5s > I can distribute the load over many cores. That's also why I cannot join the > bunch of RAID5's with a RAID0 (as someone suggested) because then again > all data is pulled through a single core. MD RAID0 is extremely fast, using a single core at the striping level should pose no problem. Did you actually tried this setup? Anyway, the suggestion from Guoqing Jiang sound promising. Let me quote him: > Perhaps set "/sys/block/mdx/md/group_thread_cnt" could help here, > see below commits: > > commit b721420e8719131896b009b11edbbd27d9b85e98 > Author: Shaohua Li > Date: Tue Aug 27 17:50:42 2013 +0800 > > raid5: sysfs entry to control worker thread number > > commit 851c30c9badfc6b294c98e887624bff53644ad21 > Author: Shaohua Li > Date: Wed Aug 28 14:30:16 2013 +0800 > > raid5: offload stripe handle to workqueue Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8 From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast02.extmail.prod.ext.rdu2.redhat.com [10.11.55.18]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7471F1049477 for ; Tue, 10 Dec 2019 21:29:23 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [207.211.31.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 407A78F6464 for ; Tue, 10 Dec 2019 21:29:23 +0000 (UTC) MIME-Version: 1.0 Message-ID: <24048.3630.705344.759092@quad.stoffel.home> Date: Tue, 10 Dec 2019 16:29:18 -0500 From: "John Stoffel" In-Reply-To: <0154b0c2-be41-cf37-0973-2ade719321cd@assyoma.it> References: <0154b0c2-be41-cf37-0973-2ade719321cd@assyoma.it> Content-Transfer-Encoding: 8bit Subject: Re: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii" To: LVM general discussion and development Cc: Daniel Janzon >>>>> "Gionatan" == Gionatan Danti writes: Gionatan> On 09/12/19 11:26, Daniel Janzon wrote: >> Exactly. The md driver executes on a single core, but with a bunch of RAID5s >> I can distribute the load over many cores. That's also why I cannot join the >> bunch of RAID5's with a RAID0 (as someone suggested) because then again >> all data is pulled through a single core. Gionatan> MD RAID0 is extremely fast, using a single core at the Gionatan> striping level should pose no problem. Did you actually Gionatan> tried this setup? Gionatan> Anyway, the suggestion from Guoqing Jiang sound promising. Let me quote him: >> Perhaps set "/sys/block/mdx/md/group_thread_cnt" could help here, >> see below commits: >> >> commit b721420e8719131896b009b11edbbd27d9b85e98 >> Author: Shaohua Li >> Date: Tue Aug 27 17:50:42 2013 +0800 >> >> raid5: sysfs entry to control worker thread number >> >> commit 851c30c9badfc6b294c98e887624bff53644ad21 >> Author: Shaohua Li >> Date: Wed Aug 28 14:30:16 2013 +0800 >> >> raid5: offload stripe handle to workqueue I think this requires a much newer kernel, but since he's running on RHEL7 using kernel 3.10.x with RH patches and such, that feature doesn't exist. I just checked on my one my RHEL7.6 systems and I don't see that option. And I just setup a RAID5 four device RAID and it doesn't have that option. So I think maybe you need to try: mdadm -C -l 0 -c 64 md_stripe /dev/md_raid5[1-8] But thinking some more, maybe you want to pin the RAID5 threads for each of your RAID5s to a seperate CPU using cpusets? Maybe that will help performance? But wait, why using use an MD stripe on top of the RAID5 setup? Or are you? Can you please provide the setup of the system? cat /proc/mdstat vgs -av pvs -av lvs -av Just so we can look at what you're doing? Also, what's the queue depth of your devices? Maybe with NVMe you can bump it up higher? Or maybe it wants to be lower... something else to check. John From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mimecast-mx02.redhat.com (mimecast05.extmail.prod.ext.rdu2.redhat.com [10.11.55.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 3920E2166B2C for ; Mon, 16 Dec 2019 08:23:03 +0000 (UTC) Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 306BB8012D3 for ; Mon, 16 Dec 2019 08:23:03 +0000 (UTC) From: Daniel Janzon Date: Mon, 16 Dec 2019 08:22:56 +0000 Message-ID: Content-Language: en-US MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [linux-lvm] Best way to run LVM over multiple SW RAIDs? Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: Content-Type: text/plain; charset="us-ascii" To: "linux-lvm@redhat.com" > From: Guoqing Jiang >>On 12/7/19 11:44 PM, John Stoffel wrote: >> As Daniel says, he's got lots of disk load, but plenty of CPU, so the >> single thread for RAID5 is a big bottleneck. >Perhaps set "/sys/block/mdx/md/group_thread_cnt" could help here, Now I finally had a chance to test this. It turns out to work great! It's not as fast as a non-raided linearly allocated LVM volume (about half of performance without getting a fat tail of high read/write response time). So there is a price for redundancy but that is worth it in my application. It's now up there in the same magnitude. Thanks a lot Guoqing! You really helped me a lot here. I'd also like to thank John Stoffel for valuable input.