From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 References: <20171019184916.GB23283@redhat.com> In-Reply-To: <20171019184916.GB23283@redhat.com> From: Joe Thornber Date: Fri, 20 Oct 2017 11:07:50 +0000 Message-ID: Content-Type: multipart/alternative; boundary="001a114419b06ca111055bf87c62" Subject: Re: [linux-lvm] cache on SSD makes system unresponsive Reply-To: LVM general discussion and development List-Id: LVM general discussion and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , List-Id: To: Mike Snitzer , Oleg Cherkasov Cc: ejt@redhat.com, linux-lvm@redhat.com --001a114419b06ca111055bf87c62 Content-Type: text/plain; charset="UTF-8" I can't look at this until Sunday. But if it's something that only exhibits in writeback mode rather than writethrough, then I'd guess it's to do with the list of writeback work that the policy builds. So check whether the list is growing endlessly, and check the work object is being freed once the copy has completed. On Thu, 19 Oct 2017 at 19:49 Mike Snitzer wrote: > On Thu, Oct 19 2017 at 1:54pm -0400, > Oleg Cherkasov wrote: > > > Hi, > > > > Recently I have decided to try out LVM cache feature on one of our > > Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk array > > (hardware RAID5 with H710 and H830 Dell adapters). Two SSD disks > > each 256Gb are in hardware RAID1 using H710 adapter with primary and > > extended partitions so I decided to make ~240Gb LVM cache to see if > > system I/O may be improved. The server is running Bareos storage > > daemon and beside sshd and Dell OpenManage monitoring does not have > > any other services. Unfortunately testing went not as I expected > > nonetheless at the end system is up and running with no data > > corrupted. > > > > Initially I have tried the default writethrough mode and after > > running dd reading test with 250Gb file got system unresponsive for > > roughly 15min with cache allocation around 50%. Writing to disks it > > seems speed up the system however marginally, so around 10% on my > > tests and I did manage to pull more than 32Tb via backup from > > different hosts and once system became unresponsive to ssh and icmp > > requests however for a very short time. > > > > I though it may be something with cache mode so switched to > > writeback via lvconvert and run dd reading test again with 250Gb > > file however that time everything went completely unexpected. > > System started to slow responding for simple user interactions like > > list files and run top. And then became completely unresponsive for > > about half an hours. Switching to main console via iLO I saw a lot > > of OOM messages and kernel tried to survive therefore randomly > > killed almost all processes. Eventually I did manage to reboot and > > immediately uncached the array. > > > > My question is about very strange behavior of LVM cache. Well, I > > may expect no performance boost or even I/O degradation however I do > > not expect run out of memory and than OOM kicks in. That server has > > only 12Gb RAM however it does run only sshd, bareos SD daemon and > > OpenManange java based monitoring system so no RAM problems were > > notices for last few years running with our LVM cache. > > > > Any ideas what may be wrong? I have second NX3200 server with > > similar hardware setup and it would be switch to FreeBSD 11.1 with > > ZFS very time soon however I may try to install CentOS 7.4 first and > > see if the problem may be reproduced. > > > > LVM2 installed is version lvm2-2.02.171-8.el7.x86_64. > > Your experience is _not_ unique. It is unfortunate but there would seem > to be some systemic issues with dm-cache being too resoruce heavy. Not > aware of any particular issue(s) yet. > > I'm focusing on this now since we've had some internal reports that > writeback is quite slow (and that tests don't complete). That IO > latencies are high. Etc. > > I'll work through it and likely enlist Joe Thornber's help next week. > > I'll keep you posted as progress is made though. > > Thanks, > Mike > --001a114419b06ca111055bf87c62 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I can't look at this until Sunday.=C2=A0 But if it'= ;s something that only exhibits in writeback mode rather than writethrough,= then I'd guess it's to do with the list of writeback work that the= policy builds.=C2=A0 So check whether the list is growing endlessly, and c= heck the work object is being freed once the copy has completed.

<= div class=3D"gmail_quote">
On Thu, 19 Oct 2017 at 19:49 Mik= e Snitzer <snitzer@redhat.com&= gt; wrote:
On Thu, Oct 19 2017 at= =C2=A0 1:54pm -0400,
Oleg Cherkasov <o1e9@member.fsf.org> wrote:

> Hi,
>
> Recently I have decided to try out LVM cache feature on one of our
> Dell NX3100 servers running CentOS 7.4.1708 with 110Tb disk array
> (hardware RAID5 with H710 and H830 Dell adapters).=C2=A0 Two SSD disks=
> each 256Gb are in hardware RAID1 using H710 adapter with primary and > extended partitions so I decided to make ~240Gb LVM cache to see if > system I/O may be improved.=C2=A0 The server is running Bareos storage=
> daemon and beside sshd and Dell OpenManage monitoring does not have > any other services. Unfortunately testing went not as I expected
> nonetheless at the end system is up and running with no data
> corrupted.
>
> Initially I have tried the default writethrough mode and after
> running dd reading test with 250Gb file got system unresponsive for > roughly 15min with cache allocation around 50%.=C2=A0 Writing to disks= it
> seems speed up the system however marginally, so around 10% on my
> tests and I did manage to pull more than 32Tb via backup from
> different hosts and once system became unresponsive to ssh and icmp > requests however for a very short time.
>
> I though it may be something with cache mode so switched to
> writeback via lvconvert and run dd reading test again with 250Gb
> file however that time everything went completely unexpected.
> System started to slow responding for simple user interactions like > list files and run top. And then became completely unresponsive for > about half an hours.=C2=A0 Switching to main console via iLO I saw a l= ot
> of OOM messages and kernel tried to survive therefore randomly
> killed almost all processes.=C2=A0 Eventually I did manage to reboot a= nd
> immediately uncached the array.
>
> My question is about very strange behavior of LVM cache.=C2=A0 Well, I=
> may expect no performance boost or even I/O degradation however I do > not expect run out of memory and than OOM kicks in.=C2=A0 That server = has
> only 12Gb RAM however it does run only sshd, bareos SD daemon and
> OpenManange java based monitoring system so no RAM problems were
> notices for last few years running with our LVM cache.
>
> Any ideas what may be wrong?=C2=A0 I have second NX3200 server with > similar hardware setup and it would be switch to FreeBSD 11.1 with
> ZFS very time soon however I may try to install CentOS 7.4 first and > see if the problem may be reproduced.
>
> LVM2 installed is version lvm2-2.02.171-8.el7.x86_64.

Your experience is _not_ unique.=C2=A0 It is unfortunate but there would se= em
to be some systemic issues with dm-cache being too resoruce heavy.=C2=A0 No= t
aware of any particular issue(s) yet.

I'm focusing on this now since we've had some internal reports that=
writeback is quite slow (and that tests don't complete).=C2=A0 That IO<= br> latencies are high.=C2=A0 Etc.

I'll work through it and likely enlist Joe Thornber's help next wee= k.

I'll keep you posted as progress is made though.

Thanks,
Mike
--001a114419b06ca111055bf87c62--