From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E2BBC433E0 for ; Thu, 11 Feb 2021 03:23:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 452D764E8B for ; Thu, 11 Feb 2021 03:23:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229690AbhBKDWt (ORCPT ); Wed, 10 Feb 2021 22:22:49 -0500 Received: from mail.snapdragon.cc ([103.26.41.109]:51892 "EHLO mail.snapdragon.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229562AbhBKDWs (ORCPT ); Wed, 10 Feb 2021 22:22:48 -0500 X-Greylist: delayed 422 seconds by postgrey-1.27 at vger.kernel.org; Wed, 10 Feb 2021 22:22:47 EST Received: by mail.snapdragon.cc (Postfix, from userid 65534) id EA99219E03A2; Thu, 11 Feb 2021 03:15:02 +0000 (UTC) Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=snapdragon.cc; s=default; t=1613013297; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BurH8Xc6zofwJAxlEf0cGHi7c59Ad13vqle4nJ0Yj4c=; b=kH5+HV4QZsPfF6BebKFNV4/707eMFRvrggRJwodmmODkQXldpNv+Xd4HW1+P3YlAImjdA9 eeKvpb8BMYD8fjQT8xKBP1VijZalX/6184UlAu+jf7OqoXiRHvwwdIstnAMYllOdk6x8lO kCGsEIVBHM8oZaq+gdsYAPyn6Xxrx/Y= Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.60.0.2.21\)) Subject: Re: Linux RAID with btrfs stuck and consume 100 % CPU From: Manuel Riel In-Reply-To: <0c792470-6ee9-8254-dd57-a7a90ac95bcd@xmyslivec.cz> Date: Thu, 11 Feb 2021 11:14:56 +0800 Cc: Chris Murphy , Michal Moravec , Linux-RAID , songliubraving@fb.com, guoqing.jiang@cloud.ionos.com Content-Transfer-Encoding: quoted-printable Message-Id: References: <29509e08-e373-b352-d696-fcb9f507a545@xmyslivec.cz> <695936b4-67a2-c862-9cb6-5545b4ab3c42@xmyslivec.cz> <2f2f1c21-c81b-55aa-6f77-e2d3f32d32cb@xmyslivec.cz> <4b0dd0aa-f77b-16c8-107b-0182378f34e6@xmyslivec.cz> <5e79d1f8-7632-48ef-de56-9e79cba87434@xmyslivec.cz> <0c792470-6ee9-8254-dd57-a7a90ac95bcd@xmyslivec.cz> To: Vojtech Myslivec Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org I'm also hitting the exact same problem with XFS on RAID6 using a RAID1=20= write journal on two NVMes. CentOS 8, 4.18.0-240.10.1.el8_3.x86_64. Symptoms: - high CPU usage of md4_raid6 process - IO wait goes up - IO on that file system locks up for tens of minutes and the kernel = reports: [Wed Feb 10 23:23:05 2021] INFO: task md4_reclaim:1070 blocked for more = than 20 seconds. [Wed Feb 10 23:23:05 2021] Not tainted = 4.18.0-240.10.1.el8_3.x86_64 #1 [Wed Feb 10 23:23:05 2021] "echo 0 > = /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Wed Feb 10 23:23:05 2021] md4_reclaim D 0 1070 2 = 0x80004000 Already confirmed it's not a timeout mismatch. No drive errors reported. = SCT Error Recovery Control is set to 7 seconds >> It's kindof a complicated setup. When this problem happens, can you >> check swap pressure? There is a RAID1 SWAP partition, but it's almost unused, since the = server has ample of RAM. >>=20 >> /sys/fs/cgroup/memory.stat >>=20 >> pgfault and maybe also pgmajfault - see if they're going up; or also >> you can look at vmstat and see how heavy swap is being used at the >> time. The thing is. >>=20 >> Because any heavy eviction means writes to dm-0->md0 raid1->sdg+sdh >> SSDs, which are the same SSDs that you have the md1 raid6 mdadm >> journal going to. So if you have any kind of swap pressure, it very >> likely will stop the journal or at least substantially slow it down, >> and now you get blocked tasks as the pressure builds more and more >> because now you have a ton of dirty writes in Btrfs that can't make = it >> to disk. I've disabled SWAP to test this theory. >> If there is minimal swap usage, then this hypothesis is false and >> something else is going on. I also don't have an explanation why your >> work around works. >=20 > Sadly, I am not able to _disable the journal_ if I do - just by = removing > the device from the array - the MD device instantly fails and btrfs > volume remounts read-only. I can not find any other way how to disable > the journal, it seems it is not supported. I can see only > `--add-journal` option and no corresponding `--delete-journal` one. >=20 > I welcome any advice how to exchange write-journal with internal = bitmap. I read that the array needs to be in read-only mode. Then you can fail = and replace the write journal. (not tested) # mdadm --readonly /dev/md0 # mdadm /dev/md0 --fail /dev/ # mdadm --manage /dev/md0 --add-journal /dev/ > Any other possible changes that comes to my mind are: > - Enlarge write-journal My write journal is about 180 GB > - Move write-journal to physical sdg/sdh SSDs (out from md0 raid1 > device). I may try this, but as you say it's risky, especially when using = "write-back" journal mode > I find the later a bit risky, as the write-journal is not redundant > then. That's the reason we choose write journal on RAID device. I'm also experimenting with write-back/write-through mode and different stripe_cache_size. Hoping to find something. If nothing helps, it may = not be possible/supported to put a write journal on another RAID device?