From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5D6AC3A5A9 for ; Sat, 2 May 2020 04:18:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A22322184D for ; Sat, 2 May 2020 04:18:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726463AbgEBES1 convert rfc822-to-8bit (ORCPT ); Sat, 2 May 2020 00:18:27 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:47060 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726058AbgEBES1 (ORCPT ); Sat, 2 May 2020 00:18:27 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id A934E69F922; Sat, 2 May 2020 00:18:26 -0400 (EDT) Date: Sat, 2 May 2020 00:18:26 -0400 From: Zygo Blaxell To: Phil Karn Cc: Alexandru Dordea , Chris Murphy , Btrfs BTRFS Subject: Re: Extremely slow device removals Message-ID: <20200502041826.GH10769@hungrycats.org> References: <8b647a7f-1223-fa9f-57c0-9a81a9bbeb27@ka9q.net> <14a8e382-0541-0f18-b969-ccf4b3254461@ka9q.net> <20200501024753.GE10769@hungrycats.org> <6F06C333-0C27-482A-9AE4-3C0123CC550A@dordea.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8BIT In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Fri, May 01, 2020 at 12:29:50AM -0700, Phil Karn wrote: > On 4/30/20 23:05, Alexandru Dordea wrote: > > Don’t get me wrong, the single 100% CPU is only during balance process. > > By running "btrfs device delete missing /storage”there is no impact on CPU/RAM. I do have 64GB DDR4 ECC but there is no more of 3GB ram usage. > 3GB used for what, does that include the system buffer cache? > > > > I can see that @Chris Murphy mention that disabling the cache will impact performance. Did you tried that? > > On my devices I do have cache enabled and till now this is the only thing that I didn't tried :) > > > It didn't seem to make an obvious difference, which surprised me a > little since the I/O seems so random. Maybe btrfs is already sticking a > lot of fences (barriers) into the write stream so the drive can't do > much reordering anyway? btrfs can send gigabytes of metadata IO per minute to a drive, enough to overwhelm even the largest device write caches. So even if you use 100% of a 256MB drive's on-board RAM as write cache, the following gigabytes of a large metadata update won't get much benefit from caching. The drive will be stuck a quarter gigabyte behind the host, trying to catch up all the time. Also, in large delete operations, half of the IOs are random _reads_, which can't be optimized by write caching. The writes are mostly sequential, so they take less IO time. So, say, 1% of the IO time is made 80% faster by write caching, for a net benefit of 0.8% (not real numbers). Write caching helps fsync() performance and not much else. A writeback SSD cache can have a significant beneficial effect on latency until it gets full, but if it's not big enough to hold the metadata then it won't be very helpful, in the worst case it will make btrfs slower. > I've always left write caching enabled in my drives since my system is > plugged into reliable power. I assume the only reason to turn it off is > to reduce the chance of filesystem corruption in case I have to force > the machine to reboot while the operation is still going. The big surprise for write caches is what happens when the drive gets a UNC sector. Some drive firmwares work properly under normal and power-loss conditions, but immediately drop the contents of the write cache when they see an unreadable block. This turns an otherwise completely survivable error--a small number of consecutive bad sectors on a single-disk filesystem---into a btrfs damaged beyond repair. In this event, metadata writes will be dropped in both copies of dup metadata, but the error is not reported to btrfs by the drive firmware because it happens after the drive has reported successful completion of the relevant flush command to the host. Write caching in drives without command queueing assumes that the drive will be able to complete the flush command before a power failure interrupts it. Usually firmware doesn't take sector read retries or external vibration events into account, but those events also prevent the drive from implementing any further write commands from the host, so write ordering is preserved. In the UNC case, the firmware drops the write cache and also keeps accepting new write commands, which is a bug--it should do at most one of those two things. The result is unrecoverable metadata loss on btrfs. Reliable power and crash avoidance won't help in this case--the filesystem will die while it's still mounted. If you find out the hard way that you have a drive with firmware that does this, the only recourse is to turn off write caching (and make sure it stays off), mkfs, and start restoring backups. > Down to only 1.99 TB now! Wow! > > --Phil > > >