Re: very slow "btrfs dev delete" 3x6Tb, 7Tb of data

From: Chris Murphy <lists@colorremedies.com>
To: Leszek Dubiel <leszek@dubiel.pl>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Subject: Re: very slow "btrfs dev delete" 3x6Tb, 7Tb of data
Date: Fri, 3 Jan 2020 12:15:17 -0700	[thread overview]
Message-ID: <CAJCQCtSG0nEEahu+KLxKCu3LYWFaA4Tp77Ai1NDmSSdtGc0w7g@mail.gmail.com> (raw)
In-Reply-To: <283b1c8a-9923-4612-0bbf-acb2a731e726@dubiel.pl>

On Fri, Jan 3, 2020 at 2:08 AM Leszek Dubiel <leszek@dubiel.pl> wrote:
>
>  >> # iotop -d30
>  >>
>  >> Total DISK READ:        34.12 M/s | Total DISK WRITE: 40.36 M/s
>  >> Current DISK READ:      34.12 M/s | Current DISK WRITE:      79.22 M/s
>  >>    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN IO> COMMAND
>  >>   4596 be/4 root       34.12 M/s   37.79 M/s  0.00 % 91.77 % btrfs
>  >
>  > Not so bad for many small file reads and writes with HDD. I've see
>  > this myself with single spindle when doing small file reads and
>  > writes.

It's not small files directly. It's the number of write requests per
second, resulting in high latency seeks. And the reason for the
seeking needs a second opinion, to be certain it's related to small
files.

I'm not really sure why there are hundreds of write requests per
second. Seems to me with thousands of small files, Btrfs can aggregate
them into a single sequential write (mostly sequential anyway) and do
the same for metadata writes; yes there is some back and forth seeking
since metadata and data block groups are in different physical
locations. But hundreds of times per second? Hmmm. I'm suspicious why.
It must be trying to read and write hundreds of small files *in
different locations* causing the seeks, and the ensuing latency.

The typical work around for this these days is add more disks or add
SSD. If you add a fourth disk, you reduce your one bottle neck:

> root@wawel:~# btrfs dev usag /
> /dev/sda2, ID: 2
>     Device size:             5.45TiB
>     Device slack:              0.00B
>     Data,RAID1:              2.62TiB
>     Metadata,RAID1:         22.00GiB
>     Unallocated:             2.81TiB
>
> /dev/sdb2, ID: 3
>     Device size:             5.45TiB
>     Device slack:              0.00B
>     Data,RAID1:              2.62TiB
>     Metadata,RAID1:         21.00GiB
>     System,RAID1:           32.00MiB
>     Unallocated:             2.81TiB
>
> /dev/sdc3, ID: 4
>     Device size:            10.90TiB
>     Device slack:            3.50KiB
>     Data,RAID1:              5.24TiB
>     Metadata,RAID1:         33.00GiB
>     System,RAID1:           32.00MiB
>     Unallocated:             5.62TiB

OK this is important. Two equal size drives, and the third is much
larger. This means writes are going to be IO bound to that single
large device because it's always going to be written to. The reads get
spread out somewhat.

Again, maybe the every day workload is the one to focus on  because
it's not such a big deal for a device replace to take overnight. Even
though it would be good for everyone's use case if it turns out
there's some optimization possible to avoid hundreds of write requests
per second, just because of small files.

-- 
Chris Murphy