All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Leszek Dubiel <leszek@dubiel.pl>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Subject: Re: very slow "btrfs dev delete" 3x6Tb, 7Tb of data
Date: Fri, 3 Jan 2020 12:15:17 -0700	[thread overview]
Message-ID: <CAJCQCtSG0nEEahu+KLxKCu3LYWFaA4Tp77Ai1NDmSSdtGc0w7g@mail.gmail.com> (raw)
In-Reply-To: <283b1c8a-9923-4612-0bbf-acb2a731e726@dubiel.pl>

On Fri, Jan 3, 2020 at 2:08 AM Leszek Dubiel <leszek@dubiel.pl> wrote:
>
>  >> # iotop -d30
>  >>
>  >> Total DISK READ:        34.12 M/s | Total DISK WRITE: 40.36 M/s
>  >> Current DISK READ:      34.12 M/s | Current DISK WRITE:      79.22 M/s
>  >>    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN IO> COMMAND
>  >>   4596 be/4 root       34.12 M/s   37.79 M/s  0.00 % 91.77 % btrfs
>  >
>  > Not so bad for many small file reads and writes with HDD. I've see
>  > this myself with single spindle when doing small file reads and
>  > writes.

It's not small files directly. It's the number of write requests per
second, resulting in high latency seeks. And the reason for the
seeking needs a second opinion, to be certain it's related to small
files.

I'm not really sure why there are hundreds of write requests per
second. Seems to me with thousands of small files, Btrfs can aggregate
them into a single sequential write (mostly sequential anyway) and do
the same for metadata writes; yes there is some back and forth seeking
since metadata and data block groups are in different physical
locations. But hundreds of times per second? Hmmm. I'm suspicious why.
It must be trying to read and write hundreds of small files *in
different locations* causing the seeks, and the ensuing latency.

The typical work around for this these days is add more disks or add
SSD. If you add a fourth disk, you reduce your one bottle neck:


> root@wawel:~# btrfs dev usag /
> /dev/sda2, ID: 2
>     Device size:             5.45TiB
>     Device slack:              0.00B
>     Data,RAID1:              2.62TiB
>     Metadata,RAID1:         22.00GiB
>     Unallocated:             2.81TiB
>
> /dev/sdb2, ID: 3
>     Device size:             5.45TiB
>     Device slack:              0.00B
>     Data,RAID1:              2.62TiB
>     Metadata,RAID1:         21.00GiB
>     System,RAID1:           32.00MiB
>     Unallocated:             2.81TiB
>
> /dev/sdc3, ID: 4
>     Device size:            10.90TiB
>     Device slack:            3.50KiB
>     Data,RAID1:              5.24TiB
>     Metadata,RAID1:         33.00GiB
>     System,RAID1:           32.00MiB
>     Unallocated:             5.62TiB

OK this is important. Two equal size drives, and the third is much
larger. This means writes are going to be IO bound to that single
large device because it's always going to be written to. The reads get
spread out somewhat.

Again, maybe the every day workload is the one to focus on  because
it's not such a big deal for a device replace to take overnight. Even
though it would be good for everyone's use case if it turns out
there's some optimization possible to avoid hundreds of write requests
per second, just because of small files.



-- 
Chris Murphy

  reply	other threads:[~2020-01-03 19:15 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-25 22:35 very slow "btrfs dev delete" 3x6Tb, 7Tb of data Leszek Dubiel
2019-12-26  5:08 ` Qu Wenruo
2019-12-26 13:17   ` Leszek Dubiel
2019-12-26 13:44     ` Remi Gauvin
2019-12-26 14:05       ` Leszek Dubiel
2019-12-26 14:21         ` Remi Gauvin
2019-12-26 15:42           ` Leszek Dubiel
2019-12-26 22:40         ` Chris Murphy
2019-12-26 22:58           ` Leszek Dubiel
2019-12-28 17:04             ` Leszek Dubiel
2019-12-28 20:23               ` Zygo Blaxell
2020-01-02 18:37                 ` Leszek Dubiel
2020-01-02 21:57                   ` Chris Murphy
2020-01-02 22:39                     ` Leszek Dubiel
2020-01-02 23:22                       ` Chris Murphy
2020-01-03  9:08                         ` Leszek Dubiel
2020-01-03 19:15                           ` Chris Murphy [this message]
2020-01-03 14:39                         ` Leszek Dubiel
2020-01-03 19:02                           ` Chris Murphy
2020-01-03 20:59                             ` Leszek Dubiel
2020-01-04  5:38                         ` Zygo Blaxell
2020-01-07 18:44                           ` write amplification, was: " Chris Murphy
2020-01-07 19:26                             ` Holger Hoffstätte
2020-01-07 23:32                             ` Zygo Blaxell
2020-01-07 23:53                               ` Chris Murphy
2020-01-08  1:41                                 ` Zygo Blaxell
2020-01-08  2:54                                   ` Chris Murphy
2020-01-06 11:14                     ` Leszek Dubiel
2020-01-07  0:21                       ` Chris Murphy
2020-01-07  7:09                         ` Leszek Dubiel
2019-12-26 22:15 ` Chris Murphy
2019-12-26 22:48   ` Leszek Dubiel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtSG0nEEahu+KLxKCu3LYWFaA4Tp77Ai1NDmSSdtGc0w7g@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=leszek@dubiel.pl \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.