All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marc MERLIN <marc@merlins.org>
To: Martin <m_btrfs@ml1.co.uk>, Xavier Nicollet <nicollet@jeru.org>
Cc: linux-btrfs@vger.kernel.org, Josef Bacik <jbacik@fb.com>,
	Chris Mason <clm@fb.com>
Subject: Re: How to debug very very slow file delete? (btrfs on md-raid5 with many files, 70GB metadata)
Date: Thu, 10 Apr 2014 10:07:34 -0700	[thread overview]
Message-ID: <20140410170734.GZ10789@merlins.org> (raw)
In-Reply-To: <20140325164142.GN12833@merlins.org>

So, since then I found out in the thread
Subject: Re: btrfs on 3.14rc5 stuck on "btrfs_tree_read_lock           sync"
that my btrfs filesystem has a clear problem, which Josef and Chris are
still looking into.

Basically, I've had btrfs near deadlocks on this filesystem:
INFO: task btrfs-transacti:3633 blocked for more than 120 seconds.
Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1
INFO: task btrfs-cleaner:3571 blocked for more than 120 seconds.
Not tainted 3.14.0-amd64-i915-preempt-20140216 #2

They are thinking it's due to a balancing issue that 3.15 might fix.

One interesting piece of information I found out since yesterday is that now
that I mounted the filesystem with -o ro,recovery , its speed as improved
very noticeably.
I'm currently copying my data off it about 10x faster.



What follows is for people interested in optimization.

I have swraid5 with dmcrypt on top, and then btrfs.
http://superuser.com/questions/305716/bad-performance-with-linux-software-raid5-and-luks-encryption
says:
"LUKS has a botleneck, that is it just spawns one thread per block device.

Are you placing the encryption on top of the RAID 5? Then from the point of
view of your OS you just have one device, then it is using just one thread
for all those disks, meaning disks are working in a serial way rather than
parallel."
but it was disputed in a reply.
Does someone know if this is still valid/correct in 3.14?


Since I'm going to recreate the filesystem considering the troubles I've had
with it, I might as well do it better this time :)
(but doing the copy back will take days, so I'd rather get it right the first time)

How would you recommend I create the array when I rebuild it?

This filesystem contains may backup with many files, most small, and ideally
identical stuff is hardlinked together (many files, many hardlinks)
gargamel:~# btrfs fi df /mnt/btrfs_pool2
Data, single: total=3.28TiB, used=2.29TiB
System, DUP: total=8.00MiB, used=384.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=74.50GiB, used=70.11GiB  <<< muchos metadata
Metadata, single: total=8.00MiB, used=0.00

This is my current array:
gargamel:~# mdadm --detail /dev/md8
/dev/md8:
        Version : 1.2
  Creation Time : Thu Mar 25 20:15:00 2010
     Raid Level : raid5
     Array Size : 7814045696 (7452.05 GiB 8001.58 GB)
  Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB)
    Persistence : Superblock is persistent
  Intent Bitmap : Internal
         Layout : left-symmetric
     Chunk Size : 512K

#1 move the intent bitmap to another device. I have /boot on swraid1 with
   ext4, so I'll likely use this.
#2 change chunk size to something smaller? 128K better?
#3 anything else?

Then, I used this for dmcrypt:
cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64  

The align-payload was good for my SSD, but probably not for a hard drive.
http://wiki.drewhess.com/wiki/Creating_an_encrypted_filesystem_on_a_partition
says
"To calculate this value, multiply your RAID chunk size in bytes by the
number of data disks in the array (N/2 for RAID 1, N-1 for RAID 5 and N-2
for RAID 6), and divide by 512 bytes per sector."

So 512K * 4 / 512 = 4K
In other words, I can do align-payload=4096 for a small reduction of write
amplification, or =1024 if I change my raid chunk size to 128K

Correct?

But from what I can see, those will only be small improvements compared to the 
btrfs performance I've seen which hopefully 3.15 will address in some way.

Other bits I found that can maybe help others:
http://superuser.com/questions/305716/bad-performance-with-linux-software-raid5-and-luks-encryption

This seems to help work around the write amplification a bit:
for i in /sys/block/md*/md/stripe_cache_size; do echo 16384 > $i; done

This looks like an easy thing, done.

If you have other suggestions/comments, please share :)

Thanks,
Marc


On Tue, Mar 25, 2014 at 09:41:42AM -0700, Marc MERLIN wrote:
> On Tue, Mar 25, 2014 at 12:13:50PM +0000, Martin wrote:
> > On 25/03/14 01:49, Marc MERLIN wrote:
> > > I had a tree with some amount of thousand files (less than 1 million)
> > > on top of md raid5.
> > > 
> > > It took 18H to rm it in 3 tries:
> 
> I ran another test after typing the original Email:
> gargamel:/mnt/dshelf2/backup/polgara# time du -sh 20140312-feisty/; time find 20140 312-feisty/ | wc -l
> 17G     20140312-feisty/
> real    245m19.491s
> user    0m2.108s
> sys     1m0.508s
> 
> 728507 <- number of files
> real    11m41.853s <- 11mn to restat them when they should all be in cache ideally
> user    0m1.040s
> sys     0m4.360s
> 
> 4 hours to stat 700K files. That's bad...
> Even 11mn to restat them just to count them looks bad too.
> 
> > > I checked that btrfs scrub is not running.
> > > What else can I check from here?
> > 
> > "noatime" set?
> 
> I have relatime
> gargamel:/mnt/dshelf2/backup/polgara# df .
> Filesystem           1K-blocks       Used  Available Use% Mounted on
> /dev/mapper/dshelf2 7814041600 3026472436 4760588292  39% /mnt/dshelf2/backup
> 
> gargamel:/mnt/dshelf2/backup/polgara# grep /mnt/dshelf2/backup /proc/mounts
> /dev/mapper/dshelf2 /mnt/dshelf2/backup btrfs rw,relatime,compress=lzo,space_cache 0 0
>  
> > What's your cpu hardware wait time?
>  
> Sorry, not sure how to get that.
>  
> > And is not *the 512kByte raid chunk* going to give you horrendous write
> > amplification?! For example, rm updates a few bytes in one 4kByte
> > metadata block and the system has to then do a read-modify-write on
> > 512kBytes...
> 
> That's probably not great, but
> 1) rm -rf should bunch a lot of writes together before they start
> hitting the block layer for writes, so I'm not sure that is too much a
> problem with the caching layer in between
> 
> 2) this does not explain 4H to just run du with relatime, which
> shouldn't generate any writing, correct?
> iostat seems to confirm:
> 
> gargamel:~# iostat /dev/md8 1 20
> Linux 3.14.0-rc5-amd64-i915-preempt-20140216c (gargamel.svh.merlins.org)        03/25/2014      _x86_64_        (4 CPU)
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle  
>           75.19    0.00   10.13    8.61    0.00    6.08
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> md8              98.00       392.00         0.00        392          0
> md8              96.00       384.00         0.00        384          0
> md8              83.00       332.00         0.00        332          0
> md8             153.00       612.00         0.00        612          0
> md8              82.00       328.00         0.00        328          0
> md8              55.00       220.00         0.00        220          0
> md8              69.00       276.00         0.00        276          0
> 
> > Also, the 64MByte chunk bit-intent map will add a lot of head seeks to
> > anything you do on that raid. (The map would be better on a separate SSD
> > or other separate drive.)
> 
> That's true for writing, but not reading, right?
>  
> > So... That sort of setup is fine for archived data that is effectively
> > read-only. You'll see poor performance for small writes/changes.
> 
> So I agree with you that the write case can be improved, especially since I also have a layer
> of dmcrypt in the middle
> gargamel:/mnt/dshelf2/backup/polgara# cryptsetup luksDump /dev/md8
> LUKS header information for /dev/md8
> Cipher name:    aes
> Cipher mode:    xts-plain64
> Hash spec:      sha1
> Payload offset: 8192
> 
> (I used cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64)
> 
> I'm still not convinced that a lot of file IO don't get all collated in memory 
> before hitting disk in bigger blocks, but maybe not.
> 
> If I were to recreate this array entirely, what would you use for the raid creation
> and cryptsetup?
> 
> More generally, before I go through all that trouble (it will likely
> take 1 week of data copying back and forth), I'd like to debug why my reads are
> so slow first.
> 
> Thanks,
> Marc
> 
> On Tue, Mar 25, 2014 at 02:57:57PM +0100, Xavier Nicollet wrote:
> > Le 25 mars 2014 à 12:13, Martin a écrit:
> > > On 25/03/14 01:49, Marc MERLIN wrote:
> > > > It took 18H to rm it in 3 tries:
> > 
> > > And is not *the 512kByte raid chunk* going to give you horrendous write
> > > amplification?! For example, rm updates a few bytes in one 4kByte
> > > metadata block and the system has to then do a read-modify-write on
> > > 512kBytes...
> > 
> > My question would be naive, but would it be possible to have a syscall or something to do 
> > a fast "rm -rf" or du ?
> 
> Well, that wouldn't hurt either, even if it wouldn't address my underlying problem.
> 
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

  reply	other threads:[~2014-04-10 17:07 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-04-07 16:05 btrfs on 3.14rc5 stuck on "btrfs_tree_read_lock sync" Marc MERLIN
2014-04-07 16:10 ` Josef Bacik
2014-04-07 18:51   ` Marc MERLIN
2014-04-07 19:32     ` Chris Mason
2014-04-07 20:00       ` Marc MERLIN
2014-04-09 17:38         ` Marc MERLIN
2014-03-25  1:49           ` How to debug very very slow file delete? Marc MERLIN
2014-03-25 12:13             ` How to debug very very slow file delete? (btrfs on md-raid5) Martin
2014-03-25 13:57               ` Xavier Nicollet
2014-03-25 16:41               ` Marc MERLIN
2014-04-10 17:07                 ` Marc MERLIN [this message]
2014-04-11 14:15                 ` Chris Samuel
2014-04-11 17:23                   ` Marc MERLIN
2014-04-11 18:00                     ` Duncan
2014-04-11 19:15                     ` Roman Mamedov
2014-04-12 20:25             ` very slow btrfs filesystem: any data needed before I wipe it? Marc MERLIN
2014-04-13  4:02               ` Duncan
2014-04-14  1:43                 ` Marc MERLIN
2014-04-14 10:28                   ` Duncan
2014-04-16 22:35                     ` Marc MERLIN
2014-04-13 14:57               ` Marc MERLIN
2014-04-13 16:59                 ` what does your btrfsck look like? Marc MERLIN
2014-04-14  2:15             ` How to debug very very slow file delete? Liu Bo
2014-04-14  2:21               ` Liu Bo
2014-06-09 23:40         ` btrfs balance crash BUG ON fs/btrfs/relocation.c:1062 or RIP build_backref_tree+0x9fc/0xcc4 Marc MERLIN
2014-06-10  0:32           ` Russell Coker
2014-06-10  4:58             ` Marc MERLIN
2014-06-14 16:21           ` Marc MERLIN
2014-06-17 18:29           ` Josef Bacik
2014-06-17 18:55             ` Marc MERLIN
2014-06-18 15:26               ` Josef Bacik
2014-06-18 20:21                 ` Marc MERLIN
2014-06-19 16:12                   ` Josef Bacik
2014-06-19 22:25                     ` Marc MERLIN
2014-06-19 22:50                       ` Josef Bacik
2014-06-20  0:53                         ` Marc MERLIN
2014-06-20 15:40                           ` Josef Bacik
2014-06-25 19:40                             ` Marc MERLIN
2014-06-25 21:05                               ` Josef Bacik
2015-05-05 21:02           ` 3.19.6: __btrfs_free_extent:5987: errno=-2 No such entry, did btrfs check --repair break it? Marc MERLIN
2015-05-06 11:04             ` Duncan
2015-05-06 17:25               ` Chris Murphy
2015-05-07  3:15                 ` Duncan
2015-05-06 17:49               ` Marc MERLIN
  -- strict thread matches above, loose matches on Subject: below --
2014-09-03 17:42 kernel BUG at fs/btrfs/extent-tree.c:7727! with 3.17-rc3 Tomasz Chmielewski
2014-09-03 12:04 ` kernel BUG at fs/btrfs/relocation.c:1065 in 3.14.16 to 3.17-rc3 Olivier Bonvalet
2014-09-29 14:13   ` Liu Bo
     [not found]   ` <20140824000720.GN3875@merlins.org>
     [not found]     ` <20140926214821.GX13219@merlins.org>
     [not found]       ` <20150502141102.GB1809@merlins.org>
     [not found]         ` <20150501210013.GH13624@merlins.org>
2015-04-29 23:21           ` 3.19.3, btrfs send/receive error: failed to clone extents Marc MERLIN
2015-05-02 16:30             ` 3.19.3: check tree block failed + WARNING: device 0 not present on scrub Marc MERLIN
2015-05-02 16:50               ` Christian Dysthe
2015-05-02 17:05                 ` Marc MERLIN
2015-05-02 17:20                   ` Christian Dysthe
2015-05-02 17:29                     ` Marc MERLIN
2015-05-02 18:56                       ` Christian Dysthe
2015-05-05  6:32               ` Marc MERLIN
2015-05-05 19:56                 ` 3.19.6: __btrfs_free_extent:5987: errno=-2 No such entry Marc MERLIN
2014-09-08 18:04 ` kernel BUG at fs/btrfs/extent-tree.c:7727! with 3.17-rc3 Tomasz Chmielewski
2014-10-04  1:19   ` Tomasz Chmielewski
2014-04-02  8:29 [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand Qu Wenruo
2014-04-02  8:29 ` [PATCH 01/27] btrfs-progs: Introduce asciidoc based man page and btrfs man page Qu Wenruo
2014-04-02  8:29 ` [PATCH 02/27] btrfs-progs: Convert man page for btrfs-subvolume Qu Wenruo
2014-04-02  8:29 ` [PATCH 03/27] btrfs-progs: Convert man page for filesystem subcommand Qu Wenruo
2014-04-02  8:29 ` [PATCH 04/27] btrfs-progs: Convert man page for btrfs-balance Qu Wenruo
2014-04-02  8:29 ` [PATCH 05/27] btrfs-progs: Convert man page for btrfs-device subcommand Qu Wenruo
2014-04-02  8:29 ` [PATCH 06/27] btrfs-progs: Convert man page for btrfs-scrub Qu Wenruo
2014-04-02  8:29 ` [PATCH 07/27] btrfs-progs: Convert man page for btrfs-check Qu Wenruo
2014-04-02  8:29 ` [PATCH 08/27] btrfs-progs: Convert man page for btrfs-rescue Qu Wenruo
2014-04-02  8:29 ` [PATCH 09/27] btrfs-progs: Convert man page for btrfs-inspect-internal Qu Wenruo
2014-04-02  8:29 ` [PATCH 10/27] btrfs-progs: Convert man page for btrfs-send Qu Wenruo
2014-04-02  8:29 ` [PATCH 11/27] btrfs-progs: Convert man page for btrfs-receive Qu Wenruo
2014-04-02  8:29 ` [PATCH 12/27] btrfs-progs: Convert man page for btrfs-quota Qu Wenruo
2014-04-02  8:29 ` [PATCH 13/27] btrfs-progs: Convert and enhance the man page of btrfs-qgroup Qu Wenruo
2014-04-02  8:29 ` [PATCH 14/27] btrfs-progs: Convert man page for btrfs-replace Qu Wenruo
2014-04-04 20:29   ` Marc MERLIN
2014-04-08  1:20     ` Qu Wenruo
2014-04-02  8:29 ` [PATCH 15/27] btrfs-progs: Convert man page for btrfs-dedup Qu Wenruo
2014-04-02  8:29 ` [PATCH 16/27] btrfs-progs: Convert man page for btrfsck Qu Wenruo
2014-04-02  8:29 ` [PATCH 17/27] btrfs-progs: Convert man page for btrfs-convert Qu Wenruo
2014-04-02  8:29 ` [PATCH 18/27] btrfs-progs: Convert man page for btrfs-debug-tree Qu Wenruo
2014-04-02  8:29 ` [PATCH 19/27] btrfs-progs: Convert man page for btrfs-find-root Qu Wenruo
2014-04-02  8:29 ` [PATCH 20/27] btrfs-progs: Convert man page for btrfs-image Qu Wenruo
2014-04-02  8:29 ` [PATCH 21/27] btrfs-progs: Convert man page for btrfs-map-logical Qu Wenruo
2014-04-02  8:29 ` [PATCH 22/27] btrfs-progs: Convert man page for btrfs-show-super Qu Wenruo
2014-04-02  8:29 ` [PATCH 23/27] btrfs-progs: Convert man page for btrfstune Qu Wenruo
2014-04-02  8:29 ` [PATCH 24/27] btrfs-progs: Convert man page for btrfs-zero-log Qu Wenruo
2014-04-04 18:46   ` Marc MERLIN
2014-04-05 22:00     ` cwillu
2014-04-05 22:02       ` Marc MERLIN
2014-04-05 22:03         ` Hugo Mills
2014-04-05 22:21           ` Marc MERLIN
2014-04-05 22:05         ` Marc MERLIN
2014-04-05 22:02       ` Hugo Mills
2014-04-08  1:42     ` Qu Wenruo
2014-04-11  5:54       ` Marc MERLIN
2014-04-02  8:29 ` [PATCH 25/27] btrfs-progs: Convert man page for fsck.btrfs Qu Wenruo
2014-04-02  8:29 ` [PATCH 26/27] btrfs-progs: Convert man page for mkfs.btrfs Qu Wenruo
2014-04-02  8:29 ` [PATCH 27/27] btrfs-progs: Switch to the new asciidoc Documentation Qu Wenruo
2014-04-02 13:24 ` [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand Chris Mason
2014-04-02 14:47   ` Marc MERLIN
2014-04-03 20:33   ` Zach Brown
2014-04-02 17:29 ` David Sterba
2014-04-16 17:12 ` David Sterba
2014-04-16 17:16   ` [PATCH] btrfs-progs: doc: link btrfsck to btrfs-check David Sterba
2014-04-17  0:47     ` Qu Wenruo
2014-04-18 14:48       ` David Sterba
2014-04-30 12:14         ` WorMzy Tykashi
2014-05-05 14:57           ` David Sterba
2014-05-08  1:40         ` Qu Wenruo
2014-05-12 14:09           ` David Sterba
2014-06-03  9:38             ` WorMzy Tykashi
2014-06-03 12:19               ` David Sterba
2014-05-17 17:43   ` [PATCH 00/27] Replace the old man page with asciidoc and man page for each btrfs subcommand Hugo Mills
2014-05-17 18:22     ` Hugo Mills
2014-05-18  7:04       ` Qu Wenruo
2014-05-18 12:05         ` Hugo Mills
2014-05-18 16:02           ` Brendan Hide
2014-05-19  0:35           ` Qu Wenruo
2014-05-18  6:51     ` Qu Wenruo
2014-05-18 10:10       ` Hugo Mills
2014-05-19 13:02     ` Chris Mason
2014-05-19 14:01     ` David Sterba
2014-05-19 14:33       ` David Sterba
2014-05-20  0:34         ` Qu Wenruo
2014-05-20 11:08           ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140410170734.GZ10789@merlins.org \
    --to=marc@merlins.org \
    --cc=clm@fb.com \
    --cc=jbacik@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=m_btrfs@ml1.co.uk \
    --cc=nicollet@jeru.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.