From: Dennis Zhou <dennis@kernel.org> To: David Sterba <dsterba@suse.com>, Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Omar Sandoval <osandov@osandov.com> Cc: kernel-team@fb.com, linux-btrfs@vger.kernel.org, Dennis Zhou <dennis@kernel.org> Subject: [PATCH v3 00/22] btrfs: async discard support Date: Wed, 20 Nov 2019 16:50:59 -0500 Message-ID: <cover.1574282259.git.dennis@kernel.org> (raw) Hello, There were only a few minor things that changed from v2 to v3 in addition to the rebase on top of misc-next. The notable changes: - Non-async discarding continues to retrim regions because we load the block groups as fully trimmed. This gives us a way to reconcile the filesystem state on device after reboot with fstrim if deisred. - Moved BTRFS_TRIM_STATE_UNTRIMMED to be 0 in the enum. - Kept discard sysfs under debug/. - Rebased on top of btrfs-devel#misc-next fa17ed069c61. Renamed instances of cache -> block_group to adjust to the rename of btrfs_block_group_cache -> btrfs_block_group. --- The RFC [1] went a lot better than expected, so let's call this v2. Changes: - I don't believe there are any functional changes to the code, but there are plenty of changes for readability and a lot of comments added. - Split out the sysfs parts into their own patches as it was a bit of work to put discard/ under debug/. - Switched over any flags to be enums. All the variables are now of appropriate type rather than being all atomics. - Dropped 0015 [2] from v1, load block groups on mount. This is in favor of reading everything in as clean and running fstrim as needed. - Misc. bug fixes. - I changed the block_group discard delay from 300s to 120s. It is just to give the allocators a chance to reuse the LBA before letting async discard take a crack at it. Data: On a number of webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache. discard=sync: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) --------------------------------------------------------------- Drive A | 434 | 1170 Drive B | 880 | 2330 Drive C | 2943 | 3920 Drive D | 4763 | 5701 discard=async: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) -------------------------------------------------------------- Drive A | 134 | 956 Drive B | 64 | 1972 Drive C | 59 | 1032 Drive D | 62 | 1200 While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding. From v1: -------- Discard is an operation that allows for the filesystem to communicate with underlying ssds that a lba region is no longer needed. This gives the drive the more information as it tries to manage the available free space to minimize write amplification. However, discard hasn't been given the most tlc. Discard is a problematic command because a drive's physical block management is more or less a black box to us and the effects of any particular discard aren't necessarily limited the lifetime of a command. Currently, btrfs handles discarding synchronously during transaction commit. This problematically can delay transaction commit based on the amount of space that needs to be trimmed and the efficacy of the discard operation for a particular drive. This series introduces async discarding, which removes discard from the transaction commit path. While every SSD has the choice of implementing trim support different, we strive here to do the right thing. The idea hinges on recognizing that write amplification really only kicks in once we're really low on free space. As long as we trim enough to keep a large enough pool of free space, in theory this should minimize the cost of issuing discards on a workload and have limited cost overhead in write amplification. With async discard, we try to emphasize discarding larger regions and reusing the lba (implicit discard). The first is done by using the free space cache to maintain discard state and thus allows us to get coalescing for fairly cheap. A background workqueue is used to scan over an LRU kept list of the block groups. It then uses filters to determine what to discard next hence giving priority to larger discards. While reusing an lba isn't explicitly attempted, it happens implicitly via find_free_extent() which if it happens to find a dirty extent, will grant us reuse of the lba. Additionally, async discarding skips metadata block groups as these should see a fairly high turnover as btrfs is a self-packing filesystem being stingy with allocating new block groups until necessary. Preliminary results seem promising as when a lot of freeing is going on, the discarding is delayed allowing for reuse which translates to less discarding (in addition to the slower discarding). This has shown a reduction in p90 and p99 read latencies on a test on our webservers. I am currently working on tuning the rate at which it discards in the background. I am doing this by evaluating other workloads and drives. The iops and bps rate limits are fairly aggressive right now as my basic survey of a few drives noted that the trim command itself is a significant part of the overhead. So optimizing for larger trims is the right thing to do. This series contains the following 22 patches and is on top of btrfs-devel#misc-next fa17ed069c61: 0001-bitmap-genericize-percpu-bitmap-region-iterators.patch 0002-btrfs-rename-DISCARD-opt-to-DISCARD_SYNC.patch 0003-btrfs-keep-track-of-which-extents-have-been-discarde.patch 0004-btrfs-keep-track-of-cleanliness-of-the-bitmap.patch 0005-btrfs-add-the-beginning-of-async-discard-discard-wor.patch 0006-btrfs-handle-empty-block_group-removal.patch 0007-btrfs-discard-one-region-at-a-time-in-async-discard.patch 0008-btrfs-add-removal-calls-for-sysfs-debug.patch 0009-btrfs-make-UUID-debug-have-its-own-kobject.patch 0010-btrfs-add-discard-sysfs-directory.patch 0011-btrfs-track-discardable-extents-for-async-discard.patch 0012-btrfs-keep-track-of-discardable_bytes.patch 0013-btrfs-calculate-discard-delay-based-on-number-of-ext.patch 0014-btrfs-add-bps-discard-rate-limit.patch 0015-btrfs-limit-max-discard-size-for-async-discard.patch 0016-btrfs-make-max-async-discard-size-tunable.patch 0017-btrfs-have-multiple-discard-lists.patch 0018-btrfs-only-keep-track-of-data-extents-for-async-disc.patch 0019-btrfs-keep-track-of-discard-reuse-stats.patch 0020-btrfs-add-async-discard-header.patch 0021-btrfs-increase-the-metadata-allowance-for-the-free_s.patch 0022-btrfs-make-smaller-extents-more-likely-to-go-into-bi.patch 0001 exports percpu's bitmap iterators for eventual use in 0011. 0002 renames DISCARD to DISCARD_SYNC. 0003 and 0004 adds discard tracking to the free space cache. 0005-0007 adds the core of async discard support. 0008-0010 modify debug sysfs to allow for adding discard/ under debug/. 0011-0016 fiddle with stats and operation limits. 0018 makes async discarding only track data block groups. 0019 adds reuse stats. 0020 adds an explanation header to discard.c. 0021 and 0022 modify the free space cache metadata allowance, add a bitmap -> extent path and makes us more likely to put smaller extents into the bitmaps. diffstats below: Dennis Zhou (22): bitmap: genericize percpu bitmap region iterators btrfs: rename DISCARD opt to DISCARD_SYNC btrfs: keep track of which extents have been discarded btrfs: keep track of cleanliness of the bitmap btrfs: add the beginning of async discard, discard workqueue btrfs: handle empty block_group removal btrfs: discard one region at a time in async discard btrfs: add removal calls for sysfs debug/ btrfs: make UUID/debug have its own kobject btrfs: add discard sysfs directory btrfs: track discardable extents for async discard btrfs: keep track of discardable_bytes btrfs: calculate discard delay based on number of extents btrfs: add bps discard rate limit btrfs: limit max discard size for async discard btrfs: make max async discard size tunable btrfs: have multiple discard lists btrfs: only keep track of data extents for async discard btrfs: keep track of discard reuse stats btrfs: add async discard header btrfs: increase the metadata allowance for the free_space_cache btrfs: make smaller extents more likely to go into bitmaps fs/btrfs/Makefile | 2 +- fs/btrfs/block-group.c | 56 ++- fs/btrfs/block-group.h | 30 ++ fs/btrfs/ctree.h | 52 ++- fs/btrfs/discard.c | 679 ++++++++++++++++++++++++++++++++++++ fs/btrfs/discard.h | 46 +++ fs/btrfs/disk-io.c | 15 +- fs/btrfs/extent-tree.c | 23 +- fs/btrfs/free-space-cache.c | 585 ++++++++++++++++++++++++++----- fs/btrfs/free-space-cache.h | 39 ++- fs/btrfs/inode-map.c | 13 +- fs/btrfs/scrub.c | 7 +- fs/btrfs/super.c | 39 ++- fs/btrfs/sysfs.c | 205 ++++++++++- include/linux/bitmap.h | 35 ++ mm/percpu.c | 61 +--- 16 files changed, 1734 insertions(+), 153 deletions(-) create mode 100644 fs/btrfs/discard.c create mode 100644 fs/btrfs/discard.h Thanks, Dennis
next reply index Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-11-20 21:50 Dennis Zhou [this message] 2019-11-20 21:51 ` [PATCH 01/22] bitmap: genericize percpu bitmap region iterators Dennis Zhou 2019-11-20 21:51 ` [PATCH 02/22] btrfs: rename DISCARD opt to DISCARD_SYNC Dennis Zhou 2019-11-20 21:51 ` [PATCH 03/22] btrfs: keep track of which extents have been discarded Dennis Zhou 2019-11-20 21:51 ` [PATCH 04/22] btrfs: keep track of cleanliness of the bitmap Dennis Zhou 2019-11-23 0:17 ` kbuild test robot 2019-11-25 13:59 ` David Sterba 2019-11-25 19:37 ` Dennis Zhou 2019-11-20 21:51 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou 2019-11-20 21:51 ` [PATCH 06/22] btrfs: handle empty block_group removal Dennis Zhou 2019-11-20 21:51 ` [PATCH 07/22] btrfs: discard one region at a time in async discard Dennis Zhou 2019-11-20 21:51 ` [PATCH 08/22] btrfs: add removal calls for sysfs debug/ Dennis Zhou 2019-11-20 21:51 ` [PATCH 09/22] btrfs: make UUID/debug have its own kobject Dennis Zhou 2019-11-20 21:51 ` [PATCH 10/22] btrfs: add discard sysfs directory Dennis Zhou 2019-11-20 21:51 ` [PATCH 11/22] btrfs: track discardable extents for async discard Dennis Zhou 2019-11-20 21:51 ` [PATCH 12/22] btrfs: keep track of discardable_bytes Dennis Zhou 2019-11-20 21:51 ` [PATCH 13/22] btrfs: calculate discard delay based on number of extents Dennis Zhou 2019-11-20 21:51 ` [PATCH 14/22] btrfs: add bps discard rate limit Dennis Zhou 2019-11-20 21:51 ` [PATCH 15/22] btrfs: limit max discard size for async discard Dennis Zhou 2019-11-20 21:51 ` [PATCH 16/22] btrfs: make max async discard size tunable Dennis Zhou 2019-11-20 21:51 ` [PATCH 17/22] btrfs: have multiple discard lists Dennis Zhou 2019-11-20 21:51 ` [PATCH 18/22] btrfs: only keep track of data extents for async discard Dennis Zhou 2019-11-20 21:51 ` [PATCH 19/22] btrfs: keep track of discard reuse stats Dennis Zhou 2019-11-20 21:51 ` [PATCH 20/22] btrfs: add async discard header Dennis Zhou 2019-11-20 21:51 ` [PATCH 21/22] btrfs: increase the metadata allowance for the free_space_cache Dennis Zhou 2019-11-20 21:51 ` [PATCH 22/22] btrfs: make smaller extents more likely to go into bitmaps Dennis Zhou
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=cover.1574282259.git.dennis@kernel.org \ --to=dennis@kernel.org \ --cc=clm@fb.com \ --cc=dsterba@suse.com \ --cc=josef@toxicpanda.com \ --cc=kernel-team@fb.com \ --cc=linux-btrfs@vger.kernel.org \ --cc=osandov@osandov.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ linux-btrfs@vger.kernel.org public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git