Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
       [not found] <201911220351.HPI9gxNo%lkp@intel.com>
@ 2019-11-22  4:27   ` Nick Desaulniers
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Desaulniers @ 2019-11-22  4:27 UTC (permalink / raw)
  To: dennis
  Cc: kbuild, clang-built-linux, kbuild test robot, kbuild-all,
	David Sterba, Chris Mason, Josef Bacik, osandov, kernel-team,
	linux-btrfs

Hi Dennis,
Below is a 0day bot report from a build w/ Clang. Warning looks legit,
can you please take a look?

On Thu, Nov 21, 2019 at 11:27 AM kbuild test robot <lkp@intel.com> wrote:
>
> CC: kbuild-all@lists.01.org
> In-Reply-To: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> References: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> TO: Dennis Zhou <dennis@kernel.org>
> CC: David Sterba <dsterba@suse.com>, Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Omar Sandoval <osandov@osandov.com>
> CC: kernel-team@fb.com, linux-btrfs@vger.kernel.org, Dennis Zhou <dennis@kernel.org>
>
> Hi Dennis,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on kdave/for-next]
> [also build test WARNING on next-20191121]
> [cannot apply to v5.4-rc8]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
>
> url:    https://github.com/0day-ci/linux/commits/Dennis-Zhou/btrfs-async-discard-support/20191121-230429
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> config: arm64-defconfig (attached as .config)
> compiler: clang version 10.0.0 (git://gitmirror/llvm_project cf823ce4ad9d04c69b7c29d236f7b14c875111c2)
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm64
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
> >> fs/btrfs/free-space-cache.c:3238:6: warning: variable 'trim_state' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
>            if (!ret) {
>                ^~~~
>    fs/btrfs/free-space-cache.c:3251:53: note: uninitialized use occurs here
>            __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
>                                                               ^~~~~~~~~~
>    fs/btrfs/free-space-cache.c:3238:2: note: remove the 'if' if its condition is always true
>            if (!ret) {
>            ^~~~~~~~~~
>    fs/btrfs/free-space-cache.c:3224:2: note: variable 'trim_state' is declared here
>            enum btrfs_trim_state trim_state;
>            ^
>    1 warning generated.
>
> vim +3238 fs/btrfs/free-space-cache.c
>
>   3210
>   3211  static int do_trimming(struct btrfs_block_group *block_group,
>   3212                         u64 *total_trimmed, u64 start, u64 bytes,
>   3213                         u64 reserved_start, u64 reserved_bytes,
>   3214                         enum btrfs_trim_state reserved_trim_state,
>   3215                         struct btrfs_trim_range *trim_entry)
>   3216  {
>   3217          struct btrfs_space_info *space_info = block_group->space_info;
>   3218          struct btrfs_fs_info *fs_info = block_group->fs_info;
>   3219          struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>   3220          int ret;
>   3221          int update = 0;
>   3222          u64 end = start + bytes;
>   3223          u64 reserved_end = reserved_start + reserved_bytes;
>   3224          enum btrfs_trim_state trim_state;
>   3225          u64 trimmed = 0;
>   3226
>   3227          spin_lock(&space_info->lock);
>   3228          spin_lock(&block_group->lock);
>   3229          if (!block_group->ro) {
>   3230                  block_group->reserved += reserved_bytes;
>   3231                  space_info->bytes_reserved += reserved_bytes;
>   3232                  update = 1;
>   3233          }
>   3234          spin_unlock(&block_group->lock);
>   3235          spin_unlock(&space_info->lock);
>   3236
>   3237          ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
> > 3238          if (!ret) {
>   3239                  *total_trimmed += trimmed;
>   3240                  trim_state = BTRFS_TRIM_STATE_TRIMMED;
>   3241          }
>   3242
>   3243          mutex_lock(&ctl->cache_writeout_mutex);
>   3244          if (reserved_start < start)
>   3245                  __btrfs_add_free_space(fs_info, ctl, reserved_start,
>   3246                                         start - reserved_start,
>   3247                                         reserved_trim_state);
>   3248          if (start + bytes < reserved_start + reserved_bytes)
>   3249                  __btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
>   3250                                         reserved_trim_state);
>   3251          __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);

^ oops

>   3252          list_del(&trim_entry->list);
>   3253          mutex_unlock(&ctl->cache_writeout_mutex);
>   3254
>   3255          if (update) {
>   3256                  spin_lock(&space_info->lock);
>   3257                  spin_lock(&block_group->lock);
>   3258                  if (block_group->ro)
>   3259                          space_info->bytes_readonly += reserved_bytes;
>   3260                  block_group->reserved -= reserved_bytes;
>   3261                  space_info->bytes_reserved -= reserved_bytes;
>   3262                  spin_unlock(&block_group->lock);
>   3263                  spin_unlock(&space_info->lock);
>   3264          }
>   3265
>   3266          return ret;
>   3267  }
>   3268
>
> ---
> 0-DAY kernel test infrastructure                 Open Source Technology Center
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-22  4:27   ` Nick Desaulniers
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Desaulniers @ 2019-11-22  4:27 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5928 bytes --]

Hi Dennis,
Below is a 0day bot report from a build w/ Clang. Warning looks legit,
can you please take a look?

On Thu, Nov 21, 2019 at 11:27 AM kbuild test robot <lkp@intel.com> wrote:
>
> CC: kbuild-all(a)lists.01.org
> In-Reply-To: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> References: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> TO: Dennis Zhou <dennis@kernel.org>
> CC: David Sterba <dsterba@suse.com>, Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Omar Sandoval <osandov@osandov.com>
> CC: kernel-team(a)fb.com, linux-btrfs(a)vger.kernel.org, Dennis Zhou <dennis@kernel.org>
>
> Hi Dennis,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on kdave/for-next]
> [also build test WARNING on next-20191121]
> [cannot apply to v5.4-rc8]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
>
> url:    https://github.com/0day-ci/linux/commits/Dennis-Zhou/btrfs-async-discard-support/20191121-230429
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> config: arm64-defconfig (attached as .config)
> compiler: clang version 10.0.0 (git://gitmirror/llvm_project cf823ce4ad9d04c69b7c29d236f7b14c875111c2)
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm64
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
> >> fs/btrfs/free-space-cache.c:3238:6: warning: variable 'trim_state' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
>            if (!ret) {
>                ^~~~
>    fs/btrfs/free-space-cache.c:3251:53: note: uninitialized use occurs here
>            __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
>                                                               ^~~~~~~~~~
>    fs/btrfs/free-space-cache.c:3238:2: note: remove the 'if' if its condition is always true
>            if (!ret) {
>            ^~~~~~~~~~
>    fs/btrfs/free-space-cache.c:3224:2: note: variable 'trim_state' is declared here
>            enum btrfs_trim_state trim_state;
>            ^
>    1 warning generated.
>
> vim +3238 fs/btrfs/free-space-cache.c
>
>   3210
>   3211  static int do_trimming(struct btrfs_block_group *block_group,
>   3212                         u64 *total_trimmed, u64 start, u64 bytes,
>   3213                         u64 reserved_start, u64 reserved_bytes,
>   3214                         enum btrfs_trim_state reserved_trim_state,
>   3215                         struct btrfs_trim_range *trim_entry)
>   3216  {
>   3217          struct btrfs_space_info *space_info = block_group->space_info;
>   3218          struct btrfs_fs_info *fs_info = block_group->fs_info;
>   3219          struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>   3220          int ret;
>   3221          int update = 0;
>   3222          u64 end = start + bytes;
>   3223          u64 reserved_end = reserved_start + reserved_bytes;
>   3224          enum btrfs_trim_state trim_state;
>   3225          u64 trimmed = 0;
>   3226
>   3227          spin_lock(&space_info->lock);
>   3228          spin_lock(&block_group->lock);
>   3229          if (!block_group->ro) {
>   3230                  block_group->reserved += reserved_bytes;
>   3231                  space_info->bytes_reserved += reserved_bytes;
>   3232                  update = 1;
>   3233          }
>   3234          spin_unlock(&block_group->lock);
>   3235          spin_unlock(&space_info->lock);
>   3236
>   3237          ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
> > 3238          if (!ret) {
>   3239                  *total_trimmed += trimmed;
>   3240                  trim_state = BTRFS_TRIM_STATE_TRIMMED;
>   3241          }
>   3242
>   3243          mutex_lock(&ctl->cache_writeout_mutex);
>   3244          if (reserved_start < start)
>   3245                  __btrfs_add_free_space(fs_info, ctl, reserved_start,
>   3246                                         start - reserved_start,
>   3247                                         reserved_trim_state);
>   3248          if (start + bytes < reserved_start + reserved_bytes)
>   3249                  __btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
>   3250                                         reserved_trim_state);
>   3251          __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);

^ oops

>   3252          list_del(&trim_entry->list);
>   3253          mutex_unlock(&ctl->cache_writeout_mutex);
>   3254
>   3255          if (update) {
>   3256                  spin_lock(&space_info->lock);
>   3257                  spin_lock(&block_group->lock);
>   3258                  if (block_group->ro)
>   3259                          space_info->bytes_readonly += reserved_bytes;
>   3260                  block_group->reserved -= reserved_bytes;
>   3261                  space_info->bytes_reserved -= reserved_bytes;
>   3262                  spin_unlock(&block_group->lock);
>   3263                  spin_unlock(&space_info->lock);
>   3264          }
>   3265
>   3266          return ret;
>   3267  }
>   3268
>
> ---
> 0-DAY kernel test infrastructure                 Open Source Technology Center
> https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-22  4:27   ` Nick Desaulniers
@ 2019-11-25 18:59     ` Dennis Zhou
  -1 siblings, 0 replies; 23+ messages in thread
From: Dennis Zhou @ 2019-11-25 18:59 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: dennis, kbuild, clang-built-linux, kbuild test robot, kbuild-all,
	David Sterba, Chris Mason, Josef Bacik, osandov, kernel-team,
	linux-btrfs

On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> Hi Dennis,
> Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> can you please take a look?
> 

Ah thanks for this! Yeah that was a miss when I switched from flags ->
an enum and didn't update the declaration properly. I'll be sending out
a v4 as another fix for arm has some rebase conflicts.

Is there a way to enable so I get these emails directly?

Thanks,
Dennis

> On Thu, Nov 21, 2019 at 11:27 AM kbuild test robot <lkp@intel.com> wrote:
> >
> > CC: kbuild-all@lists.01.org
> > In-Reply-To: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> > References: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> > TO: Dennis Zhou <dennis@kernel.org>
> > CC: David Sterba <dsterba@suse.com>, Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Omar Sandoval <osandov@osandov.com>
> > CC: kernel-team@fb.com, linux-btrfs@vger.kernel.org, Dennis Zhou <dennis@kernel.org>
> >
> > Hi Dennis,
> >
> > I love your patch! Perhaps something to improve:
> >
> > [auto build test WARNING on kdave/for-next]
> > [also build test WARNING on next-20191121]
> > [cannot apply to v5.4-rc8]
> > [if your patch is applied to the wrong git tree, please drop us a note to help
> > improve the system. BTW, we also suggest to use '--base' option to specify the
> > base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
> >
> > url:    https://github.com/0day-ci/linux/commits/Dennis-Zhou/btrfs-async-discard-support/20191121-230429
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> > config: arm64-defconfig (attached as .config)
> > compiler: clang version 10.0.0 (git://gitmirror/llvm_project cf823ce4ad9d04c69b7c29d236f7b14c875111c2)
> > reproduce:
> >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> >         chmod +x ~/bin/make.cross
> >         # save the attached .config to linux build tree
> >         make.cross ARCH=arm64
> >
> > If you fix the issue, kindly add following tag
> > Reported-by: kbuild test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> > >> fs/btrfs/free-space-cache.c:3238:6: warning: variable 'trim_state' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
> >            if (!ret) {
> >                ^~~~
> >    fs/btrfs/free-space-cache.c:3251:53: note: uninitialized use occurs here
> >            __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
> >                                                               ^~~~~~~~~~
> >    fs/btrfs/free-space-cache.c:3238:2: note: remove the 'if' if its condition is always true
> >            if (!ret) {
> >            ^~~~~~~~~~
> >    fs/btrfs/free-space-cache.c:3224:2: note: variable 'trim_state' is declared here
> >            enum btrfs_trim_state trim_state;
> >            ^
> >    1 warning generated.
> >
> > vim +3238 fs/btrfs/free-space-cache.c
> >
> >   3210
> >   3211  static int do_trimming(struct btrfs_block_group *block_group,
> >   3212                         u64 *total_trimmed, u64 start, u64 bytes,
> >   3213                         u64 reserved_start, u64 reserved_bytes,
> >   3214                         enum btrfs_trim_state reserved_trim_state,
> >   3215                         struct btrfs_trim_range *trim_entry)
> >   3216  {
> >   3217          struct btrfs_space_info *space_info = block_group->space_info;
> >   3218          struct btrfs_fs_info *fs_info = block_group->fs_info;
> >   3219          struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >   3220          int ret;
> >   3221          int update = 0;
> >   3222          u64 end = start + bytes;
> >   3223          u64 reserved_end = reserved_start + reserved_bytes;
> >   3224          enum btrfs_trim_state trim_state;
> >   3225          u64 trimmed = 0;
> >   3226
> >   3227          spin_lock(&space_info->lock);
> >   3228          spin_lock(&block_group->lock);
> >   3229          if (!block_group->ro) {
> >   3230                  block_group->reserved += reserved_bytes;
> >   3231                  space_info->bytes_reserved += reserved_bytes;
> >   3232                  update = 1;
> >   3233          }
> >   3234          spin_unlock(&block_group->lock);
> >   3235          spin_unlock(&space_info->lock);
> >   3236
> >   3237          ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
> > > 3238          if (!ret) {
> >   3239                  *total_trimmed += trimmed;
> >   3240                  trim_state = BTRFS_TRIM_STATE_TRIMMED;
> >   3241          }
> >   3242
> >   3243          mutex_lock(&ctl->cache_writeout_mutex);
> >   3244          if (reserved_start < start)
> >   3245                  __btrfs_add_free_space(fs_info, ctl, reserved_start,
> >   3246                                         start - reserved_start,
> >   3247                                         reserved_trim_state);
> >   3248          if (start + bytes < reserved_start + reserved_bytes)
> >   3249                  __btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
> >   3250                                         reserved_trim_state);
> >   3251          __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
> 
> ^ oops
> 
> >   3252          list_del(&trim_entry->list);
> >   3253          mutex_unlock(&ctl->cache_writeout_mutex);
> >   3254
> >   3255          if (update) {
> >   3256                  spin_lock(&space_info->lock);
> >   3257                  spin_lock(&block_group->lock);
> >   3258                  if (block_group->ro)
> >   3259                          space_info->bytes_readonly += reserved_bytes;
> >   3260                  block_group->reserved -= reserved_bytes;
> >   3261                  space_info->bytes_reserved -= reserved_bytes;
> >   3262                  spin_unlock(&block_group->lock);
> >   3263                  spin_unlock(&space_info->lock);
> >   3264          }
> >   3265
> >   3266          return ret;
> >   3267  }
> >   3268
> >
> > ---
> > 0-DAY kernel test infrastructure                 Open Source Technology Center
> > https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-25 18:59     ` Dennis Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Zhou @ 2019-11-25 18:59 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6533 bytes --]

On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> Hi Dennis,
> Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> can you please take a look?
> 

Ah thanks for this! Yeah that was a miss when I switched from flags ->
an enum and didn't update the declaration properly. I'll be sending out
a v4 as another fix for arm has some rebase conflicts.

Is there a way to enable so I get these emails directly?

Thanks,
Dennis

> On Thu, Nov 21, 2019 at 11:27 AM kbuild test robot <lkp@intel.com> wrote:
> >
> > CC: kbuild-all(a)lists.01.org
> > In-Reply-To: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> > References: <63d3257efe1158a6fbbd7abe865cd9250b494438.1574282259.git.dennis@kernel.org>
> > TO: Dennis Zhou <dennis@kernel.org>
> > CC: David Sterba <dsterba@suse.com>, Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, Omar Sandoval <osandov@osandov.com>
> > CC: kernel-team(a)fb.com, linux-btrfs(a)vger.kernel.org, Dennis Zhou <dennis@kernel.org>
> >
> > Hi Dennis,
> >
> > I love your patch! Perhaps something to improve:
> >
> > [auto build test WARNING on kdave/for-next]
> > [also build test WARNING on next-20191121]
> > [cannot apply to v5.4-rc8]
> > [if your patch is applied to the wrong git tree, please drop us a note to help
> > improve the system. BTW, we also suggest to use '--base' option to specify the
> > base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
> >
> > url:    https://github.com/0day-ci/linux/commits/Dennis-Zhou/btrfs-async-discard-support/20191121-230429
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
> > config: arm64-defconfig (attached as .config)
> > compiler: clang version 10.0.0 (git://gitmirror/llvm_project cf823ce4ad9d04c69b7c29d236f7b14c875111c2)
> > reproduce:
> >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> >         chmod +x ~/bin/make.cross
> >         # save the attached .config to linux build tree
> >         make.cross ARCH=arm64
> >
> > If you fix the issue, kindly add following tag
> > Reported-by: kbuild test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> > >> fs/btrfs/free-space-cache.c:3238:6: warning: variable 'trim_state' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
> >            if (!ret) {
> >                ^~~~
> >    fs/btrfs/free-space-cache.c:3251:53: note: uninitialized use occurs here
> >            __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
> >                                                               ^~~~~~~~~~
> >    fs/btrfs/free-space-cache.c:3238:2: note: remove the 'if' if its condition is always true
> >            if (!ret) {
> >            ^~~~~~~~~~
> >    fs/btrfs/free-space-cache.c:3224:2: note: variable 'trim_state' is declared here
> >            enum btrfs_trim_state trim_state;
> >            ^
> >    1 warning generated.
> >
> > vim +3238 fs/btrfs/free-space-cache.c
> >
> >   3210
> >   3211  static int do_trimming(struct btrfs_block_group *block_group,
> >   3212                         u64 *total_trimmed, u64 start, u64 bytes,
> >   3213                         u64 reserved_start, u64 reserved_bytes,
> >   3214                         enum btrfs_trim_state reserved_trim_state,
> >   3215                         struct btrfs_trim_range *trim_entry)
> >   3216  {
> >   3217          struct btrfs_space_info *space_info = block_group->space_info;
> >   3218          struct btrfs_fs_info *fs_info = block_group->fs_info;
> >   3219          struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> >   3220          int ret;
> >   3221          int update = 0;
> >   3222          u64 end = start + bytes;
> >   3223          u64 reserved_end = reserved_start + reserved_bytes;
> >   3224          enum btrfs_trim_state trim_state;
> >   3225          u64 trimmed = 0;
> >   3226
> >   3227          spin_lock(&space_info->lock);
> >   3228          spin_lock(&block_group->lock);
> >   3229          if (!block_group->ro) {
> >   3230                  block_group->reserved += reserved_bytes;
> >   3231                  space_info->bytes_reserved += reserved_bytes;
> >   3232                  update = 1;
> >   3233          }
> >   3234          spin_unlock(&block_group->lock);
> >   3235          spin_unlock(&space_info->lock);
> >   3236
> >   3237          ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
> > > 3238          if (!ret) {
> >   3239                  *total_trimmed += trimmed;
> >   3240                  trim_state = BTRFS_TRIM_STATE_TRIMMED;
> >   3241          }
> >   3242
> >   3243          mutex_lock(&ctl->cache_writeout_mutex);
> >   3244          if (reserved_start < start)
> >   3245                  __btrfs_add_free_space(fs_info, ctl, reserved_start,
> >   3246                                         start - reserved_start,
> >   3247                                         reserved_trim_state);
> >   3248          if (start + bytes < reserved_start + reserved_bytes)
> >   3249                  __btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
> >   3250                                         reserved_trim_state);
> >   3251          __btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
> 
> ^ oops
> 
> >   3252          list_del(&trim_entry->list);
> >   3253          mutex_unlock(&ctl->cache_writeout_mutex);
> >   3254
> >   3255          if (update) {
> >   3256                  spin_lock(&space_info->lock);
> >   3257                  spin_lock(&block_group->lock);
> >   3258                  if (block_group->ro)
> >   3259                          space_info->bytes_readonly += reserved_bytes;
> >   3260                  block_group->reserved -= reserved_bytes;
> >   3261                  space_info->bytes_reserved -= reserved_bytes;
> >   3262                  spin_unlock(&block_group->lock);
> >   3263                  spin_unlock(&space_info->lock);
> >   3264          }
> >   3265
> >   3266          return ret;
> >   3267  }
> >   3268
> >
> > ---
> > 0-DAY kernel test infrastructure                 Open Source Technology Center
> > https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-25 18:59     ` Dennis Zhou
@ 2019-11-25 19:39       ` Nick Desaulniers
  -1 siblings, 0 replies; 23+ messages in thread
From: Nick Desaulniers @ 2019-11-25 19:39 UTC (permalink / raw)
  To: Dennis Zhou, Chen Rong, Philip Li
  Cc: kbuild, clang-built-linux, kbuild test robot, kbuild-all,
	David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
>
> On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > Hi Dennis,
> > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > can you please take a look?
> >
>
> Ah thanks for this! Yeah that was a miss when I switched from flags ->
> an enum and didn't update the declaration properly. I'll be sending out
> a v4 as another fix for arm has some rebase conflicts.
>
> Is there a way to enable so I get these emails directly?

+ Rong, Philip

The reports have only been sent to our mailing list where we've been
manually triaging them.  The issue with enabling them globally was
that the script to reproduce the warning still doesn't mention how to
build w/ Clang.

In general the reports have been high value (I ignore most reports
with -Wimplicit-function-declaration, which is the most frequent as it
shows the patch was not compile tested at all).

Rong, Philip, it's been a while since we talked about this last. Is
there a general timeline of when these reports will be turned on
globally?  Even if the directions to reproduce aren't quite right,
generally there's enough info in the existing bugs where authors can
rewrite their patch without even needing to rebuild with Clang (though
having correct directions to reproduce would be nice, we could wait
until someone asked for them explicitly).

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-25 19:39       ` Nick Desaulniers
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Desaulniers @ 2019-11-25 19:39 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1477 bytes --]

On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
>
> On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > Hi Dennis,
> > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > can you please take a look?
> >
>
> Ah thanks for this! Yeah that was a miss when I switched from flags ->
> an enum and didn't update the declaration properly. I'll be sending out
> a v4 as another fix for arm has some rebase conflicts.
>
> Is there a way to enable so I get these emails directly?

+ Rong, Philip

The reports have only been sent to our mailing list where we've been
manually triaging them.  The issue with enabling them globally was
that the script to reproduce the warning still doesn't mention how to
build w/ Clang.

In general the reports have been high value (I ignore most reports
with -Wimplicit-function-declaration, which is the most frequent as it
shows the patch was not compile tested at all).

Rong, Philip, it's been a while since we talked about this last. Is
there a general timeline of when these reports will be turned on
globally?  Even if the directions to reproduce aren't quite right,
generally there's enough info in the existing bugs where authors can
rewrite their patch without even needing to rebuild with Clang (though
having correct directions to reproduce would be nice, we could wait
until someone asked for them explicitly).

-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-25 19:39       ` Nick Desaulniers
  (?)
@ 2019-11-26  1:42         ` Philip Li
  -1 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  1:42 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Dennis Zhou, Chen Rong, kbuild, clang-built-linux,
	kbuild test robot, kbuild-all, David Sterba, Chris Mason,
	Josef Bacik, Omar Sandoval, kernel-team, linux-btrfs

On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > Hi Dennis,
> > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > can you please take a look?
> > >
> >
> > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > an enum and didn't update the declaration properly. I'll be sending out
> > a v4 as another fix for arm has some rebase conflicts.
> >
> > Is there a way to enable so I get these emails directly?
> 
> + Rong, Philip
> 
> The reports have only been sent to our mailing list where we've been
> manually triaging them.  The issue with enabling them globally was
> that the script to reproduce the warning still doesn't mention how to
> build w/ Clang.
Thanks Nick for continuous caring on this. One thing we initially worry
is how to avoid duplicated reports to developer, like the one that can
be same as gcc's finding. We haven't found a way to effectively handle
this.

> 
> In general the reports have been high value (I ignore most reports
> with -Wimplicit-function-declaration, which is the most frequent as it
> shows the patch was not compile tested at all).
Do we mean the report with -Wimplicit-function-declaration can be duplicated
to gcc, so we can ignore them to avoid duplication to developer?

> 
> Rong, Philip, it's been a while since we talked about this last. Is
> there a general timeline of when these reports will be turned on
> globally?  Even if the directions to reproduce aren't quite right,
For the timeline, it's not decided due to the duplication concern. We tend
to look into next year after other priorities are solved for this year.

> generally there's enough info in the existing bugs where authors can
> rewrite their patch without even needing to rebuild with Clang (though
> having correct directions to reproduce would be nice, we could wait
> until someone asked for them explicitly).
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-26  1:42         ` Philip Li
  0 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  1:42 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]

On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > Hi Dennis,
> > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > can you please take a look?
> > >
> >
> > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > an enum and didn't update the declaration properly. I'll be sending out
> > a v4 as another fix for arm has some rebase conflicts.
> >
> > Is there a way to enable so I get these emails directly?
> 
> + Rong, Philip
> 
> The reports have only been sent to our mailing list where we've been
> manually triaging them.  The issue with enabling them globally was
> that the script to reproduce the warning still doesn't mention how to
> build w/ Clang.
Thanks Nick for continuous caring on this. One thing we initially worry
is how to avoid duplicated reports to developer, like the one that can
be same as gcc's finding. We haven't found a way to effectively handle
this.

> 
> In general the reports have been high value (I ignore most reports
> with -Wimplicit-function-declaration, which is the most frequent as it
> shows the patch was not compile tested at all).
Do we mean the report with -Wimplicit-function-declaration can be duplicated
to gcc, so we can ignore them to avoid duplication to developer?

> 
> Rong, Philip, it's been a while since we talked about this last. Is
> there a general timeline of when these reports will be turned on
> globally?  Even if the directions to reproduce aren't quite right,
For the timeline, it's not decided due to the duplication concern. We tend
to look into next year after other priorities are solved for this year.

> generally there's enough info in the existing bugs where authors can
> rewrite their patch without even needing to rebuild with Clang (though
> having correct directions to reproduce would be nice, we could wait
> until someone asked for them explicitly).
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-26  1:42         ` Philip Li
  0 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  1:42 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]

On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > Hi Dennis,
> > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > can you please take a look?
> > >
> >
> > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > an enum and didn't update the declaration properly. I'll be sending out
> > a v4 as another fix for arm has some rebase conflicts.
> >
> > Is there a way to enable so I get these emails directly?
> 
> + Rong, Philip
> 
> The reports have only been sent to our mailing list where we've been
> manually triaging them.  The issue with enabling them globally was
> that the script to reproduce the warning still doesn't mention how to
> build w/ Clang.
Thanks Nick for continuous caring on this. One thing we initially worry
is how to avoid duplicated reports to developer, like the one that can
be same as gcc's finding. We haven't found a way to effectively handle
this.

> 
> In general the reports have been high value (I ignore most reports
> with -Wimplicit-function-declaration, which is the most frequent as it
> shows the patch was not compile tested at all).
Do we mean the report with -Wimplicit-function-declaration can be duplicated
to gcc, so we can ignore them to avoid duplication to developer?

> 
> Rong, Philip, it's been a while since we talked about this last. Is
> there a general timeline of when these reports will be turned on
> globally?  Even if the directions to reproduce aren't quite right,
For the timeline, it's not decided due to the duplication concern. We tend
to look into next year after other priorities are solved for this year.

> generally there's enough info in the existing bugs where authors can
> rewrite their patch without even needing to rebuild with Clang (though
> having correct directions to reproduce would be nice, we could wait
> until someone asked for them explicitly).
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-26  1:42         ` Philip Li
@ 2019-11-26  1:47           ` Nick Desaulniers
  -1 siblings, 0 replies; 23+ messages in thread
From: Nick Desaulniers @ 2019-11-26  1:47 UTC (permalink / raw)
  To: Philip Li
  Cc: Dennis Zhou, Chen Rong, kbuild, clang-built-linux,
	kbuild test robot, kbuild-all, David Sterba, Chris Mason,
	Josef Bacik, Omar Sandoval, kernel-team, linux-btrfs

On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
>
> On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > >
> > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > Hi Dennis,
> > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > can you please take a look?
> > > >
> > >
> > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > an enum and didn't update the declaration properly. I'll be sending out
> > > a v4 as another fix for arm has some rebase conflicts.
> > >
> > > Is there a way to enable so I get these emails directly?
> >
> > + Rong, Philip
> >
> > The reports have only been sent to our mailing list where we've been
> > manually triaging them.  The issue with enabling them globally was
> > that the script to reproduce the warning still doesn't mention how to
> > build w/ Clang.
> Thanks Nick for continuous caring on this. One thing we initially worry
> is how to avoid duplicated reports to developer, like the one that can
> be same as gcc's finding. We haven't found a way to effectively handle
> this.

Thanks for maintaining an invaluable tool.

How would the reports be duplicated? Does 0day bot build with GCC,
then rebuild with Clang?

Regardless, does it matter? If I make a mistake, and get two build
failure emails from 0day bot instead of one, does it matter? Sometimes
developers may just get one, as some warnings are unique to each
compiler.  Maybe it runs the risk of folks ignoring the email if the
volume is too much, but do authors generally ignore 0day bot emails?
(I'd hope not).

>
> >
> > In general the reports have been high value (I ignore most reports
> > with -Wimplicit-function-declaration, which is the most frequent as it
> > shows the patch was not compile tested at all).
> Do we mean the report with -Wimplicit-function-declaration can be duplicated
> to gcc, so we can ignore them to avoid duplication to developer?

Many of the warnings GCC has Clang does as well.
-Wimplicit-function-declaration is the most common warning I see in
triage, which developers would see regardless of toolchain had they
compiled first before pushing.  It might be interesting to see maybe
the intersection or common flags between GCC and Clang, and only email
Clang reports for warnings unique to clang?  I think CFLAGS can even
be passed into make invocations so you could do:
$ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
-Wno-implicit-function-declaration -Wno-...>
such that any resulting warnings were unique to Clang.  I'd expect
such a list to quickly get stale over time though.

>
> >
> > Rong, Philip, it's been a while since we talked about this last. Is
> > there a general timeline of when these reports will be turned on
> > globally?  Even if the directions to reproduce aren't quite right,
> For the timeline, it's not decided due to the duplication concern. We tend
> to look into next year after other priorities are solved for this year.
>
> > generally there's enough info in the existing bugs where authors can
> > rewrite their patch without even needing to rebuild with Clang (though
> > having correct directions to reproduce would be nice, we could wait
> > until someone asked for them explicitly).
> >
> > --
> > Thanks,
> > ~Nick Desaulniers



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-26  1:47           ` Nick Desaulniers
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Desaulniers @ 2019-11-26  1:47 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3540 bytes --]

On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
>
> On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > >
> > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > Hi Dennis,
> > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > can you please take a look?
> > > >
> > >
> > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > an enum and didn't update the declaration properly. I'll be sending out
> > > a v4 as another fix for arm has some rebase conflicts.
> > >
> > > Is there a way to enable so I get these emails directly?
> >
> > + Rong, Philip
> >
> > The reports have only been sent to our mailing list where we've been
> > manually triaging them.  The issue with enabling them globally was
> > that the script to reproduce the warning still doesn't mention how to
> > build w/ Clang.
> Thanks Nick for continuous caring on this. One thing we initially worry
> is how to avoid duplicated reports to developer, like the one that can
> be same as gcc's finding. We haven't found a way to effectively handle
> this.

Thanks for maintaining an invaluable tool.

How would the reports be duplicated? Does 0day bot build with GCC,
then rebuild with Clang?

Regardless, does it matter? If I make a mistake, and get two build
failure emails from 0day bot instead of one, does it matter? Sometimes
developers may just get one, as some warnings are unique to each
compiler.  Maybe it runs the risk of folks ignoring the email if the
volume is too much, but do authors generally ignore 0day bot emails?
(I'd hope not).

>
> >
> > In general the reports have been high value (I ignore most reports
> > with -Wimplicit-function-declaration, which is the most frequent as it
> > shows the patch was not compile tested at all).
> Do we mean the report with -Wimplicit-function-declaration can be duplicated
> to gcc, so we can ignore them to avoid duplication to developer?

Many of the warnings GCC has Clang does as well.
-Wimplicit-function-declaration is the most common warning I see in
triage, which developers would see regardless of toolchain had they
compiled first before pushing.  It might be interesting to see maybe
the intersection or common flags between GCC and Clang, and only email
Clang reports for warnings unique to clang?  I think CFLAGS can even
be passed into make invocations so you could do:
$ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
-Wno-implicit-function-declaration -Wno-...>
such that any resulting warnings were unique to Clang.  I'd expect
such a list to quickly get stale over time though.

>
> >
> > Rong, Philip, it's been a while since we talked about this last. Is
> > there a general timeline of when these reports will be turned on
> > globally?  Even if the directions to reproduce aren't quite right,
> For the timeline, it's not decided due to the duplication concern. We tend
> to look into next year after other priorities are solved for this year.
>
> > generally there's enough info in the existing bugs where authors can
> > rewrite their patch without even needing to rebuild with Clang (though
> > having correct directions to reproduce would be nice, we could wait
> > until someone asked for them explicitly).
> >
> > --
> > Thanks,
> > ~Nick Desaulniers



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-26  1:47           ` Nick Desaulniers
  (?)
@ 2019-11-26  4:07             ` Philip Li
  -1 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  4:07 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Dennis Zhou, Chen Rong, kbuild, clang-built-linux,
	kbuild test robot, kbuild-all, David Sterba, Chris Mason,
	Josef Bacik, Omar Sandoval, kernel-team, linux-btrfs

On Mon, Nov 25, 2019 at 05:47:00PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
> >
> > On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > > Hi Dennis,
> > > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > > can you please take a look?
> > > > >
> > > >
> > > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > > an enum and didn't update the declaration properly. I'll be sending out
> > > > a v4 as another fix for arm has some rebase conflicts.
> > > >
> > > > Is there a way to enable so I get these emails directly?
> > >
> > > + Rong, Philip
> > >
> > > The reports have only been sent to our mailing list where we've been
> > > manually triaging them.  The issue with enabling them globally was
> > > that the script to reproduce the warning still doesn't mention how to
> > > build w/ Clang.
> > Thanks Nick for continuous caring on this. One thing we initially worry
> > is how to avoid duplicated reports to developer, like the one that can
> > be same as gcc's finding. We haven't found a way to effectively handle
> > this.
> 
> Thanks for maintaining an invaluable tool.
> 
> How would the reports be duplicated? Does 0day bot build with GCC,
> then rebuild with Clang?
no, they are built separately. For duplication, i refer to the issue
can be detected by both tool, and gcc reports out already (or clang first).

> 
> Regardless, does it matter? If I make a mistake, and get two build
> failure emails from 0day bot instead of one, does it matter? Sometimes
> developers may just get one, as some warnings are unique to each
> compiler.  Maybe it runs the risk of folks ignoring the email if the
> volume is too much, but do authors generally ignore 0day bot emails?
> (I'd hope not).
:-) this is a good point, and recently we are working to make the
service more stable to generate reports in time.

> 
> >
> > >
> > > In general the reports have been high value (I ignore most reports
> > > with -Wimplicit-function-declaration, which is the most frequent as it
> > > shows the patch was not compile tested at all).
> > Do we mean the report with -Wimplicit-function-declaration can be duplicated
> > to gcc, so we can ignore them to avoid duplication to developer?
> 
> Many of the warnings GCC has Clang does as well.
> -Wimplicit-function-declaration is the most common warning I see in
> triage, which developers would see regardless of toolchain had they
> compiled first before pushing.  It might be interesting to see maybe
> the intersection or common flags between GCC and Clang, and only email
> Clang reports for warnings unique to clang?  I think CFLAGS can even
> be passed into make invocations so you could do:
> $ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
> -Wno-implicit-function-declaration -Wno-...>
> such that any resulting warnings were unique to Clang.  I'd expect
> such a list to quickly get stale over time though.
thanks for the idea, we will look into this. 

> 
> >
> > >
> > > Rong, Philip, it's been a while since we talked about this last. Is
> > > there a general timeline of when these reports will be turned on
> > > globally?  Even if the directions to reproduce aren't quite right,
> > For the timeline, it's not decided due to the duplication concern. We tend
> > to look into next year after other priorities are solved for this year.
> >
> > > generally there's enough info in the existing bugs where authors can
> > > rewrite their patch without even needing to rebuild with Clang (though
> > > having correct directions to reproduce would be nice, we could wait
> > > until someone asked for them explicitly).
> > >
> > > --
> > > Thanks,
> > > ~Nick Desaulniers
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-26  4:07             ` Philip Li
  0 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  4:07 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4088 bytes --]

On Mon, Nov 25, 2019 at 05:47:00PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
> >
> > On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > > Hi Dennis,
> > > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > > can you please take a look?
> > > > >
> > > >
> > > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > > an enum and didn't update the declaration properly. I'll be sending out
> > > > a v4 as another fix for arm has some rebase conflicts.
> > > >
> > > > Is there a way to enable so I get these emails directly?
> > >
> > > + Rong, Philip
> > >
> > > The reports have only been sent to our mailing list where we've been
> > > manually triaging them.  The issue with enabling them globally was
> > > that the script to reproduce the warning still doesn't mention how to
> > > build w/ Clang.
> > Thanks Nick for continuous caring on this. One thing we initially worry
> > is how to avoid duplicated reports to developer, like the one that can
> > be same as gcc's finding. We haven't found a way to effectively handle
> > this.
> 
> Thanks for maintaining an invaluable tool.
> 
> How would the reports be duplicated? Does 0day bot build with GCC,
> then rebuild with Clang?
no, they are built separately. For duplication, i refer to the issue
can be detected by both tool, and gcc reports out already (or clang first).

> 
> Regardless, does it matter? If I make a mistake, and get two build
> failure emails from 0day bot instead of one, does it matter? Sometimes
> developers may just get one, as some warnings are unique to each
> compiler.  Maybe it runs the risk of folks ignoring the email if the
> volume is too much, but do authors generally ignore 0day bot emails?
> (I'd hope not).
:-) this is a good point, and recently we are working to make the
service more stable to generate reports in time.

> 
> >
> > >
> > > In general the reports have been high value (I ignore most reports
> > > with -Wimplicit-function-declaration, which is the most frequent as it
> > > shows the patch was not compile tested at all).
> > Do we mean the report with -Wimplicit-function-declaration can be duplicated
> > to gcc, so we can ignore them to avoid duplication to developer?
> 
> Many of the warnings GCC has Clang does as well.
> -Wimplicit-function-declaration is the most common warning I see in
> triage, which developers would see regardless of toolchain had they
> compiled first before pushing.  It might be interesting to see maybe
> the intersection or common flags between GCC and Clang, and only email
> Clang reports for warnings unique to clang?  I think CFLAGS can even
> be passed into make invocations so you could do:
> $ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
> -Wno-implicit-function-declaration -Wno-...>
> such that any resulting warnings were unique to Clang.  I'd expect
> such a list to quickly get stale over time though.
thanks for the idea, we will look into this. 

> 
> >
> > >
> > > Rong, Philip, it's been a while since we talked about this last. Is
> > > there a general timeline of when these reports will be turned on
> > > globally?  Even if the directions to reproduce aren't quite right,
> > For the timeline, it's not decided due to the duplication concern. We tend
> > to look into next year after other priorities are solved for this year.
> >
> > > generally there's enough info in the existing bugs where authors can
> > > rewrite their patch without even needing to rebuild with Clang (though
> > > having correct directions to reproduce would be nice, we could wait
> > > until someone asked for them explicitly).
> > >
> > > --
> > > Thanks,
> > > ~Nick Desaulniers
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-26  4:07             ` Philip Li
  0 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  4:07 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 4088 bytes --]

On Mon, Nov 25, 2019 at 05:47:00PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
> >
> > On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > > Hi Dennis,
> > > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > > can you please take a look?
> > > > >
> > > >
> > > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > > an enum and didn't update the declaration properly. I'll be sending out
> > > > a v4 as another fix for arm has some rebase conflicts.
> > > >
> > > > Is there a way to enable so I get these emails directly?
> > >
> > > + Rong, Philip
> > >
> > > The reports have only been sent to our mailing list where we've been
> > > manually triaging them.  The issue with enabling them globally was
> > > that the script to reproduce the warning still doesn't mention how to
> > > build w/ Clang.
> > Thanks Nick for continuous caring on this. One thing we initially worry
> > is how to avoid duplicated reports to developer, like the one that can
> > be same as gcc's finding. We haven't found a way to effectively handle
> > this.
> 
> Thanks for maintaining an invaluable tool.
> 
> How would the reports be duplicated? Does 0day bot build with GCC,
> then rebuild with Clang?
no, they are built separately. For duplication, i refer to the issue
can be detected by both tool, and gcc reports out already (or clang first).

> 
> Regardless, does it matter? If I make a mistake, and get two build
> failure emails from 0day bot instead of one, does it matter? Sometimes
> developers may just get one, as some warnings are unique to each
> compiler.  Maybe it runs the risk of folks ignoring the email if the
> volume is too much, but do authors generally ignore 0day bot emails?
> (I'd hope not).
:-) this is a good point, and recently we are working to make the
service more stable to generate reports in time.

> 
> >
> > >
> > > In general the reports have been high value (I ignore most reports
> > > with -Wimplicit-function-declaration, which is the most frequent as it
> > > shows the patch was not compile tested at all).
> > Do we mean the report with -Wimplicit-function-declaration can be duplicated
> > to gcc, so we can ignore them to avoid duplication to developer?
> 
> Many of the warnings GCC has Clang does as well.
> -Wimplicit-function-declaration is the most common warning I see in
> triage, which developers would see regardless of toolchain had they
> compiled first before pushing.  It might be interesting to see maybe
> the intersection or common flags between GCC and Clang, and only email
> Clang reports for warnings unique to clang?  I think CFLAGS can even
> be passed into make invocations so you could do:
> $ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
> -Wno-implicit-function-declaration -Wno-...>
> such that any resulting warnings were unique to Clang.  I'd expect
> such a list to quickly get stale over time though.
thanks for the idea, we will look into this. 

> 
> >
> > >
> > > Rong, Philip, it's been a while since we talked about this last. Is
> > > there a general timeline of when these reports will be turned on
> > > globally?  Even if the directions to reproduce aren't quite right,
> > For the timeline, it's not decided due to the duplication concern. We tend
> > to look into next year after other priorities are solved for this year.
> >
> > > generally there's enough info in the existing bugs where authors can
> > > rewrite their patch without even needing to rebuild with Clang (though
> > > having correct directions to reproduce would be nice, we could wait
> > > until someone asked for them explicitly).
> > >
> > > --
> > > Thanks,
> > > ~Nick Desaulniers
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-26  1:47           ` Nick Desaulniers
  (?)
@ 2019-11-26  4:09             ` Philip Li
  -1 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  4:09 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Dennis Zhou, Chen Rong, kbuild, clang-built-linux,
	kbuild test robot, kbuild-all, David Sterba, Chris Mason,
	Josef Bacik, Omar Sandoval, kernel-team, linux-btrfs

On Mon, Nov 25, 2019 at 05:47:00PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
> >
> > On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > > Hi Dennis,
> > > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > > can you please take a look?
> > > > >
> > > >
> > > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > > an enum and didn't update the declaration properly. I'll be sending out
> > > > a v4 as another fix for arm has some rebase conflicts.
> > > >
> > > > Is there a way to enable so I get these emails directly?
> > >
> > > + Rong, Philip
> > >
> > > The reports have only been sent to our mailing list where we've been
> > > manually triaging them.  The issue with enabling them globally was
> > > that the script to reproduce the warning still doesn't mention how to
> > > build w/ Clang.
Hi Nick, i forgot one question. Is it still expected to use latest clang
to build test? Any possibility the issue is related to clang compiler itself?

Thanks

> > Thanks Nick for continuous caring on this. One thing we initially worry
> > is how to avoid duplicated reports to developer, like the one that can
> > be same as gcc's finding. We haven't found a way to effectively handle
> > this.
> 
> Thanks for maintaining an invaluable tool.
> 
> How would the reports be duplicated? Does 0day bot build with GCC,
> then rebuild with Clang?
> 
> Regardless, does it matter? If I make a mistake, and get two build
> failure emails from 0day bot instead of one, does it matter? Sometimes
> developers may just get one, as some warnings are unique to each
> compiler.  Maybe it runs the risk of folks ignoring the email if the
> volume is too much, but do authors generally ignore 0day bot emails?
> (I'd hope not).
> 
> >
> > >
> > > In general the reports have been high value (I ignore most reports
> > > with -Wimplicit-function-declaration, which is the most frequent as it
> > > shows the patch was not compile tested at all).
> > Do we mean the report with -Wimplicit-function-declaration can be duplicated
> > to gcc, so we can ignore them to avoid duplication to developer?
> 
> Many of the warnings GCC has Clang does as well.
> -Wimplicit-function-declaration is the most common warning I see in
> triage, which developers would see regardless of toolchain had they
> compiled first before pushing.  It might be interesting to see maybe
> the intersection or common flags between GCC and Clang, and only email
> Clang reports for warnings unique to clang?  I think CFLAGS can even
> be passed into make invocations so you could do:
> $ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
> -Wno-implicit-function-declaration -Wno-...>
> such that any resulting warnings were unique to Clang.  I'd expect
> such a list to quickly get stale over time though.
> 
> >
> > >
> > > Rong, Philip, it's been a while since we talked about this last. Is
> > > there a general timeline of when these reports will be turned on
> > > globally?  Even if the directions to reproduce aren't quite right,
> > For the timeline, it's not decided due to the duplication concern. We tend
> > to look into next year after other priorities are solved for this year.
> >
> > > generally there's enough info in the existing bugs where authors can
> > > rewrite their patch without even needing to rebuild with Clang (though
> > > having correct directions to reproduce would be nice, we could wait
> > > until someone asked for them explicitly).
> > >
> > > --
> > > Thanks,
> > > ~Nick Desaulniers
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-26  4:09             ` Philip Li
  0 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  4:09 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3936 bytes --]

On Mon, Nov 25, 2019 at 05:47:00PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
> >
> > On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > > Hi Dennis,
> > > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > > can you please take a look?
> > > > >
> > > >
> > > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > > an enum and didn't update the declaration properly. I'll be sending out
> > > > a v4 as another fix for arm has some rebase conflicts.
> > > >
> > > > Is there a way to enable so I get these emails directly?
> > >
> > > + Rong, Philip
> > >
> > > The reports have only been sent to our mailing list where we've been
> > > manually triaging them.  The issue with enabling them globally was
> > > that the script to reproduce the warning still doesn't mention how to
> > > build w/ Clang.
Hi Nick, i forgot one question. Is it still expected to use latest clang
to build test? Any possibility the issue is related to clang compiler itself?

Thanks

> > Thanks Nick for continuous caring on this. One thing we initially worry
> > is how to avoid duplicated reports to developer, like the one that can
> > be same as gcc's finding. We haven't found a way to effectively handle
> > this.
> 
> Thanks for maintaining an invaluable tool.
> 
> How would the reports be duplicated? Does 0day bot build with GCC,
> then rebuild with Clang?
> 
> Regardless, does it matter? If I make a mistake, and get two build
> failure emails from 0day bot instead of one, does it matter? Sometimes
> developers may just get one, as some warnings are unique to each
> compiler.  Maybe it runs the risk of folks ignoring the email if the
> volume is too much, but do authors generally ignore 0day bot emails?
> (I'd hope not).
> 
> >
> > >
> > > In general the reports have been high value (I ignore most reports
> > > with -Wimplicit-function-declaration, which is the most frequent as it
> > > shows the patch was not compile tested at all).
> > Do we mean the report with -Wimplicit-function-declaration can be duplicated
> > to gcc, so we can ignore them to avoid duplication to developer?
> 
> Many of the warnings GCC has Clang does as well.
> -Wimplicit-function-declaration is the most common warning I see in
> triage, which developers would see regardless of toolchain had they
> compiled first before pushing.  It might be interesting to see maybe
> the intersection or common flags between GCC and Clang, and only email
> Clang reports for warnings unique to clang?  I think CFLAGS can even
> be passed into make invocations so you could do:
> $ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
> -Wno-implicit-function-declaration -Wno-...>
> such that any resulting warnings were unique to Clang.  I'd expect
> such a list to quickly get stale over time though.
> 
> >
> > >
> > > Rong, Philip, it's been a while since we talked about this last. Is
> > > there a general timeline of when these reports will be turned on
> > > globally?  Even if the directions to reproduce aren't quite right,
> > For the timeline, it's not decided due to the duplication concern. We tend
> > to look into next year after other priorities are solved for this year.
> >
> > > generally there's enough info in the existing bugs where authors can
> > > rewrite their patch without even needing to rebuild with Clang (though
> > > having correct directions to reproduce would be nice, we could wait
> > > until someone asked for them explicitly).
> > >
> > > --
> > > Thanks,
> > > ~Nick Desaulniers
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
@ 2019-11-26  4:09             ` Philip Li
  0 siblings, 0 replies; 23+ messages in thread
From: Philip Li @ 2019-11-26  4:09 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 3936 bytes --]

On Mon, Nov 25, 2019 at 05:47:00PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 25, 2019 at 5:35 PM Philip Li <philip.li@intel.com> wrote:
> >
> > On Mon, Nov 25, 2019 at 11:39:08AM -0800, Nick Desaulniers wrote:
> > > On Mon, Nov 25, 2019 at 10:59 AM Dennis Zhou <dennis@kernel.org> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 08:27:43PM -0800, Nick Desaulniers wrote:
> > > > > Hi Dennis,
> > > > > Below is a 0day bot report from a build w/ Clang. Warning looks legit,
> > > > > can you please take a look?
> > > > >
> > > >
> > > > Ah thanks for this! Yeah that was a miss when I switched from flags ->
> > > > an enum and didn't update the declaration properly. I'll be sending out
> > > > a v4 as another fix for arm has some rebase conflicts.
> > > >
> > > > Is there a way to enable so I get these emails directly?
> > >
> > > + Rong, Philip
> > >
> > > The reports have only been sent to our mailing list where we've been
> > > manually triaging them.  The issue with enabling them globally was
> > > that the script to reproduce the warning still doesn't mention how to
> > > build w/ Clang.
Hi Nick, i forgot one question. Is it still expected to use latest clang
to build test? Any possibility the issue is related to clang compiler itself?

Thanks

> > Thanks Nick for continuous caring on this. One thing we initially worry
> > is how to avoid duplicated reports to developer, like the one that can
> > be same as gcc's finding. We haven't found a way to effectively handle
> > this.
> 
> Thanks for maintaining an invaluable tool.
> 
> How would the reports be duplicated? Does 0day bot build with GCC,
> then rebuild with Clang?
> 
> Regardless, does it matter? If I make a mistake, and get two build
> failure emails from 0day bot instead of one, does it matter? Sometimes
> developers may just get one, as some warnings are unique to each
> compiler.  Maybe it runs the risk of folks ignoring the email if the
> volume is too much, but do authors generally ignore 0day bot emails?
> (I'd hope not).
> 
> >
> > >
> > > In general the reports have been high value (I ignore most reports
> > > with -Wimplicit-function-declaration, which is the most frequent as it
> > > shows the patch was not compile tested at all).
> > Do we mean the report with -Wimplicit-function-declaration can be duplicated
> > to gcc, so we can ignore them to avoid duplication to developer?
> 
> Many of the warnings GCC has Clang does as well.
> -Wimplicit-function-declaration is the most common warning I see in
> triage, which developers would see regardless of toolchain had they
> compiled first before pushing.  It might be interesting to see maybe
> the intersection or common flags between GCC and Clang, and only email
> Clang reports for warnings unique to clang?  I think CFLAGS can even
> be passed into make invocations so you could do:
> $ make CC=clang KBUILD_CFLAGS=<list of flags common to GCC and Clang;
> -Wno-implicit-function-declaration -Wno-...>
> such that any resulting warnings were unique to Clang.  I'd expect
> such a list to quickly get stale over time though.
> 
> >
> > >
> > > Rong, Philip, it's been a while since we talked about this last. Is
> > > there a general timeline of when these reports will be turned on
> > > globally?  Even if the directions to reproduce aren't quite right,
> > For the timeline, it's not decided due to the duplication concern. We tend
> > to look into next year after other priorities are solved for this year.
> >
> > > generally there's enough info in the existing bugs where authors can
> > > rewrite their patch without even needing to rebuild with Clang (though
> > > having correct directions to reproduce would be nice, we could wait
> > > until someone asked for them explicitly).
> > >
> > > --
> > > Thanks,
> > > ~Nick Desaulniers
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v6 00/22] btrfs: async discard support
@ 2019-12-14  0:22 Dennis Zhou
  2019-12-14  0:22 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
  0 siblings, 1 reply; 23+ messages in thread
From: Dennis Zhou @ 2019-12-14  0:22 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Hello,

Dave reported a lockdep issue [1]. I'm a bit surprised as I can't repro
it, but it obviously is right. I believe I fixed the issue by moving the
fully trimmed check outside of the block_group lock.  I mistakingly
thought the btrfs_block_group lock subsumed btrfs_free_space_ctl
tree_lock. This clearly isn't the case.

Changes in v6:
 - Move the fully trimmed check outside of the block_group lock.

v5 is available here: [2].

This series is on top of btrfs-devel#misc-next 7ee98bb808e2 + [3] and
[4].

[1] https://lore.kernel.org/linux-btrfs/20191210140438.GU2734@twin.jikos.cz/
[2] https://lore.kernel.org/linux-btrfs/cover.1575919745.git.dennis@kernel.org/
[3] https://lore.kernel.org/linux-btrfs/d934383ea528d920a95b6107daad6023b516f0f4.1576109087.git.dennis@kernel.org/
[4] https://lore.kernel.org/linux-btrfs/20191209193846.18162-1-dennis@kernel.org/

Dennis Zhou (22):
  bitmap: genericize percpu bitmap region iterators
  btrfs: rename DISCARD opt to DISCARD_SYNC
  btrfs: keep track of which extents have been discarded
  btrfs: keep track of cleanliness of the bitmap
  btrfs: add the beginning of async discard, discard workqueue
  btrfs: handle empty block_group removal
  btrfs: discard one region at a time in async discard
  btrfs: add removal calls for sysfs debug/
  btrfs: make UUID/debug have its own kobject
  btrfs: add discard sysfs directory
  btrfs: track discardable extents for async discard
  btrfs: keep track of discardable_bytes
  btrfs: calculate discard delay based on number of extents
  btrfs: add bps discard rate limit
  btrfs: limit max discard size for async discard
  btrfs: make max async discard size tunable
  btrfs: have multiple discard lists
  btrfs: only keep track of data extents for async discard
  btrfs: keep track of discard reuse stats
  btrfs: add async discard header
  btrfs: increase the metadata allowance for the free_space_cache
  btrfs: make smaller extents more likely to go into bitmaps

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  87 ++++-
 fs/btrfs/block-group.h      |  30 ++
 fs/btrfs/ctree.h            |  52 ++-
 fs/btrfs/discard.c          | 684 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  42 +++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |   8 +-
 fs/btrfs/free-space-cache.c | 611 +++++++++++++++++++++++++++-----
 fs/btrfs/free-space-cache.h |  41 ++-
 fs/btrfs/inode-map.c        |  13 +-
 fs/btrfs/inode.c            |   2 +-
 fs/btrfs/scrub.c            |   7 +-
 fs/btrfs/super.c            |  39 +-
 fs/btrfs/sysfs.c            | 205 ++++++++++-
 fs/btrfs/volumes.c          |   7 +
 include/linux/bitmap.h      |  35 ++
 mm/percpu.c                 |  61 +---
 18 files changed, 1789 insertions(+), 152 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-12-14  0:22 [PATCH v6 00/22] btrfs: async discard support Dennis Zhou
@ 2019-12-14  0:22 ` Dennis Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Zhou @ 2019-12-14  0:22 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be address in a future patch in this
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  37 ++++-
 fs/btrfs/block-group.h      |   9 ++
 fs/btrfs/ctree.h            |  21 +++
 fs/btrfs/discard.c          | 278 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  23 +++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |   4 +
 fs/btrfs/free-space-cache.c |  54 ++++++-
 fs/btrfs/free-space-cache.h |   2 +
 fs/btrfs/super.c            |  35 ++++-
 fs/btrfs/volumes.c          |   7 +
 12 files changed, 474 insertions(+), 13 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 82200dbca5ac..9a0ff3384381 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o block-group.o
+	   block-rsv.o delalloc-space.o block-group.o discard.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index be1938dc94fd..c1b1b59343bd 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -14,6 +14,7 @@
 #include "sysfs.h"
 #include "tree-log.h"
 #include "delalloc-space.h"
+#include "discard.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -131,6 +132,15 @@ void btrfs_put_block_group(struct btrfs_block_group *cache)
 		WARN_ON(cache->pinned > 0);
 		WARN_ON(cache->reserved > 0);
 
+		/*
+		 * A block_group shouldn't be on the discard_list anymore.
+		 * Remove the block_group from the discard_list to prevent us
+		 * from causing a panic due to NPE.
+		 */
+		if (WARN_ON(!list_empty(&cache->discard_list)))
+			btrfs_discard_cancel_work(&cache->fs_info->discard_ctl,
+						  cache);
+
 		/*
 		 * If not empty, someone is still holding mutex of
 		 * full_stripe_lock, which can only be released by caller.
@@ -466,8 +476,8 @@ u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end
 		} else if (extent_start > start && extent_start < end) {
 			size = extent_start - start;
 			total_added += size;
-			ret = btrfs_add_free_space(block_group, start,
-						   size);
+			ret = btrfs_add_free_space_async_trimmed(block_group,
+								 start, size);
 			BUG_ON(ret); /* -ENOMEM or logic error */
 			start = extent_end + 1;
 		} else {
@@ -478,7 +488,8 @@ u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end
 	if (start < end) {
 		size = end - start;
 		total_added += size;
-		ret = btrfs_add_free_space(block_group, start, size);
+		ret = btrfs_add_free_space_async_trimmed(block_group, start,
+							 size);
 		BUG_ON(ret); /* -ENOMEM or logic error */
 	}
 
@@ -1258,6 +1269,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
+		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
 
 		/* Don't want to race with allocators so take the groups_sem */
@@ -1333,6 +1346,23 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 
+		/*
+		 * At this point, the block_group is read only and should fail
+		 * new allocations.  However, btrfs_finish_extent_commit() can
+		 * cause this block_group to be placed back on the discard
+		 * lists because now the block_group isn't fully discarded.
+		 * Bail here and try again later after discarding everything.
+		 */
+		spin_lock(&fs_info->discard_ctl.lock);
+		if (!list_empty(&block_group->discard_list)) {
+			spin_unlock(&fs_info->discard_ctl.lock);
+			btrfs_dec_block_group_ro(block_group);
+			btrfs_discard_queue_work(&fs_info->discard_ctl,
+						 block_group);
+			goto end_trans;
+		}
+		spin_unlock(&fs_info->discard_ctl.lock);
+
 		/* Reset pinned so btrfs_put_block_group doesn't complain */
 		spin_lock(&space_info->lock);
 		spin_lock(&block_group->lock);
@@ -1603,6 +1633,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	INIT_LIST_HEAD(&cache->ro_list);
+	INIT_LIST_HEAD(&cache->discard_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9b409676c4b2..884defd61dcd 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -116,7 +116,11 @@ struct btrfs_block_group {
 	/* For read-only block groups */
 	struct list_head ro_list;
 
+	/* For discard operations */
 	atomic_t trimming;
+	struct list_head discard_list;
+	int discard_index;
+	u64 discard_eligible_time;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
@@ -158,6 +162,11 @@ struct btrfs_block_group {
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
+static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
+{
+	return (block_group->start + block_group->length);
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline int btrfs_should_fragment_free_space(
 		struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2f6c21ea84af..f7b429277089 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -440,6 +440,21 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* Discard control. */
+/*
+ * Async discard uses multiple lists to differentiate the discard filter
+ * parameters.
+ */
+#define BTRFS_NR_DISCARD_LISTS		1
+
+struct btrfs_discard_ctl {
+	struct workqueue_struct *discard_workers;
+	struct delayed_work work;
+	spinlock_t lock;
+	struct btrfs_block_group *block_group;
+	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+};
+
 /* delayed seq elem */
 struct seq_list {
 	struct list_head list;
@@ -526,6 +541,9 @@ enum {
 	 * so we don't need to offload checksums to workqueues.
 	 */
 	BTRFS_FS_CSUM_IMPL_FAST,
+
+	/* Indicate that the discard workqueue can service discards. */
+	BTRFS_FS_DISCARD_RUNNING,
 };
 
 struct btrfs_fs_info {
@@ -816,6 +834,8 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *scrub_wr_completion_workers;
 	struct btrfs_workqueue *scrub_parity_workers;
 
+	struct btrfs_discard_ctl discard_ctl;
+
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	u32 check_integrity_print_mask;
 #endif
@@ -1189,6 +1209,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
new file mode 100644
index 000000000000..692d64025802
--- /dev/null
+++ b/fs/btrfs/discard.c
@@ -0,0 +1,278 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/list.h>
+#include <linux/sizes.h>
+#include <linux/workqueue.h>
+#include "ctree.h"
+#include "block-group.h"
+#include "discard.h"
+#include "free-space-cache.h"
+
+/* This is an initial delay to give some chance for lba reuse. */
+#define BTRFS_DISCARD_DELAY		(120ULL * NSEC_PER_SEC)
+
+static struct list_head *get_discard_list(struct btrfs_discard_ctl *discard_ctl,
+					  struct btrfs_block_group *block_group)
+{
+	return &discard_ctl->discard_list[block_group->discard_index];
+}
+
+static void add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				struct btrfs_block_group *block_group)
+{
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl)) {
+		spin_unlock(&discard_ctl->lock);
+		return;
+	}
+
+	if (list_empty(&block_group->discard_list))
+		block_group->discard_eligible_time = (ktime_get_ns() +
+						      BTRFS_DISCARD_DELAY);
+
+	list_move_tail(&block_group->discard_list,
+		       get_discard_list(discard_ctl, block_group));
+
+	spin_unlock(&discard_ctl->lock);
+}
+
+static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				     struct btrfs_block_group *block_group)
+{
+	bool running = false;
+
+	spin_lock(&discard_ctl->lock);
+
+	if (block_group == discard_ctl->block_group) {
+		running = true;
+		discard_ctl->block_group = NULL;
+	}
+
+	block_group->discard_eligible_time = 0;
+	list_del_init(&block_group->discard_list);
+
+	spin_unlock(&discard_ctl->lock);
+
+	return running;
+}
+
+/**
+ * find_next_block_group - find block_group that's up next for discarding
+ * @discard_ctl: discard control
+ * @now: current time
+ *
+ * Iterate over the discard lists to find the next block_group up for
+ * discarding checking the discard_eligible_time of block_group.
+ */
+static struct btrfs_block_group *find_next_block_group(
+					struct btrfs_discard_ctl *discard_ctl,
+					u64 now)
+{
+	struct btrfs_block_group *ret_block_group = NULL, *block_group;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
+		struct list_head *discard_list = &discard_ctl->discard_list[i];
+
+		if (!list_empty(discard_list)) {
+			block_group = list_first_entry(discard_list,
+						       struct btrfs_block_group,
+						       discard_list);
+
+			if (!ret_block_group)
+				ret_block_group = block_group;
+
+			if (ret_block_group->discard_eligible_time < now)
+				break;
+
+			if (ret_block_group->discard_eligible_time >
+			    block_group->discard_eligible_time)
+				ret_block_group = block_group;
+		}
+	}
+
+	return ret_block_group;
+}
+
+/**
+ * peek_discard_list - wrap find_next_block_group()
+ * @discard_ctl: discard control
+ *
+ * This wraps find_next_block_group() and sets the block_group to be in use.
+ */
+static struct btrfs_block_group *peek_discard_list(
+					struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	block_group = find_next_block_group(discard_ctl, now);
+
+	if (block_group && now < block_group->discard_eligible_time)
+		block_group = NULL;
+
+	discard_ctl->block_group = block_group;
+
+	spin_unlock(&discard_ctl->lock);
+
+	return block_group;
+}
+
+/**
+ * btrfs_discard_cancel_work - remove a block_group from the discard lists
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This removes @block_group from the discard lists.  If necessary, it waits on
+ * the current work and then reschedules the delayed work.
+ */
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group)
+{
+	if (remove_from_discard_list(discard_ctl, block_group)) {
+		cancel_delayed_work_sync(&discard_ctl->work);
+		btrfs_discard_schedule_work(discard_ctl, true);
+	}
+}
+
+/**
+ * btrfs_discard_queue_work - handles queuing the block_groups
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This maintains the LRU order of the discard lists.
+ */
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group)
+{
+	if (!block_group ||
+	    !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
+		return;
+
+	add_to_discard_list(discard_ctl, block_group);
+
+	if (!delayed_work_pending(&discard_ctl->work))
+		btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_discard_schedule_work - responsible for scheduling the discard work
+ * @discard_ctl: discard control
+ * @override: override the current timer
+ *
+ * Discards are issued by a delayed workqueue item.  @override is used to
+ * update the current delay as the baseline delay interview is reevaluated
+ * on transaction commit.  This is also maxed with any other rate limit.
+ */
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl))
+		goto out;
+
+	if (!override && delayed_work_pending(&discard_ctl->work))
+		goto out;
+
+	block_group = find_next_block_group(discard_ctl, now);
+	if (block_group) {
+		u64 delay = 0;
+
+		if (now < block_group->discard_eligible_time)
+			delay = nsecs_to_jiffies(
+				block_group->discard_eligible_time - now);
+
+		mod_delayed_work(discard_ctl->discard_workers,
+				 &discard_ctl->work,
+				 delay);
+	}
+
+out:
+	spin_unlock(&discard_ctl->lock);
+}
+
+/**
+ * btrfs_discard_workfn - discard work function
+ * @work: work
+ *
+ * This finds the next block_group to start discarding and then discards it.
+ */
+static void btrfs_discard_workfn(struct work_struct *work)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+	struct btrfs_block_group *block_group;
+	u64 trimmed = 0;
+
+	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
+
+	block_group = peek_discard_list(discard_ctl);
+	if (!block_group || !btrfs_run_discard_work(discard_ctl))
+		return;
+
+	btrfs_trim_block_group(block_group, &trimmed, block_group->start,
+			       btrfs_block_group_end(block_group), 0);
+
+	remove_from_discard_list(discard_ctl, block_group);
+
+	btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_run_discard_work - determines if async discard should be running
+ * @discard_ctl: discard control
+ *
+ * Checks if the file system is writeable and BTRFS_FS_DISCARD_RUNNING is set.
+ */
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
+						     struct btrfs_fs_info,
+						     discard_ctl);
+
+	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
+		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
+}
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
+		btrfs_discard_cleanup(fs_info);
+		return;
+	}
+
+	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info)
+{
+	clear_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_discard_ctl *discard_ctl = &fs_info->discard_ctl;
+	int i;
+
+	spin_lock_init(&discard_ctl->lock);
+
+	INIT_DELAYED_WORK(&discard_ctl->work, btrfs_discard_workfn);
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
+		INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
+}
+
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
+{
+	btrfs_discard_stop(fs_info);
+	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
+}
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
new file mode 100644
index 000000000000..f3775e84d35a
--- /dev/null
+++ b/fs/btrfs/discard.h
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef BTRFS_DISCARD_H
+#define BTRFS_DISCARD_H
+
+struct btrfs_fs_info;
+struct btrfs_discard_ctl;
+struct btrfs_block_group;
+
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group);
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group);
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override);
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
+void btrfs_discard_init(struct btrfs_fs_info *fs_info);
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
+
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 881aba162e4e..d517c4b31338 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -41,6 +41,7 @@
 #include "tree-checker.h"
 #include "ref-verify.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -1953,6 +1954,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->readahead_workers);
 	btrfs_destroy_workqueue(fs_info->flush_workers);
 	btrfs_destroy_workqueue(fs_info->qgroup_rescan_workers);
+	if (fs_info->discard_ctl.discard_workers)
+		destroy_workqueue(fs_info->discard_ctl.discard_workers);
 	/*
 	 * Now that all other work queues are destroyed, we can safely destroy
 	 * the queues used for metadata I/O, since tasks from those other work
@@ -2148,6 +2151,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 				      max_active, 2);
 	fs_info->qgroup_rescan_workers =
 		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
+	fs_info->discard_ctl.discard_workers =
+		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
 	      fs_info->flush_workers &&
@@ -2158,7 +2163,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->readahead_workers &&
 	      fs_info->fixup_workers && fs_info->delayed_workers &&
-	      fs_info->qgroup_rescan_workers)) {
+	      fs_info->qgroup_rescan_workers &&
+	      fs_info->discard_ctl.discard_workers)) {
 		return -ENOMEM;
 	}
 
@@ -2793,6 +2799,8 @@ int __cold open_ctree(struct super_block *sb,
 	btrfs_init_dev_replace_locks(fs_info);
 	btrfs_init_qgroup(fs_info);
 
+	btrfs_discard_init(fs_info);
+
 	btrfs_init_free_cluster(&fs_info->meta_alloc_cluster);
 	btrfs_init_free_cluster(&fs_info->data_alloc_cluster);
 
@@ -3256,6 +3264,8 @@ int __cold open_ctree(struct super_block *sb,
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	btrfs_discard_resume(fs_info);
+
 	if (!fs_info->uuid_root) {
 		btrfs_info(fs_info, "creating UUID tree");
 		ret = btrfs_create_uuid_tree(fs_info);
@@ -3971,6 +3981,9 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	cancel_work_sync(&fs_info->async_reclaim_work);
 
+	/* cancel or finish ongoing discard work */
+	btrfs_discard_cleanup(fs_info);
+
 	if (!sb_rdonly(fs_info->sb)) {
 		/*
 		 * The cleaner kthread is stopped, so do one final pass over
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1a8bf943c3e7..2c12366cfde5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,6 +32,7 @@
 #include "block-rsv.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2934,6 +2935,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		cond_resched();
 	}
 
+	if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_schedule_work(&fs_info->discard_ctl, true);
+
 	/*
 	 * Transaction is finished.  We don't need the lock anymore.  We
 	 * do need to clean up the block groups in case of a transaction
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 7108962e208d..0a868b6df893 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -21,6 +21,7 @@
 #include "space-info.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BITS_PER_BITMAP		(PAGE_SIZE * 8UL)
 #define MAX_CACHE_BYTES_PER_GIG	SZ_32K
@@ -755,9 +756,11 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 		/*
 		 * Sync discard ensures that the free space cache is always
 		 * trimmed.  So when reading this in, the state should reflect
-		 * that.
+		 * that.  We also do this for async as a stop gap for lack of
+		 * persistence.
 		 */
-		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			e->trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
 		if (!e->bytes) {
@@ -2382,6 +2385,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   u64 offset, u64 bytes,
 			   enum btrfs_trim_state trim_state)
 {
+	struct btrfs_block_group *block_group = ctl->private;
 	struct btrfs_free_space *info;
 	int ret = 0;
 
@@ -2431,6 +2435,9 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 		ASSERT(ret != -EEXIST);
 	}
 
+	if (trim_state != BTRFS_TRIM_STATE_TRIMMED)
+		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
+
 	return ret;
 }
 
@@ -2447,6 +2454,25 @@ int btrfs_add_free_space(struct btrfs_block_group *block_group,
 				      bytenr, size, trim_state);
 }
 
+/*
+ * This is a subtle distinction because when adding free space back in general,
+ * we want it to be added as untrimmed for async. But in the case where we add
+ * it on loading of a block group, we want to consider it trimmed.
+ */
+int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
+				       u64 bytenr, u64 size)
+{
+	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
+
+	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC) ||
+	    btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
+		trim_state = BTRFS_TRIM_STATE_TRIMMED;
+
+	return __btrfs_add_free_space(block_group->fs_info,
+				      block_group->free_space_ctl,
+				      bytenr, size, trim_state);
+}
+
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 			    u64 offset, u64 bytes)
 {
@@ -3209,6 +3235,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 static int do_trimming(struct btrfs_block_group *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
 		       u64 reserved_start, u64 reserved_bytes,
+		       enum btrfs_trim_state reserved_trim_state,
 		       struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
@@ -3216,6 +3243,9 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
+	u64 end = start + bytes;
+	u64 reserved_end = reserved_start + reserved_bytes;
+	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 	u64 trimmed = 0;
 
 	spin_lock(&space_info->lock);
@@ -3229,11 +3259,20 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	spin_unlock(&space_info->lock);
 
 	ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
-	if (!ret)
+	if (!ret) {
 		*total_trimmed += trimmed;
+		trim_state = BTRFS_TRIM_STATE_TRIMMED;
+	}
 
 	mutex_lock(&ctl->cache_writeout_mutex);
-	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	if (reserved_start < start)
+		__btrfs_add_free_space(fs_info, ctl, reserved_start,
+				       start - reserved_start,
+				       reserved_trim_state);
+	if (start + bytes < reserved_start + reserved_bytes)
+		__btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
+				       reserved_trim_state);
+	__btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
 	list_del(&trim_entry->list);
 	mutex_unlock(&ctl->cache_writeout_mutex);
 
@@ -3260,6 +3299,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 	int ret = 0;
 	u64 extent_start;
 	u64 extent_bytes;
+	enum btrfs_trim_state extent_trim_state;
 	u64 bytes;
 
 	while (start < end) {
@@ -3301,6 +3341,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 
 		extent_start = entry->offset;
 		extent_bytes = entry->bytes;
+		extent_trim_state = entry->trim_state;
 		start = max(start, extent_start);
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
@@ -3319,7 +3360,8 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes, &trim_entry);
+				  extent_start, extent_bytes, extent_trim_state,
+				  &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3445,7 +3487,7 @@ static int trim_bitmaps(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes, &trim_entry);
+				  start, bytes, 0, &trim_entry);
 		if (ret) {
 			reset_trimming_bitmap(ctl, offset);
 			break;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 29d16f58b40b..a88c30cb3b2b 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -113,6 +113,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   enum btrfs_trim_state trim_state);
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size);
+int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
+				       u64 bytenr, u64 size);
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 			    u64 bytenr, u64 size);
 void __btrfs_remove_free_space_cache(struct btrfs_free_space_ctl *ctl);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 08ac6a7a67f0..215b25012e49 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -46,6 +46,7 @@
 #include "sysfs.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
+#include "discard.h"
 
 #include "qgroup.h"
 #define CREATE_TRACE_POINTS
@@ -146,6 +147,8 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
 	if (sb_rdonly(sb))
 		return;
 
+	btrfs_discard_stop(fs_info);
+
 	/* btrfs handle error by forcing the filesystem readonly */
 	sb->s_flags |= SB_RDONLY;
 	btrfs_info(fs_info, "forced readonly");
@@ -313,6 +316,7 @@ enum {
 	Opt_datasum, Opt_nodatasum,
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
+	Opt_discard_mode,
 	Opt_nologreplay,
 	Opt_norecovery,
 	Opt_ratio,
@@ -376,6 +380,7 @@ static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_discard_mode, "discard=%s"},
 	{Opt_nologreplay, "nologreplay"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -695,12 +700,26 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				   info->metadata_ratio);
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(info, DISCARD_SYNC,
-					   "turning on sync discard");
+		case Opt_discard_mode:
+			if (token == Opt_discard ||
+			    strcmp(args[0].from, "sync") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_ASYNC);
+				btrfs_set_and_info(info, DISCARD_SYNC,
+						   "turning on sync discard");
+			} else if (strcmp(args[0].from, "async") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_SYNC);
+				btrfs_set_and_info(info, DISCARD_ASYNC,
+						   "turning on async discard");
+			} else {
+				ret = -EINVAL;
+				goto out;
+			}
 			break;
 		case Opt_nodiscard:
 			btrfs_clear_and_info(info, DISCARD_SYNC,
 					     "turning off discard");
+			btrfs_clear_and_info(info, DISCARD_ASYNC,
+					     "turning off async discard");
 			break;
 		case Opt_space_cache:
 		case Opt_space_cache_version:
@@ -1324,6 +1343,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
 		seq_puts(seq, ",discard");
+	if (btrfs_test_opt(info, DISCARD_ASYNC))
+		seq_puts(seq, ",discard=async");
 	if (!(info->sb->s_flags & SB_POSIXACL))
 		seq_puts(seq, ",noacl");
 	if (btrfs_test_opt(info, SPACE_CACHE))
@@ -1713,6 +1734,14 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
 		btrfs_cleanup_defrag_inodes(fs_info);
 	}
 
+	/* If we toggled discard async. */
+	if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+	    btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_resume(fs_info);
+	else if (btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+		 !btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_cleanup(fs_info);
+
 	clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
 }
 
@@ -1760,6 +1789,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 */
 		cancel_work_sync(&fs_info->async_reclaim_work);
 
+		btrfs_discard_cleanup(fs_info);
+
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
 		/* avoid complains from lockdep et al. */
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66377e678504..65e78e59d5c4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -30,6 +30,7 @@
 #include "tree-checker.h"
 #include "space-info.h"
 #include "block-group.h"
+#include "discard.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
@@ -2869,6 +2870,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 {
 	struct btrfs_root *root = fs_info->chunk_root;
 	struct btrfs_trans_handle *trans;
+	struct btrfs_block_group *block_group;
 	int ret;
 
 	/*
@@ -2892,6 +2894,11 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	if (ret)
 		return ret;
 
+	block_group = btrfs_lookup_block_group(fs_info, chunk_offset);
+	if (!block_group)
+		return -ENOENT;
+	btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 	trans = btrfs_start_trans_remove_block_group(root->fs_info,
 						     chunk_offset);
 	if (IS_ERR(trans)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 00/22] btrfs: async discard support
@ 2019-12-09 19:45 Dennis Zhou
  2019-12-09 19:45 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
  0 siblings, 1 reply; 23+ messages in thread
From: Dennis Zhou @ 2019-12-09 19:45 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Hello,

Dave reported that with async discard enabled, relocation fails [1].
This could be caused by two things. First, if we unpin extents, that
means we haven't fully discarded the block group and need to let async
discard revisit it. Second, relocation removes block_groups outside of
the normal. I fixed both issues and now it successfully passes xfstests
btrfs/003.

Changes in v5:
 - Changed the rules so free space is always added as the right type
   based on discard settings (see btrfs_add_free_space()), this removes
   the need to pass around trim_state in unpin_extent_range().
 - Handled relocation block group deletion (xfstests btrfs/003)
 - When adding to the discard lists, make sure the work queue is active.
   (made all additions go through either btrfs_discard_queue_work() or
   btrfs_discard_check_filter()).
 - Added 10 sec reuse timeout for fully empty block groups.

v4 is available here: [2].

This series is on top of btrfs-devel#misc-next fbc0468e42d2 + [3].

[1] https://lore.kernel.org/linux-btrfs/20191126215204.GP2734@twin.jikos.cz/
[2] https://lore.kernel.org/linux-btrfs/cover.1574709825.git.dennis@kernel.org/
[3] https://lore.kernel.org/linux-btrfs/20191209193846.18162-1-dennis@kernel.org/

Dennis Zhou (22):
  bitmap: genericize percpu bitmap region iterators
  btrfs: rename DISCARD opt to DISCARD_SYNC
  btrfs: keep track of which extents have been discarded
  btrfs: keep track of cleanliness of the bitmap
  btrfs: add the beginning of async discard, discard workqueue
  btrfs: handle empty block_group removal
  btrfs: discard one region at a time in async discard
  btrfs: add removal calls for sysfs debug/
  btrfs: make UUID/debug have its own kobject
  btrfs: add discard sysfs directory
  btrfs: track discardable extents for async discard
  btrfs: keep track of discardable_bytes
  btrfs: calculate discard delay based on number of extents
  btrfs: add bps discard rate limit
  btrfs: limit max discard size for async discard
  btrfs: make max async discard size tunable
  btrfs: have multiple discard lists
  btrfs: only keep track of data extents for async discard
  btrfs: keep track of discard reuse stats
  btrfs: add async discard header
  btrfs: increase the metadata allowance for the free_space_cache
  btrfs: make smaller extents more likely to go into bitmaps

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  88 ++++-
 fs/btrfs/block-group.h      |  30 ++
 fs/btrfs/ctree.h            |  52 ++-
 fs/btrfs/discard.c          | 684 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  42 +++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |   8 +-
 fs/btrfs/free-space-cache.c | 611 +++++++++++++++++++++++++++-----
 fs/btrfs/free-space-cache.h |  41 ++-
 fs/btrfs/inode-map.c        |  13 +-
 fs/btrfs/inode.c            |   2 +-
 fs/btrfs/scrub.c            |   7 +-
 fs/btrfs/super.c            |  39 +-
 fs/btrfs/sysfs.c            | 205 ++++++++++-
 fs/btrfs/volumes.c          |   7 +
 include/linux/bitmap.h      |  35 ++
 mm/percpu.c                 |  61 +---
 18 files changed, 1789 insertions(+), 153 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-12-09 19:45 [PATCH v5 00/22] btrfs: async discard support Dennis Zhou
@ 2019-12-09 19:45 ` Dennis Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Zhou @ 2019-12-09 19:45 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be address in a future patch in this
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  37 ++++-
 fs/btrfs/block-group.h      |   9 ++
 fs/btrfs/ctree.h            |  21 +++
 fs/btrfs/discard.c          | 278 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  23 +++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |   4 +
 fs/btrfs/free-space-cache.c |  54 ++++++-
 fs/btrfs/free-space-cache.h |   2 +
 fs/btrfs/super.c            |  35 ++++-
 fs/btrfs/volumes.c          |   7 +
 12 files changed, 474 insertions(+), 13 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 82200dbca5ac..9a0ff3384381 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o block-group.o
+	   block-rsv.o delalloc-space.o block-group.o discard.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index be1938dc94fd..c1b1b59343bd 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -14,6 +14,7 @@
 #include "sysfs.h"
 #include "tree-log.h"
 #include "delalloc-space.h"
+#include "discard.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -131,6 +132,15 @@ void btrfs_put_block_group(struct btrfs_block_group *cache)
 		WARN_ON(cache->pinned > 0);
 		WARN_ON(cache->reserved > 0);
 
+		/*
+		 * A block_group shouldn't be on the discard_list anymore.
+		 * Remove the block_group from the discard_list to prevent us
+		 * from causing a panic due to NPE.
+		 */
+		if (WARN_ON(!list_empty(&cache->discard_list)))
+			btrfs_discard_cancel_work(&cache->fs_info->discard_ctl,
+						  cache);
+
 		/*
 		 * If not empty, someone is still holding mutex of
 		 * full_stripe_lock, which can only be released by caller.
@@ -466,8 +476,8 @@ u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end
 		} else if (extent_start > start && extent_start < end) {
 			size = extent_start - start;
 			total_added += size;
-			ret = btrfs_add_free_space(block_group, start,
-						   size);
+			ret = btrfs_add_free_space_async_trimmed(block_group,
+								 start, size);
 			BUG_ON(ret); /* -ENOMEM or logic error */
 			start = extent_end + 1;
 		} else {
@@ -478,7 +488,8 @@ u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end
 	if (start < end) {
 		size = end - start;
 		total_added += size;
-		ret = btrfs_add_free_space(block_group, start, size);
+		ret = btrfs_add_free_space_async_trimmed(block_group, start,
+							 size);
 		BUG_ON(ret); /* -ENOMEM or logic error */
 	}
 
@@ -1258,6 +1269,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
+		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
 
 		/* Don't want to race with allocators so take the groups_sem */
@@ -1333,6 +1346,23 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		mutex_unlock(&fs_info->unused_bg_unpin_mutex);
 
+		/*
+		 * At this point, the block_group is read only and should fail
+		 * new allocations.  However, btrfs_finish_extent_commit() can
+		 * cause this block_group to be placed back on the discard
+		 * lists because now the block_group isn't fully discarded.
+		 * Bail here and try again later after discarding everything.
+		 */
+		spin_lock(&fs_info->discard_ctl.lock);
+		if (!list_empty(&block_group->discard_list)) {
+			spin_unlock(&fs_info->discard_ctl.lock);
+			btrfs_dec_block_group_ro(block_group);
+			btrfs_discard_queue_work(&fs_info->discard_ctl,
+						 block_group);
+			goto end_trans;
+		}
+		spin_unlock(&fs_info->discard_ctl.lock);
+
 		/* Reset pinned so btrfs_put_block_group doesn't complain */
 		spin_lock(&space_info->lock);
 		spin_lock(&block_group->lock);
@@ -1603,6 +1633,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	INIT_LIST_HEAD(&cache->ro_list);
+	INIT_LIST_HEAD(&cache->discard_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9b409676c4b2..884defd61dcd 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -116,7 +116,11 @@ struct btrfs_block_group {
 	/* For read-only block groups */
 	struct list_head ro_list;
 
+	/* For discard operations */
 	atomic_t trimming;
+	struct list_head discard_list;
+	int discard_index;
+	u64 discard_eligible_time;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
@@ -158,6 +162,11 @@ struct btrfs_block_group {
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
+static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
+{
+	return (block_group->start + block_group->length);
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline int btrfs_should_fragment_free_space(
 		struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 51a303441802..d67bbbb07925 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -440,6 +440,21 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* Discard control. */
+/*
+ * Async discard uses multiple lists to differentiate the discard filter
+ * parameters.
+ */
+#define BTRFS_NR_DISCARD_LISTS		1
+
+struct btrfs_discard_ctl {
+	struct workqueue_struct *discard_workers;
+	struct delayed_work work;
+	spinlock_t lock;
+	struct btrfs_block_group *block_group;
+	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+};
+
 /* delayed seq elem */
 struct seq_list {
 	struct list_head list;
@@ -526,6 +541,9 @@ enum {
 	 * so we don't need to offload checksums to workqueues.
 	 */
 	BTRFS_FS_CSUM_IMPL_FAST,
+
+	/* Indicate that the discard workqueue can service discards. */
+	BTRFS_FS_DISCARD_RUNNING,
 };
 
 struct btrfs_fs_info {
@@ -816,6 +834,8 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *scrub_wr_completion_workers;
 	struct btrfs_workqueue *scrub_parity_workers;
 
+	struct btrfs_discard_ctl discard_ctl;
+
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	u32 check_integrity_print_mask;
 #endif
@@ -1189,6 +1209,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
new file mode 100644
index 000000000000..692d64025802
--- /dev/null
+++ b/fs/btrfs/discard.c
@@ -0,0 +1,278 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/list.h>
+#include <linux/sizes.h>
+#include <linux/workqueue.h>
+#include "ctree.h"
+#include "block-group.h"
+#include "discard.h"
+#include "free-space-cache.h"
+
+/* This is an initial delay to give some chance for lba reuse. */
+#define BTRFS_DISCARD_DELAY		(120ULL * NSEC_PER_SEC)
+
+static struct list_head *get_discard_list(struct btrfs_discard_ctl *discard_ctl,
+					  struct btrfs_block_group *block_group)
+{
+	return &discard_ctl->discard_list[block_group->discard_index];
+}
+
+static void add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				struct btrfs_block_group *block_group)
+{
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl)) {
+		spin_unlock(&discard_ctl->lock);
+		return;
+	}
+
+	if (list_empty(&block_group->discard_list))
+		block_group->discard_eligible_time = (ktime_get_ns() +
+						      BTRFS_DISCARD_DELAY);
+
+	list_move_tail(&block_group->discard_list,
+		       get_discard_list(discard_ctl, block_group));
+
+	spin_unlock(&discard_ctl->lock);
+}
+
+static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				     struct btrfs_block_group *block_group)
+{
+	bool running = false;
+
+	spin_lock(&discard_ctl->lock);
+
+	if (block_group == discard_ctl->block_group) {
+		running = true;
+		discard_ctl->block_group = NULL;
+	}
+
+	block_group->discard_eligible_time = 0;
+	list_del_init(&block_group->discard_list);
+
+	spin_unlock(&discard_ctl->lock);
+
+	return running;
+}
+
+/**
+ * find_next_block_group - find block_group that's up next for discarding
+ * @discard_ctl: discard control
+ * @now: current time
+ *
+ * Iterate over the discard lists to find the next block_group up for
+ * discarding checking the discard_eligible_time of block_group.
+ */
+static struct btrfs_block_group *find_next_block_group(
+					struct btrfs_discard_ctl *discard_ctl,
+					u64 now)
+{
+	struct btrfs_block_group *ret_block_group = NULL, *block_group;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
+		struct list_head *discard_list = &discard_ctl->discard_list[i];
+
+		if (!list_empty(discard_list)) {
+			block_group = list_first_entry(discard_list,
+						       struct btrfs_block_group,
+						       discard_list);
+
+			if (!ret_block_group)
+				ret_block_group = block_group;
+
+			if (ret_block_group->discard_eligible_time < now)
+				break;
+
+			if (ret_block_group->discard_eligible_time >
+			    block_group->discard_eligible_time)
+				ret_block_group = block_group;
+		}
+	}
+
+	return ret_block_group;
+}
+
+/**
+ * peek_discard_list - wrap find_next_block_group()
+ * @discard_ctl: discard control
+ *
+ * This wraps find_next_block_group() and sets the block_group to be in use.
+ */
+static struct btrfs_block_group *peek_discard_list(
+					struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	block_group = find_next_block_group(discard_ctl, now);
+
+	if (block_group && now < block_group->discard_eligible_time)
+		block_group = NULL;
+
+	discard_ctl->block_group = block_group;
+
+	spin_unlock(&discard_ctl->lock);
+
+	return block_group;
+}
+
+/**
+ * btrfs_discard_cancel_work - remove a block_group from the discard lists
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This removes @block_group from the discard lists.  If necessary, it waits on
+ * the current work and then reschedules the delayed work.
+ */
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group)
+{
+	if (remove_from_discard_list(discard_ctl, block_group)) {
+		cancel_delayed_work_sync(&discard_ctl->work);
+		btrfs_discard_schedule_work(discard_ctl, true);
+	}
+}
+
+/**
+ * btrfs_discard_queue_work - handles queuing the block_groups
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This maintains the LRU order of the discard lists.
+ */
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group)
+{
+	if (!block_group ||
+	    !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
+		return;
+
+	add_to_discard_list(discard_ctl, block_group);
+
+	if (!delayed_work_pending(&discard_ctl->work))
+		btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_discard_schedule_work - responsible for scheduling the discard work
+ * @discard_ctl: discard control
+ * @override: override the current timer
+ *
+ * Discards are issued by a delayed workqueue item.  @override is used to
+ * update the current delay as the baseline delay interview is reevaluated
+ * on transaction commit.  This is also maxed with any other rate limit.
+ */
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl))
+		goto out;
+
+	if (!override && delayed_work_pending(&discard_ctl->work))
+		goto out;
+
+	block_group = find_next_block_group(discard_ctl, now);
+	if (block_group) {
+		u64 delay = 0;
+
+		if (now < block_group->discard_eligible_time)
+			delay = nsecs_to_jiffies(
+				block_group->discard_eligible_time - now);
+
+		mod_delayed_work(discard_ctl->discard_workers,
+				 &discard_ctl->work,
+				 delay);
+	}
+
+out:
+	spin_unlock(&discard_ctl->lock);
+}
+
+/**
+ * btrfs_discard_workfn - discard work function
+ * @work: work
+ *
+ * This finds the next block_group to start discarding and then discards it.
+ */
+static void btrfs_discard_workfn(struct work_struct *work)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+	struct btrfs_block_group *block_group;
+	u64 trimmed = 0;
+
+	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
+
+	block_group = peek_discard_list(discard_ctl);
+	if (!block_group || !btrfs_run_discard_work(discard_ctl))
+		return;
+
+	btrfs_trim_block_group(block_group, &trimmed, block_group->start,
+			       btrfs_block_group_end(block_group), 0);
+
+	remove_from_discard_list(discard_ctl, block_group);
+
+	btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_run_discard_work - determines if async discard should be running
+ * @discard_ctl: discard control
+ *
+ * Checks if the file system is writeable and BTRFS_FS_DISCARD_RUNNING is set.
+ */
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
+						     struct btrfs_fs_info,
+						     discard_ctl);
+
+	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
+		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
+}
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
+		btrfs_discard_cleanup(fs_info);
+		return;
+	}
+
+	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info)
+{
+	clear_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_discard_ctl *discard_ctl = &fs_info->discard_ctl;
+	int i;
+
+	spin_lock_init(&discard_ctl->lock);
+
+	INIT_DELAYED_WORK(&discard_ctl->work, btrfs_discard_workfn);
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
+		INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
+}
+
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
+{
+	btrfs_discard_stop(fs_info);
+	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
+}
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
new file mode 100644
index 000000000000..f3775e84d35a
--- /dev/null
+++ b/fs/btrfs/discard.h
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef BTRFS_DISCARD_H
+#define BTRFS_DISCARD_H
+
+struct btrfs_fs_info;
+struct btrfs_discard_ctl;
+struct btrfs_block_group;
+
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group);
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group);
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override);
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
+void btrfs_discard_init(struct btrfs_fs_info *fs_info);
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
+
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ab888d89d844..fe22ba265ced 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -41,6 +41,7 @@
 #include "tree-checker.h"
 #include "ref-verify.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -1953,6 +1954,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->readahead_workers);
 	btrfs_destroy_workqueue(fs_info->flush_workers);
 	btrfs_destroy_workqueue(fs_info->qgroup_rescan_workers);
+	if (fs_info->discard_ctl.discard_workers)
+		destroy_workqueue(fs_info->discard_ctl.discard_workers);
 	/*
 	 * Now that all other work queues are destroyed, we can safely destroy
 	 * the queues used for metadata I/O, since tasks from those other work
@@ -2148,6 +2151,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 				      max_active, 2);
 	fs_info->qgroup_rescan_workers =
 		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
+	fs_info->discard_ctl.discard_workers =
+		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
 	      fs_info->flush_workers &&
@@ -2158,7 +2163,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->readahead_workers &&
 	      fs_info->fixup_workers && fs_info->delayed_workers &&
-	      fs_info->qgroup_rescan_workers)) {
+	      fs_info->qgroup_rescan_workers &&
+	      fs_info->discard_ctl.discard_workers)) {
 		return -ENOMEM;
 	}
 
@@ -2793,6 +2799,8 @@ int __cold open_ctree(struct super_block *sb,
 	btrfs_init_dev_replace_locks(fs_info);
 	btrfs_init_qgroup(fs_info);
 
+	btrfs_discard_init(fs_info);
+
 	btrfs_init_free_cluster(&fs_info->meta_alloc_cluster);
 	btrfs_init_free_cluster(&fs_info->data_alloc_cluster);
 
@@ -3256,6 +3264,8 @@ int __cold open_ctree(struct super_block *sb,
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	btrfs_discard_resume(fs_info);
+
 	if (!fs_info->uuid_root) {
 		btrfs_info(fs_info, "creating UUID tree");
 		ret = btrfs_create_uuid_tree(fs_info);
@@ -3971,6 +3981,9 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	cancel_work_sync(&fs_info->async_reclaim_work);
 
+	/* cancel or finish ongoing discard work */
+	btrfs_discard_cleanup(fs_info);
+
 	if (!sb_rdonly(fs_info->sb)) {
 		/*
 		 * The cleaner kthread is stopped, so do one final pass over
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9c8ff4307b7c..574be68e9659 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,6 +32,7 @@
 #include "block-rsv.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2934,6 +2935,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		cond_resched();
 	}
 
+	if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_schedule_work(&fs_info->discard_ctl, true);
+
 	/*
 	 * Transaction is finished.  We don't need the lock anymore.  We
 	 * do need to clean up the block groups in case of a transaction
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 7108962e208d..0a868b6df893 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -21,6 +21,7 @@
 #include "space-info.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BITS_PER_BITMAP		(PAGE_SIZE * 8UL)
 #define MAX_CACHE_BYTES_PER_GIG	SZ_32K
@@ -755,9 +756,11 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 		/*
 		 * Sync discard ensures that the free space cache is always
 		 * trimmed.  So when reading this in, the state should reflect
-		 * that.
+		 * that.  We also do this for async as a stop gap for lack of
+		 * persistence.
 		 */
-		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			e->trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
 		if (!e->bytes) {
@@ -2382,6 +2385,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   u64 offset, u64 bytes,
 			   enum btrfs_trim_state trim_state)
 {
+	struct btrfs_block_group *block_group = ctl->private;
 	struct btrfs_free_space *info;
 	int ret = 0;
 
@@ -2431,6 +2435,9 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 		ASSERT(ret != -EEXIST);
 	}
 
+	if (trim_state != BTRFS_TRIM_STATE_TRIMMED)
+		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
+
 	return ret;
 }
 
@@ -2447,6 +2454,25 @@ int btrfs_add_free_space(struct btrfs_block_group *block_group,
 				      bytenr, size, trim_state);
 }
 
+/*
+ * This is a subtle distinction because when adding free space back in general,
+ * we want it to be added as untrimmed for async. But in the case where we add
+ * it on loading of a block group, we want to consider it trimmed.
+ */
+int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
+				       u64 bytenr, u64 size)
+{
+	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
+
+	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC) ||
+	    btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
+		trim_state = BTRFS_TRIM_STATE_TRIMMED;
+
+	return __btrfs_add_free_space(block_group->fs_info,
+				      block_group->free_space_ctl,
+				      bytenr, size, trim_state);
+}
+
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 			    u64 offset, u64 bytes)
 {
@@ -3209,6 +3235,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 static int do_trimming(struct btrfs_block_group *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
 		       u64 reserved_start, u64 reserved_bytes,
+		       enum btrfs_trim_state reserved_trim_state,
 		       struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
@@ -3216,6 +3243,9 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
+	u64 end = start + bytes;
+	u64 reserved_end = reserved_start + reserved_bytes;
+	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 	u64 trimmed = 0;
 
 	spin_lock(&space_info->lock);
@@ -3229,11 +3259,20 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	spin_unlock(&space_info->lock);
 
 	ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
-	if (!ret)
+	if (!ret) {
 		*total_trimmed += trimmed;
+		trim_state = BTRFS_TRIM_STATE_TRIMMED;
+	}
 
 	mutex_lock(&ctl->cache_writeout_mutex);
-	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	if (reserved_start < start)
+		__btrfs_add_free_space(fs_info, ctl, reserved_start,
+				       start - reserved_start,
+				       reserved_trim_state);
+	if (start + bytes < reserved_start + reserved_bytes)
+		__btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
+				       reserved_trim_state);
+	__btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
 	list_del(&trim_entry->list);
 	mutex_unlock(&ctl->cache_writeout_mutex);
 
@@ -3260,6 +3299,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 	int ret = 0;
 	u64 extent_start;
 	u64 extent_bytes;
+	enum btrfs_trim_state extent_trim_state;
 	u64 bytes;
 
 	while (start < end) {
@@ -3301,6 +3341,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 
 		extent_start = entry->offset;
 		extent_bytes = entry->bytes;
+		extent_trim_state = entry->trim_state;
 		start = max(start, extent_start);
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
@@ -3319,7 +3360,8 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes, &trim_entry);
+				  extent_start, extent_bytes, extent_trim_state,
+				  &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3445,7 +3487,7 @@ static int trim_bitmaps(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes, &trim_entry);
+				  start, bytes, 0, &trim_entry);
 		if (ret) {
 			reset_trimming_bitmap(ctl, offset);
 			break;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 29d16f58b40b..a88c30cb3b2b 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -113,6 +113,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   enum btrfs_trim_state trim_state);
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size);
+int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
+				       u64 bytenr, u64 size);
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 			    u64 bytenr, u64 size);
 void __btrfs_remove_free_space_cache(struct btrfs_free_space_ctl *ctl);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 08ac6a7a67f0..215b25012e49 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -46,6 +46,7 @@
 #include "sysfs.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
+#include "discard.h"
 
 #include "qgroup.h"
 #define CREATE_TRACE_POINTS
@@ -146,6 +147,8 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
 	if (sb_rdonly(sb))
 		return;
 
+	btrfs_discard_stop(fs_info);
+
 	/* btrfs handle error by forcing the filesystem readonly */
 	sb->s_flags |= SB_RDONLY;
 	btrfs_info(fs_info, "forced readonly");
@@ -313,6 +316,7 @@ enum {
 	Opt_datasum, Opt_nodatasum,
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
+	Opt_discard_mode,
 	Opt_nologreplay,
 	Opt_norecovery,
 	Opt_ratio,
@@ -376,6 +380,7 @@ static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_discard_mode, "discard=%s"},
 	{Opt_nologreplay, "nologreplay"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -695,12 +700,26 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				   info->metadata_ratio);
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(info, DISCARD_SYNC,
-					   "turning on sync discard");
+		case Opt_discard_mode:
+			if (token == Opt_discard ||
+			    strcmp(args[0].from, "sync") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_ASYNC);
+				btrfs_set_and_info(info, DISCARD_SYNC,
+						   "turning on sync discard");
+			} else if (strcmp(args[0].from, "async") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_SYNC);
+				btrfs_set_and_info(info, DISCARD_ASYNC,
+						   "turning on async discard");
+			} else {
+				ret = -EINVAL;
+				goto out;
+			}
 			break;
 		case Opt_nodiscard:
 			btrfs_clear_and_info(info, DISCARD_SYNC,
 					     "turning off discard");
+			btrfs_clear_and_info(info, DISCARD_ASYNC,
+					     "turning off async discard");
 			break;
 		case Opt_space_cache:
 		case Opt_space_cache_version:
@@ -1324,6 +1343,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
 		seq_puts(seq, ",discard");
+	if (btrfs_test_opt(info, DISCARD_ASYNC))
+		seq_puts(seq, ",discard=async");
 	if (!(info->sb->s_flags & SB_POSIXACL))
 		seq_puts(seq, ",noacl");
 	if (btrfs_test_opt(info, SPACE_CACHE))
@@ -1713,6 +1734,14 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
 		btrfs_cleanup_defrag_inodes(fs_info);
 	}
 
+	/* If we toggled discard async. */
+	if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+	    btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_resume(fs_info);
+	else if (btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+		 !btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_cleanup(fs_info);
+
 	clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
 }
 
@@ -1760,6 +1789,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 */
 		cancel_work_sync(&fs_info->async_reclaim_work);
 
+		btrfs_discard_cleanup(fs_info);
+
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
 		/* avoid complains from lockdep et al. */
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f5c0c401c330..d25f5e6d5280 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -30,6 +30,7 @@
 #include "tree-checker.h"
 #include "space-info.h"
 #include "block-group.h"
+#include "discard.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
@@ -2876,6 +2877,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 {
 	struct btrfs_root *root = fs_info->chunk_root;
 	struct btrfs_trans_handle *trans;
+	struct btrfs_block_group *block_group;
 	int ret;
 
 	/*
@@ -2899,6 +2901,11 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	if (ret)
 		return ret;
 
+	block_group = btrfs_lookup_block_group(fs_info, chunk_offset);
+	if (!block_group)
+		return -ENOENT;
+	btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 	trans = btrfs_start_trans_remove_block_group(root->fs_info,
 						     chunk_offset);
 	if (IS_ERR(trans)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 00/22] btrfs: async discard support
@ 2019-11-25 19:46 Dennis Zhou
  2019-11-25 19:46 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
  0 siblings, 1 reply; 23+ messages in thread
From: Dennis Zhou @ 2019-11-25 19:46 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Hello,

v3 [1] had 2 minor issues which are fixed:
 - I was generically dividing u64 which made 32 bit arches unhappy. [2]
 - Uninitialized use of trim_state local variable [3]

I've gone through and fixed the apparent u64 divisions which passes the
make.cross build provided on the end of the series.

This is based on btrfs-devel#misc-next e88411ecf657.

[1] https://lore.kernel.org/linux-btrfs/cover.1574282259.git.dennis@kernel.org/
[2] https://lore.kernel.org/linux-btrfs/201911230623.j5g9qeNZ%25lkp@intel.com/
[3] https://lore.kernel.org/linux-btrfs/CAKwvOdn5j37AYzmoOsaSqyYdBkjqevbTrSyGQypB+G_NgxX0fQ@mail.gmail.com/

diffstats below:

Dennis Zhou (22):
  bitmap: genericize percpu bitmap region iterators
  btrfs: rename DISCARD opt to DISCARD_SYNC
  btrfs: keep track of which extents have been discarded
  btrfs: keep track of cleanliness of the bitmap
  btrfs: add the beginning of async discard, discard workqueue
  btrfs: handle empty block_group removal
  btrfs: discard one region at a time in async discard
  btrfs: add removal calls for sysfs debug/
  btrfs: make UUID/debug have its own kobject
  btrfs: add discard sysfs directory
  btrfs: track discardable extents for async discard
  btrfs: keep track of discardable_bytes
  btrfs: calculate discard delay based on number of extents
  btrfs: add bps discard rate limit
  btrfs: limit max discard size for async discard
  btrfs: make max async discard size tunable
  btrfs: have multiple discard lists
  btrfs: only keep track of data extents for async discard
  btrfs: keep track of discard reuse stats
  btrfs: add async discard header
  btrfs: increase the metadata allowance for the free_space_cache
  btrfs: make smaller extents more likely to go into bitmaps

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  56 ++-
 fs/btrfs/block-group.h      |  30 ++
 fs/btrfs/ctree.h            |  52 ++-
 fs/btrfs/discard.c          | 680 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  46 +++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |  23 +-
 fs/btrfs/free-space-cache.c | 587 ++++++++++++++++++++++++++-----
 fs/btrfs/free-space-cache.h |  39 ++-
 fs/btrfs/inode-map.c        |  13 +-
 fs/btrfs/scrub.c            |   7 +-
 fs/btrfs/super.c            |  39 ++-
 fs/btrfs/sysfs.c            | 205 ++++++++++-
 include/linux/bitmap.h      |  35 ++
 mm/percpu.c                 |  61 +---
 16 files changed, 1737 insertions(+), 153 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-25 19:46 [PATCH v4 00/22] btrfs: async discard support Dennis Zhou
@ 2019-11-25 19:46 ` Dennis Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Zhou @ 2019-11-25 19:46 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be address in a future patch in this
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |   4 +
 fs/btrfs/block-group.h      |   9 ++
 fs/btrfs/ctree.h            |  21 +++
 fs/btrfs/discard.c          | 273 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  26 ++++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |   4 +
 fs/btrfs/free-space-cache.c |  35 ++++-
 fs/btrfs/super.c            |  35 ++++-
 10 files changed, 414 insertions(+), 10 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 82200dbca5ac..9a0ff3384381 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o block-group.o
+	   block-rsv.o delalloc-space.o block-group.o discard.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6064be2d5556..d2bc46c365f4 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -14,6 +14,7 @@
 #include "sysfs.h"
 #include "tree-log.h"
 #include "delalloc-space.h"
+#include "discard.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -1272,6 +1273,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
+		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
 
 		/* Don't want to race with allocators so take the groups_sem */
@@ -1617,6 +1620,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	INIT_LIST_HEAD(&cache->ro_list);
+	INIT_LIST_HEAD(&cache->discard_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9b409676c4b2..884defd61dcd 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -116,7 +116,11 @@ struct btrfs_block_group {
 	/* For read-only block groups */
 	struct list_head ro_list;
 
+	/* For discard operations */
 	atomic_t trimming;
+	struct list_head discard_list;
+	int discard_index;
+	u64 discard_eligible_time;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
@@ -158,6 +162,11 @@ struct btrfs_block_group {
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
+static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
+{
+	return (block_group->start + block_group->length);
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline int btrfs_should_fragment_free_space(
 		struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8ac3b2deef4a..d5ce8054f074 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -440,6 +440,21 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* Discard control. */
+/*
+ * Async discard uses multiple lists to differentiate the discard filter
+ * parameters.
+ */
+#define BTRFS_NR_DISCARD_LISTS		1
+
+struct btrfs_discard_ctl {
+	struct workqueue_struct *discard_workers;
+	struct delayed_work work;
+	spinlock_t lock;
+	struct btrfs_block_group *block_group;
+	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+};
+
 /* delayed seq elem */
 struct seq_list {
 	struct list_head list;
@@ -526,6 +541,9 @@ enum {
 	 * so we don't need to offload checksums to workqueues.
 	 */
 	BTRFS_FS_CSUM_IMPL_FAST,
+
+	/* Indicate that the discard workqueue can service discards. */
+	BTRFS_FS_DISCARD_RUNNING,
 };
 
 struct btrfs_fs_info {
@@ -816,6 +834,8 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *scrub_wr_completion_workers;
 	struct btrfs_workqueue *scrub_parity_workers;
 
+	struct btrfs_discard_ctl discard_ctl;
+
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	u32 check_integrity_print_mask;
 #endif
@@ -1189,6 +1209,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
new file mode 100644
index 000000000000..15d54de3d682
--- /dev/null
+++ b/fs/btrfs/discard.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/list.h>
+#include <linux/sizes.h>
+#include <linux/workqueue.h>
+#include "ctree.h"
+#include "block-group.h"
+#include "discard.h"
+#include "free-space-cache.h"
+
+/* This is an initial delay to give some chance for lba reuse. */
+#define BTRFS_DISCARD_DELAY		(120ULL * NSEC_PER_SEC)
+
+static struct list_head *btrfs_get_discard_list(
+					struct btrfs_discard_ctl *discard_ctl,
+					struct btrfs_block_group *block_group)
+{
+	return &discard_ctl->discard_list[block_group->discard_index];
+}
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group)
+{
+	spin_lock(&discard_ctl->lock);
+
+	if (list_empty(&block_group->discard_list))
+		block_group->discard_eligible_time = (ktime_get_ns() +
+						      BTRFS_DISCARD_DELAY);
+
+	list_move_tail(&block_group->discard_list,
+		       btrfs_get_discard_list(discard_ctl, block_group));
+
+	spin_unlock(&discard_ctl->lock);
+}
+
+static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				     struct btrfs_block_group *block_group)
+{
+	bool running = false;
+
+	spin_lock(&discard_ctl->lock);
+
+	if (block_group == discard_ctl->block_group) {
+		running = true;
+		discard_ctl->block_group = NULL;
+	}
+
+	block_group->discard_eligible_time = 0;
+	list_del_init(&block_group->discard_list);
+
+	spin_unlock(&discard_ctl->lock);
+
+	return running;
+}
+
+/**
+ * find_next_block_group - find block_group that's up next for discarding
+ * @discard_ctl: discard control
+ * @now: current time
+ *
+ * Iterate over the discard lists to find the next block_group up for
+ * discarding checking the discard_eligible_time of block_group.
+ */
+static struct btrfs_block_group *find_next_block_group(
+					struct btrfs_discard_ctl *discard_ctl,
+					u64 now)
+{
+	struct btrfs_block_group *ret_block_group = NULL, *block_group;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
+		struct list_head *discard_list = &discard_ctl->discard_list[i];
+
+		if (!list_empty(discard_list)) {
+			block_group = list_first_entry(discard_list,
+						       struct btrfs_block_group,
+						       discard_list);
+
+			if (!ret_block_group)
+				ret_block_group = block_group;
+
+			if (ret_block_group->discard_eligible_time < now)
+				break;
+
+			if (ret_block_group->discard_eligible_time >
+			    block_group->discard_eligible_time)
+				ret_block_group = block_group;
+		}
+	}
+
+	return ret_block_group;
+}
+
+/**
+ * peek_discard_list - wrap find_next_block_group()
+ * @discard_ctl: discard control
+ *
+ * This wraps find_next_block_group() and sets the block_group to be in use.
+ */
+static struct btrfs_block_group *peek_discard_list(
+					struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	block_group = find_next_block_group(discard_ctl, now);
+
+	if (block_group && now < block_group->discard_eligible_time)
+		block_group = NULL;
+
+	discard_ctl->block_group = block_group;
+
+	spin_unlock(&discard_ctl->lock);
+
+	return block_group;
+}
+
+/**
+ * btrfs_discard_cancel_work - remove a block_group from the discard lists
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This removes @block_group from the discard lists.  If necessary, it waits on
+ * the current work and then reschedules the delayed work.
+ */
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group)
+{
+	if (remove_from_discard_list(discard_ctl, block_group)) {
+		cancel_delayed_work_sync(&discard_ctl->work);
+		btrfs_discard_schedule_work(discard_ctl, true);
+	}
+}
+
+/**
+ * btrfs_discard_queue_work - handles queuing the block_groups
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This maintains the LRU order of the discard lists.
+ */
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group)
+{
+	if (!block_group ||
+	    !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
+		return;
+
+	btrfs_add_to_discard_list(discard_ctl, block_group);
+	if (!delayed_work_pending(&discard_ctl->work))
+		btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_discard_schedule_work - responsible for scheduling the discard work
+ * @discard_ctl: discard control
+ * @override: override the current timer
+ *
+ * Discards are issued by a delayed workqueue item.  @override is used to
+ * update the current delay as the baseline delay interview is reevaluated
+ * on transaction commit.  This is also maxed with any other rate limit.
+ */
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl))
+		goto out;
+
+	if (!override && delayed_work_pending(&discard_ctl->work))
+		goto out;
+
+	block_group = find_next_block_group(discard_ctl, now);
+	if (block_group) {
+		u64 delay = 0;
+
+		if (now < block_group->discard_eligible_time)
+			delay = nsecs_to_jiffies(
+				block_group->discard_eligible_time - now);
+
+		mod_delayed_work(discard_ctl->discard_workers,
+				 &discard_ctl->work,
+				 delay);
+	}
+
+out:
+	spin_unlock(&discard_ctl->lock);
+}
+
+/**
+ * btrfs_discard_workfn - discard work function
+ * @work: work
+ *
+ * This finds the next block_group to start discarding and then discards it.
+ */
+static void btrfs_discard_workfn(struct work_struct *work)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+	struct btrfs_block_group *block_group;
+	u64 trimmed = 0;
+
+	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
+
+	block_group = peek_discard_list(discard_ctl);
+	if (!block_group || !btrfs_run_discard_work(discard_ctl))
+		return;
+
+	btrfs_trim_block_group(block_group, &trimmed, block_group->start,
+			       btrfs_block_group_end(block_group), 0);
+
+	remove_from_discard_list(discard_ctl, block_group);
+
+	btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_run_discard_work - determines if async discard should be running
+ * @discard_ctl: discard control
+ *
+ * Checks if the file system is writeable and BTRFS_FS_DISCARD_RUNNING is set.
+ */
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
+						     struct btrfs_fs_info,
+						     discard_ctl);
+
+	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
+		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
+}
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
+		btrfs_discard_cleanup(fs_info);
+		return;
+	}
+
+	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info)
+{
+	clear_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_discard_ctl *discard_ctl = &fs_info->discard_ctl;
+	int i;
+
+	spin_lock_init(&discard_ctl->lock);
+
+	INIT_DELAYED_WORK(&discard_ctl->work, btrfs_discard_workfn);
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
+		INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
+}
+
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
+{
+	btrfs_discard_stop(fs_info);
+	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
+}
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
new file mode 100644
index 000000000000..439ca8c51877
--- /dev/null
+++ b/fs/btrfs/discard.h
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef BTRFS_DISCARD_H
+#define BTRFS_DISCARD_H
+
+struct btrfs_fs_info;
+struct btrfs_discard_ctl;
+struct btrfs_block_group;
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group);
+
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group);
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group);
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override);
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
+void btrfs_discard_init(struct btrfs_fs_info *fs_info);
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
+
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e0edfdc9c82b..8785b5c999b0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -41,6 +41,7 @@
 #include "tree-checker.h"
 #include "ref-verify.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -1953,6 +1954,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->readahead_workers);
 	btrfs_destroy_workqueue(fs_info->flush_workers);
 	btrfs_destroy_workqueue(fs_info->qgroup_rescan_workers);
+	if (fs_info->discard_ctl.discard_workers)
+		destroy_workqueue(fs_info->discard_ctl.discard_workers);
 	/*
 	 * Now that all other work queues are destroyed, we can safely destroy
 	 * the queues used for metadata I/O, since tasks from those other work
@@ -2148,6 +2151,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 				      max_active, 2);
 	fs_info->qgroup_rescan_workers =
 		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
+	fs_info->discard_ctl.discard_workers =
+		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
 	      fs_info->flush_workers &&
@@ -2158,7 +2163,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->readahead_workers &&
 	      fs_info->fixup_workers && fs_info->delayed_workers &&
-	      fs_info->qgroup_rescan_workers)) {
+	      fs_info->qgroup_rescan_workers &&
+	      fs_info->discard_ctl.discard_workers)) {
 		return -ENOMEM;
 	}
 
@@ -2793,6 +2799,8 @@ int __cold open_ctree(struct super_block *sb,
 	btrfs_init_dev_replace_locks(fs_info);
 	btrfs_init_qgroup(fs_info);
 
+	btrfs_discard_init(fs_info);
+
 	btrfs_init_free_cluster(&fs_info->meta_alloc_cluster);
 	btrfs_init_free_cluster(&fs_info->data_alloc_cluster);
 
@@ -3263,6 +3271,8 @@ int __cold open_ctree(struct super_block *sb,
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	btrfs_discard_resume(fs_info);
+
 	if (!fs_info->uuid_root) {
 		btrfs_info(fs_info, "creating UUID tree");
 		ret = btrfs_create_uuid_tree(fs_info);
@@ -3954,6 +3964,9 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	 */
 	kthread_park(fs_info->cleaner_kthread);
 
+	/* cancel or finish ongoing work */
+	btrfs_discard_cleanup(fs_info);
+
 	/* wait for the qgroup rescan worker to stop */
 	btrfs_qgroup_wait_for_completion(fs_info, false);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5dc696d0c5b2..857642dc8589 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,6 +32,7 @@
 #include "block-rsv.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2940,6 +2941,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		cond_resched();
 	}
 
+	if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_schedule_work(&fs_info->discard_ctl, true);
+
 	/*
 	 * Transaction is finished.  We don't need the lock anymore.  We
 	 * do need to clean up the block groups in case of a transaction
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 49ae1e6ee104..449bfaf6030f 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -21,6 +21,7 @@
 #include "space-info.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BITS_PER_BITMAP		(PAGE_SIZE * 8UL)
 #define MAX_CACHE_BYTES_PER_GIG	SZ_32K
@@ -755,9 +756,11 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 		/*
 		 * Sync discard ensures that the free space cache is always
 		 * trimmed.  So when reading this in, the state should reflect
-		 * that.
+		 * that.  We also do this for async as a stop gap for lack of
+		 * persistence.
 		 */
-		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			e->trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
 		if (!e->bytes) {
@@ -2382,6 +2385,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   u64 offset, u64 bytes,
 			   enum btrfs_trim_state trim_state)
 {
+	struct btrfs_block_group *block_group = ctl->private;
 	struct btrfs_free_space *info;
 	int ret = 0;
 
@@ -2431,6 +2435,9 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 		ASSERT(ret != -EEXIST);
 	}
 
+	if (trim_state != BTRFS_TRIM_STATE_TRIMMED)
+		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
+
 	return ret;
 }
 
@@ -3204,6 +3211,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 static int do_trimming(struct btrfs_block_group *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
 		       u64 reserved_start, u64 reserved_bytes,
+		       enum btrfs_trim_state reserved_trim_state,
 		       struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
@@ -3211,6 +3219,9 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
+	u64 end = start + bytes;
+	u64 reserved_end = reserved_start + reserved_bytes;
+	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 	u64 trimmed = 0;
 
 	spin_lock(&space_info->lock);
@@ -3224,11 +3235,20 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	spin_unlock(&space_info->lock);
 
 	ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
-	if (!ret)
+	if (!ret) {
 		*total_trimmed += trimmed;
+		trim_state = BTRFS_TRIM_STATE_TRIMMED;
+	}
 
 	mutex_lock(&ctl->cache_writeout_mutex);
-	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	if (reserved_start < start)
+		__btrfs_add_free_space(fs_info, ctl, reserved_start,
+				       start - reserved_start,
+				       reserved_trim_state);
+	if (start + bytes < reserved_start + reserved_bytes)
+		__btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
+				       reserved_trim_state);
+	__btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
 	list_del(&trim_entry->list);
 	mutex_unlock(&ctl->cache_writeout_mutex);
 
@@ -3255,6 +3275,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 	int ret = 0;
 	u64 extent_start;
 	u64 extent_bytes;
+	enum btrfs_trim_state extent_trim_state;
 	u64 bytes;
 
 	while (start < end) {
@@ -3296,6 +3317,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 
 		extent_start = entry->offset;
 		extent_bytes = entry->bytes;
+		extent_trim_state = entry->trim_state;
 		start = max(start, extent_start);
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
@@ -3314,7 +3336,8 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes, &trim_entry);
+				  extent_start, extent_bytes, extent_trim_state,
+				  &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3440,7 +3463,7 @@ static int trim_bitmaps(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes, &trim_entry);
+				  start, bytes, 0, &trim_entry);
 		if (ret) {
 			reset_trimming_bitmap(ctl, offset);
 			break;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f131fb9f0f69..0edf00753c00 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -46,6 +46,7 @@
 #include "sysfs.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
+#include "discard.h"
 
 #include "qgroup.h"
 #define CREATE_TRACE_POINTS
@@ -146,6 +147,8 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
 	if (sb_rdonly(sb))
 		return;
 
+	btrfs_discard_stop(fs_info);
+
 	/* btrfs handle error by forcing the filesystem readonly */
 	sb->s_flags |= SB_RDONLY;
 	btrfs_info(fs_info, "forced readonly");
@@ -313,6 +316,7 @@ enum {
 	Opt_datasum, Opt_nodatasum,
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
+	Opt_discard_mode,
 	Opt_nologreplay,
 	Opt_norecovery,
 	Opt_ratio,
@@ -376,6 +380,7 @@ static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_discard_mode, "discard=%s"},
 	{Opt_nologreplay, "nologreplay"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -695,12 +700,26 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				   info->metadata_ratio);
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(info, DISCARD_SYNC,
-					   "turning on sync discard");
+		case Opt_discard_mode:
+			if (token == Opt_discard ||
+			    strcmp(args[0].from, "sync") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_ASYNC);
+				btrfs_set_and_info(info, DISCARD_SYNC,
+						   "turning on sync discard");
+			} else if (strcmp(args[0].from, "async") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_SYNC);
+				btrfs_set_and_info(info, DISCARD_ASYNC,
+						   "turning on async discard");
+			} else {
+				ret = -EINVAL;
+				goto out;
+			}
 			break;
 		case Opt_nodiscard:
 			btrfs_clear_and_info(info, DISCARD_SYNC,
 					     "turning off discard");
+			btrfs_clear_and_info(info, DISCARD_ASYNC,
+					     "turning off async discard");
 			break;
 		case Opt_space_cache:
 		case Opt_space_cache_version:
@@ -1324,6 +1343,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
 		seq_puts(seq, ",discard");
+	if (btrfs_test_opt(info, DISCARD_ASYNC))
+		seq_puts(seq, ",discard=async");
 	if (!(info->sb->s_flags & SB_POSIXACL))
 		seq_puts(seq, ",noacl");
 	if (btrfs_test_opt(info, SPACE_CACHE))
@@ -1713,6 +1734,14 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
 		btrfs_cleanup_defrag_inodes(fs_info);
 	}
 
+	/* If we toggled discard async. */
+	if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+	    btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_resume(fs_info);
+	else if (btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+		 !btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_cleanup(fs_info);
+
 	clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
 }
 
@@ -1760,6 +1789,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 */
 		cancel_work_sync(&fs_info->async_reclaim_work);
 
+		btrfs_discard_cleanup(fs_info);
+
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
 		/* avoid complains from lockdep et al. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 00/22] btrfs: async discard support
@ 2019-11-20 21:50 Dennis Zhou
  2019-11-20 21:51 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
  0 siblings, 1 reply; 23+ messages in thread
From: Dennis Zhou @ 2019-11-20 21:50 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Hello,

There were only a few minor things that changed from v2 to v3 in
addition to the rebase on top of misc-next.

The notable changes:
 - Non-async discarding continues to retrim regions because we load the
   block groups as fully trimmed. This gives us a way to reconcile the
   filesystem state on device after reboot with fstrim if deisred.
 - Moved BTRFS_TRIM_STATE_UNTRIMMED to be 0 in the enum.
 - Kept discard sysfs under debug/.
 - Rebased on top of btrfs-devel#misc-next fa17ed069c61. Renamed
   instances of cache -> block_group to adjust to the rename of
   btrfs_block_group_cache -> btrfs_block_group.

---

The RFC [1] went a lot better than expected, so let's call this v2.

Changes:
 - I don't believe there are any functional changes to the code, but
   there are plenty of changes for readability and a lot of comments
   added.
 - Split out the sysfs parts into their own patches as it was a bit of
   work to put discard/ under debug/.
 - Switched over any flags to be enums. All the variables are now of
   appropriate type rather than being all atomics.
 - Dropped 0015 [2] from v1, load block groups on mount. This is in
   favor of reading everything in as clean and running fstrim as needed.
 - Misc. bug fixes.
 - I changed the block_group discard delay from 300s to 120s. It is just
   to give the allocators a chance to reuse the LBA before letting async
   discard take a crack at it.

Data:
On a number of webservers, I collected data every minute accounting the
time we spent in btrfs_finish_extent_commit() (col. 1) and in
btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is
where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170 
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

From v1:
--------
Discard is an operation that allows for the filesystem to communicate
with underlying ssds that a lba region is no longer needed. This gives
the drive the more information as it tries to manage the available free
space to minimize write amplification. However, discard hasn't been
given the most tlc. Discard is a problematic command because a drive's
physical block management is more or less a black box to us and the
effects of any particular discard aren't necessarily limited the
lifetime of a command.

Currently, btrfs handles discarding synchronously during transaction
commit. This problematically can delay transaction commit based on the
amount of space that needs to be trimmed and the efficacy of the discard
operation for a particular drive.

This series introduces async discarding, which removes discard from the
transaction commit path. While every SSD has the choice of implementing
trim support different, we strive here to do the right thing. The idea
hinges on recognizing that write amplification really only kicks in once
we're really low on free space.  As long as we trim enough to keep a
large enough pool of free space, in theory this should minimize the cost
of issuing discards on a workload and have limited cost overhead in
write amplification.

With async discard, we try to emphasize discarding larger regions
and reusing the lba (implicit discard). The first is done by using the
free space cache to maintain discard state and thus allows us to get
coalescing for fairly cheap. A background workqueue is used to scan over
an LRU kept list of the block groups. It then uses filters to determine
what to discard next hence giving priority to larger discards. While
reusing an lba isn't explicitly attempted, it happens implicitly via
find_free_extent() which if it happens to find a dirty extent, will
grant us reuse of the lba. Additionally, async discarding skips metadata
block groups as these should see a fairly high turnover as btrfs is a
self-packing filesystem being stingy with allocating new block groups
until necessary.

Preliminary results seem promising as when a lot of freeing is going on,
the discarding is delayed allowing for reuse which translates to less
discarding (in addition to the slower discarding). This has shown a
reduction in p90 and p99 read latencies on a test on our webservers.

I am currently working on tuning the rate at which it discards in the
background. I am doing this by evaluating other workloads and drives.
The iops and bps rate limits are fairly aggressive right now as my
basic survey of a few drives noted that the trim command itself is a
significant part of the overhead. So optimizing for larger trims is the
right thing to do.

This series contains the following 22 patches and is on top of
btrfs-devel#misc-next fa17ed069c61:
  0001-bitmap-genericize-percpu-bitmap-region-iterators.patch
  0002-btrfs-rename-DISCARD-opt-to-DISCARD_SYNC.patch
  0003-btrfs-keep-track-of-which-extents-have-been-discarde.patch
  0004-btrfs-keep-track-of-cleanliness-of-the-bitmap.patch
  0005-btrfs-add-the-beginning-of-async-discard-discard-wor.patch
  0006-btrfs-handle-empty-block_group-removal.patch
  0007-btrfs-discard-one-region-at-a-time-in-async-discard.patch
  0008-btrfs-add-removal-calls-for-sysfs-debug.patch
  0009-btrfs-make-UUID-debug-have-its-own-kobject.patch
  0010-btrfs-add-discard-sysfs-directory.patch
  0011-btrfs-track-discardable-extents-for-async-discard.patch
  0012-btrfs-keep-track-of-discardable_bytes.patch
  0013-btrfs-calculate-discard-delay-based-on-number-of-ext.patch
  0014-btrfs-add-bps-discard-rate-limit.patch
  0015-btrfs-limit-max-discard-size-for-async-discard.patch
  0016-btrfs-make-max-async-discard-size-tunable.patch
  0017-btrfs-have-multiple-discard-lists.patch
  0018-btrfs-only-keep-track-of-data-extents-for-async-disc.patch
  0019-btrfs-keep-track-of-discard-reuse-stats.patch
  0020-btrfs-add-async-discard-header.patch
  0021-btrfs-increase-the-metadata-allowance-for-the-free_s.patch
  0022-btrfs-make-smaller-extents-more-likely-to-go-into-bi.patch

0001 exports percpu's bitmap iterators for eventual use in 0011. 0002
renames DISCARD to DISCARD_SYNC. 0003 and 0004 adds discard tracking to
the free space cache. 0005-0007 adds the core of async discard support.
0008-0010 modify debug sysfs to allow for adding discard/ under debug/.
0011-0016 fiddle with stats and operation limits. 0018 makes async
discarding only track data block groups. 0019 adds reuse stats. 0020
adds an explanation header to discard.c. 0021 and 0022 modify the free
space cache metadata allowance, add a bitmap -> extent path and makes us
more likely to put smaller extents into the bitmaps.

diffstats below:

Dennis Zhou (22):
  bitmap: genericize percpu bitmap region iterators
  btrfs: rename DISCARD opt to DISCARD_SYNC
  btrfs: keep track of which extents have been discarded
  btrfs: keep track of cleanliness of the bitmap
  btrfs: add the beginning of async discard, discard workqueue
  btrfs: handle empty block_group removal
  btrfs: discard one region at a time in async discard
  btrfs: add removal calls for sysfs debug/
  btrfs: make UUID/debug have its own kobject
  btrfs: add discard sysfs directory
  btrfs: track discardable extents for async discard
  btrfs: keep track of discardable_bytes
  btrfs: calculate discard delay based on number of extents
  btrfs: add bps discard rate limit
  btrfs: limit max discard size for async discard
  btrfs: make max async discard size tunable
  btrfs: have multiple discard lists
  btrfs: only keep track of data extents for async discard
  btrfs: keep track of discard reuse stats
  btrfs: add async discard header
  btrfs: increase the metadata allowance for the free_space_cache
  btrfs: make smaller extents more likely to go into bitmaps

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  56 ++-
 fs/btrfs/block-group.h      |  30 ++
 fs/btrfs/ctree.h            |  52 ++-
 fs/btrfs/discard.c          | 679 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  46 +++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |  23 +-
 fs/btrfs/free-space-cache.c | 585 ++++++++++++++++++++++++++-----
 fs/btrfs/free-space-cache.h |  39 ++-
 fs/btrfs/inode-map.c        |  13 +-
 fs/btrfs/scrub.c            |   7 +-
 fs/btrfs/super.c            |  39 ++-
 fs/btrfs/sysfs.c            | 205 ++++++++++-
 include/linux/bitmap.h      |  35 ++
 mm/percpu.c                 |  61 +---
 16 files changed, 1734 insertions(+), 153 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-11-20 21:50 [PATCH v3 00/22] btrfs: async discard support Dennis Zhou
@ 2019-11-20 21:51 ` Dennis Zhou
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Zhou @ 2019-11-20 21:51 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be address in a future patch in this
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |   4 +
 fs/btrfs/block-group.h      |   9 ++
 fs/btrfs/ctree.h            |  21 +++
 fs/btrfs/discard.c          | 273 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  26 ++++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |   4 +
 fs/btrfs/free-space-cache.c |  35 ++++-
 fs/btrfs/super.c            |  35 ++++-
 10 files changed, 414 insertions(+), 10 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 82200dbca5ac..9a0ff3384381 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o block-group.o
+	   block-rsv.o delalloc-space.o block-group.o discard.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6064be2d5556..d2bc46c365f4 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -14,6 +14,7 @@
 #include "sysfs.h"
 #include "tree-log.h"
 #include "delalloc-space.h"
+#include "discard.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -1272,6 +1273,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
+		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
 
 		/* Don't want to race with allocators so take the groups_sem */
@@ -1617,6 +1620,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	INIT_LIST_HEAD(&cache->ro_list);
+	INIT_LIST_HEAD(&cache->discard_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9b409676c4b2..884defd61dcd 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -116,7 +116,11 @@ struct btrfs_block_group {
 	/* For read-only block groups */
 	struct list_head ro_list;
 
+	/* For discard operations */
 	atomic_t trimming;
+	struct list_head discard_list;
+	int discard_index;
+	u64 discard_eligible_time;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
@@ -158,6 +162,11 @@ struct btrfs_block_group {
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
+static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
+{
+	return (block_group->start + block_group->length);
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline int btrfs_should_fragment_free_space(
 		struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8ac3b2deef4a..d5ce8054f074 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -440,6 +440,21 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* Discard control. */
+/*
+ * Async discard uses multiple lists to differentiate the discard filter
+ * parameters.
+ */
+#define BTRFS_NR_DISCARD_LISTS		1
+
+struct btrfs_discard_ctl {
+	struct workqueue_struct *discard_workers;
+	struct delayed_work work;
+	spinlock_t lock;
+	struct btrfs_block_group *block_group;
+	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+};
+
 /* delayed seq elem */
 struct seq_list {
 	struct list_head list;
@@ -526,6 +541,9 @@ enum {
 	 * so we don't need to offload checksums to workqueues.
 	 */
 	BTRFS_FS_CSUM_IMPL_FAST,
+
+	/* Indicate that the discard workqueue can service discards. */
+	BTRFS_FS_DISCARD_RUNNING,
 };
 
 struct btrfs_fs_info {
@@ -816,6 +834,8 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *scrub_wr_completion_workers;
 	struct btrfs_workqueue *scrub_parity_workers;
 
+	struct btrfs_discard_ctl discard_ctl;
+
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	u32 check_integrity_print_mask;
 #endif
@@ -1189,6 +1209,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
new file mode 100644
index 000000000000..15d54de3d682
--- /dev/null
+++ b/fs/btrfs/discard.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/list.h>
+#include <linux/sizes.h>
+#include <linux/workqueue.h>
+#include "ctree.h"
+#include "block-group.h"
+#include "discard.h"
+#include "free-space-cache.h"
+
+/* This is an initial delay to give some chance for lba reuse. */
+#define BTRFS_DISCARD_DELAY		(120ULL * NSEC_PER_SEC)
+
+static struct list_head *btrfs_get_discard_list(
+					struct btrfs_discard_ctl *discard_ctl,
+					struct btrfs_block_group *block_group)
+{
+	return &discard_ctl->discard_list[block_group->discard_index];
+}
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group)
+{
+	spin_lock(&discard_ctl->lock);
+
+	if (list_empty(&block_group->discard_list))
+		block_group->discard_eligible_time = (ktime_get_ns() +
+						      BTRFS_DISCARD_DELAY);
+
+	list_move_tail(&block_group->discard_list,
+		       btrfs_get_discard_list(discard_ctl, block_group));
+
+	spin_unlock(&discard_ctl->lock);
+}
+
+static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				     struct btrfs_block_group *block_group)
+{
+	bool running = false;
+
+	spin_lock(&discard_ctl->lock);
+
+	if (block_group == discard_ctl->block_group) {
+		running = true;
+		discard_ctl->block_group = NULL;
+	}
+
+	block_group->discard_eligible_time = 0;
+	list_del_init(&block_group->discard_list);
+
+	spin_unlock(&discard_ctl->lock);
+
+	return running;
+}
+
+/**
+ * find_next_block_group - find block_group that's up next for discarding
+ * @discard_ctl: discard control
+ * @now: current time
+ *
+ * Iterate over the discard lists to find the next block_group up for
+ * discarding checking the discard_eligible_time of block_group.
+ */
+static struct btrfs_block_group *find_next_block_group(
+					struct btrfs_discard_ctl *discard_ctl,
+					u64 now)
+{
+	struct btrfs_block_group *ret_block_group = NULL, *block_group;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
+		struct list_head *discard_list = &discard_ctl->discard_list[i];
+
+		if (!list_empty(discard_list)) {
+			block_group = list_first_entry(discard_list,
+						       struct btrfs_block_group,
+						       discard_list);
+
+			if (!ret_block_group)
+				ret_block_group = block_group;
+
+			if (ret_block_group->discard_eligible_time < now)
+				break;
+
+			if (ret_block_group->discard_eligible_time >
+			    block_group->discard_eligible_time)
+				ret_block_group = block_group;
+		}
+	}
+
+	return ret_block_group;
+}
+
+/**
+ * peek_discard_list - wrap find_next_block_group()
+ * @discard_ctl: discard control
+ *
+ * This wraps find_next_block_group() and sets the block_group to be in use.
+ */
+static struct btrfs_block_group *peek_discard_list(
+					struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	block_group = find_next_block_group(discard_ctl, now);
+
+	if (block_group && now < block_group->discard_eligible_time)
+		block_group = NULL;
+
+	discard_ctl->block_group = block_group;
+
+	spin_unlock(&discard_ctl->lock);
+
+	return block_group;
+}
+
+/**
+ * btrfs_discard_cancel_work - remove a block_group from the discard lists
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This removes @block_group from the discard lists.  If necessary, it waits on
+ * the current work and then reschedules the delayed work.
+ */
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group)
+{
+	if (remove_from_discard_list(discard_ctl, block_group)) {
+		cancel_delayed_work_sync(&discard_ctl->work);
+		btrfs_discard_schedule_work(discard_ctl, true);
+	}
+}
+
+/**
+ * btrfs_discard_queue_work - handles queuing the block_groups
+ * @discard_ctl: discard control
+ * @block_group: block_group of interest
+ *
+ * This maintains the LRU order of the discard lists.
+ */
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group)
+{
+	if (!block_group ||
+	    !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
+		return;
+
+	btrfs_add_to_discard_list(discard_ctl, block_group);
+	if (!delayed_work_pending(&discard_ctl->work))
+		btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_discard_schedule_work - responsible for scheduling the discard work
+ * @discard_ctl: discard control
+ * @override: override the current timer
+ *
+ * Discards are issued by a delayed workqueue item.  @override is used to
+ * update the current delay as the baseline delay interview is reevaluated
+ * on transaction commit.  This is also maxed with any other rate limit.
+ */
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override)
+{
+	struct btrfs_block_group *block_group;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl))
+		goto out;
+
+	if (!override && delayed_work_pending(&discard_ctl->work))
+		goto out;
+
+	block_group = find_next_block_group(discard_ctl, now);
+	if (block_group) {
+		u64 delay = 0;
+
+		if (now < block_group->discard_eligible_time)
+			delay = nsecs_to_jiffies(
+				block_group->discard_eligible_time - now);
+
+		mod_delayed_work(discard_ctl->discard_workers,
+				 &discard_ctl->work,
+				 delay);
+	}
+
+out:
+	spin_unlock(&discard_ctl->lock);
+}
+
+/**
+ * btrfs_discard_workfn - discard work function
+ * @work: work
+ *
+ * This finds the next block_group to start discarding and then discards it.
+ */
+static void btrfs_discard_workfn(struct work_struct *work)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+	struct btrfs_block_group *block_group;
+	u64 trimmed = 0;
+
+	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
+
+	block_group = peek_discard_list(discard_ctl);
+	if (!block_group || !btrfs_run_discard_work(discard_ctl))
+		return;
+
+	btrfs_trim_block_group(block_group, &trimmed, block_group->start,
+			       btrfs_block_group_end(block_group), 0);
+
+	remove_from_discard_list(discard_ctl, block_group);
+
+	btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_run_discard_work - determines if async discard should be running
+ * @discard_ctl: discard control
+ *
+ * Checks if the file system is writeable and BTRFS_FS_DISCARD_RUNNING is set.
+ */
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
+						     struct btrfs_fs_info,
+						     discard_ctl);
+
+	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
+		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
+}
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
+		btrfs_discard_cleanup(fs_info);
+		return;
+	}
+
+	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info)
+{
+	clear_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_discard_ctl *discard_ctl = &fs_info->discard_ctl;
+	int i;
+
+	spin_lock_init(&discard_ctl->lock);
+
+	INIT_DELAYED_WORK(&discard_ctl->work, btrfs_discard_workfn);
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
+		INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
+}
+
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
+{
+	btrfs_discard_stop(fs_info);
+	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
+}
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
new file mode 100644
index 000000000000..439ca8c51877
--- /dev/null
+++ b/fs/btrfs/discard.h
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef BTRFS_DISCARD_H
+#define BTRFS_DISCARD_H
+
+struct btrfs_fs_info;
+struct btrfs_discard_ctl;
+struct btrfs_block_group;
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group);
+
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group *block_group);
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group *block_group);
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override);
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
+void btrfs_discard_init(struct btrfs_fs_info *fs_info);
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
+
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e0edfdc9c82b..8785b5c999b0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -41,6 +41,7 @@
 #include "tree-checker.h"
 #include "ref-verify.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -1953,6 +1954,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->readahead_workers);
 	btrfs_destroy_workqueue(fs_info->flush_workers);
 	btrfs_destroy_workqueue(fs_info->qgroup_rescan_workers);
+	if (fs_info->discard_ctl.discard_workers)
+		destroy_workqueue(fs_info->discard_ctl.discard_workers);
 	/*
 	 * Now that all other work queues are destroyed, we can safely destroy
 	 * the queues used for metadata I/O, since tasks from those other work
@@ -2148,6 +2151,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 				      max_active, 2);
 	fs_info->qgroup_rescan_workers =
 		btrfs_alloc_workqueue(fs_info, "qgroup-rescan", flags, 1, 0);
+	fs_info->discard_ctl.discard_workers =
+		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
 	      fs_info->flush_workers &&
@@ -2158,7 +2163,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->readahead_workers &&
 	      fs_info->fixup_workers && fs_info->delayed_workers &&
-	      fs_info->qgroup_rescan_workers)) {
+	      fs_info->qgroup_rescan_workers &&
+	      fs_info->discard_ctl.discard_workers)) {
 		return -ENOMEM;
 	}
 
@@ -2793,6 +2799,8 @@ int __cold open_ctree(struct super_block *sb,
 	btrfs_init_dev_replace_locks(fs_info);
 	btrfs_init_qgroup(fs_info);
 
+	btrfs_discard_init(fs_info);
+
 	btrfs_init_free_cluster(&fs_info->meta_alloc_cluster);
 	btrfs_init_free_cluster(&fs_info->data_alloc_cluster);
 
@@ -3263,6 +3271,8 @@ int __cold open_ctree(struct super_block *sb,
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	btrfs_discard_resume(fs_info);
+
 	if (!fs_info->uuid_root) {
 		btrfs_info(fs_info, "creating UUID tree");
 		ret = btrfs_create_uuid_tree(fs_info);
@@ -3954,6 +3964,9 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	 */
 	kthread_park(fs_info->cleaner_kthread);
 
+	/* cancel or finish ongoing work */
+	btrfs_discard_cleanup(fs_info);
+
 	/* wait for the qgroup rescan worker to stop */
 	btrfs_qgroup_wait_for_completion(fs_info, false);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e1ff2f115182..40e965b06f8f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,6 +32,7 @@
 #include "block-rsv.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2940,6 +2941,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		cond_resched();
 	}
 
+	if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_schedule_work(&fs_info->discard_ctl, true);
+
 	/*
 	 * Transaction is finished.  We don't need the lock anymore.  We
 	 * do need to clean up the block groups in case of a transaction
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 946a7df33249..72933996e743 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -21,6 +21,7 @@
 #include "space-info.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BITS_PER_BITMAP		(PAGE_SIZE * 8UL)
 #define MAX_CACHE_BYTES_PER_GIG	SZ_32K
@@ -755,9 +756,11 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 		/*
 		 * Sync discard ensures that the free space cache is always
 		 * trimmed.  So when reading this in, the state should reflect
-		 * that.
+		 * that.  We also do this for async as a stop gap for lack of
+		 * persistence.
 		 */
-		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			e->trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
 		if (!e->bytes) {
@@ -2382,6 +2385,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   u64 offset, u64 bytes,
 			   enum btrfs_trim_state trim_state)
 {
+	struct btrfs_block_group *block_group = ctl->private;
 	struct btrfs_free_space *info;
 	int ret = 0;
 
@@ -2431,6 +2435,9 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 		ASSERT(ret != -EEXIST);
 	}
 
+	if (trim_state != BTRFS_TRIM_STATE_TRIMMED)
+		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
+
 	return ret;
 }
 
@@ -3204,6 +3211,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 static int do_trimming(struct btrfs_block_group *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
 		       u64 reserved_start, u64 reserved_bytes,
+		       enum btrfs_trim_state reserved_trim_state,
 		       struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
@@ -3211,6 +3219,9 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
+	u64 end = start + bytes;
+	u64 reserved_end = reserved_start + reserved_bytes;
+	enum btrfs_trim_state trim_state;
 	u64 trimmed = 0;
 
 	spin_lock(&space_info->lock);
@@ -3224,11 +3235,20 @@ static int do_trimming(struct btrfs_block_group *block_group,
 	spin_unlock(&space_info->lock);
 
 	ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
-	if (!ret)
+	if (!ret) {
 		*total_trimmed += trimmed;
+		trim_state = BTRFS_TRIM_STATE_TRIMMED;
+	}
 
 	mutex_lock(&ctl->cache_writeout_mutex);
-	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	if (reserved_start < start)
+		__btrfs_add_free_space(fs_info, ctl, reserved_start,
+				       start - reserved_start,
+				       reserved_trim_state);
+	if (start + bytes < reserved_start + reserved_bytes)
+		__btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
+				       reserved_trim_state);
+	__btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
 	list_del(&trim_entry->list);
 	mutex_unlock(&ctl->cache_writeout_mutex);
 
@@ -3255,6 +3275,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 	int ret = 0;
 	u64 extent_start;
 	u64 extent_bytes;
+	enum btrfs_trim_state extent_trim_state;
 	u64 bytes;
 
 	while (start < end) {
@@ -3296,6 +3317,7 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 
 		extent_start = entry->offset;
 		extent_bytes = entry->bytes;
+		extent_trim_state = entry->trim_state;
 		start = max(start, extent_start);
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
@@ -3314,7 +3336,8 @@ static int trim_no_bitmap(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes, &trim_entry);
+				  extent_start, extent_bytes, extent_trim_state,
+				  &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3440,7 +3463,7 @@ static int trim_bitmaps(struct btrfs_block_group *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes, &trim_entry);
+				  start, bytes, 0, &trim_entry);
 		if (ret) {
 			reset_trimming_bitmap(ctl, offset);
 			break;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f131fb9f0f69..0edf00753c00 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -46,6 +46,7 @@
 #include "sysfs.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
+#include "discard.h"
 
 #include "qgroup.h"
 #define CREATE_TRACE_POINTS
@@ -146,6 +147,8 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
 	if (sb_rdonly(sb))
 		return;
 
+	btrfs_discard_stop(fs_info);
+
 	/* btrfs handle error by forcing the filesystem readonly */
 	sb->s_flags |= SB_RDONLY;
 	btrfs_info(fs_info, "forced readonly");
@@ -313,6 +316,7 @@ enum {
 	Opt_datasum, Opt_nodatasum,
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
+	Opt_discard_mode,
 	Opt_nologreplay,
 	Opt_norecovery,
 	Opt_ratio,
@@ -376,6 +380,7 @@ static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_discard_mode, "discard=%s"},
 	{Opt_nologreplay, "nologreplay"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -695,12 +700,26 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				   info->metadata_ratio);
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(info, DISCARD_SYNC,
-					   "turning on sync discard");
+		case Opt_discard_mode:
+			if (token == Opt_discard ||
+			    strcmp(args[0].from, "sync") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_ASYNC);
+				btrfs_set_and_info(info, DISCARD_SYNC,
+						   "turning on sync discard");
+			} else if (strcmp(args[0].from, "async") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_SYNC);
+				btrfs_set_and_info(info, DISCARD_ASYNC,
+						   "turning on async discard");
+			} else {
+				ret = -EINVAL;
+				goto out;
+			}
 			break;
 		case Opt_nodiscard:
 			btrfs_clear_and_info(info, DISCARD_SYNC,
 					     "turning off discard");
+			btrfs_clear_and_info(info, DISCARD_ASYNC,
+					     "turning off async discard");
 			break;
 		case Opt_space_cache:
 		case Opt_space_cache_version:
@@ -1324,6 +1343,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
 		seq_puts(seq, ",discard");
+	if (btrfs_test_opt(info, DISCARD_ASYNC))
+		seq_puts(seq, ",discard=async");
 	if (!(info->sb->s_flags & SB_POSIXACL))
 		seq_puts(seq, ",noacl");
 	if (btrfs_test_opt(info, SPACE_CACHE))
@@ -1713,6 +1734,14 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
 		btrfs_cleanup_defrag_inodes(fs_info);
 	}
 
+	/* If we toggled discard async. */
+	if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+	    btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_resume(fs_info);
+	else if (btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+		 !btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_cleanup(fs_info);
+
 	clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
 }
 
@@ -1760,6 +1789,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 */
 		cancel_work_sync(&fs_info->async_reclaim_work);
 
+		btrfs_discard_cleanup(fs_info);
+
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
 		/* avoid complains from lockdep et al. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 00/22] btrfs: async discard support
@ 2019-10-23 22:52 Dennis Zhou
  2019-10-23 22:52 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
  0 siblings, 1 reply; 23+ messages in thread
From: Dennis Zhou @ 2019-10-23 22:52 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

Hello,

The RFC [1] went a lot better than expected, so let's call this v2.

Changes:
 - I don't believe there are any functional changes to the code, but
   there are plenty of changes for readability and a lot of comments
   added.
 - Split out the sysfs parts into their own patches as it was a bit of
   work to put discard/ under debug/.
 - Switched over any flags to be enums. All the variables are now of
   appropriate type rather than being all atomics.
 - Dropped 0015 [2] from v1, load block groups on mount. This is in
   favor of reading everything in as clean and running fstrim as needed.
 - Misc. bug fixes.
 - I changed the block_group discard delay from 300s to 120s. It is just
   to give the allocators a chance to reuse the LBA before letting async
   discard take a crack at it.

Data:
On a number of webservers, I collected data every minute accounting the
time we spent in btrfs_finish_extent_commit() (col. 1) and in
btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is
where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170 
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.

From v1:
--------
Discard is an operation that allows for the filesystem to communicate
with underlying ssds that a lba region is no longer needed. This gives
the drive the more information as it tries to manage the available free
space to minimize write amplification. However, discard hasn't been
given the most tlc. Discard is a problematic command because a drive's
physical block management is more or less a black box to us and the
effects of any particular discard aren't necessarily limited the
lifetime of a command.

Currently, btrfs handles discarding synchronously during transaction
commit. This problematically can delay transaction commit based on the
amount of space that needs to be trimmed and the efficacy of the discard
operation for a particular drive.

This series introduces async discarding, which removes discard from the
transaction commit path. While every SSD has the choice of implementing
trim support different, we strive here to do the right thing. The idea
hinges on recognizing that write amplification really only kicks in once
we're really low on free space.  As long as we trim enough to keep a
large enough pool of free space, in theory this should minimize the cost
of issuing discards on a workload and have limited cost overhead in
write amplification.

With async discard, we try to emphasize discarding larger regions
and reusing the lba (implicit discard). The first is done by using the
free space cache to maintain discard state and thus allows us to get
coalescing for fairly cheap. A background workqueue is used to scan over
an LRU kept list of the block groups. It then uses filters to determine
what to discard next hence giving priority to larger discards. While
reusing an lba isn't explicitly attempted, it happens implicitly via
find_free_extent() which if it happens to find a dirty extent, will
grant us reuse of the lba. Additionally, async discarding skips metadata
block groups as these should see a fairly high turnover as btrfs is a
self-packing filesystem being stingy with allocating new block groups
until necessary.

Preliminary results seem promising as when a lot of freeing is going on,
the discarding is delayed allowing for reuse which translates to less
discarding (in addition to the slower discarding). This has shown a
reduction in p90 and p99 read latencies on a test on our webservers.

I am currently working on tuning the rate at which it discards in the
background. I am doing this by evaluating other workloads and drives.
The iops and bps rate limits are fairly aggressive right now as my
basic survey of a few drives noted that the trim command itself is a
significant part of the overhead. So optimizing for larger trims is the
right thing to do.

This series contains the following 22 patches and is on top of
btrfs-devel#master 3b7c59a1950c:
  0001-bitmap-genericize-percpu-bitmap-region-iterators.patch
  0002-btrfs-rename-DISCARD-opt-to-DISCARD_SYNC.patch
  0003-btrfs-keep-track-of-which-extents-have-been-discarde.patch
  0004-btrfs-keep-track-of-cleanliness-of-the-bitmap.patch
  0005-btrfs-add-the-beginning-of-async-discard-discard-wor.patch
  0006-btrfs-handle-empty-block_group-removal.patch
  0007-btrfs-discard-one-region-at-a-time-in-async-discard.patch
  0008-btrfs-add-removal-calls-for-sysfs-debug.patch
  0009-btrfs-make-UUID-debug-have-its-own-kobject.patch
  0010-btrfs-add-discard-sysfs-directory.patch
  0011-btrfs-track-discardable-extents-for-async-discard.patch
  0012-btrfs-keep-track-of-discardable_bytes.patch
  0013-btrfs-calculate-discard-delay-based-on-number-of-ext.patch
  0014-btrfs-add-bps-discard-rate-limit.patch
  0015-btrfs-limit-max-discard-size-for-async-discard.patch
  0016-btrfs-make-max-async-discard-size-tunable.patch
  0017-btrfs-have-multiple-discard-lists.patch
  0018-btrfs-only-keep-track-of-data-extents-for-async-disc.patch
  0019-btrfs-keep-track-of-discard-reuse-stats.patch
  0020-btrfs-add-async-discard-header.patch
  0021-btrfs-increase-the-metadata-allowance-for-the-free_s.patch
  0022-btrfs-make-smaller-extents-more-likely-to-go-into-bi.patch

0001 exports percpu's bitmap iterators for eventual use in 0011. 0002
renames DISCARD to DISCARD_SYNC. 0003 and 0004 adds discard tracking to
the free space cache. 0005-0007 adds the core of async discard support.
0008-0010 modify debug sysfs to allow for adding discard/ under debug/.
0011-0016 fiddle with stats and operation limits. 0018 makes async
discarding only track data block groups. 0019 adds reuse stats. 0020
adds an explanation header to discard.c. 0021 and 0022 modify the free
space cache metadata allowance, add a bitmap -> extent path and makes us
more likely to put smaller extents into the bitmaps.

diffstats below:

Dennis Zhou (22):
  bitmap: genericize percpu bitmap region iterators
  btrfs: rename DISCARD opt to DISCARD_SYNC
  btrfs: keep track of which extents have been discarded
  btrfs: keep track of cleanliness of the bitmap
  btrfs: add the beginning of async discard, discard workqueue
  btrfs: handle empty block_group removal
  btrfs: discard one region at a time in async discard
  btrfs: add removal calls for sysfs debug/
  btrfs: make UUID/debug have its own kobject
  btrfs: add discard sysfs directory
  btrfs: track discardable extents for async discard
  btrfs: keep track of discardable_bytes
  btrfs: calculate discard delay based on number of extents
  btrfs: add bps discard rate limit
  btrfs: limit max discard size for async discard
  btrfs: make max async discard size tunable
  btrfs: have multiple discard lists
  btrfs: only keep track of data extents for async discard
  btrfs: keep track of discard reuse stats
  btrfs: add async discard header
  btrfs: increase the metadata allowance for the free_space_cache
  btrfs: make smaller extents more likely to go into bitmaps

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |  56 ++-
 fs/btrfs/block-group.h      |  30 ++
 fs/btrfs/ctree.h            |  52 ++-
 fs/btrfs/discard.c          | 666 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  48 +++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |  23 +-
 fs/btrfs/free-space-cache.c | 587 ++++++++++++++++++++++++++-----
 fs/btrfs/free-space-cache.h |  39 ++-
 fs/btrfs/inode-map.c        |  13 +-
 fs/btrfs/scrub.c            |   7 +-
 fs/btrfs/super.c            |  39 ++-
 fs/btrfs/sysfs.c            | 203 ++++++++++-
 include/linux/bitmap.h      |  35 ++
 mm/percpu.c                 |  61 +---
 16 files changed, 1723 insertions(+), 153 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

Thanks,
Dennis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-10-23 22:52 [PATCH v2 00/22] btrfs: async discard support Dennis Zhou
@ 2019-10-23 22:52 ` Dennis Zhou
  2019-11-11 18:49   ` David Sterba
  0 siblings, 1 reply; 23+ messages in thread
From: Dennis Zhou @ 2019-10-23 22:52 UTC (permalink / raw)
  To: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval
  Cc: kernel-team, linux-btrfs, Dennis Zhou

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be address in a future patch in this
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/block-group.c      |   4 +
 fs/btrfs/block-group.h      |   9 ++
 fs/btrfs/ctree.h            |  21 +++
 fs/btrfs/discard.c          | 274 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/discard.h          |  28 ++++
 fs/btrfs/disk-io.c          |  15 +-
 fs/btrfs/extent-tree.c      |   4 +
 fs/btrfs/free-space-cache.c |  35 ++++-
 fs/btrfs/super.c            |  35 ++++-
 10 files changed, 417 insertions(+), 10 deletions(-)
 create mode 100644 fs/btrfs/discard.c
 create mode 100644 fs/btrfs/discard.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 82200dbca5ac..9a0ff3384381 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o block-group.o
+	   block-rsv.o delalloc-space.o block-group.o discard.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index afe86028246a..8bbbe7488328 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -14,6 +14,7 @@
 #include "sysfs.h"
 #include "tree-log.h"
 #include "delalloc-space.h"
+#include "discard.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -1273,6 +1274,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->unused_bgs_lock);
 
+		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
+
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
 
 		/* Don't want to race with allocators so take the groups_sem */
@@ -1622,6 +1625,7 @@ static struct btrfs_block_group_cache *btrfs_create_block_group_cache(
 	INIT_LIST_HEAD(&cache->cluster_list);
 	INIT_LIST_HEAD(&cache->bg_list);
 	INIT_LIST_HEAD(&cache->ro_list);
+	INIT_LIST_HEAD(&cache->discard_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index c391800388dd..633dce5b9d57 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -115,7 +115,11 @@ struct btrfs_block_group_cache {
 	/* For read-only block groups */
 	struct list_head ro_list;
 
+	/* For discard operations */
 	atomic_t trimming;
+	struct list_head discard_list;
+	int discard_index;
+	u64 discard_eligible_time;
 
 	/* For dirty block groups */
 	struct list_head dirty_list;
@@ -157,6 +161,11 @@ struct btrfs_block_group_cache {
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
+static inline u64 btrfs_block_group_end(struct btrfs_block_group_cache *cache)
+{
+	return (cache->key.objectid + cache->key.offset);
+}
+
 #ifdef CONFIG_BTRFS_DEBUG
 static inline int btrfs_should_fragment_free_space(
 		struct btrfs_block_group_cache *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1877586576aa..efa8390e8419 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -438,6 +438,21 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* Discard control. */
+/*
+ * Async discard uses multiple lists to differentiate the discard filter
+ * parameters.
+ */
+#define BTRFS_NR_DISCARD_LISTS		1
+
+struct btrfs_discard_ctl {
+	struct workqueue_struct *discard_workers;
+	struct delayed_work work;
+	spinlock_t lock;
+	struct btrfs_block_group_cache *cache;
+	struct list_head discard_list[BTRFS_NR_DISCARD_LISTS];
+};
+
 /* delayed seq elem */
 struct seq_list {
 	struct list_head list;
@@ -524,6 +539,9 @@ enum {
 	 * so we don't need to offload checksums to workqueues.
 	 */
 	BTRFS_FS_CSUM_IMPL_FAST,
+
+	/* Indicate that the discard workqueue can service discards. */
+	BTRFS_FS_DISCARD_RUNNING,
 };
 
 struct btrfs_fs_info {
@@ -817,6 +835,8 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *scrub_wr_completion_workers;
 	struct btrfs_workqueue *scrub_parity_workers;
 
+	struct btrfs_discard_ctl discard_ctl;
+
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	u32 check_integrity_print_mask;
 #endif
@@ -1190,6 +1210,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
new file mode 100644
index 000000000000..0a72a1902ca6
--- /dev/null
+++ b/fs/btrfs/discard.c
@@ -0,0 +1,274 @@
+/*
+ * Copyright (C) 2019 Facebook.  All rights reserved.
+ */
+
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/ktime.h>
+#include <linux/list.h>
+#include <linux/sizes.h>
+#include <linux/workqueue.h>
+#include "ctree.h"
+#include "block-group.h"
+#include "discard.h"
+#include "free-space-cache.h"
+
+/* This is an initial delay to give some chance for lba reuse. */
+#define BTRFS_DISCARD_DELAY		(120ULL * NSEC_PER_SEC)
+
+static struct list_head *btrfs_get_discard_list(
+					struct btrfs_discard_ctl *discard_ctl,
+					struct btrfs_block_group_cache *cache)
+{
+	return &discard_ctl->discard_list[cache->discard_index];
+}
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache)
+{
+	spin_lock(&discard_ctl->lock);
+
+	if (list_empty(&cache->discard_list))
+		cache->discard_eligible_time = (ktime_get_ns() +
+						BTRFS_DISCARD_DELAY);
+
+	list_move_tail(&cache->discard_list,
+		       btrfs_get_discard_list(discard_ctl, cache));
+
+	spin_unlock(&discard_ctl->lock);
+}
+
+static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
+				     struct btrfs_block_group_cache *cache)
+{
+	bool running = false;
+
+	spin_lock(&discard_ctl->lock);
+
+	if (cache == discard_ctl->cache) {
+		running = true;
+		discard_ctl->cache = NULL;
+	}
+
+	cache->discard_eligible_time = 0;
+	list_del_init(&cache->discard_list);
+
+	spin_unlock(&discard_ctl->lock);
+
+	return running;
+}
+
+/**
+ * find_next_cache - find cache that's up next for discarding
+ * @discard_ctl: discard control
+ * @now: current time
+ *
+ * Iterate over the discard lists to find the next block_group up for
+ * discarding checking the discard_eligible_time of block_group.
+ */
+static struct btrfs_block_group_cache *find_next_cache(
+					struct btrfs_discard_ctl *discard_ctl,
+					u64 now)
+{
+	struct btrfs_block_group_cache *ret_cache = NULL, *cache;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++) {
+		struct list_head *discard_list = &discard_ctl->discard_list[i];
+
+		if (!list_empty(discard_list)) {
+			cache = list_first_entry(discard_list,
+						 struct btrfs_block_group_cache,
+						 discard_list);
+
+			if (!ret_cache)
+				ret_cache = cache;
+
+			if (ret_cache->discard_eligible_time < now)
+				break;
+
+			if (ret_cache->discard_eligible_time >
+			    cache->discard_eligible_time)
+				ret_cache = cache;
+		}
+	}
+
+	return ret_cache;
+}
+
+/**
+ * peek_discard_list - wrap find_next_cache()
+ * @discard_ctl: discard control
+ *
+ * This wraps find_next_cache() and sets the cache to be in use.
+ */
+static struct btrfs_block_group_cache *peek_discard_list(
+					struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_block_group_cache *cache;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	cache = find_next_cache(discard_ctl, now);
+
+	if (cache && now < cache->discard_eligible_time)
+		cache = NULL;
+
+	discard_ctl->cache = cache;
+
+	spin_unlock(&discard_ctl->lock);
+
+	return cache;
+}
+
+/**
+ * btrfs_discard_cancel_work - remove a block_group from the discard lists
+ * @discard_ctl: discard control
+ * @cache: block_group of interest
+ *
+ * This removes @cache from the discard lists.  If necessary, it waits on the
+ * current work and then reschedules the delayed work.
+ */
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache)
+{
+	if (remove_from_discard_list(discard_ctl, cache)) {
+		cancel_delayed_work_sync(&discard_ctl->work);
+		btrfs_discard_schedule_work(discard_ctl, true);
+	}
+}
+
+/**
+ * btrfs_discard_queue_work - handles queuing the block_groups
+ * @discard_ctl: discard control
+ * @cache: block_group of interest
+ *
+ * This maintains the LRU order of the discard lists.
+ */
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group_cache *cache)
+{
+	if (!cache || !btrfs_test_opt(cache->fs_info, DISCARD_ASYNC))
+		return;
+
+	btrfs_add_to_discard_list(discard_ctl, cache);
+	if (!delayed_work_pending(&discard_ctl->work))
+		btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_discard_schedule_work - responsible for scheduling the discard work
+ * @discard_ctl: discard control
+ * @override: override the current timer
+ *
+ * Discards are issued by a delayed workqueue item.  @override is used to
+ * update the current delay as the baseline delay interview is reevaluated
+ * on transaction commit.  This is also maxed with any other rate limit.
+ */
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override)
+{
+	struct btrfs_block_group_cache *cache;
+	u64 now = ktime_get_ns();
+
+	spin_lock(&discard_ctl->lock);
+
+	if (!btrfs_run_discard_work(discard_ctl))
+		goto out;
+
+	if (!override && delayed_work_pending(&discard_ctl->work))
+		goto out;
+
+	cache = find_next_cache(discard_ctl, now);
+	if (cache) {
+		u64 delay = 0;
+
+		if (now < cache->discard_eligible_time)
+			delay = nsecs_to_jiffies(cache->discard_eligible_time -
+						 now);
+
+		mod_delayed_work(discard_ctl->discard_workers,
+				 &discard_ctl->work,
+				 delay);
+	}
+
+out:
+	spin_unlock(&discard_ctl->lock);
+}
+
+/**
+ * btrfs_discard_workfn - discard work function
+ * @work: work
+ *
+ * This finds the next cache to start discarding and then discards it.
+ */
+static void btrfs_discard_workfn(struct work_struct *work)
+{
+	struct btrfs_discard_ctl *discard_ctl;
+	struct btrfs_block_group_cache *cache;
+	u64 trimmed = 0;
+
+	discard_ctl = container_of(work, struct btrfs_discard_ctl, work.work);
+
+	cache = peek_discard_list(discard_ctl);
+	if (!cache || !btrfs_run_discard_work(discard_ctl))
+		return;
+
+	btrfs_trim_block_group(cache, &trimmed, cache->key.objectid,
+			       btrfs_block_group_end(cache), 0);
+
+	remove_from_discard_list(discard_ctl, cache);
+
+	btrfs_discard_schedule_work(discard_ctl, false);
+}
+
+/**
+ * btrfs_run_discard_work - determines if async discard should be running
+ * @discard_ctl: discard control
+ *
+ * Checks if the file system is writeable and BTRFS_FS_DISCARD_RUNNING is set.
+ */
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl)
+{
+	struct btrfs_fs_info *fs_info = container_of(discard_ctl,
+						     struct btrfs_fs_info,
+						     discard_ctl);
+
+	return (!(fs_info->sb->s_flags & SB_RDONLY) &&
+		test_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags));
+}
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_test_opt(fs_info, DISCARD_ASYNC)) {
+		btrfs_discard_cleanup(fs_info);
+		return;
+	}
+
+	set_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info)
+{
+	clear_bit(BTRFS_FS_DISCARD_RUNNING, &fs_info->flags);
+}
+
+void btrfs_discard_init(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_discard_ctl *discard_ctl = &fs_info->discard_ctl;
+	int i;
+
+	spin_lock_init(&discard_ctl->lock);
+
+	INIT_DELAYED_WORK(&discard_ctl->work, btrfs_discard_workfn);
+
+	for (i = 0; i < BTRFS_NR_DISCARD_LISTS; i++)
+		 INIT_LIST_HEAD(&discard_ctl->discard_list[i]);
+}
+
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info)
+{
+	btrfs_discard_stop(fs_info);
+	cancel_delayed_work_sync(&fs_info->discard_ctl.work);
+}
diff --git a/fs/btrfs/discard.h b/fs/btrfs/discard.h
new file mode 100644
index 000000000000..48b4710a80d0
--- /dev/null
+++ b/fs/btrfs/discard.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright (C) 2019 Facebook.  All rights reserved.
+ */
+
+#ifndef BTRFS_DISCARD_H
+#define BTRFS_DISCARD_H
+
+struct btrfs_fs_info;
+struct btrfs_discard_ctl;
+struct btrfs_block_group_cache;
+
+void btrfs_add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache);
+
+void btrfs_discard_cancel_work(struct btrfs_discard_ctl *discard_ctl,
+			       struct btrfs_block_group_cache *cache);
+void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
+			      struct btrfs_block_group_cache *cache);
+void btrfs_discard_schedule_work(struct btrfs_discard_ctl *discard_ctl,
+				 bool override);
+bool btrfs_run_discard_work(struct btrfs_discard_ctl *discard_ctl);
+
+void btrfs_discard_resume(struct btrfs_fs_info *fs_info);
+void btrfs_discard_stop(struct btrfs_fs_info *fs_info);
+void btrfs_discard_init(struct btrfs_fs_info *fs_info);
+void btrfs_discard_cleanup(struct btrfs_fs_info *fs_info);
+
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 044981cf6df9..a304ec972f67 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -41,6 +41,7 @@
 #include "tree-checker.h"
 #include "ref-verify.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -2009,6 +2010,8 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->flush_workers);
 	btrfs_destroy_workqueue(fs_info->qgroup_rescan_workers);
 	btrfs_destroy_workqueue(fs_info->extent_workers);
+	if (fs_info->discard_ctl.discard_workers)
+		destroy_workqueue(fs_info->discard_ctl.discard_workers);
 	/*
 	 * Now that all other work queues are destroyed, we can safely destroy
 	 * the queues used for metadata I/O, since tasks from those other work
@@ -2218,6 +2221,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 		btrfs_alloc_workqueue(fs_info, "extent-refs", flags,
 				      min_t(u64, fs_devices->num_devices,
 					    max_active), 8);
+	fs_info->discard_ctl.discard_workers =
+		alloc_workqueue("btrfs_discard", WQ_UNBOUND | WQ_FREEZABLE, 1);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
 	      fs_info->submit_workers && fs_info->flush_workers &&
@@ -2229,7 +2234,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	      fs_info->caching_workers && fs_info->readahead_workers &&
 	      fs_info->fixup_workers && fs_info->delayed_workers &&
 	      fs_info->extent_workers &&
-	      fs_info->qgroup_rescan_workers)) {
+	      fs_info->qgroup_rescan_workers &&
+	      fs_info->discard_ctl.discard_workers)) {
 		return -ENOMEM;
 	}
 
@@ -2772,6 +2778,8 @@ int open_ctree(struct super_block *sb,
 	btrfs_init_dev_replace_locks(fs_info);
 	btrfs_init_qgroup(fs_info);
 
+	btrfs_discard_init(fs_info);
+
 	btrfs_init_free_cluster(&fs_info->meta_alloc_cluster);
 	btrfs_init_free_cluster(&fs_info->data_alloc_cluster);
 
@@ -3284,6 +3292,8 @@ int open_ctree(struct super_block *sb,
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	btrfs_discard_resume(fs_info);
+
 	if (!fs_info->uuid_root) {
 		btrfs_info(fs_info, "creating UUID tree");
 		ret = btrfs_create_uuid_tree(fs_info);
@@ -3993,6 +4003,9 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	 */
 	kthread_park(fs_info->cleaner_kthread);
 
+	/* cancel or finish ongoing work */
+	btrfs_discard_cleanup(fs_info);
+
 	/* wait for the qgroup rescan worker to stop */
 	btrfs_qgroup_wait_for_completion(fs_info, false);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6a40bba3cb19..de00fd6e338b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,6 +32,7 @@
 #include "block-rsv.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2920,6 +2921,9 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 		cond_resched();
 	}
 
+	if (btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_schedule_work(&fs_info->discard_ctl, true);
+
 	/*
 	 * Transaction is finished.  We don't need the lock anymore.  We
 	 * do need to clean up the block groups in case of a transaction
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 900b935e5997..8120630e4439 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -21,6 +21,7 @@
 #include "space-info.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "discard.h"
 
 #define BITS_PER_BITMAP		(PAGE_SIZE * 8UL)
 #define MAX_CACHE_BYTES_PER_GIG	SZ_32K
@@ -750,9 +751,11 @@ static int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
 		/*
 		 * Sync discard ensures that the free space cache is always
 		 * trimmed.  So when reading this in, the state should reflect
-		 * that.
+		 * that.  We also do this for async as a stop gap for lack of
+		 * persistence.
 		 */
-		if (btrfs_test_opt(fs_info, DISCARD_SYNC))
+		if (btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+		    btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			e->trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
 		if (!e->bytes) {
@@ -2379,6 +2382,7 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   u64 offset, u64 bytes,
 			   enum btrfs_trim_state trim_state)
 {
+	struct btrfs_block_group_cache *cache = ctl->private;
 	struct btrfs_free_space *info;
 	int ret = 0;
 
@@ -2428,6 +2432,9 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 		ASSERT(ret != -EEXIST);
 	}
 
+	if (trim_state != BTRFS_TRIM_STATE_TRIMMED)
+		btrfs_discard_queue_work(&fs_info->discard_ctl, cache);
+
 	return ret;
 }
 
@@ -3201,6 +3208,7 @@ void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
 static int do_trimming(struct btrfs_block_group_cache *block_group,
 		       u64 *total_trimmed, u64 start, u64 bytes,
 		       u64 reserved_start, u64 reserved_bytes,
+		       enum btrfs_trim_state reserved_trim_state,
 		       struct btrfs_trim_range *trim_entry)
 {
 	struct btrfs_space_info *space_info = block_group->space_info;
@@ -3208,6 +3216,9 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	int ret;
 	int update = 0;
+	u64 end = start + bytes;
+	u64 reserved_end = reserved_start + reserved_bytes;
+	enum btrfs_trim_state trim_state;
 	u64 trimmed = 0;
 
 	spin_lock(&space_info->lock);
@@ -3221,11 +3232,20 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 	spin_unlock(&space_info->lock);
 
 	ret = btrfs_discard_extent(fs_info, start, bytes, &trimmed);
-	if (!ret)
+	if (!ret) {
 		*total_trimmed += trimmed;
+		trim_state = BTRFS_TRIM_STATE_TRIMMED;
+	}
 
 	mutex_lock(&ctl->cache_writeout_mutex);
-	btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
+	if (reserved_start < start)
+		__btrfs_add_free_space(fs_info, ctl, reserved_start,
+				       start - reserved_start,
+				       reserved_trim_state);
+	if (start + bytes < reserved_start + reserved_bytes)
+		__btrfs_add_free_space(fs_info, ctl, end, reserved_end - end,
+				       reserved_trim_state);
+	__btrfs_add_free_space(fs_info, ctl, start, bytes, trim_state);
 	list_del(&trim_entry->list);
 	mutex_unlock(&ctl->cache_writeout_mutex);
 
@@ -3252,6 +3272,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 	int ret = 0;
 	u64 extent_start;
 	u64 extent_bytes;
+	enum btrfs_trim_state extent_trim_state;
 	u64 bytes;
 
 	while (start < end) {
@@ -3293,6 +3314,7 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 
 		extent_start = entry->offset;
 		extent_bytes = entry->bytes;
+		extent_trim_state = entry->trim_state;
 		start = max(start, extent_start);
 		bytes = min(extent_start + extent_bytes, end) - start;
 		if (bytes < minlen) {
@@ -3311,7 +3333,8 @@ static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  extent_start, extent_bytes, &trim_entry);
+				  extent_start, extent_bytes, extent_trim_state,
+				  &trim_entry);
 		if (ret)
 			break;
 next:
@@ -3437,7 +3460,7 @@ static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
 		mutex_unlock(&ctl->cache_writeout_mutex);
 
 		ret = do_trimming(block_group, total_trimmed, start, bytes,
-				  start, bytes, &trim_entry);
+				  start, bytes, 0, &trim_entry);
 		if (ret) {
 			reset_trimming_bitmap(ctl, offset);
 			break;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a02fece949cb..7a1bd85e1981 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -46,6 +46,7 @@
 #include "sysfs.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
+#include "discard.h"
 
 #include "qgroup.h"
 #define CREATE_TRACE_POINTS
@@ -146,6 +147,8 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
 	if (sb_rdonly(sb))
 		return;
 
+	btrfs_discard_stop(fs_info);
+
 	/* btrfs handle error by forcing the filesystem readonly */
 	sb->s_flags |= SB_RDONLY;
 	btrfs_info(fs_info, "forced readonly");
@@ -313,6 +316,7 @@ enum {
 	Opt_datasum, Opt_nodatasum,
 	Opt_defrag, Opt_nodefrag,
 	Opt_discard, Opt_nodiscard,
+	Opt_discard_mode,
 	Opt_nologreplay,
 	Opt_norecovery,
 	Opt_ratio,
@@ -376,6 +380,7 @@ static const match_table_t tokens = {
 	{Opt_nodefrag, "noautodefrag"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+	{Opt_discard_mode, "discard=%s"},
 	{Opt_nologreplay, "nologreplay"},
 	{Opt_norecovery, "norecovery"},
 	{Opt_ratio, "metadata_ratio=%u"},
@@ -695,12 +700,26 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				   info->metadata_ratio);
 			break;
 		case Opt_discard:
-			btrfs_set_and_info(info, DISCARD_SYNC,
-					   "turning on sync discard");
+		case Opt_discard_mode:
+			if (token == Opt_discard ||
+			    strcmp(args[0].from, "sync") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_ASYNC);
+				btrfs_set_and_info(info, DISCARD_SYNC,
+						   "turning on sync discard");
+			} else if (strcmp(args[0].from, "async") == 0) {
+				btrfs_clear_opt(info->mount_opt, DISCARD_SYNC);
+				btrfs_set_and_info(info, DISCARD_ASYNC,
+						   "turning on async discard");
+			} else {
+				ret = -EINVAL;
+				goto out;
+			}
 			break;
 		case Opt_nodiscard:
 			btrfs_clear_and_info(info, DISCARD_SYNC,
 					     "turning off discard");
+			btrfs_clear_and_info(info, DISCARD_ASYNC,
+					     "turning off async discard");
 			break;
 		case Opt_space_cache:
 		case Opt_space_cache_version:
@@ -1324,6 +1343,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",flushoncommit");
 	if (btrfs_test_opt(info, DISCARD_SYNC))
 		seq_puts(seq, ",discard");
+	if (btrfs_test_opt(info, DISCARD_ASYNC))
+		seq_puts(seq, ",discard=async");
 	if (!(info->sb->s_flags & SB_POSIXACL))
 		seq_puts(seq, ",noacl");
 	if (btrfs_test_opt(info, SPACE_CACHE))
@@ -1714,6 +1735,14 @@ static inline void btrfs_remount_cleanup(struct btrfs_fs_info *fs_info,
 		btrfs_cleanup_defrag_inodes(fs_info);
 	}
 
+	/* If we toggled discard async. */
+	if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+	    btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_resume(fs_info);
+	else if (btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
+		 !btrfs_test_opt(fs_info, DISCARD_ASYNC))
+		btrfs_discard_cleanup(fs_info);
+
 	clear_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state);
 }
 
@@ -1761,6 +1790,8 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 		 */
 		cancel_work_sync(&fs_info->async_reclaim_work);
 
+		btrfs_discard_cleanup(fs_info);
+
 		/* wait for the uuid_scan task to finish */
 		down(&fs_info->uuid_tree_rescan_sem);
 		/* avoid complains from lockdep et al. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue
  2019-10-23 22:52 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
@ 2019-11-11 18:49   ` David Sterba
  0 siblings, 0 replies; 23+ messages in thread
From: David Sterba @ 2019-11-11 18:49 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: David Sterba, Chris Mason, Josef Bacik, Omar Sandoval,
	kernel-team, linux-btrfs

On Wed, Oct 23, 2019 at 06:52:59PM -0400, Dennis Zhou wrote:
> --- /dev/null
> +++ b/fs/btrfs/discard.c
> @@ -0,0 +1,274 @@
> +/*
> + * Copyright (C) 2019 Facebook.  All rights reserved.
> + */

It's the other way around, SPDX should be there,
https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Copyright_notices_in_files.2C_SPDX

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-12-14  0:22 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <201911220351.HPI9gxNo%lkp@intel.com>
2019-11-22  4:27 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Nick Desaulniers
2019-11-22  4:27   ` Nick Desaulniers
2019-11-25 18:59   ` Dennis Zhou
2019-11-25 18:59     ` Dennis Zhou
2019-11-25 19:39     ` Nick Desaulniers
2019-11-25 19:39       ` Nick Desaulniers
2019-11-26  1:42       ` Philip Li
2019-11-26  1:42         ` Philip Li
2019-11-26  1:42         ` Philip Li
2019-11-26  1:47         ` Nick Desaulniers
2019-11-26  1:47           ` Nick Desaulniers
2019-11-26  4:07           ` Philip Li
2019-11-26  4:07             ` Philip Li
2019-11-26  4:07             ` Philip Li
2019-11-26  4:09           ` Philip Li
2019-11-26  4:09             ` Philip Li
2019-11-26  4:09             ` Philip Li
2019-12-14  0:22 [PATCH v6 00/22] btrfs: async discard support Dennis Zhou
2019-12-14  0:22 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
  -- strict thread matches above, loose matches on Subject: below --
2019-12-09 19:45 [PATCH v5 00/22] btrfs: async discard support Dennis Zhou
2019-12-09 19:45 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
2019-11-25 19:46 [PATCH v4 00/22] btrfs: async discard support Dennis Zhou
2019-11-25 19:46 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
2019-11-20 21:50 [PATCH v3 00/22] btrfs: async discard support Dennis Zhou
2019-11-20 21:51 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
2019-10-23 22:52 [PATCH v2 00/22] btrfs: async discard support Dennis Zhou
2019-10-23 22:52 ` [PATCH 05/22] btrfs: add the beginning of async discard, discard workqueue Dennis Zhou
2019-11-11 18:49   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.