Linux-BTRFS Archive on lore.kernel.org
 help / Atom feed
* [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
@ 2018-12-13 21:17 fdmanana
  2018-12-14  7:27 ` Nikolay Borisov
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: fdmanana @ 2018-12-13 21:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Several places allocate a device while holding the device list mutex. This
can result in a deadlock if reclaim happens because the device, and its
flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which
is used by btrfs_alloc_device()). A transaction commit, which reclaim can
trigger, needs to lock the device list mutex in its critical section, done
at btrfs_update_commit_device_size().

Some of these places are device_list_add(), which ends up being called
through the device scan ioctl, and btrfs_close_one_device(), which ends up
being called through the device remove ioctl.

Since all the places that add elements to the list of resized devices (the
device grow and shrink functions) only lock the chunk mutex before adding
a device to the list, drop the need to acquire the device list mutex from
btrfs_update_commit_device_size(), which is the only other place that uses
this list and it already locks the chunk mutex.

Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/volumes.c | 8 ++------
 fs/btrfs/volumes.h | 1 +
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c872adfc939e..74c4ed29e36e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -176,7 +176,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
  * chunk_mutex
  * -----------
  * protects chunks, adding or removing during allocation, trim or when a new
- * device is added/removed
+ * device is added/removed, and the list of resized devices at struct
+ * btrfs_fs_info::fs_devices::resized_devices
  *
  * cleaner_mutex
  * -------------
@@ -7298,10 +7299,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_device *curr, *next;
 
-	if (list_empty(&fs_devices->resized_devices))
-		return;
-
-	mutex_lock(&fs_devices->device_list_mutex);
 	mutex_lock(&fs_info->chunk_mutex);
 	list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
 				 resized_list) {
@@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
 		curr->commit_total_bytes = curr->disk_total_bytes;
 	}
 	mutex_unlock(&fs_info->chunk_mutex);
-	mutex_unlock(&fs_devices->device_list_mutex);
 }
 
 /* Must be invoked during the transaction commit */
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index aefce895e994..362574b9c37a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -229,6 +229,7 @@ struct btrfs_fs_devices {
 	struct mutex device_list_mutex;
 	struct list_head devices;
 
+	/* protected by struct btrfs_fs_info::chunk_mutex */
 	struct list_head resized_devices;
 	/* devices not currently being allocated */
 	struct list_head alloc_list;
-- 
2.11.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2018-12-13 21:17 [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices fdmanana
@ 2018-12-14  7:27 ` Nikolay Borisov
  2019-01-08 11:51 ` Filipe Manana
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Nikolay Borisov @ 2018-12-14  7:27 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 13.12.18 г. 23:17 ч., fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Several places allocate a device while holding the device list mutex. This
> can result in a deadlock if reclaim happens because the device, and its
> flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which
> is used by btrfs_alloc_device()). A transaction commit, which reclaim can
> trigger, needs to lock the device list mutex in its critical section, done
> at btrfs_update_commit_device_size().
> 
> Some of these places are device_list_add(), which ends up being called
> through the device scan ioctl, and btrfs_close_one_device(), which ends up
> being called through the device remove ioctl.
> 
> Since all the places that add elements to the list of resized devices (the
> device grow and shrink functions) only lock the chunk mutex before adding
> a device to the list, drop the need to acquire the device list mutex from
> btrfs_update_commit_device_size(), which is the only other place that uses
> this list and it already locks the chunk mutex.
> 
> Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
> Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>


Reviewed-by: Nikolay Borisov <nborisov@suse.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2018-12-13 21:17 [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices fdmanana
  2018-12-14  7:27 ` Nikolay Borisov
@ 2019-01-08 11:51 ` Filipe Manana
  2019-01-09 18:26 ` David Sterba
  2019-01-11 17:17 ` [PATCH v2] " fdmanana
  3 siblings, 0 replies; 10+ messages in thread
From: Filipe Manana @ 2019-01-08 11:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

On Thu, Dec 13, 2018 at 9:18 PM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> Several places allocate a device while holding the device list mutex. This
> can result in a deadlock if reclaim happens because the device, and its
> flush bio, are allocated using GFP_KERNEL mode (by __alloc_device() which
> is used by btrfs_alloc_device()). A transaction commit, which reclaim can
> trigger, needs to lock the device list mutex in its critical section, done
> at btrfs_update_commit_device_size().
>
> Some of these places are device_list_add(), which ends up being called
> through the device scan ioctl, and btrfs_close_one_device(), which ends up
> being called through the device remove ioctl.
>
> Since all the places that add elements to the list of resized devices (the
> device grow and shrink functions) only lock the chunk mutex before adding
> a device to the list, drop the need to acquire the device list mutex from
> btrfs_update_commit_device_size(), which is the only other place that uses
> this list and it already locks the chunk mutex.
>
> Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
> Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Ping.

> ---
>  fs/btrfs/volumes.c | 8 ++------
>  fs/btrfs/volumes.h | 1 +
>  2 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index c872adfc939e..74c4ed29e36e 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -176,7 +176,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>   * chunk_mutex
>   * -----------
>   * protects chunks, adding or removing during allocation, trim or when a new
> - * device is added/removed
> + * device is added/removed, and the list of resized devices at struct
> + * btrfs_fs_info::fs_devices::resized_devices
>   *
>   * cleaner_mutex
>   * -------------
> @@ -7298,10 +7299,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>         struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>         struct btrfs_device *curr, *next;
>
> -       if (list_empty(&fs_devices->resized_devices))
> -               return;
> -
> -       mutex_lock(&fs_devices->device_list_mutex);
>         mutex_lock(&fs_info->chunk_mutex);
>         list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>                                  resized_list) {
> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>                 curr->commit_total_bytes = curr->disk_total_bytes;
>         }
>         mutex_unlock(&fs_info->chunk_mutex);
> -       mutex_unlock(&fs_devices->device_list_mutex);
>  }
>
>  /* Must be invoked during the transaction commit */
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index aefce895e994..362574b9c37a 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -229,6 +229,7 @@ struct btrfs_fs_devices {
>         struct mutex device_list_mutex;
>         struct list_head devices;
>
> +       /* protected by struct btrfs_fs_info::chunk_mutex */
>         struct list_head resized_devices;
>         /* devices not currently being allocated */
>         struct list_head alloc_list;
> --
> 2.11.0
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2018-12-13 21:17 [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices fdmanana
  2018-12-14  7:27 ` Nikolay Borisov
  2019-01-08 11:51 ` Filipe Manana
@ 2019-01-09 18:26 ` David Sterba
  2019-01-09 19:48   ` Filipe Manana
  2019-01-10  7:03   ` Nikolay Borisov
  2019-01-11 17:17 ` [PATCH v2] " fdmanana
  3 siblings, 2 replies; 10+ messages in thread
From: David Sterba @ 2019-01-09 18:26 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
> -	if (list_empty(&fs_devices->resized_devices))
> -		return;
> -
> -	mutex_lock(&fs_devices->device_list_mutex);
>  	mutex_lock(&fs_info->chunk_mutex);
>  	list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>  				 resized_list) {
> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>  		curr->commit_total_bytes = curr->disk_total_bytes;

I'm not sure about removing the device_list_mutex that's said to protect
the commit_total_bytes (comment in struct btrfs_device).

Otherwise the logic is ok, the double lock could happen as you describe.

btrfs_update_commit_device_size is called from btrfs_commit_transaction,
at the same time as commit_bytes_used. The latter is handled in a
similar way in btrfs_update_commit_device_bytes_used, but does not take
the device_list_mutex.

commit_total_bytes is checked several times (eg. in write_dev_supers) to
see if writing the superblock copy is still within the device range.

So, without the protected change, it's theoretically possible that a
stale value is used for the test and the superblock is either written
though it should not, and the other way around.

This would require a resize racing at the time of the check. Grow and
shrink seem to take chunk_mutex while adjusting all the total/size
values, but it's not actually easy to follow as sometimes there are
helpers like btrfs_device_set_total_bytes used and sometimes it's direct
access.

That the device_list_mutex can be safely dropped probably follows from
the simple fact that btrfs_update_commit_device_bytes_used is called
before write_dev_supers in the same context.

But this sounds too simple, given that there are locks taken and
released and btrfs_write_and_wait_transaction called between.

Referencing this code:

2201         btrfs_update_commit_device_size(fs_info);
2202         btrfs_update_commit_device_bytes_used(cur_trans);
2203
2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
2206
2207         btrfs_trans_release_chunk_metadata(trans);
2208
2209         spin_lock(&fs_info->trans_lock);
2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
2211         fs_info->running_transaction = NULL;
2212         spin_unlock(&fs_info->trans_lock);
2213         mutex_unlock(&fs_info->reloc_mutex);
2214
2215         wake_up(&fs_info->transaction_wait);
2216
2217         ret = btrfs_write_and_wait_transaction(trans);
2218         if (ret) {
2219                 btrfs_handle_fs_error(fs_info, ret,
2220                                       "Error while writing out transaction");
2221                 mutex_unlock(&fs_info->tree_log_mutex);
2222                 goto scrub_continue;
2223         }
2224
2225         ret = write_all_supers(fs_info, 0);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2019-01-09 18:26 ` David Sterba
@ 2019-01-09 19:48   ` Filipe Manana
  2019-01-10  7:32     ` Anand Jain
  2019-01-10  7:03   ` Nikolay Borisov
  1 sibling, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2019-01-09 19:48 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs

On Wed, Jan 9, 2019 at 6:27 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
> > -     if (list_empty(&fs_devices->resized_devices))
> > -             return;
> > -
> > -     mutex_lock(&fs_devices->device_list_mutex);
> >       mutex_lock(&fs_info->chunk_mutex);
> >       list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
> >                                resized_list) {
> > @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
> >               curr->commit_total_bytes = curr->disk_total_bytes;
>
> I'm not sure about removing the device_list_mutex that's said to protect
> the commit_total_bytes (comment in struct btrfs_device).
>
> Otherwise the logic is ok, the double lock could happen as you describe.
>
> btrfs_update_commit_device_size is called from btrfs_commit_transaction,
> at the same time as commit_bytes_used. The latter is handled in a
> similar way in btrfs_update_commit_device_bytes_used, but does not take
> the device_list_mutex.
>
> commit_total_bytes is checked several times (eg. in write_dev_supers) to
> see if writing the superblock copy is still within the device range.
>
> So, without the protected change, it's theoretically possible that a
> stale value is used for the test and the superblock is either written
> though it should not, and the other way around.
>
> This would require a resize racing at the time of the check. Grow and
> shrink seem to take chunk_mutex while adjusting all the total/size
> values, but it's not actually easy to follow as sometimes there are
> helpers like btrfs_device_set_total_bytes used and sometimes it's direct
> access.
>
> That the device_list_mutex can be safely dropped probably follows from
> the simple fact that btrfs_update_commit_device_bytes_used is called
> before write_dev_supers in the same context.
>
> But this sounds too simple, given that there are locks taken and
> released and btrfs_write_and_wait_transaction called between.

Regardless of all that (and honestly I haven't double checked and
skimmed only through what you said),
there's a more important aspect I missed before: write_all_supers()
takes (and needs) the device list mutex,
therefore this change won't fix the deadlock because of that.

thanks

>
> Referencing this code:
>
> 2201         btrfs_update_commit_device_size(fs_info);
> 2202         btrfs_update_commit_device_bytes_used(cur_trans);
> 2203
> 2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
> 2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
> 2206
> 2207         btrfs_trans_release_chunk_metadata(trans);
> 2208
> 2209         spin_lock(&fs_info->trans_lock);
> 2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
> 2211         fs_info->running_transaction = NULL;
> 2212         spin_unlock(&fs_info->trans_lock);
> 2213         mutex_unlock(&fs_info->reloc_mutex);
> 2214
> 2215         wake_up(&fs_info->transaction_wait);
> 2216
> 2217         ret = btrfs_write_and_wait_transaction(trans);
> 2218         if (ret) {
> 2219                 btrfs_handle_fs_error(fs_info, ret,
> 2220                                       "Error while writing out transaction");
> 2221                 mutex_unlock(&fs_info->tree_log_mutex);
> 2222                 goto scrub_continue;
> 2223         }
> 2224
> 2225         ret = write_all_supers(fs_info, 0);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2019-01-09 18:26 ` David Sterba
  2019-01-09 19:48   ` Filipe Manana
@ 2019-01-10  7:03   ` Nikolay Borisov
  1 sibling, 0 replies; 10+ messages in thread
From: Nikolay Borisov @ 2019-01-10  7:03 UTC (permalink / raw)
  To: dsterba, fdmanana, linux-btrfs



On 9.01.19 г. 20:26 ч., David Sterba wrote:
> On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
>> -	if (list_empty(&fs_devices->resized_devices))
>> -		return;
>> -
>> -	mutex_lock(&fs_devices->device_list_mutex);
>>  	mutex_lock(&fs_info->chunk_mutex);
>>  	list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>>  				 resized_list) {
>> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>>  		curr->commit_total_bytes = curr->disk_total_bytes;
> 
> I'm not sure about removing the device_list_mutex that's said to protect
> the commit_total_bytes (comment in struct btrfs_device).
> 
> Otherwise the logic is ok, the double lock could happen as you describe.
> 
> btrfs_update_commit_device_size is called from btrfs_commit_transaction,
> at the same time as commit_bytes_used. The latter is handled in a
> similar way in btrfs_update_commit_device_bytes_used, but does not take
> the device_list_mutex.
> 
> commit_total_bytes is checked several times (eg. in write_dev_supers) to
> see if writing the superblock copy is still within the device range.
> 
> So, without the protected change, it's theoretically possible that a
> stale value is used for the test and the superblock is either written
> though it should not, and the other way around.

But can it really, btrfs_[grow|shrink]_device happen under transaction
and their modification of the device_disk_total_bytes (the value
assigned to commit_total_bytes) always happen under chunk_mutex. Also
the updates to both values are really owned by the transaction, so even
if grow/shrink modify those value they will queue those changes in a new
transaction, hence write_dev_super will see consistent value in the
current transaction.


I have a patch from Jeff (which is part of a bigger work) that actually
unifies the resize/device size change lists into a single single and
makes the code a bit easier to grok, nevertheless the above explanation
is still correct even without this patch.


> 
> This would require a resize racing at the time of the check. Grow and
> shrink seem to take chunk_mutex while adjusting all the total/size
> values, but it's not actually easy to follow as sometimes there are
> helpers like btrfs_device_set_total_bytes used and sometimes it's direct
> access.
> 
> That the device_list_mutex can be safely dropped probably follows from
> the simple fact that btrfs_update_commit_device_bytes_used is called
> before write_dev_supers in the same context.
> 
> But this sounds too simple, given that there are locks taken and
> released and btrfs_write_and_wait_transaction called between.
> 
> Referencing this code:
> 
> 2201         btrfs_update_commit_device_size(fs_info);
> 2202         btrfs_update_commit_device_bytes_used(cur_trans);
> 2203
> 2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
> 2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
> 2206
> 2207         btrfs_trans_release_chunk_metadata(trans);
> 2208
> 2209         spin_lock(&fs_info->trans_lock);
> 2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
> 2211         fs_info->running_transaction = NULL;
> 2212         spin_unlock(&fs_info->trans_lock);
> 2213         mutex_unlock(&fs_info->reloc_mutex);
> 2214
> 2215         wake_up(&fs_info->transaction_wait);
> 2216
> 2217         ret = btrfs_write_and_wait_transaction(trans);
> 2218         if (ret) {
> 2219                 btrfs_handle_fs_error(fs_info, ret,
> 2220                                       "Error while writing out transaction");
> 2221                 mutex_unlock(&fs_info->tree_log_mutex);
> 2222                 goto scrub_continue;
> 2223         }
> 2224
> 2225         ret = write_all_supers(fs_info, 0);
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2019-01-09 19:48   ` Filipe Manana
@ 2019-01-10  7:32     ` Anand Jain
  0 siblings, 0 replies; 10+ messages in thread
From: Anand Jain @ 2019-01-10  7:32 UTC (permalink / raw)
  To: Filipe Manana, dsterba, linux-btrfs



On 01/10/2019 03:48 AM, Filipe Manana wrote:
> On Wed, Jan 9, 2019 at 6:27 PM David Sterba <dsterba@suse.cz> wrote:
>>
>> On Thu, Dec 13, 2018 at 09:17:25PM +0000, fdmanana@kernel.org wrote:
>>> -     if (list_empty(&fs_devices->resized_devices))
>>> -             return;
>>> -
>>> -     mutex_lock(&fs_devices->device_list_mutex);
>>>        mutex_lock(&fs_info->chunk_mutex);
>>>        list_for_each_entry_safe(curr, next, &fs_devices->resized_devices,
>>>                                 resized_list) {
>>> @@ -7309,7 +7306,6 @@ void btrfs_update_commit_device_size(struct btrfs_fs_info *fs_info)
>>>                curr->commit_total_bytes = curr->disk_total_bytes;
>>
>> I'm not sure about removing the device_list_mutex that's said to protect
>> the commit_total_bytes (comment in struct btrfs_device).
>>
>> Otherwise the logic is ok, the double lock could happen as you describe.
>>
>> btrfs_update_commit_device_size is called from btrfs_commit_transaction,
>> at the same time as commit_bytes_used. The latter is handled in a
>> similar way in btrfs_update_commit_device_bytes_used, but does not take
>> the device_list_mutex.
>>
>> commit_total_bytes is checked several times (eg. in write_dev_supers) to
>> see if writing the superblock copy is still within the device range.
>>
>> So, without the protected change, it's theoretically possible that a
>> stale value is used for the test and the superblock is either written
>> though it should not, and the other way around.
>>
>> This would require a resize racing at the time of the check. Grow and
>> shrink seem to take chunk_mutex while adjusting all the total/size
>> values, but it's not actually easy to follow as sometimes there are
>> helpers like btrfs_device_set_total_bytes used and sometimes it's direct
>> access.
>>
>> That the device_list_mutex can be safely dropped probably follows from
>> the simple fact that btrfs_update_commit_device_bytes_used is called
>> before write_dev_supers in the same context.
>>
>> But this sounds too simple, given that there are locks taken and
>> released and btrfs_write_and_wait_transaction called between.
> 
> Regardless of all that (and honestly I haven't double checked and
> skimmed only through what you said),
> there's a more important aspect I missed before: write_all_supers()
> takes (and needs) the device list mutex,
> therefore this change won't fix the deadlock because of that.

  Though this won't fix the problem, this patch is still ok
  as its drops the unnecessary device_list_mutex in
  btrfs_update_commit_device_size(). So for that if the change log
  updated,
    Reviewed-by: Anand Jain <anand.jain@oracle.com>

  To address the actual problem.

  Functions which call btrfs_alloc_device() are..
     device_list_add()
     close_fs_devices()
     btrfs_init_dev_replace_tgtdev()
     clone_fs_devices()
     btrfs_init_new_device()

  Now among the above following holds the device_list_mutex when calling
  btrfs_alloc_device()
     device_list_add()
     close_fs_devices()
     clone_fs_devices()

  Now among above three, the lock at device_list_add() and
clone_fs_devices() can be ignored, because the reclaim or flush IO can't
take place on these FSID:devices as device_list_add() is called when FS
is not-mounted or in the mounting context, and clone_fs_devices() is
called when the SEED device is still a read-only FS.

  And so we have to only worry about close_fs_devices().

  close_fs_devices() - I didn't like the way it does, that is allocate a
new struct btrfs_device instead of just zero-ing the struct btrfs_device
during unmount. I guess it was done to avoid RCU warning (not sure) and
if it isn't real issues I am happy to see btrfs_device_alloc() is
dropped in close_fs_devices(). Which means it also fixes the problem
that - you need more memory to unmount an ideal FS.

Thanks.

> thanks
> 
>>
>> Referencing this code:
>>
>> 2201         btrfs_update_commit_device_size(fs_info);
>> 2202         btrfs_update_commit_device_bytes_used(cur_trans);
>> 2203
>> 2204         clear_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
>> 2205         clear_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
>> 2206
>> 2207         btrfs_trans_release_chunk_metadata(trans);
>> 2208
>> 2209         spin_lock(&fs_info->trans_lock);
>> 2210         cur_trans->state = TRANS_STATE_UNBLOCKED;
>> 2211         fs_info->running_transaction = NULL;
>> 2212         spin_unlock(&fs_info->trans_lock);
>> 2213         mutex_unlock(&fs_info->reloc_mutex);
>> 2214
>> 2215         wake_up(&fs_info->transaction_wait);
>> 2216
>> 2217         ret = btrfs_write_and_wait_transaction(trans);
>> 2218         if (ret) {
>> 2219                 btrfs_handle_fs_error(fs_info, ret,
>> 2220                                       "Error while writing out transaction");
>> 2221                 mutex_unlock(&fs_info->tree_log_mutex);
>> 2222                 goto scrub_continue;
>> 2223         }
>> 2224
>> 2225         ret = write_all_supers(fs_info, 0);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2018-12-13 21:17 [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices fdmanana
                   ` (2 preceding siblings ...)
  2019-01-09 18:26 ` David Sterba
@ 2019-01-11 17:17 ` " fdmanana
  2019-01-14  8:21   ` Anand Jain
  3 siblings, 1 reply; 10+ messages in thread
From: fdmanana @ 2019-01-11 17:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

In a few places we are allocating a device using the GFP_KERNEL flag when
it is not safe to do so, because if reclaim is triggered it can cause a
transaction commit while we are holding the device list mutex. This mutex
is required in the transaction commit path (at write_all_supers() and
btrfs_update_commit_device_size()).

So fix this by setting up a nofs memory allocation context in those cases.

Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Change the approach to fix the problem by setting up nofs contextes
    where needed.

 fs/btrfs/volumes.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2576b1a379c9..663566baae78 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -14,6 +14,7 @@
 #include <linux/semaphore.h>
 #include <linux/uuid.h>
 #include <linux/list_sort.h>
+#include <linux/sched/mm.h>
 #include "ctree.h"
 #include "extent_map.h"
 #include "disk-io.h"
@@ -988,20 +989,29 @@ static noinline struct btrfs_device *device_list_add(const char *path,
 	}
 
 	if (!device) {
+		unsigned int nofs_flag;
+
 		if (fs_devices->opened) {
 			mutex_unlock(&fs_devices->device_list_mutex);
 			return ERR_PTR(-EBUSY);
 		}
 
+		/*
+		 * Setup nofs context because we are holding the device list
+		 * mutex, which is required for a transaction commit.
+		 */
+		nofs_flag = memalloc_nofs_save();
 		device = btrfs_alloc_device(NULL, &devid,
 					    disk_super->dev_item.uuid);
 		if (IS_ERR(device)) {
+			memalloc_nofs_restore(nofs_flag);
 			mutex_unlock(&fs_devices->device_list_mutex);
 			/* we can safely leave the fs_devices entry around */
 			return device;
 		}
 
-		name = rcu_string_strdup(path, GFP_NOFS);
+		name = rcu_string_strdup(path, GFP_KERNEL);
+		memalloc_nofs_restore(nofs_flag);
 		if (!name) {
 			btrfs_free_device(device);
 			mutex_unlock(&fs_devices->device_list_mutex);
@@ -1137,11 +1147,19 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
 	/* We have held the volume lock, it is safe to get the devices. */
 	list_for_each_entry(orig_dev, &orig->devices, dev_list) {
 		struct rcu_string *name;
+		unsigned int nofs_flag;
 
+		/*
+		 * Setup nofs context because we are holding the device list
+		 * mutex, which is required for a transaction commit.
+		 */
+		nofs_flag = memalloc_nofs_save();
 		device = btrfs_alloc_device(NULL, &orig_dev->devid,
 					    orig_dev->uuid);
-		if (IS_ERR(device))
+		if (IS_ERR(device)) {
+			memalloc_nofs_restore(nofs_flag);
 			goto error;
+		}
 
 		/*
 		 * This is ok to do without rcu read locked because we hold the
@@ -1151,12 +1169,14 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
 			name = rcu_string_strdup(orig_dev->name->str,
 					GFP_KERNEL);
 			if (!name) {
+				memalloc_nofs_restore(nofs_flag);
 				btrfs_free_device(device);
 				goto error;
 			}
 			rcu_assign_pointer(device->name, name);
 		}
 
+		memalloc_nofs_restore(nofs_flag);
 		list_add(&device->dev_list, &fs_devices->devices);
 		device->fs_devices = fs_devices;
 		fs_devices->num_devices++;
@@ -1262,6 +1282,7 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 	struct btrfs_fs_devices *fs_devices = device->fs_devices;
 	struct btrfs_device *new_device;
 	struct rcu_string *name;
+	unsigned int nofs_flag;
 
 	if (device->bdev)
 		fs_devices->open_devices--;
@@ -1277,17 +1298,23 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 
 	btrfs_close_bdev(device);
 
+	/*
+	 * Setup nofs context because we are holding the device list
+	 * mutex, which is required for a transaction commit.
+	 */
+	nofs_flag = memalloc_nofs_save();
 	new_device = btrfs_alloc_device(NULL, &device->devid,
 					device->uuid);
 	BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
 
 	/* Safe because we are under uuid_mutex */
 	if (device->name) {
-		name = rcu_string_strdup(device->name->str, GFP_NOFS);
+		name = rcu_string_strdup(device->name->str, GFP_KERNEL);
 		BUG_ON(!name); /* -ENOMEM */
 		rcu_assign_pointer(new_device->name, name);
 	}
 
+	memalloc_nofs_restore(nofs_flag);
 	list_replace_rcu(&device->dev_list, &new_device->dev_list);
 	new_device->fs_devices = device->fs_devices;
 
-- 
2.11.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2019-01-11 17:17 ` [PATCH v2] " fdmanana
@ 2019-01-14  8:21   ` Anand Jain
  2019-01-18 18:07     ` David Sterba
  0 siblings, 1 reply; 10+ messages in thread
From: Anand Jain @ 2019-01-14  8:21 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 01/12/2019 01:17 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> In a few places we are allocating a device using the GFP_KERNEL flag when
> it is not safe to do so, because if reclaim is triggered it can cause a
> transaction commit while we are holding the device list mutex. This mutex
> is required in the transaction commit path (at write_all_supers() and
> btrfs_update_commit_device_size()).
> 
> So fix this by setting up a nofs memory allocation context in those cases.
> 
> Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
> Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> 
> V2: Change the approach to fix the problem by setting up nofs contextes
>      where needed.
> 
>   fs/btrfs/volumes.c | 33 ++++++++++++++++++++++++++++++---
>   1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 2576b1a379c9..663566baae78 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -14,6 +14,7 @@
>   #include <linux/semaphore.h>
>   #include <linux/uuid.h>
>   #include <linux/list_sort.h>
> +#include <linux/sched/mm.h>
>   #include "ctree.h"
>   #include "extent_map.h"
>   #include "disk-io.h"
> @@ -988,20 +989,29 @@ static noinline struct btrfs_device *device_list_add(const char *path,
>   	}
>   
>   	if (!device) {
> +		unsigned int nofs_flag;
> +
>   		if (fs_devices->opened) {
>   			mutex_unlock(&fs_devices->device_list_mutex);
>   			return ERR_PTR(-EBUSY);
>   		}
>   
> +		/*
> +		 * Setup nofs context because we are holding the device list
> +		 * mutex, which is required for a transaction commit.
> +		 */

I wonder if there is a bug due to GFP_KERNEL in device_list_add()?
as device_list_add() can only be called only when the FSID is not yet
mounted. OR if its done for the sake of consistency when calling\
btrfs_alloc_device().

Thanks, Anand


> +		nofs_flag = memalloc_nofs_save();
>   		device = btrfs_alloc_device(NULL, &devid,
>   					    disk_super->dev_item.uuid);
>   		if (IS_ERR(device)) {
> +			memalloc_nofs_restore(nofs_flag);
>   			mutex_unlock(&fs_devices->device_list_mutex);
>   			/* we can safely leave the fs_devices entry around */
>   			return device;
>   		}
>   
> -		name = rcu_string_strdup(path, GFP_NOFS);
> +		name = rcu_string_strdup(path, GFP_KERNEL);
> +		memalloc_nofs_restore(nofs_flag);
>   		if (!name) {
>   			btrfs_free_device(device);
>   			mutex_unlock(&fs_devices->device_list_mutex);
> @@ -1137,11 +1147,19 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
>   	/* We have held the volume lock, it is safe to get the devices. */
>   	list_for_each_entry(orig_dev, &orig->devices, dev_list) {
>   		struct rcu_string *name;
> +		unsigned int nofs_flag;
>   
> +		/*
> +		 * Setup nofs context because we are holding the device list
> +		 * mutex, which is required for a transaction commit.
> +		 */
> +		nofs_flag = memalloc_nofs_save();
>   		device = btrfs_alloc_device(NULL, &orig_dev->devid,
>   					    orig_dev->uuid);
> -		if (IS_ERR(device))
> +		if (IS_ERR(device)) {
> +			memalloc_nofs_restore(nofs_flag);
>   			goto error;
> +		}
>   
>   		/*
>   		 * This is ok to do without rcu read locked because we hold the
> @@ -1151,12 +1169,14 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
>   			name = rcu_string_strdup(orig_dev->name->str,
>   					GFP_KERNEL);
>   			if (!name) {
> +				memalloc_nofs_restore(nofs_flag);
>   				btrfs_free_device(device);
>   				goto error;
>   			}
>   			rcu_assign_pointer(device->name, name);
>   		}
>   
> +		memalloc_nofs_restore(nofs_flag);
>   		list_add(&device->dev_list, &fs_devices->devices);
>   		device->fs_devices = fs_devices;
>   		fs_devices->num_devices++;
> @@ -1262,6 +1282,7 @@ static void btrfs_close_one_device(struct btrfs_device *device)
>   	struct btrfs_fs_devices *fs_devices = device->fs_devices;
>   	struct btrfs_device *new_device;
>   	struct rcu_string *name;
> +	unsigned int nofs_flag;
>   
>   	if (device->bdev)
>   		fs_devices->open_devices--;
> @@ -1277,17 +1298,23 @@ static void btrfs_close_one_device(struct btrfs_device *device)
>   
>   	btrfs_close_bdev(device);
>   
> +	/*
> +	 * Setup nofs context because we are holding the device list
> +	 * mutex, which is required for a transaction commit.
> +	 */
> +	nofs_flag = memalloc_nofs_save();
>   	new_device = btrfs_alloc_device(NULL, &device->devid,
>   					device->uuid);
>   	BUG_ON(IS_ERR(new_device)); /* -ENOMEM */
>   
>   	/* Safe because we are under uuid_mutex */
>   	if (device->name) {
> -		name = rcu_string_strdup(device->name->str, GFP_NOFS);
> +		name = rcu_string_strdup(device->name->str, GFP_KERNEL);
>   		BUG_ON(!name); /* -ENOMEM */
>   		rcu_assign_pointer(new_device->name, name);
>   	}
>   
> +	memalloc_nofs_restore(nofs_flag);
>   	list_replace_rcu(&device->dev_list, &new_device->dev_list);
>   	new_device->fs_devices = device->fs_devices;
>   
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] Btrfs: avoid deadlock with memory reclaim due to allocation of devices
  2019-01-14  8:21   ` Anand Jain
@ 2019-01-18 18:07     ` David Sterba
  0 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2019-01-18 18:07 UTC (permalink / raw)
  To: Anand Jain; +Cc: fdmanana, linux-btrfs

On Mon, Jan 14, 2019 at 04:21:43PM +0800, Anand Jain wrote:
> 
> 
> On 01/12/2019 01:17 AM, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > In a few places we are allocating a device using the GFP_KERNEL flag when
> > it is not safe to do so, because if reclaim is triggered it can cause a
> > transaction commit while we are holding the device list mutex. This mutex
> > is required in the transaction commit path (at write_all_supers() and
> > btrfs_update_commit_device_size()).
> > 
> > So fix this by setting up a nofs memory allocation context in those cases.
> > 
> > Fixes: 78f2c9e6dbb14 ("btrfs: device add and remove: use GFP_KERNEL")
> > Fixes: e0ae999414238 ("btrfs: preallocate device flush bio")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> > 
> > V2: Change the approach to fix the problem by setting up nofs contextes
> >      where needed.
> > 
> >   fs/btrfs/volumes.c | 33 ++++++++++++++++++++++++++++++---
> >   1 file changed, 30 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 2576b1a379c9..663566baae78 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -14,6 +14,7 @@
> >   #include <linux/semaphore.h>
> >   #include <linux/uuid.h>
> >   #include <linux/list_sort.h>
> > +#include <linux/sched/mm.h>
> >   #include "ctree.h"
> >   #include "extent_map.h"
> >   #include "disk-io.h"
> > @@ -988,20 +989,29 @@ static noinline struct btrfs_device *device_list_add(const char *path,
> >   	}
> >   
> >   	if (!device) {
> > +		unsigned int nofs_flag;
> > +
> >   		if (fs_devices->opened) {
> >   			mutex_unlock(&fs_devices->device_list_mutex);
> >   			return ERR_PTR(-EBUSY);
> >   		}
> >   
> > +		/*
> > +		 * Setup nofs context because we are holding the device list
> > +		 * mutex, which is required for a transaction commit.
> > +		 */
> 
> I wonder if there is a bug due to GFP_KERNEL in device_list_add()?
> as device_list_add() can only be called only when the FSID is not yet
> mounted. OR if its done for the sake of consistency when calling\
> btrfs_alloc_device().

It still could be called but a new device will not be allocated, all is
done either via scan or during mount. A missing device has an entry in
fs_devices.

We can keep th NOFS protection around that to make it future-proof, as
it's not trivial to see if this is always called from safe context or
not.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-13 21:17 [PATCH] Btrfs: avoid deadlock with memory reclaim due to allocation of devices fdmanana
2018-12-14  7:27 ` Nikolay Borisov
2019-01-08 11:51 ` Filipe Manana
2019-01-09 18:26 ` David Sterba
2019-01-09 19:48   ` Filipe Manana
2019-01-10  7:32     ` Anand Jain
2019-01-10  7:03   ` Nikolay Borisov
2019-01-11 17:17 ` [PATCH v2] " fdmanana
2019-01-14  8:21   ` Anand Jain
2019-01-18 18:07     ` David Sterba

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox