From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 679A4C48BDF for ; Mon, 21 Jun 2021 02:00:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 417CB6112D for ; Mon, 21 Jun 2021 02:00:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229968AbhFUCBj (ORCPT ); Sun, 20 Jun 2021 22:01:39 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:14132 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229899AbhFUCBi (ORCPT ); Sun, 20 Jun 2021 22:01:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1624240765; x=1655776765; h=date:from:to:cc:subject:message-id:mime-version; bh=tkdSwhITMRcei5omTXapM0Hyyfr6kQ5ioNRjtuot1jk=; b=ZGe3WPnQJrugrPMUNuO6IslaE8GxElXGpEIWeh7P5lh9GueD0rBzGx/q AUc7AyNbG1gBheQvKJh0u1k9Zvs/KF0rdDX0/i11HsBIjzy/xj6QMK448 OH2ubEMeZR7DtCD/XypIyGEL39TYeM/qTBRYE0I5P6XADBoZvt18AgnHr rz6YtvOuxuNePGfWOjuTqSvv5ryaz/eC8T/28sQ6kT9FardjNywAAvi6d HEYL0vncDRDZcSXbAnhKbFAeY4NBO5/ayoyXz5toiK1BV/qyK+jzOJTZL u7ReexSzBRU25Mx/2a0uDwWoqpou8PF5jl3DD8CJuQE40OUdpNNUMe+tf w==; IronPort-SDR: ntEQDrzlOTmxz0oCjYQ5gY30xz4sdo6pIcIbWwENz0GDVqdO16ewn+EoezM8t6351mgI0zjuna 1jy5vgoB9evLgoLHOKcHNhtLsrXxkq5tDvX2lJRsBlBi6t711bLqwToBiTvHahzMg9+6MA4zT7 3lgcl190J1CX/3jZuoc5X46uxlihNJhtbWV0vtyqRK8rIonJp9kx7zyUPSQWq8vwCQTTXTev/w 43oe/8TCjTGDkwSZdb0/wxyf1HQCb6SecJKUSZlOwr9V6L93zVALhw0WRM06SN/r+5+35w7WxQ s5E= X-IronPort-AV: E=Sophos;i="5.83,288,1616428800"; d="scan'208";a="177229228" Received: from uls-op-cesaip02.wdc.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 21 Jun 2021 09:59:23 +0800 IronPort-SDR: r6A3DVDY8wyTftlEjAOGZFVLdDX89x4wH8nzksoa+mY0RBeOESYglqxafTyN0jzlsXQ9K589Fa H0JoqCtyYiMp54oKxdxe+cUFZPXVpszYTSLqn4wO6NdffxH7XmLGzPyuwbNJTc4+JUYC7xBs0W bFkDDtA33rZ9Ja56ezFGioSG2RFKCBh1nMUHsJudjbAxDrzHTayOkkg3Cr0SqTbGCcSCm1xcPy e/g1M/gju6ck+xVzgpmhwx30HOgofccc5N1yhpbx0kBTorUHdS4A+BLIFOHQLlQM5xnCTwLTeU Gy7I6Z3ZBO9wUkSEr/v2/EyP Received: from uls-op-cesaip01.wdc.com ([10.248.3.36]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jun 2021 18:36:43 -0700 IronPort-SDR: oLriaeeHHkUKLgWiaCg4a1Ky+8iRXVmABOX18yZ1+cuZ4S2RUdM9VP9r9qVXHMhLnhexd1ReOw Ulp/VpOx5dsBxIOeefiY267W9DdHXIqKyxqWxYT4b7h1UsWcj4rrO2K1nvLgknzuTWSnSI9wQ0 nkpIFruSfv9Mi5aS01tK2HB4JxeR1VdWVjmSALL/NXZuo4puxMYa9KHnPlIyl106dLDMCrbNCX mv9EHZaLNLyh8TGfULTTV4WIz9BEC7PhYBKh0gIrv6d9vzI7QMSisNBTcROA8//+xT4ZC64JXf Byo= WDCIronportException: Internal Received: from 5cg0390p95.ad.shared (HELO naota-xeon) ([10.225.48.64]) by uls-op-cesaip01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jun 2021 18:59:23 -0700 Date: Mon, 21 Jun 2021 10:59:22 +0900 From: Naohiro Aota To: linux-btrfs@vger.kernel.org Cc: Filipe Manana Subject: hung_task issue by wait_event() in check_system_chunk() Message-ID: <20210621015922.ewgbffxuawia7liz@naota-xeon> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org We recently observed a hung_task issue on our 5.13-rc* test runs (without any lockdep warnings!). The code is based on Linux 5.13-rc6 + a patch to fix unbalanced unlock in qgroup_account_snapshot() [1] + some debug printings. I found the cause of the issue, but I'm not sure how we can fix it. [1] https://lore.kernel.org/linux-btrfs/20210621012114.1884779-1-naohiro.aota@wdc.com/T/#u * Summary: In btrfs_chunk_alloc(), we set space_info->chunk_alloc = 1 before a chunk allocation and turn it to 0 after the allocation. We block other chunk allocations while space_info->chunk_alloc == 1. So, the code should be forward-progress while space_info->chunk_alloc == 1. However, commit eafa4fd0ad06 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") broke the assumption. It introduced wait_event() to wait for an ongoing transaction which allocated a new chunk to finish. That waiting leads to an invisible circular dependency. If the thread that allocated a chunk tries to allocate another chunk, it cannot proceed because the wait_event()'ing thread set space_info->chunk_alloc = 1. We might need to set space_info->chunk_alloc = 0 while wait_event()? But then we need to recheck everything again from the start of btrfs_chunk_alloc()? * Full analysis: # These messages are from check_system_chunk() after "trans->chunk_bytes_reserved += thresh;" [ 905.785876][ T722] 722 add 393216 to transaction 233 [ 905.800494][ T5063] 5063 add 393216 to transaction 233 Thread 722 and 5063 reserved space for its chunk allocation metadata on the running transaction. [ 905.806845][ T5070] BTRFS info (device nullb1): left=65536, need=393216, flags=4 [ 905.811547][ T5070] BTRFS info (device nullb1): space_info 2 has 65536 free, is not full ... # This message is from check_system_chunk() before the wait_event() [ 905.831000][ T5070] waiting for reserved 786432 trans 0 min_needed 393216 Now thread 5070 does not have enough SYSTEM space (left=65536 < need=393216) for its chunk allocation. And, since other transaction_handle in the same transaction (transaction_handle of thread 722 and 5063) already reserved space for them, it now waits for them to free their reservation with transaction commit. [ 1108.499029][ T22] INFO: task kworker/u4:5:721 blocked for more than 122 seconds. But, nothing progressed! So, what is happened with threads 722 and 5063? Backtrace of thread 722 crash> bt 722 PID: 722 TASK: ffff888100f80000 CPU: 0 COMMAND: "kworker/u4:6" #0 [ffffc90002467160] __schedule at ffffffff8260da25 #1 [ffffc90002467268] preempt_schedule_common at ffffffff8260f635 #2 [ffffc90002467288] __cond_resched at ffffffff8260f68d #3 [ffffc90002467298] btrfs_chunk_alloc at ffffffffa03cd50a [btrfs] #4 [ffffc90002467330] find_free_extent at ffffffffa01adfa2 [btrfs] #5 [ffffc900024674f0] btrfs_reserve_extent at ffffffffa01bf7d0 [btrfs] #6 [ffffc90002467580] btrfs_alloc_tree_block at ffffffffa01c074f [btrfs] #7 [ffffc90002467720] alloc_tree_block_no_bg_flush at ffffffffa018d9e5 [btrfs] #8 [ffffc90002467780] __btrfs_cow_block at ffffffffa0194fb1 [btrfs] #9 [ffffc900024678d0] btrfs_cow_block at ffffffffa01960b5 [btrfs] #10 [ffffc90002467950] btrfs_search_slot at ffffffffa01a15c7 [btrfs] #11 [ffffc90002467ac0] btrfs_remove_chunk at ffffffffa027f43c [btrfs] #12 [ffffc90002467c50] btrfs_relocate_chunk at ffffffffa0280503 [btrfs] #13 [ffffc90002467c88] btrfs_reclaim_bgs_work.cold at ffffffffa040e7b7 [btrfs] #14 [ffffc90002467c90] __kasan_check_read at ffffffff8167ebf1 #15 [ffffc90002467d08] process_one_work at ffffffff811ae718 #16 [ffffc90002467e40] worker_thread at ffffffff811af8de #17 [ffffc90002467f08] kthread at ffffffff811c0857 #18 [ffffc90002467f50] ret_from_fork at ffffffff810034b2 So, here it is. Since check_system_chunk() takes neither space_info->lock nor fs_info->chunk_mutex while the waiting, thread 722 is in busy-loop to wait for space_info->chunk_alloc == 0. #3 [ffffc90002467298] btrfs_chunk_alloc at ffffffffa03cd50a [btrfs] /home/naota/src/linux-works/btrfs-zoned/fs/btrfs/block-group.c: 3282 And, how about thread 5063? crash> bt 5063 PID: 5063 TASK: ffff8881ca254000 CPU: 0 COMMAND: "fsstress" #0 [ffffc90007d46fd8] __schedule at ffffffff8260da25 #1 [ffffc90007d470e0] schedule at ffffffff8260f247 #2 [ffffc90007d47110] rwsem_down_read_slowpath at ffffffff82619b9d #3 [ffffc90007d47268] __down_read_common at ffffffff8126492c #4 [ffffc90007d47318] down_read_nested at ffffffff81264da4 #5 [ffffc90007d47368] btrfs_read_lock_root_node at ffffffffa02b5007 [btrfs] #6 [ffffc90007d47390] btrfs_search_slot at ffffffffa01a181b [btrfs] #7 [ffffc90007d47500] _raw_spin_unlock_irqrestore at ffffffff8262349c #8 [ffffc90007d47550] insert_with_overflow at ffffffffa01c74b9 [btrfs] #9 [ffffc90007d47600] btrfs_insert_dir_item at ffffffffa01c7bff [btrfs] #10 [ffffc90007d47738] create_subvol at ffffffffa02a610b [btrfs] #11 [ffffc90007d47970] btrfs_mksubvol at ffffffffa02a7829 [btrfs] #12 [ffffc90007d47a08] _copy_from_user at ffffffff81b315ab #13 [ffffc90007d47a60] btrfs_ioctl_snap_create_v2 at ffffffffa02a832f [btrfs] #14 [ffffc90007d47ab8] btrfs_ioctl at ffffffffa02ae668 [btrfs] #15 [ffffc90007d47d40] __x64_sys_ioctl at ffffffff8171c09f #16 [ffffc90007d47f38] do_syscall_64 at ffffffff82602a70 #17 [ffffc90007d47f50] entry_SYSCALL_64_after_hwframe at ffffffff8280007c So, it's contended here: #4 [ffffc90007d47318] down_read_nested at ffffffff81264da4 ffffc90007d47320: [ffff88811a30bc20:btrfs_extent_buffer] 0000000000000000 ffffc90007d47330: 0000000000000000 ffffc90007d47360 ffffc90007d47340: __btrfs_tree_read_lock+40 [ffff88811a30bc20:btrfs_extent_buffer] ffffc90007d47350: ffffed102e902800 dffffc0000000000 ffffc90007d47360: ffffc90007d47388 btrfs_read_lock_root_node+71 Thread 5063 is trying to read lock this extent buffer. crash> struct extent_buffer ffff88811a30bc20 struct extent_buffer { start = 1803321344, ... lock_owner = 5065, ... The lock is owned by thread 5065. crash> bt 5065 PID: 5065 TASK: ffff88812a7f0000 CPU: 1 COMMAND: "fsstress" #0 [ffffc90007d66ab0] __schedule at ffffffff8260da25 #1 [ffffc90007d66bb8] preempt_schedule_common at ffffffff8260f635 #2 [ffffc90007d66bd8] __cond_resched at ffffffff8260f68d #3 [ffffc90007d66be8] btrfs_chunk_alloc at ffffffffa03cd50a [btrfs] #4 [ffffc90007d66c80] find_free_extent at ffffffffa01adfa2 [btrfs] #5 [ffffc90007d66e40] btrfs_reserve_extent at ffffffffa01bf7d0 [btrfs] #6 [ffffc90007d66ed0] btrfs_alloc_tree_block at ffffffffa01c074f [btrfs] #7 [ffffc90007d67070] alloc_tree_block_no_bg_flush at ffffffffa018d9e5 [btrfs] #8 [ffffc90007d670d0] __btrfs_cow_block at ffffffffa0194fb1 [btrfs] #9 [ffffc90007d67220] btrfs_cow_block at ffffffffa01960b5 [btrfs] #10 [ffffc90007d672a0] btrfs_search_slot at ffffffffa01a15c7 [btrfs] #11 [ffffc90007d67410] btrfs_lookup_file_extent at ffffffffa01ca05e [btrfs] #12 [ffffc90007d674a8] btrfs_drop_extents at ffffffffa0233449 [btrfs] #13 [ffffc90007d67738] btrfs_replace_file_extents at ffffffffa023762d [btrfs] #14 [ffffc90007d678d8] btrfs_clone at ffffffffa03d674f [btrfs] #15 [ffffc90007d67b20] btrfs_extent_same_range at ffffffffa03d6d38 [btrfs] #16 [ffffc90007d67b68] btrfs_remap_file_range at ffffffffa03d7b68 [btrfs] #17 [ffffc90007d67c60] vfs_dedupe_file_range_one at ffffffff8179f58b #18 [ffffc90007d67cd0] vfs_dedupe_file_range at ffffffff8179fadd #19 [ffffc90007d67d40] __x64_sys_ioctl at ffffffff8171c344 #20 [ffffc90007d67f38] do_syscall_64 at ffffffff82602a70 #21 [ffffc90007d67f50] entry_SYSCALL_64_after_hwframe at ffffffff8280007c Again! Thread 5065 is busy-looping in the same loop in btrfs_chunk_alloc(). * How to fix it? I'm not sure how we can fix it without breaking the original intention of commit eafa4fd0ad06. We might need to set space_info->chunk_alloc = 0 while in the wait_event(). But then, we also need to recheck everything from the start of btrfs_chunk_alloc(). Or, we can commit the transaction before another chunk allocation if its transaction_handle is waited? Thanks,