On 26 Apr 2022, at 16:18, Qian Cai wrote: > On Mon, Apr 25, 2022 at 10:31:12AM -0400, Zi Yan wrote: >> From: Zi Yan >> >> Hi David, >> >> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA >> and alloc_contig_range(). It prepares for my upcoming changes to make >> MAX_ORDER adjustable at boot time[1]. It is on top of mmotm-2022-04-20-17-12. >> >> Changelog >> === >> V11 >> --- >> 1. Moved start_isolate_page_range()/undo_isolate_page_range() alignment >> change to a separate patch after the unmovable page check change and >> alloc_contig_range() change to avoid some unwanted memory >> hotplug/hotremove failures. >> 2. Cleaned up has_unmovable_pages() in Patch 2. >> >> V10 >> --- >> 1. Reverted back to the original outer_start, outer_end range for >> test_pages_isolated() and isolate_freepages_range() in Patch 3, >> otherwise isolation will fail if start in alloc_contig_range() is in >> the middle of a free page. >> >> V9 >> --- >> 1. Limited has_unmovable_pages() check within a pageblock. >> 2. Added a check to ensure page isolation is done within a single zone >> in isolate_single_pageblock(). >> 3. Fixed an off-by-one bug in isolate_single_pageblock(). >> 4. Fixed a NULL-deferencing bug when the pages before to-be-isolated pageblock >> is not online in isolate_single_pageblock(). >> >> V8 >> --- >> 1. Cleaned up has_unmovable_pages() to remove page argument. >> >> V7 >> --- >> 1. Added page validity check in isolate_single_pageblock() to avoid out >> of zone pages. >> 2. Fixed a bug in split_free_page() to split and free pages in correct >> page order. >> >> V6 >> --- >> 1. Resolved compilation error/warning reported by kernel test robot. >> 2. Tried to solve the coding concerns from Christophe Leroy. >> 3. Shortened lengthy lines (pointed out by Christoph Hellwig). >> >> V5 >> --- >> 1. Moved isolation address alignment handling in start_isolate_page_range(). >> 2. Rewrote and simplified how alloc_contig_range() works at pageblock >> granularity (Patch 3). Only two pageblock migratetypes need to be saved and >> restored. start_isolate_page_range() might need to migrate pages in this >> version, but it prevents the caller from worrying about >> max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment after the page range >> is isolated. >> >> V4 >> --- >> 1. Dropped two irrelevant patches on non-lru compound page handling, as >> it is not supported upstream. >> 2. Renamed migratetype_has_fallback() to migratetype_is_mergeable(). >> 3. Always check whether two pageblocks can be merged in >> __free_one_page() when order is >= pageblock_order, as the case (not >> mergeable pageblocks are isolated, CMA, and HIGHATOMIC) becomes more common. >> 3. Moving has_unmovable_pages() is now a separate patch. >> 4. Removed MAX_ORDER-1 alignment requirement in the comment in virtio_mem code. >> >> Description >> === >> >> The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range() >> isolates pageblocks to remove free memory from buddy allocator but isolating >> only a subset of pageblocks within a page spanning across multiple pageblocks >> causes free page accounting issues. Isolated page might not be put into the >> right free list, since the code assumes the migratetype of the first pageblock >> as the whole free page migratetype. This is based on the discussion at [2]. >> >> To remove the requirement, this patchset: >> 1. isolates pages at pageblock granularity instead of >> max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages); >> 2. splits free pages across the specified range or migrates in-use pages >> across the specified range then splits the freed page to avoid free page >> accounting issues (it happens when multiple pageblocks within a single page >> have different migratetypes); >> 3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned >> range during isolation to avoid alloc_contig_range() failure when pageblocks >> within a MAX_ORDER - 1 aligned range are allocated separately. >> 4. returns pages not in the range as it did before. >> >> One optimization might come later: >> 1. make MIGRATE_ISOLATE a separate bit to be able to restore the original >> migratetypes when isolation fails in the middle of the range. >> >> Feel free to give comments and suggestions. Thanks. >> >> [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/ >> [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/ >> >> Zi Yan (6): >> mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c >> mm: page_isolation: check specified range for unmovable pages >> mm: make alloc_contig_range work at pageblock granularity >> mm: page_isolation: enable arbitrary range page isolation. >> mm: cma: use pageblock_order as the single alignment >> drivers: virtio_mem: use pageblock size as the minimum virtio_mem >> size. >> >> drivers/virtio/virtio_mem.c | 6 +- >> include/linux/cma.h | 4 +- >> include/linux/mmzone.h | 5 +- >> include/linux/page-isolation.h | 6 +- >> mm/internal.h | 6 + >> mm/memory_hotplug.c | 3 +- >> mm/page_alloc.c | 191 +++++------------- >> mm/page_isolation.c | 345 +++++++++++++++++++++++++++++++-- >> 8 files changed, 392 insertions(+), 174 deletions(-) > > Reverting this series fixed a deadlock during memory offline/online > tests and then a crash. Hi Qian, Thanks for reporting the issue. Do you have a reproducer I can use to debug the code? > > INFO: task kmemleak:1027 blocked for more than 120 seconds. > Not tainted 5.18.0-rc4-next-20220426-dirty #27 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kmemleak state:D stack:27744 pid: 1027 ppid: 2 flags:0x00000008 > Call trace: > __switch_to > __schedule > schedule > percpu_rwsem_wait > __percpu_down_read > percpu_down_read.constprop.0 > get_online_mems > kmemleak_scan > kmemleak_scan_thread > kthread > ret_from_fork > > Showing all locks held in the system: > 1 lock held by rcu_tasks_kthre/11: > #0: ffffc1e2cefc17f0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp > 1 lock held by rcu_tasks_rude_/12: > #0: ffffc1e2cefc1a90 (rcu_tasks_rude.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp > 1 lock held by rcu_tasks_trace/13: > #0: ffffc1e2cefc1db0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp > 1 lock held by khungtaskd/824: > #0: ffffc1e2cefc2820 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks > 2 locks held by kmemleak/1027: > #0: ffffc1e2cf1aa628 (scan_mutex){+.+.}-{3:3}, at: kmemleak_scan_thread > #1: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: get_online_mems > 2 locks held by cppc_fie/1805: > 1 lock held by in:imklog/2822: > 8 locks held by tee/3334: > #0: ffff0816d65c9438 (sb_writers#6){.+.+}-{0:0}, at: vfs_write > #1: ffff40025438be88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter > #2: ffff4000c8261eb0 (kn->active#298){.+.+}-{0:0}, at: kernfs_fop_write_iter > #3: ffffc1e2d0013f68 (device_hotplug_lock){+.+.}-{3:3}, at: online_store > #4: ffff0800cd8bb998 (&dev->mutex){....}-{3:3}, at: device_offline > #5: ffffc1e2ceed3750 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock > #6: ffffc1e2cf14e690 (mem_hotplug_lock){++++}-{0:0}, at: offline_pages > #7: ffffc1e2cf13bf68 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable > __zone_set_pageset_high_and_batch at mm/page_alloc.c:7005 > (inlined by) zone_pcp_disable at mm/page_alloc.c:9286 > > Later, running some kernel compilation workloads could trigger a crash. > > Unable to handle kernel paging request at virtual address fffffbfffe000030 > KASAN: maybe wild-memory-access in range [0x0003dffff0000180-0x0003dffff0000187] > Mem abort info: > ESR = 0x96000006 > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x06: level 2 translation fault > Data abort info: > ISV = 0, ISS = 0x00000006 > CM = 0, WnR = 0 > swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000817545fd000 > [fffffbfffe000030] pgd=00000817581e9003, p4d=00000817581e9003, pud=00000817581ea003, pmd=0000000000000000 > Internal error: Oops: 96000006 [#1] PREEMPT SMP > Modules linked in: bridge stp llc cdc_ether usbnet ipmi_devintf ipmi_msghandler cppc_cpufreq fuse ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod nouveau drm_ttm_helper ttm crct10dif_ce mlx5_core drm_display_helper drm_kms_helper nvme mpt3sas xhci_pci nvme_core drm raid_class xhci_pci_renesas > CPU: 147 PID: 3334 Comm: tee Not tainted 5.18.0-rc4-next-20220426-dirty #27 > pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : isolate_single_pageblock > lr : isolate_single_pageblock > sp : ffff80003e767500 > x29: ffff80003e767500 x28: 0000000000000000 x27: ffff783c59963b1f > x26: dfff800000000000 x25: ffffc1e2ccb1d000 x24: ffffc1e2ccb1d8f8 > x23: 00000000803bfe00 x22: ffffc1e2cee39098 x21: 0000000000000020 > x20: 00000000803c0000 x19: fffffbfffe000000 x18: ffffc1e2cee37d1c > x17: 0000000000000000 x16: 1fffe8004a86f14c x15: 1fffe806c89e154a > x14: 1fffe8004a86f11c x13: 0000000000000004 x12: ffff783c5c455e6d > x11: 1ffff83c5c455e6c x10: ffff783c5c455e6c x9 : dfff800000000000 > x8 : ffffc1e2e22af363 x7 : 0000000000000001 x6 : 0000000000000003 > x5 : ffffc1e2e22af360 x4 : ffff783c5c455e6c x3 : ffff700007cece90 > x2 : 0000000000000003 x1 : 0000000000000000 x0 : fffffbfffe000030 > Call trace: > Call trace: > isolate_single_pageblock > PageBuddy at ./include/linux/page-flags.h:969 (discriminator 3) > (inlined by) isolate_single_pageblock at mm/page_isolation.c:414 (discriminator 3) > start_isolate_page_range > offline_pages > memory_subsys_offline > device_offline > online_store > dev_attr_store > sysfs_kf_write > kernfs_fop_write_iter > new_sync_write > vfs_write > ksys_write > __arm64_sys_write > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > Code: 38fa6821 7100003f 7a411041 54000dca (b9403260) > ---[ end trace 0000000000000000 ]--- > Kernel panic - not syncing: Oops: Fatal exception > SMP: stopping secondary CPUs > Kernel Offset: 0x41e2c0720000 from 0xffff800008000000 > PHYS_OFFSET: 0x80000000 > CPU features: 0x000,0021700d,19801c82 > Memory Limit: none -- Best Regards, Yan, Zi