From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753721AbcBCNSC (ORCPT ); Wed, 3 Feb 2016 08:18:02 -0500 Received: from mx1.redhat.com ([209.132.183.28]:49819 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750750AbcBCNR6 (ORCPT ); Wed, 3 Feb 2016 08:17:58 -0500 Date: Wed, 3 Feb 2016 14:17:49 +0100 From: Mateusz Guzik To: Johannes Weiner Cc: Andrew Morton , Vladimir Davydov , Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH 1/3] mm: migrate: do not touch page->mem_cgroup of live pages Message-ID: <20160203131748.GB15520@mguzik> References: <1454109573-29235-1-git-send-email-hannes@cmpxchg.org> <1454109573-29235-2-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1454109573-29235-2-git-send-email-hannes@cmpxchg.org> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 29, 2016 at 06:19:31PM -0500, Johannes Weiner wrote: > Changing a page's memcg association complicates dealing with the page, > so we want to limit this as much as possible. Page migration e.g. does > not have to do that. Just like page cache replacement, it can forcibly > charge a replacement page, and then uncharge the old page when it gets > freed. Temporarily overcharging the cgroup by a single page is not an > issue in practice, and charging is so cheap nowadays that this is much > preferrable to the headache of messing with live pages. > > The only place that still changes the page->mem_cgroup binding of live > pages is when pages move along with a task to another cgroup. But that > path isolates the page from the LRU, takes the page lock, and the move > lock (lock_page_memcg()). That means page->mem_cgroup is always stable > in callers that have the page isolated from the LRU or locked. Lighter > unlocked paths, like writeback accounting, can use lock_page_memcg(). > > Signed-off-by: Johannes Weiner [..] > @@ -372,12 +373,13 @@ int migrate_page_move_mapping(struct address_space *mapping, > * Now we know that no one else is looking at the page: > * no turning back from here. > */ > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > if (PageSwapBacked(page)) > SetPageSwapBacked(newpage); > > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); /* add cache reference */ > if (PageSwapCache(page)) { > SetPageSwapCache(newpage); > @@ -457,9 +459,11 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, > return -EAGAIN; > } > > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > + > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); > > radix_tree_replace_slot(pslot, newpage); I ran trinity on recent linux-next and got the lockdep splat below and if I read it right, this is the culprit. In particular, mem_cgroup_migrate was put in an area covered by spin_lock_irq(&mapping->tree_lock), but stuff it calls enables and disables interrupts on its own. [ 105.084225] ================================= [ 105.096026] [ INFO: inconsistent lock state ] [ 105.107896] 4.5.0-rc2-next-20160203dupa+ #121 Not tainted [ 105.119904] --------------------------------- [ 105.131667] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 105.144052] trinity-c3/1600 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 105.155944] (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [] migrate_page_move_mapping+0x71/0x4f0 [ 105.179744] {IN-HARDIRQ-W} state was registered at: [ 105.191556] [] __lock_acquire+0x79f/0xbf0 [ 105.203521] [] lock_acquire+0xca/0x1c0 [ 105.215447] [] _raw_spin_lock_irqsave+0x55/0x90 [ 105.227537] [] test_clear_page_writeback+0x67/0x2a0 [ 105.239612] [] end_page_writeback+0x1f/0xa0 [ 105.251622] [] end_buffer_async_write+0xd6/0x1b0 [ 105.263649] [] end_bio_bh_io_sync+0x28/0x40 [ 105.276160] [] bio_endio+0x40/0x60 [ 105.288599] [] dec_pending+0x15d/0x320 [ 105.301083] [] clone_endio+0x5b/0xe0 [ 105.313418] [] bio_endio+0x40/0x60 [ 105.325763] [] blk_update_request+0xb2/0x3b0 [ 105.338236] [] blk_mq_end_request+0x1a/0x70 [ 105.350626] [] virtblk_request_done+0x3f/0x70 [ 105.363303] [] __blk_mq_complete_request_remote+0x13/0x20 [ 105.387289] [] flush_smp_call_function_queue+0x7b/0x150 [ 105.399903] [] generic_smp_call_function_single_interrupt+0x13/0x60 [ 105.423568] [] smp_call_function_single_interrupt+0x27/0x40 [ 105.446971] [] call_function_single_interrupt+0x96/0xa0 [ 105.459109] [] kmem_cache_alloc+0x27e/0x2e0 [ 105.471141] [] mempool_alloc_slab+0x1c/0x20 [ 105.483200] [] mempool_alloc+0x79/0x1b0 [ 105.495195] [] bio_alloc_bioset+0x146/0x220 [ 105.507268] [] __split_and_process_bio+0x253/0x4f0 [ 105.519505] [] dm_make_request+0x7a/0x110 [ 105.531563] [] generic_make_request+0x166/0x2c0 [ 105.544062] [] submit_bio+0x77/0x150 [ 105.556195] [] submit_bh_wbc+0x12f/0x160 [ 105.568461] [] __block_write_full_page.constprop.41+0x138/0x3b0 [ 105.591728] [] block_write_full_page+0xec/0x110 [ 105.603869] [] blkdev_writepage+0x18/0x20 [ 105.616272] [] __writepage+0x16/0x50 [ 105.628206] [] write_cache_pages+0x2c0/0x620 [ 105.640298] [] generic_writepages+0x54/0x80 [ 105.652244] [] do_writepages+0x21/0x40 [ 105.664204] [] __filemap_fdatawrite_range+0xc6/0x100 [ 105.676241] [] filemap_write_and_wait+0x2f/0x60 [ 105.688339] [] __sync_blockdev+0x1f/0x40 [ 105.700350] [] sync_blockdev+0x13/0x20 [ 105.712352] [] jbd2_journal_recover+0x119/0x130 [ 105.724532] [] jbd2_journal_load+0xe0/0x390 [ 105.736533] [] ext4_load_journal+0x5ef/0x6b8 [ 105.748581] [] ext4_fill_super+0x1ad3/0x2a10 [ 105.760597] [] mount_bdev+0x18c/0x1c0 [ 105.772574] [] ext4_mount+0x15/0x20 [ 105.784489] [] mount_fs+0x39/0x170 [ 105.796406] [] vfs_kern_mount+0x6b/0x150 [ 105.808386] [] do_mount+0x24d/0xed0 [ 105.820355] [] SyS_mount+0x83/0xd0 [ 105.832317] [] entry_SYSCALL_64_fastpath+0x1f/0xbd [ 105.844390] irq event stamp: 341784 [ 105.856114] hardirqs last enabled at (341783): [] flat_send_IPI_mask+0x8a/0xc0 [ 105.879644] hardirqs last disabled at (341784): [] _raw_spin_lock_irq+0x1f/0x80 [ 105.903369] softirqs last enabled at (341744): [] __do_softirq+0x343/0x480 [ 105.926856] softirqs last disabled at (341739): [] irq_exit+0x106/0x120 [ 105.950499] [ 105.950499] other info that might help us debug this: [ 105.973803] Possible unsafe locking scenario: [ 105.973803] [ 105.997202] CPU0 [ 106.008754] ---- [ 106.020265] lock(&(&mapping->tree_lock)->rlock); [ 106.032196] [ 106.043832] lock(&(&mapping->tree_lock)->rlock); [ 106.055891] [ 106.055891] *** DEADLOCK *** [ 106.055891] [ 106.090289] 2 locks held by trinity-c3/1600: [ 106.102049] #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x15f/0x470 [ 106.125631] #1: (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [] migrate_page_move_mapping+0x71/0x4f0 [ 106.149777] [ 106.149777] stack backtrace: [ 106.172646] CPU: 3 PID: 1600 Comm: trinity-c3 Not tainted 4.5.0-rc2-next-20160203dupa+ #121 [ 106.196137] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 106.208200] 0000000000000086 00000000cf42937f ffff88004bdef630 ffffffff814acb6d [ 106.231859] ffff88004e0e0000 ffffffff82f17fb0 ffff88004bdef680 ffffffff811a47e9 [ 106.255250] 0000000000000000 ffff880000000001 ffff880000000001 0000000000000000 [ 106.278647] Call Trace: [ 106.290201] [] dump_stack+0x85/0xc8 [ 106.302013] [] print_usage_bug+0x1eb/0x1fc [ 106.314144] [] ? print_shortest_lock_dependencies+0x1d0/0x1d0 [ 106.337708] [] mark_lock+0x20d/0x290 [ 106.349493] [] mark_held_locks+0x71/0x90 [ 106.361462] [] ? _raw_spin_unlock_irq+0x2c/0x40 [ 106.373469] [] trace_hardirqs_on_caller+0xa9/0x1c0 [ 106.385466] [] trace_hardirqs_on+0xd/0x10 [ 106.397421] [] _raw_spin_unlock_irq+0x2c/0x40 [ 106.409417] [] commit_charge+0xbe/0x390 [ 106.421310] [] mem_cgroup_migrate+0x135/0x360 [ 106.433352] [] migrate_page_move_mapping+0x132/0x4f0 [ 106.445369] [] migrate_page+0x2b/0x50 [ 106.457301] [] buffer_migrate_page+0x10a/0x150 [ 106.469260] [] move_to_new_page+0x93/0x270 [ 106.481209] [] ? try_to_unmap+0xa7/0x170 [ 106.493094] [] ? page_remove_rmap+0x2a0/0x2a0 [ 106.505052] [] ? __hugepage_set_anon_rmap+0x80/0x80 [ 106.517130] [] migrate_pages+0x846/0xac0 [ 106.528993] [] ? __reset_isolation_suitable+0x120/0x120 [ 106.541061] [] ? isolate_freepages_block+0x4e0/0x4e0 [ 106.553068] [] compact_zone+0x33d/0xa80 [ 106.565026] [] compact_zone_order+0x7b/0xc0 [ 106.576944] [] try_to_compact_pages+0x13a/0x2e0 [ 106.588948] [] __alloc_pages_direct_compact+0x3b/0xf9 [ 106.600995] [] __alloc_pages_slowpath.constprop.87+0x2e5/0x886 [ 106.624383] [] __alloc_pages_nodemask+0x456/0x460 [ 106.636380] [] alloc_pages_vma+0x28b/0x2d0 [ 106.648440] [] do_huge_pmd_anonymous_page+0x13e/0x540 [ 106.660497] [] handle_mm_fault+0x7e4/0x980 [ 106.672585] [] ? handle_mm_fault+0x59/0x980 [ 106.684595] [] __do_page_fault+0x1cd/0x470 [ 106.696524] [] trace_do_page_fault+0x6e/0x250 [ 106.708477] [] do_async_page_fault+0x1a/0xb0 [ 106.720407] [] async_page_fault+0x28/0x30 -- Mateusz Guzik From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f177.google.com (mail-io0-f177.google.com [209.85.223.177]) by kanga.kvack.org (Postfix) with ESMTP id 7EBE7828E6 for ; Wed, 3 Feb 2016 08:18:00 -0500 (EST) Received: by mail-io0-f177.google.com with SMTP id g73so54550904ioe.3 for ; Wed, 03 Feb 2016 05:18:00 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id c1si12329193igx.68.2016.02.03.05.17.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 Feb 2016 05:17:59 -0800 (PST) Date: Wed, 3 Feb 2016 14:17:49 +0100 From: Mateusz Guzik Subject: Re: [PATCH 1/3] mm: migrate: do not touch page->mem_cgroup of live pages Message-ID: <20160203131748.GB15520@mguzik> References: <1454109573-29235-1-git-send-email-hannes@cmpxchg.org> <1454109573-29235-2-git-send-email-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1454109573-29235-2-git-send-email-hannes@cmpxchg.org> Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Andrew Morton , Vladimir Davydov , Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com On Fri, Jan 29, 2016 at 06:19:31PM -0500, Johannes Weiner wrote: > Changing a page's memcg association complicates dealing with the page, > so we want to limit this as much as possible. Page migration e.g. does > not have to do that. Just like page cache replacement, it can forcibly > charge a replacement page, and then uncharge the old page when it gets > freed. Temporarily overcharging the cgroup by a single page is not an > issue in practice, and charging is so cheap nowadays that this is much > preferrable to the headache of messing with live pages. > > The only place that still changes the page->mem_cgroup binding of live > pages is when pages move along with a task to another cgroup. But that > path isolates the page from the LRU, takes the page lock, and the move > lock (lock_page_memcg()). That means page->mem_cgroup is always stable > in callers that have the page isolated from the LRU or locked. Lighter > unlocked paths, like writeback accounting, can use lock_page_memcg(). > > Signed-off-by: Johannes Weiner [..] > @@ -372,12 +373,13 @@ int migrate_page_move_mapping(struct address_space *mapping, > * Now we know that no one else is looking at the page: > * no turning back from here. > */ > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > if (PageSwapBacked(page)) > SetPageSwapBacked(newpage); > > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); /* add cache reference */ > if (PageSwapCache(page)) { > SetPageSwapCache(newpage); > @@ -457,9 +459,11 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, > return -EAGAIN; > } > > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > + > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); > > radix_tree_replace_slot(pslot, newpage); I ran trinity on recent linux-next and got the lockdep splat below and if I read it right, this is the culprit. In particular, mem_cgroup_migrate was put in an area covered by spin_lock_irq(&mapping->tree_lock), but stuff it calls enables and disables interrupts on its own. [ 105.084225] ================================= [ 105.096026] [ INFO: inconsistent lock state ] [ 105.107896] 4.5.0-rc2-next-20160203dupa+ #121 Not tainted [ 105.119904] --------------------------------- [ 105.131667] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 105.144052] trinity-c3/1600 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 105.155944] (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [] migrate_page_move_mapping+0x71/0x4f0 [ 105.179744] {IN-HARDIRQ-W} state was registered at: [ 105.191556] [] __lock_acquire+0x79f/0xbf0 [ 105.203521] [] lock_acquire+0xca/0x1c0 [ 105.215447] [] _raw_spin_lock_irqsave+0x55/0x90 [ 105.227537] [] test_clear_page_writeback+0x67/0x2a0 [ 105.239612] [] end_page_writeback+0x1f/0xa0 [ 105.251622] [] end_buffer_async_write+0xd6/0x1b0 [ 105.263649] [] end_bio_bh_io_sync+0x28/0x40 [ 105.276160] [] bio_endio+0x40/0x60 [ 105.288599] [] dec_pending+0x15d/0x320 [ 105.301083] [] clone_endio+0x5b/0xe0 [ 105.313418] [] bio_endio+0x40/0x60 [ 105.325763] [] blk_update_request+0xb2/0x3b0 [ 105.338236] [] blk_mq_end_request+0x1a/0x70 [ 105.350626] [] virtblk_request_done+0x3f/0x70 [ 105.363303] [] __blk_mq_complete_request_remote+0x13/0x20 [ 105.387289] [] flush_smp_call_function_queue+0x7b/0x150 [ 105.399903] [] generic_smp_call_function_single_interrupt+0x13/0x60 [ 105.423568] [] smp_call_function_single_interrupt+0x27/0x40 [ 105.446971] [] call_function_single_interrupt+0x96/0xa0 [ 105.459109] [] kmem_cache_alloc+0x27e/0x2e0 [ 105.471141] [] mempool_alloc_slab+0x1c/0x20 [ 105.483200] [] mempool_alloc+0x79/0x1b0 [ 105.495195] [] bio_alloc_bioset+0x146/0x220 [ 105.507268] [] __split_and_process_bio+0x253/0x4f0 [ 105.519505] [] dm_make_request+0x7a/0x110 [ 105.531563] [] generic_make_request+0x166/0x2c0 [ 105.544062] [] submit_bio+0x77/0x150 [ 105.556195] [] submit_bh_wbc+0x12f/0x160 [ 105.568461] [] __block_write_full_page.constprop.41+0x138/0x3b0 [ 105.591728] [] block_write_full_page+0xec/0x110 [ 105.603869] [] blkdev_writepage+0x18/0x20 [ 105.616272] [] __writepage+0x16/0x50 [ 105.628206] [] write_cache_pages+0x2c0/0x620 [ 105.640298] [] generic_writepages+0x54/0x80 [ 105.652244] [] do_writepages+0x21/0x40 [ 105.664204] [] __filemap_fdatawrite_range+0xc6/0x100 [ 105.676241] [] filemap_write_and_wait+0x2f/0x60 [ 105.688339] [] __sync_blockdev+0x1f/0x40 [ 105.700350] [] sync_blockdev+0x13/0x20 [ 105.712352] [] jbd2_journal_recover+0x119/0x130 [ 105.724532] [] jbd2_journal_load+0xe0/0x390 [ 105.736533] [] ext4_load_journal+0x5ef/0x6b8 [ 105.748581] [] ext4_fill_super+0x1ad3/0x2a10 [ 105.760597] [] mount_bdev+0x18c/0x1c0 [ 105.772574] [] ext4_mount+0x15/0x20 [ 105.784489] [] mount_fs+0x39/0x170 [ 105.796406] [] vfs_kern_mount+0x6b/0x150 [ 105.808386] [] do_mount+0x24d/0xed0 [ 105.820355] [] SyS_mount+0x83/0xd0 [ 105.832317] [] entry_SYSCALL_64_fastpath+0x1f/0xbd [ 105.844390] irq event stamp: 341784 [ 105.856114] hardirqs last enabled at (341783): [] flat_send_IPI_mask+0x8a/0xc0 [ 105.879644] hardirqs last disabled at (341784): [] _raw_spin_lock_irq+0x1f/0x80 [ 105.903369] softirqs last enabled at (341744): [] __do_softirq+0x343/0x480 [ 105.926856] softirqs last disabled at (341739): [] irq_exit+0x106/0x120 [ 105.950499] [ 105.950499] other info that might help us debug this: [ 105.973803] Possible unsafe locking scenario: [ 105.973803] [ 105.997202] CPU0 [ 106.008754] ---- [ 106.020265] lock(&(&mapping->tree_lock)->rlock); [ 106.032196] [ 106.043832] lock(&(&mapping->tree_lock)->rlock); [ 106.055891] [ 106.055891] *** DEADLOCK *** [ 106.055891] [ 106.090289] 2 locks held by trinity-c3/1600: [ 106.102049] #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x15f/0x470 [ 106.125631] #1: (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [] migrate_page_move_mapping+0x71/0x4f0 [ 106.149777] [ 106.149777] stack backtrace: [ 106.172646] CPU: 3 PID: 1600 Comm: trinity-c3 Not tainted 4.5.0-rc2-next-20160203dupa+ #121 [ 106.196137] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 106.208200] 0000000000000086 00000000cf42937f ffff88004bdef630 ffffffff814acb6d [ 106.231859] ffff88004e0e0000 ffffffff82f17fb0 ffff88004bdef680 ffffffff811a47e9 [ 106.255250] 0000000000000000 ffff880000000001 ffff880000000001 0000000000000000 [ 106.278647] Call Trace: [ 106.290201] [] dump_stack+0x85/0xc8 [ 106.302013] [] print_usage_bug+0x1eb/0x1fc [ 106.314144] [] ? print_shortest_lock_dependencies+0x1d0/0x1d0 [ 106.337708] [] mark_lock+0x20d/0x290 [ 106.349493] [] mark_held_locks+0x71/0x90 [ 106.361462] [] ? _raw_spin_unlock_irq+0x2c/0x40 [ 106.373469] [] trace_hardirqs_on_caller+0xa9/0x1c0 [ 106.385466] [] trace_hardirqs_on+0xd/0x10 [ 106.397421] [] _raw_spin_unlock_irq+0x2c/0x40 [ 106.409417] [] commit_charge+0xbe/0x390 [ 106.421310] [] mem_cgroup_migrate+0x135/0x360 [ 106.433352] [] migrate_page_move_mapping+0x132/0x4f0 [ 106.445369] [] migrate_page+0x2b/0x50 [ 106.457301] [] buffer_migrate_page+0x10a/0x150 [ 106.469260] [] move_to_new_page+0x93/0x270 [ 106.481209] [] ? try_to_unmap+0xa7/0x170 [ 106.493094] [] ? page_remove_rmap+0x2a0/0x2a0 [ 106.505052] [] ? __hugepage_set_anon_rmap+0x80/0x80 [ 106.517130] [] migrate_pages+0x846/0xac0 [ 106.528993] [] ? __reset_isolation_suitable+0x120/0x120 [ 106.541061] [] ? isolate_freepages_block+0x4e0/0x4e0 [ 106.553068] [] compact_zone+0x33d/0xa80 [ 106.565026] [] compact_zone_order+0x7b/0xc0 [ 106.576944] [] try_to_compact_pages+0x13a/0x2e0 [ 106.588948] [] __alloc_pages_direct_compact+0x3b/0xf9 [ 106.600995] [] __alloc_pages_slowpath.constprop.87+0x2e5/0x886 [ 106.624383] [] __alloc_pages_nodemask+0x456/0x460 [ 106.636380] [] alloc_pages_vma+0x28b/0x2d0 [ 106.648440] [] do_huge_pmd_anonymous_page+0x13e/0x540 [ 106.660497] [] handle_mm_fault+0x7e4/0x980 [ 106.672585] [] ? handle_mm_fault+0x59/0x980 [ 106.684595] [] __do_page_fault+0x1cd/0x470 [ 106.696524] [] trace_do_page_fault+0x6e/0x250 [ 106.708477] [] do_async_page_fault+0x1a/0xb0 [ 106.720407] [] async_page_fault+0x28/0x30 -- Mateusz Guzik -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mateusz Guzik Subject: Re: [PATCH 1/3] mm: migrate: do not touch page->mem_cgroup of live pages Date: Wed, 3 Feb 2016 14:17:49 +0100 Message-ID: <20160203131748.GB15520@mguzik> References: <1454109573-29235-1-git-send-email-hannes@cmpxchg.org> <1454109573-29235-2-git-send-email-hannes@cmpxchg.org> Mime-Version: 1.0 Return-path: Content-Disposition: inline In-Reply-To: <1454109573-29235-2-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Johannes Weiner Cc: Andrew Morton , Vladimir Davydov , Michal Hocko , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kernel-team-b10kYP2dOMg@public.gmane.org On Fri, Jan 29, 2016 at 06:19:31PM -0500, Johannes Weiner wrote: > Changing a page's memcg association complicates dealing with the page, > so we want to limit this as much as possible. Page migration e.g. does > not have to do that. Just like page cache replacement, it can forcibly > charge a replacement page, and then uncharge the old page when it gets > freed. Temporarily overcharging the cgroup by a single page is not an > issue in practice, and charging is so cheap nowadays that this is much > preferrable to the headache of messing with live pages. > > The only place that still changes the page->mem_cgroup binding of live > pages is when pages move along with a task to another cgroup. But that > path isolates the page from the LRU, takes the page lock, and the move > lock (lock_page_memcg()). That means page->mem_cgroup is always stable > in callers that have the page isolated from the LRU or locked. Lighter > unlocked paths, like writeback accounting, can use lock_page_memcg(). > > Signed-off-by: Johannes Weiner [..] > @@ -372,12 +373,13 @@ int migrate_page_move_mapping(struct address_space *mapping, > * Now we know that no one else is looking at the page: > * no turning back from here. > */ > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > if (PageSwapBacked(page)) > SetPageSwapBacked(newpage); > > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); /* add cache reference */ > if (PageSwapCache(page)) { > SetPageSwapCache(newpage); > @@ -457,9 +459,11 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, > return -EAGAIN; > } > > - set_page_memcg(newpage, page_memcg(page)); > newpage->index = page->index; > newpage->mapping = page->mapping; > + > + mem_cgroup_migrate(page, newpage); > + > get_page(newpage); > > radix_tree_replace_slot(pslot, newpage); I ran trinity on recent linux-next and got the lockdep splat below and if I read it right, this is the culprit. In particular, mem_cgroup_migrate was put in an area covered by spin_lock_irq(&mapping->tree_lock), but stuff it calls enables and disables interrupts on its own. [ 105.084225] ================================= [ 105.096026] [ INFO: inconsistent lock state ] [ 105.107896] 4.5.0-rc2-next-20160203dupa+ #121 Not tainted [ 105.119904] --------------------------------- [ 105.131667] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 105.144052] trinity-c3/1600 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 105.155944] (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [] migrate_page_move_mapping+0x71/0x4f0 [ 105.179744] {IN-HARDIRQ-W} state was registered at: [ 105.191556] [] __lock_acquire+0x79f/0xbf0 [ 105.203521] [] lock_acquire+0xca/0x1c0 [ 105.215447] [] _raw_spin_lock_irqsave+0x55/0x90 [ 105.227537] [] test_clear_page_writeback+0x67/0x2a0 [ 105.239612] [] end_page_writeback+0x1f/0xa0 [ 105.251622] [] end_buffer_async_write+0xd6/0x1b0 [ 105.263649] [] end_bio_bh_io_sync+0x28/0x40 [ 105.276160] [] bio_endio+0x40/0x60 [ 105.288599] [] dec_pending+0x15d/0x320 [ 105.301083] [] clone_endio+0x5b/0xe0 [ 105.313418] [] bio_endio+0x40/0x60 [ 105.325763] [] blk_update_request+0xb2/0x3b0 [ 105.338236] [] blk_mq_end_request+0x1a/0x70 [ 105.350626] [] virtblk_request_done+0x3f/0x70 [ 105.363303] [] __blk_mq_complete_request_remote+0x13/0x20 [ 105.387289] [] flush_smp_call_function_queue+0x7b/0x150 [ 105.399903] [] generic_smp_call_function_single_interrupt+0x13/0x60 [ 105.423568] [] smp_call_function_single_interrupt+0x27/0x40 [ 105.446971] [] call_function_single_interrupt+0x96/0xa0 [ 105.459109] [] kmem_cache_alloc+0x27e/0x2e0 [ 105.471141] [] mempool_alloc_slab+0x1c/0x20 [ 105.483200] [] mempool_alloc+0x79/0x1b0 [ 105.495195] [] bio_alloc_bioset+0x146/0x220 [ 105.507268] [] __split_and_process_bio+0x253/0x4f0 [ 105.519505] [] dm_make_request+0x7a/0x110 [ 105.531563] [] generic_make_request+0x166/0x2c0 [ 105.544062] [] submit_bio+0x77/0x150 [ 105.556195] [] submit_bh_wbc+0x12f/0x160 [ 105.568461] [] __block_write_full_page.constprop.41+0x138/0x3b0 [ 105.591728] [] block_write_full_page+0xec/0x110 [ 105.603869] [] blkdev_writepage+0x18/0x20 [ 105.616272] [] __writepage+0x16/0x50 [ 105.628206] [] write_cache_pages+0x2c0/0x620 [ 105.640298] [] generic_writepages+0x54/0x80 [ 105.652244] [] do_writepages+0x21/0x40 [ 105.664204] [] __filemap_fdatawrite_range+0xc6/0x100 [ 105.676241] [] filemap_write_and_wait+0x2f/0x60 [ 105.688339] [] __sync_blockdev+0x1f/0x40 [ 105.700350] [] sync_blockdev+0x13/0x20 [ 105.712352] [] jbd2_journal_recover+0x119/0x130 [ 105.724532] [] jbd2_journal_load+0xe0/0x390 [ 105.736533] [] ext4_load_journal+0x5ef/0x6b8 [ 105.748581] [] ext4_fill_super+0x1ad3/0x2a10 [ 105.760597] [] mount_bdev+0x18c/0x1c0 [ 105.772574] [] ext4_mount+0x15/0x20 [ 105.784489] [] mount_fs+0x39/0x170 [ 105.796406] [] vfs_kern_mount+0x6b/0x150 [ 105.808386] [] do_mount+0x24d/0xed0 [ 105.820355] [] SyS_mount+0x83/0xd0 [ 105.832317] [] entry_SYSCALL_64_fastpath+0x1f/0xbd [ 105.844390] irq event stamp: 341784 [ 105.856114] hardirqs last enabled at (341783): [] flat_send_IPI_mask+0x8a/0xc0 [ 105.879644] hardirqs last disabled at (341784): [] _raw_spin_lock_irq+0x1f/0x80 [ 105.903369] softirqs last enabled at (341744): [] __do_softirq+0x343/0x480 [ 105.926856] softirqs last disabled at (341739): [] irq_exit+0x106/0x120 [ 105.950499] [ 105.950499] other info that might help us debug this: [ 105.973803] Possible unsafe locking scenario: [ 105.973803] [ 105.997202] CPU0 [ 106.008754] ---- [ 106.020265] lock(&(&mapping->tree_lock)->rlock); [ 106.032196] [ 106.043832] lock(&(&mapping->tree_lock)->rlock); [ 106.055891] [ 106.055891] *** DEADLOCK *** [ 106.055891] [ 106.090289] 2 locks held by trinity-c3/1600: [ 106.102049] #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x15f/0x470 [ 106.125631] #1: (&(&mapping->tree_lock)->rlock){?.-.-.}, at: [] migrate_page_move_mapping+0x71/0x4f0 [ 106.149777] [ 106.149777] stack backtrace: [ 106.172646] CPU: 3 PID: 1600 Comm: trinity-c3 Not tainted 4.5.0-rc2-next-20160203dupa+ #121 [ 106.196137] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 106.208200] 0000000000000086 00000000cf42937f ffff88004bdef630 ffffffff814acb6d [ 106.231859] ffff88004e0e0000 ffffffff82f17fb0 ffff88004bdef680 ffffffff811a47e9 [ 106.255250] 0000000000000000 ffff880000000001 ffff880000000001 0000000000000000 [ 106.278647] Call Trace: [ 106.290201] [] dump_stack+0x85/0xc8 [ 106.302013] [] print_usage_bug+0x1eb/0x1fc [ 106.314144] [] ? print_shortest_lock_dependencies+0x1d0/0x1d0 [ 106.337708] [] mark_lock+0x20d/0x290 [ 106.349493] [] mark_held_locks+0x71/0x90 [ 106.361462] [] ? _raw_spin_unlock_irq+0x2c/0x40 [ 106.373469] [] trace_hardirqs_on_caller+0xa9/0x1c0 [ 106.385466] [] trace_hardirqs_on+0xd/0x10 [ 106.397421] [] _raw_spin_unlock_irq+0x2c/0x40 [ 106.409417] [] commit_charge+0xbe/0x390 [ 106.421310] [] mem_cgroup_migrate+0x135/0x360 [ 106.433352] [] migrate_page_move_mapping+0x132/0x4f0 [ 106.445369] [] migrate_page+0x2b/0x50 [ 106.457301] [] buffer_migrate_page+0x10a/0x150 [ 106.469260] [] move_to_new_page+0x93/0x270 [ 106.481209] [] ? try_to_unmap+0xa7/0x170 [ 106.493094] [] ? page_remove_rmap+0x2a0/0x2a0 [ 106.505052] [] ? __hugepage_set_anon_rmap+0x80/0x80 [ 106.517130] [] migrate_pages+0x846/0xac0 [ 106.528993] [] ? __reset_isolation_suitable+0x120/0x120 [ 106.541061] [] ? isolate_freepages_block+0x4e0/0x4e0 [ 106.553068] [] compact_zone+0x33d/0xa80 [ 106.565026] [] compact_zone_order+0x7b/0xc0 [ 106.576944] [] try_to_compact_pages+0x13a/0x2e0 [ 106.588948] [] __alloc_pages_direct_compact+0x3b/0xf9 [ 106.600995] [] __alloc_pages_slowpath.constprop.87+0x2e5/0x886 [ 106.624383] [] __alloc_pages_nodemask+0x456/0x460 [ 106.636380] [] alloc_pages_vma+0x28b/0x2d0 [ 106.648440] [] do_huge_pmd_anonymous_page+0x13e/0x540 [ 106.660497] [] handle_mm_fault+0x7e4/0x980 [ 106.672585] [] ? handle_mm_fault+0x59/0x980 [ 106.684595] [] __do_page_fault+0x1cd/0x470 [ 106.696524] [] trace_do_page_fault+0x6e/0x250 [ 106.708477] [] do_async_page_fault+0x1a/0xb0 [ 106.720407] [] async_page_fault+0x28/0x30 -- Mateusz Guzik