All of lore.kernel.org
 help / color / mirror / Atom feed
* mm: memcontrol: rewrite uncharge API: problems
@ 2014-06-30 23:55 ` Hugh Dickins
  0 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-06-30 23:55 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

Hi Hannes,

Your rewrite of the memcg charge/uncharge API is bold and attractive,
but I'm having some problems with the way release_pages() now does
uncharging in I/O completion context.

At the bottom see the lockdep message I get when I start shmem swapping.
Which I have not begun to attempt to decipher (over to you!), but I do
see release_pages() mentioned in there (also i915, hope it's irrelevant).

Which was already worrying me on the PowerPC G5, when moving tasks from
one memcg to another and removing the old, while swapping and swappingoff
(I haven't tried much else actually, maybe it's much easier to reproduce).

I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).

I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
memsw res_counter spinlock, if memcg is NULL.  I don't understand why
usually the PowerPC: I did see something like it once on this x86 laptop,
maybe having lockdep in on this slows things down enough not to hit that.

I've stopped those crashes with patch below: the memcg_batch uncharging
was never designed for use from interrupts.  But I bet it needs more work:
to disable interrupts, or do something clever with atomics, or... over to
you again.

As it stands, I think an interrupt in the wrong place risks leaking
charges (but actually I see the reverse - kernel/res_counter.c:28!
underflow warnings; though I don't know if it's merely that the patch
lets the machine stay up long enough to reach those, or causes them).

Not-really-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/memcontrol.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- 3.16-rc2-mm1/mm/memcontrol.c	2014-06-25 18:43:59.856588121 -0700
+++ linux/mm/memcontrol.c	2014-06-29 21:45:03.896588350 -0700
@@ -3636,12 +3636,11 @@ void mem_cgroup_uncharge_end(void)
 	if (!batch->do_batch)
 		return;
 
-	batch->do_batch--;
-	if (batch->do_batch) /* If stacked, do nothing. */
-		return;
+	if (batch->do_batch > 1) /* If stacked, do nothing. */
+		goto out;
 
 	if (!batch->memcg)
-		return;
+		goto out;
 	/*
 	 * This "batch->memcg" is valid without any css_get/put etc...
 	 * bacause we hide charges behind us.
@@ -3655,6 +3654,8 @@ void mem_cgroup_uncharge_end(void)
 	memcg_oom_recover(batch->memcg);
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
+out:
+	batch->do_batch--;
 }
 
 #ifdef CONFIG_MEMCG_SWAP

And here's lockdep's little fortune cookie:

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
3.16.0-rc2-mm1 #3 Not tainted
------------------------------------------------------
cc1/2771 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&rtpz->lock)->rlock){+.+.-.}, at: [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
dd
and this task is already holding:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239
which would create a new lock dependency:
 (&(&zone->lru_lock)->rlock){..-.-.} -> (&(&rtpz->lock)->rlock){+.+.-.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&zone->lru_lock)->rlock){..-.-.}
... which became SOFTIRQ-irq-safe at:
  [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
  [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
  [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
  [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
  [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
  [<ffffffff81134400>] end_swap_bio_write+0x5c/0x69
  [<ffffffff8123473e>] bio_endio+0x50/0x6e
  [<ffffffff81238dee>] blk_update_request+0x163/0x255
  [<ffffffff81238ef7>] blk_update_bidi_request+0x17/0x65
  [<ffffffff81239242>] blk_end_bidi_request+0x1a/0x56
  [<ffffffff81239289>] blk_end_request+0xb/0xd
  [<ffffffff813a075a>] scsi_io_completion+0x16d/0x553
  [<ffffffff81399c0f>] scsi_finish_command+0xb6/0xbf
  [<ffffffff813a0564>] scsi_softirq_done+0xe9/0xf0
  [<ffffffff8123e8e5>] blk_done_softirq+0x79/0x8b
  [<ffffffff81088675>] __do_softirq+0xfc/0x21f
  [<ffffffff8108898f>] irq_exit+0x3d/0x92
  [<ffffffff81032379>] do_IRQ+0xcc/0xe5
  [<ffffffff815bf5ac>] ret_from_intr+0x0/0x13
  [<ffffffff81443ac0>] cpuidle_enter+0x12/0x14
  [<ffffffff810bb4e4>] cpu_startup_entry+0x187/0x243
  [<ffffffff815a90ab>] rest_init+0x12f/0x133
  [<ffffffff81970e7c>] start_kernel+0x396/0x3a3
  [<ffffffff81970489>] x86_64_start_reservations+0x2a/0x2c
  [<ffffffff81970552>] x86_64_start_kernel+0xc7/0xca

to a SOFTIRQ-irq-unsafe lock:
 (&(&rtpz->lock)->rlock){+.+.-.}
... which became SOFTIRQ-irq-unsafe at:
...  [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
  [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
  [<ffffffff811535bb>] commit_charge+0x260/0x26f
  [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
  [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
  [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
  [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
  [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
  [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
  [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
  [<ffffffff8115a56c>] vfs_write+0xb5/0x169
  [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
  [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&rtpz->lock)->rlock);
                               local_irq_disable();
                               lock(&(&zone->lru_lock)->rlock);
                               lock(&(&rtpz->lock)->rlock);
  <Interrupt>
    lock(&(&zone->lru_lock)->rlock);

 *** DEADLOCK ***

1 lock held by cc1/2771:
 #0:  (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&(&zone->lru_lock)->rlock){..-.-.} ops: 413812 {
   IN-SOFTIRQ-W at:
                    [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
                    [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                    [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
                    [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
                    [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
                    [<ffffffff81134400>] end_swap_bio_write+0x5c/0x69
                    [<ffffffff8123473e>] bio_endio+0x50/0x6e
                    [<ffffffff81238dee>] blk_update_request+0x163/0x255
                    [<ffffffff81238ef7>] blk_update_bidi_request+0x17/0x65
                    [<ffffffff81239242>] blk_end_bidi_request+0x1a/0x56
                    [<ffffffff81239289>] blk_end_request+0xb/0xd
                    [<ffffffff813a075a>] scsi_io_completion+0x16d/0x553
                    [<ffffffff81399c0f>] scsi_finish_command+0xb6/0xbf
                    [<ffffffff813a0564>] scsi_softirq_done+0xe9/0xf0
                    [<ffffffff8123e8e5>] blk_done_softirq+0x79/0x8b
                    [<ffffffff81088675>] __do_softirq+0xfc/0x21f
                    [<ffffffff8108898f>] irq_exit+0x3d/0x92
                    [<ffffffff81032379>] do_IRQ+0xcc/0xe5
                    [<ffffffff815bf5ac>] ret_from_intr+0x0/0x13
                    [<ffffffff81443ac0>] cpuidle_enter+0x12/0x14
                    [<ffffffff810bb4e4>] cpu_startup_entry+0x187/0x243
                    [<ffffffff815a90ab>] rest_init+0x12f/0x133
                    [<ffffffff81970e7c>] start_kernel+0x396/0x3a3
                    [<ffffffff81970489>] x86_64_start_reservations+0x2a/0x2c
                    [<ffffffff81970552>] x86_64_start_kernel+0xc7/0xca
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
                       [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                       [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
                       [<ffffffff8110e66d>] lru_add_drain_cpu+0x4d/0xb3
                       [<ffffffff8110e783>] lru_add_drain+0x1a/0x37
                       [<ffffffff81111b95>] shrink_active_list+0x62/0x2cb
                       [<ffffffff81112eaa>] balance_pgdat+0x156/0x503
                       [<ffffffff8111355e>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bea2c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
                   [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                   [<ffffffff8110dcc5>] __pagevec_lru_add+0x12/0x14
                   [<ffffffff8110dd37>] __lru_cache_add+0x70/0x9f
                   [<ffffffff8110e44e>] lru_cache_add_anon+0x14/0x16
                   [<ffffffff81115b5a>] shmem_getpage_gfp+0x409/0x6c2
                   [<ffffffff81115fcb>] shmem_read_mapping_page_gfp+0x2e/0x49
                   [<ffffffff8133168f>] i915_gem_object_get_pages_gtt+0xe5/0x3f9
                   [<ffffffff8132d66e>] i915_gem_object_get_pages+0x64/0x8f
                   [<ffffffff81330eaa>] i915_gem_object_pin+0x2a0/0x5af
                   [<ffffffff813408fb>] intel_init_ring_buffer+0x2ba/0x3e6
                   [<ffffffff8134323a>] intel_init_render_ring_buffer+0x38b/0x3a6
                   [<ffffffff8132faae>] i915_gem_init_hw+0x127/0x2c6
                   [<ffffffff8132fd57>] i915_gem_init+0x10a/0x189
                   [<ffffffff81381d0c>] i915_driver_load+0xb1b/0xde7
                   [<ffffffff812fff60>] drm_dev_register+0x7f/0xf8
                   [<ffffffff81302185>] drm_get_pci_dev+0xf7/0x1b4
                   [<ffffffff81311d2f>] i915_pci_probe+0x40/0x49
                   [<ffffffff8127dddd>] local_pci_probe+0x1f/0x51
                   [<ffffffff8127ded5>] pci_device_probe+0xc6/0xec
                   [<ffffffff81389720>] driver_probe_device+0x99/0x1b9
                   [<ffffffff813898d4>] __driver_attach+0x5c/0x7e
                   [<ffffffff81387e7f>] bus_for_each_dev+0x55/0x89
                   [<ffffffff813893f6>] driver_attach+0x19/0x1b
                   [<ffffffff81388fb2>] bus_add_driver+0xec/0x1d3
                   [<ffffffff81389e21>] driver_register+0x89/0xc5
                   [<ffffffff8127d48f>] __pci_register_driver+0x58/0x5b
                   [<ffffffff8130229b>] drm_pci_init+0x59/0xda
                   [<ffffffff8199497f>] i915_init+0x89/0x90
                   [<ffffffff8100030e>] do_one_initcall+0xea/0x18c
                   [<ffffffff81970f8d>] kernel_init_freeable+0x104/0x196
                   [<ffffffff815a90b8>] kernel_init+0x9/0xd5
                   [<ffffffff815bea2c>] ret_from_fork+0x7c/0xb0
 }
 ... key      at: [<ffffffff8273c920>] __key.37664+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
   [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
   [<ffffffff811571aa>] mem_cgroup_uncharge+0xf6/0x1c0
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81134ea2>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff8112136f>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121d5d>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121d6f>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112a968>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b


the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
-> (&(&rtpz->lock)->rlock){+.+.-.} ops: 2348 {
   HARDIRQ-ON-W at:
                    [<ffffffff810c2073>] __lock_acquire+0x5f4/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
                    [<ffffffff811535bb>] commit_charge+0x260/0x26f
                    [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
                    [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
                    [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
                    [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
                    [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
                    [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
                    [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
                    [<ffffffff8115a56c>] vfs_write+0xb5/0x169
                    [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
                    [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b
   SOFTIRQ-ON-W at:
                    [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
                    [<ffffffff811535bb>] commit_charge+0x260/0x26f
                    [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
                    [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
                    [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
                    [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
                    [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
                    [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
                    [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
                    [<ffffffff8115a56c>] vfs_write+0xb5/0x169
                    [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
                    [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                       [<ffffffff81156311>] mem_cgroup_soft_limit_reclaim+0x80/0x6b9
                       [<ffffffff81112fc2>] balance_pgdat+0x26e/0x503
                       [<ffffffff8111355e>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bea2c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                   [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
                   [<ffffffff811535bb>] commit_charge+0x260/0x26f
                   [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
                   [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
                   [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
                   [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
                   [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
                   [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
                   [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
                   [<ffffffff8115a56c>] vfs_write+0xb5/0x169
                   [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
                   [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b
 }
 ... key      at: [<ffffffff82747bf0>] __key.49479+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
   [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
   [<ffffffff811571aa>] mem_cgroup_uncharge+0xf6/0x1c0
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81134ea2>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff8112136f>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121d5d>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121d6f>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112a968>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b


stack backtrace:
CPU: 1 PID: 2771 Comm: cc1 Not tainted 3.16.0-rc2-mm1 #3
Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011
 0000000000000000 ffff88000fe77a18 ffffffff815b2b2f ffff880004b09868
 ffff88000fe77b10 ffffffff810c0eb6 0000000000000000 ffff880000000000
 ffff880000000001 0000000400000006 ffffffff81811f22 ffff88000fe77a60
Call Trace:
 [<ffffffff815b2b2f>] dump_stack+0x4e/0x7a
 [<ffffffff810c0eb6>] check_usage+0x591/0x5a2
 [<ffffffff81156261>] ? mem_cgroup_bad_page_check+0x15/0x1d
 [<ffffffff810c1809>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff815be16f>] ? _raw_spin_unlock_irq+0x32/0x46
 [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
 [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
 [<ffffffff810c38a6>] lock_acquire+0x61/0x78
 [<ffffffff811518b5>] ? memcg_check_events+0x17e/0x206
 [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
 [<ffffffff811518b5>] ? memcg_check_events+0x17e/0x206
 [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
 [<ffffffff811571aa>] mem_cgroup_uncharge+0xf6/0x1c0
 [<ffffffff8110db2a>] release_pages+0x1d2/0x239
 [<ffffffff81134ea2>] free_pages_and_swap_cache+0x72/0x8c
 [<ffffffff8112136f>] tlb_flush_mmu_free+0x21/0x3c
 [<ffffffff81121d5d>] tlb_flush_mmu+0x1b/0x1e
 [<ffffffff81121d6f>] tlb_finish_mmu+0xf/0x34
 [<ffffffff8112a968>] exit_mmap+0xb5/0x117
 [<ffffffff81081a9d>] mmput+0x52/0xce
 [<ffffffff81086842>] do_exit+0x355/0x9b7
 [<ffffffff815bf64e>] ? retint_swapgs+0xe/0x13
 [<ffffffff81086f46>] do_group_exit+0x76/0xb5
 [<ffffffff81086f94>] SyS_exit_group+0xf/0xf
 [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* mm: memcontrol: rewrite uncharge API: problems
@ 2014-06-30 23:55 ` Hugh Dickins
  0 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-06-30 23:55 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

Hi Hannes,

Your rewrite of the memcg charge/uncharge API is bold and attractive,
but I'm having some problems with the way release_pages() now does
uncharging in I/O completion context.

At the bottom see the lockdep message I get when I start shmem swapping.
Which I have not begun to attempt to decipher (over to you!), but I do
see release_pages() mentioned in there (also i915, hope it's irrelevant).

Which was already worrying me on the PowerPC G5, when moving tasks from
one memcg to another and removing the old, while swapping and swappingoff
(I haven't tried much else actually, maybe it's much easier to reproduce).

I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).

I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
memsw res_counter spinlock, if memcg is NULL.  I don't understand why
usually the PowerPC: I did see something like it once on this x86 laptop,
maybe having lockdep in on this slows things down enough not to hit that.

I've stopped those crashes with patch below: the memcg_batch uncharging
was never designed for use from interrupts.  But I bet it needs more work:
to disable interrupts, or do something clever with atomics, or... over to
you again.

As it stands, I think an interrupt in the wrong place risks leaking
charges (but actually I see the reverse - kernel/res_counter.c:28!
underflow warnings; though I don't know if it's merely that the patch
lets the machine stay up long enough to reach those, or causes them).

Not-really-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/memcontrol.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- 3.16-rc2-mm1/mm/memcontrol.c	2014-06-25 18:43:59.856588121 -0700
+++ linux/mm/memcontrol.c	2014-06-29 21:45:03.896588350 -0700
@@ -3636,12 +3636,11 @@ void mem_cgroup_uncharge_end(void)
 	if (!batch->do_batch)
 		return;
 
-	batch->do_batch--;
-	if (batch->do_batch) /* If stacked, do nothing. */
-		return;
+	if (batch->do_batch > 1) /* If stacked, do nothing. */
+		goto out;
 
 	if (!batch->memcg)
-		return;
+		goto out;
 	/*
 	 * This "batch->memcg" is valid without any css_get/put etc...
 	 * bacause we hide charges behind us.
@@ -3655,6 +3654,8 @@ void mem_cgroup_uncharge_end(void)
 	memcg_oom_recover(batch->memcg);
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
+out:
+	batch->do_batch--;
 }
 
 #ifdef CONFIG_MEMCG_SWAP

And here's lockdep's little fortune cookie:

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
3.16.0-rc2-mm1 #3 Not tainted
------------------------------------------------------
cc1/2771 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&rtpz->lock)->rlock){+.+.-.}, at: [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
dd
and this task is already holding:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239
which would create a new lock dependency:
 (&(&zone->lru_lock)->rlock){..-.-.} -> (&(&rtpz->lock)->rlock){+.+.-.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&zone->lru_lock)->rlock){..-.-.}
... which became SOFTIRQ-irq-safe at:
  [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
  [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
  [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
  [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
  [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
  [<ffffffff81134400>] end_swap_bio_write+0x5c/0x69
  [<ffffffff8123473e>] bio_endio+0x50/0x6e
  [<ffffffff81238dee>] blk_update_request+0x163/0x255
  [<ffffffff81238ef7>] blk_update_bidi_request+0x17/0x65
  [<ffffffff81239242>] blk_end_bidi_request+0x1a/0x56
  [<ffffffff81239289>] blk_end_request+0xb/0xd
  [<ffffffff813a075a>] scsi_io_completion+0x16d/0x553
  [<ffffffff81399c0f>] scsi_finish_command+0xb6/0xbf
  [<ffffffff813a0564>] scsi_softirq_done+0xe9/0xf0
  [<ffffffff8123e8e5>] blk_done_softirq+0x79/0x8b
  [<ffffffff81088675>] __do_softirq+0xfc/0x21f
  [<ffffffff8108898f>] irq_exit+0x3d/0x92
  [<ffffffff81032379>] do_IRQ+0xcc/0xe5
  [<ffffffff815bf5ac>] ret_from_intr+0x0/0x13
  [<ffffffff81443ac0>] cpuidle_enter+0x12/0x14
  [<ffffffff810bb4e4>] cpu_startup_entry+0x187/0x243
  [<ffffffff815a90ab>] rest_init+0x12f/0x133
  [<ffffffff81970e7c>] start_kernel+0x396/0x3a3
  [<ffffffff81970489>] x86_64_start_reservations+0x2a/0x2c
  [<ffffffff81970552>] x86_64_start_kernel+0xc7/0xca

to a SOFTIRQ-irq-unsafe lock:
 (&(&rtpz->lock)->rlock){+.+.-.}
... which became SOFTIRQ-irq-unsafe at:
...  [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
  [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
  [<ffffffff811535bb>] commit_charge+0x260/0x26f
  [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
  [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
  [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
  [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
  [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
  [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
  [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
  [<ffffffff8115a56c>] vfs_write+0xb5/0x169
  [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
  [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&rtpz->lock)->rlock);
                               local_irq_disable();
                               lock(&(&zone->lru_lock)->rlock);
                               lock(&(&rtpz->lock)->rlock);
  <Interrupt>
    lock(&(&zone->lru_lock)->rlock);

 *** DEADLOCK ***

1 lock held by cc1/2771:
 #0:  (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&(&zone->lru_lock)->rlock){..-.-.} ops: 413812 {
   IN-SOFTIRQ-W at:
                    [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
                    [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                    [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
                    [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
                    [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
                    [<ffffffff81134400>] end_swap_bio_write+0x5c/0x69
                    [<ffffffff8123473e>] bio_endio+0x50/0x6e
                    [<ffffffff81238dee>] blk_update_request+0x163/0x255
                    [<ffffffff81238ef7>] blk_update_bidi_request+0x17/0x65
                    [<ffffffff81239242>] blk_end_bidi_request+0x1a/0x56
                    [<ffffffff81239289>] blk_end_request+0xb/0xd
                    [<ffffffff813a075a>] scsi_io_completion+0x16d/0x553
                    [<ffffffff81399c0f>] scsi_finish_command+0xb6/0xbf
                    [<ffffffff813a0564>] scsi_softirq_done+0xe9/0xf0
                    [<ffffffff8123e8e5>] blk_done_softirq+0x79/0x8b
                    [<ffffffff81088675>] __do_softirq+0xfc/0x21f
                    [<ffffffff8108898f>] irq_exit+0x3d/0x92
                    [<ffffffff81032379>] do_IRQ+0xcc/0xe5
                    [<ffffffff815bf5ac>] ret_from_intr+0x0/0x13
                    [<ffffffff81443ac0>] cpuidle_enter+0x12/0x14
                    [<ffffffff810bb4e4>] cpu_startup_entry+0x187/0x243
                    [<ffffffff815a90ab>] rest_init+0x12f/0x133
                    [<ffffffff81970e7c>] start_kernel+0x396/0x3a3
                    [<ffffffff81970489>] x86_64_start_reservations+0x2a/0x2c
                    [<ffffffff81970552>] x86_64_start_kernel+0xc7/0xca
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
                       [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                       [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
                       [<ffffffff8110e66d>] lru_add_drain_cpu+0x4d/0xb3
                       [<ffffffff8110e783>] lru_add_drain+0x1a/0x37
                       [<ffffffff81111b95>] shrink_active_list+0x62/0x2cb
                       [<ffffffff81112eaa>] balance_pgdat+0x156/0x503
                       [<ffffffff8111355e>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bea2c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
                   [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                   [<ffffffff8110dcc5>] __pagevec_lru_add+0x12/0x14
                   [<ffffffff8110dd37>] __lru_cache_add+0x70/0x9f
                   [<ffffffff8110e44e>] lru_cache_add_anon+0x14/0x16
                   [<ffffffff81115b5a>] shmem_getpage_gfp+0x409/0x6c2
                   [<ffffffff81115fcb>] shmem_read_mapping_page_gfp+0x2e/0x49
                   [<ffffffff8133168f>] i915_gem_object_get_pages_gtt+0xe5/0x3f9
                   [<ffffffff8132d66e>] i915_gem_object_get_pages+0x64/0x8f
                   [<ffffffff81330eaa>] i915_gem_object_pin+0x2a0/0x5af
                   [<ffffffff813408fb>] intel_init_ring_buffer+0x2ba/0x3e6
                   [<ffffffff8134323a>] intel_init_render_ring_buffer+0x38b/0x3a6
                   [<ffffffff8132faae>] i915_gem_init_hw+0x127/0x2c6
                   [<ffffffff8132fd57>] i915_gem_init+0x10a/0x189
                   [<ffffffff81381d0c>] i915_driver_load+0xb1b/0xde7
                   [<ffffffff812fff60>] drm_dev_register+0x7f/0xf8
                   [<ffffffff81302185>] drm_get_pci_dev+0xf7/0x1b4
                   [<ffffffff81311d2f>] i915_pci_probe+0x40/0x49
                   [<ffffffff8127dddd>] local_pci_probe+0x1f/0x51
                   [<ffffffff8127ded5>] pci_device_probe+0xc6/0xec
                   [<ffffffff81389720>] driver_probe_device+0x99/0x1b9
                   [<ffffffff813898d4>] __driver_attach+0x5c/0x7e
                   [<ffffffff81387e7f>] bus_for_each_dev+0x55/0x89
                   [<ffffffff813893f6>] driver_attach+0x19/0x1b
                   [<ffffffff81388fb2>] bus_add_driver+0xec/0x1d3
                   [<ffffffff81389e21>] driver_register+0x89/0xc5
                   [<ffffffff8127d48f>] __pci_register_driver+0x58/0x5b
                   [<ffffffff8130229b>] drm_pci_init+0x59/0xda
                   [<ffffffff8199497f>] i915_init+0x89/0x90
                   [<ffffffff8100030e>] do_one_initcall+0xea/0x18c
                   [<ffffffff81970f8d>] kernel_init_freeable+0x104/0x196
                   [<ffffffff815a90b8>] kernel_init+0x9/0xd5
                   [<ffffffff815bea2c>] ret_from_fork+0x7c/0xb0
 }
 ... key      at: [<ffffffff8273c920>] __key.37664+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
   [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
   [<ffffffff811571aa>] mem_cgroup_uncharge+0xf6/0x1c0
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81134ea2>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff8112136f>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121d5d>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121d6f>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112a968>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b


the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
-> (&(&rtpz->lock)->rlock){+.+.-.} ops: 2348 {
   HARDIRQ-ON-W at:
                    [<ffffffff810c2073>] __lock_acquire+0x5f4/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
                    [<ffffffff811535bb>] commit_charge+0x260/0x26f
                    [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
                    [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
                    [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
                    [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
                    [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
                    [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
                    [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
                    [<ffffffff8115a56c>] vfs_write+0xb5/0x169
                    [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
                    [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b
   SOFTIRQ-ON-W at:
                    [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
                    [<ffffffff811535bb>] commit_charge+0x260/0x26f
                    [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
                    [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
                    [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
                    [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
                    [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
                    [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
                    [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
                    [<ffffffff8115a56c>] vfs_write+0xb5/0x169
                    [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
                    [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                       [<ffffffff81156311>] mem_cgroup_soft_limit_reclaim+0x80/0x6b9
                       [<ffffffff81112fc2>] balance_pgdat+0x26e/0x503
                       [<ffffffff8111355e>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bea2c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
                   [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
                   [<ffffffff811535bb>] commit_charge+0x260/0x26f
                   [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
                   [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
                   [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
                   [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
                   [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
                   [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
                   [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
                   [<ffffffff8115a56c>] vfs_write+0xb5/0x169
                   [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
                   [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b
 }
 ... key      at: [<ffffffff82747bf0>] __key.49479+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
   [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
   [<ffffffff811571aa>] mem_cgroup_uncharge+0xf6/0x1c0
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81134ea2>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff8112136f>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121d5d>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121d6f>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112a968>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b


stack backtrace:
CPU: 1 PID: 2771 Comm: cc1 Not tainted 3.16.0-rc2-mm1 #3
Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011
 0000000000000000 ffff88000fe77a18 ffffffff815b2b2f ffff880004b09868
 ffff88000fe77b10 ffffffff810c0eb6 0000000000000000 ffff880000000000
 ffff880000000001 0000000400000006 ffffffff81811f22 ffff88000fe77a60
Call Trace:
 [<ffffffff815b2b2f>] dump_stack+0x4e/0x7a
 [<ffffffff810c0eb6>] check_usage+0x591/0x5a2
 [<ffffffff81156261>] ? mem_cgroup_bad_page_check+0x15/0x1d
 [<ffffffff810c1809>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff815be16f>] ? _raw_spin_unlock_irq+0x32/0x46
 [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
 [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
 [<ffffffff810c38a6>] lock_acquire+0x61/0x78
 [<ffffffff811518b5>] ? memcg_check_events+0x17e/0x206
 [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
 [<ffffffff811518b5>] ? memcg_check_events+0x17e/0x206
 [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
 [<ffffffff811571aa>] mem_cgroup_uncharge+0xf6/0x1c0
 [<ffffffff8110db2a>] release_pages+0x1d2/0x239
 [<ffffffff81134ea2>] free_pages_and_swap_cache+0x72/0x8c
 [<ffffffff8112136f>] tlb_flush_mmu_free+0x21/0x3c
 [<ffffffff81121d5d>] tlb_flush_mmu+0x1b/0x1e
 [<ffffffff81121d6f>] tlb_finish_mmu+0xf/0x34
 [<ffffffff8112a968>] exit_mmap+0xb5/0x117
 [<ffffffff81081a9d>] mmput+0x52/0xce
 [<ffffffff81086842>] do_exit+0x355/0x9b7
 [<ffffffff815bf64e>] ? retint_swapgs+0xe/0x13
 [<ffffffff81086f46>] do_group_exit+0x76/0xb5
 [<ffffffff81086f94>] SyS_exit_group+0xf/0xf
 [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
  2014-06-30 23:55 ` Hugh Dickins
@ 2014-07-01 17:46   ` Johannes Weiner
  -1 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-01 17:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

Hi Hugh,

On Mon, Jun 30, 2014 at 04:55:10PM -0700, Hugh Dickins wrote:
> Hi Hannes,
> 
> Your rewrite of the memcg charge/uncharge API is bold and attractive,
> but I'm having some problems with the way release_pages() now does
> uncharging in I/O completion context.

Yes, I need to make the uncharge path IRQ-safe.  This looks doable.

> At the bottom see the lockdep message I get when I start shmem swapping.
> Which I have not begun to attempt to decipher (over to you!), but I do
> see release_pages() mentioned in there (also i915, hope it's irrelevant).

This seems to be about uncharge acquiring the IRQ-unsafe soft limit
tree lock while the outer release_pages() holds the IRQ-safe lru_lock.
A separate issue, AFAICS, that would also be fixed by IRQ-proofing the
uncharge path.

> Which was already worrying me on the PowerPC G5, when moving tasks from
> one memcg to another and removing the old, while swapping and swappingoff
> (I haven't tried much else actually, maybe it's much easier to reproduce).
> 
> I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
> res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
> free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
> unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).
> 
> I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
> memsw res_counter spinlock, if memcg is NULL.  I don't understand why
> usually the PowerPC: I did see something like it once on this x86 laptop,
> maybe having lockdep in on this slows things down enough not to hit that.
> 
> I've stopped those crashes with patch below: the memcg_batch uncharging
> was never designed for use from interrupts.  But I bet it needs more work:
> to disable interrupts, or do something clever with atomics, or... over to
> you again.

I was convinced I had tested these changes with lockdep enabled, but
it must have been at an earlier stage while developing the series.
Otherwise, I should have gotten the same splat as you report.

Thanks for the report, I hope to have something useful ASAP.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
@ 2014-07-01 17:46   ` Johannes Weiner
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-01 17:46 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

Hi Hugh,

On Mon, Jun 30, 2014 at 04:55:10PM -0700, Hugh Dickins wrote:
> Hi Hannes,
> 
> Your rewrite of the memcg charge/uncharge API is bold and attractive,
> but I'm having some problems with the way release_pages() now does
> uncharging in I/O completion context.

Yes, I need to make the uncharge path IRQ-safe.  This looks doable.

> At the bottom see the lockdep message I get when I start shmem swapping.
> Which I have not begun to attempt to decipher (over to you!), but I do
> see release_pages() mentioned in there (also i915, hope it's irrelevant).

This seems to be about uncharge acquiring the IRQ-unsafe soft limit
tree lock while the outer release_pages() holds the IRQ-safe lru_lock.
A separate issue, AFAICS, that would also be fixed by IRQ-proofing the
uncharge path.

> Which was already worrying me on the PowerPC G5, when moving tasks from
> one memcg to another and removing the old, while swapping and swappingoff
> (I haven't tried much else actually, maybe it's much easier to reproduce).
> 
> I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
> res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
> free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
> unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).
> 
> I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
> memsw res_counter spinlock, if memcg is NULL.  I don't understand why
> usually the PowerPC: I did see something like it once on this x86 laptop,
> maybe having lockdep in on this slows things down enough not to hit that.
> 
> I've stopped those crashes with patch below: the memcg_batch uncharging
> was never designed for use from interrupts.  But I bet it needs more work:
> to disable interrupts, or do something clever with atomics, or... over to
> you again.

I was convinced I had tested these changes with lockdep enabled, but
it must have been at an earlier stage while developing the series.
Otherwise, I should have gotten the same splat as you report.

Thanks for the report, I hope to have something useful ASAP.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
  2014-07-01 17:46   ` Johannes Weiner
@ 2014-07-02 21:20     ` Johannes Weiner
  -1 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-02 21:20 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Tue, Jul 01, 2014 at 01:46:12PM -0400, Johannes Weiner wrote:
> Hi Hugh,
> 
> On Mon, Jun 30, 2014 at 04:55:10PM -0700, Hugh Dickins wrote:
> > Hi Hannes,
> > 
> > Your rewrite of the memcg charge/uncharge API is bold and attractive,
> > but I'm having some problems with the way release_pages() now does
> > uncharging in I/O completion context.
> 
> Yes, I need to make the uncharge path IRQ-safe.  This looks doable.
> 
> > At the bottom see the lockdep message I get when I start shmem swapping.
> > Which I have not begun to attempt to decipher (over to you!), but I do
> > see release_pages() mentioned in there (also i915, hope it's irrelevant).
> 
> This seems to be about uncharge acquiring the IRQ-unsafe soft limit
> tree lock while the outer release_pages() holds the IRQ-safe lru_lock.
> A separate issue, AFAICS, that would also be fixed by IRQ-proofing the
> uncharge path.
> 
> > Which was already worrying me on the PowerPC G5, when moving tasks from
> > one memcg to another and removing the old, while swapping and swappingoff
> > (I haven't tried much else actually, maybe it's much easier to reproduce).
> > 
> > I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
> > res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
> > free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
> > unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).
> > 
> > I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
> > memsw res_counter spinlock, if memcg is NULL.  I don't understand why
> > usually the PowerPC: I did see something like it once on this x86 laptop,
> > maybe having lockdep in on this slows things down enough not to hit that.
> > 
> > I've stopped those crashes with patch below: the memcg_batch uncharging
> > was never designed for use from interrupts.  But I bet it needs more work:
> > to disable interrupts, or do something clever with atomics, or... over to
> > you again.
> 
> I was convinced I had tested these changes with lockdep enabled, but
> it must have been at an earlier stage while developing the series.
> Otherwise, I should have gotten the same splat as you report.

Turns out this was because the soft limit was not set in my tests, and
without soft limit excess that spinlock is never acquired.  I could
reproduce it now.

> Thanks for the report, I hope to have something useful ASAP.

Could you give the following patch a spin?  I put it in the mmots
stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.

Thanks!

---
>From b13bbe7774296388ca28afc1ce5776d6d6b371fb Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 2 Jul 2014 08:50:23 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix

Hugh reports:

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
3.16.0-rc2-mm1 #3 Not tainted
------------------------------------------------------
cc1/2771 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&rtpz->lock)->rlock){+.+.-.}, at: [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
dd
and this task is already holding:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239
which would create a new lock dependency:
 (&(&zone->lru_lock)->rlock){..-.-.} -> (&(&rtpz->lock)->rlock){+.+.-.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&zone->lru_lock)->rlock){..-.-.}
... which became SOFTIRQ-irq-safe at:
  [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bdfbd>] _raw_spin_lock_irqsave+0x3f/0x51
  [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
  [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
  [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
  [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
  [<ffffffff81134400>] end_swap_bio_write+0x5c/0x69
  [<ffffffff8123473e>] bio_endio+0x50/0x6e
  [<ffffffff81238dee>] blk_update_request+0x163/0x255
  [<ffffffff81238ef7>] blk_update_bidi_request+0x17/0x65
  [<ffffffff81239242>] blk_end_bidi_request+0x1a/0x56
  [<ffffffff81239289>] blk_end_request+0xb/0xd
  [<ffffffff813a075a>] scsi_io_completion+0x16d/0x553
  [<ffffffff81399c0f>] scsi_finish_command+0xb6/0xbf
  [<ffffffff813a0564>] scsi_softirq_done+0xe9/0xf0
  [<ffffffff8123e8e5>] blk_done_softirq+0x79/0x8b
  [<ffffffff81088675>] __do_softirq+0xfc/0x21f
  [<ffffffff8108898f>] irq_exit+0x3d/0x92
  [<ffffffff81032379>] do_IRQ+0xcc/0xe5
  [<ffffffff815bf5ac>] ret_from_intr+0x0/0x13
  [<ffffffff81443ac0>] cpuidle_enter+0x12/0x14
  [<ffffffff810bb4e4>] cpu_startup_entry+0x187/0x243
  [<ffffffff815a90ab>] rest_init+0x12f/0x133
  [<ffffffff81970e7c>] start_kernel+0x396/0x3a3
  [<ffffffff81970489>] x86_64_start_reservations+0x2a/0x2c
  [<ffffffff81970552>] x86_64_start_kernel+0xc7/0xca

to a SOFTIRQ-irq-unsafe lock:
 (&(&rtpz->lock)->rlock){+.+.-.}
... which became SOFTIRQ-irq-unsafe at:
...  [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815bde9f>] _raw_spin_lock+0x34/0x41
  [<ffffffff811518b5>] memcg_check_events+0x17e/0x206
  [<ffffffff811535bb>] commit_charge+0x260/0x26f
  [<ffffffff81157004>] mem_cgroup_commit_charge+0xb1/0xdb
  [<ffffffff81115b51>] shmem_getpage_gfp+0x400/0x6c2
  [<ffffffff81115ecc>] shmem_write_begin+0x33/0x35
  [<ffffffff81102a24>] generic_perform_write+0xb7/0x1a4
  [<ffffffff8110391e>] __generic_file_write_iter+0x25b/0x29b
  [<ffffffff81103999>] generic_file_write_iter+0x3b/0xa5
  [<ffffffff8115a115>] new_sync_write+0x7b/0x9f
  [<ffffffff8115a56c>] vfs_write+0xb5/0x169
  [<ffffffff8115ae1f>] SyS_write+0x45/0x8c
  [<ffffffff815bead2>] system_call_fastpath+0x16/0x1b

The soft limit tree lock needs to be IRQ-safe as it's acquired while
holding the IRQ-safe zone->lru_lock.

But more importantly, with uncharge happening in release_pages() now,
this path is executed from interrupt context.

Make the soft limit tree lock, uncharge batching, and charge
statistics IRQ-safe.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 113 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 67 insertions(+), 46 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c3ffb02651e..91b621846e10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -754,9 +754,11 @@ static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_zone *mz,
 				       struct mem_cgroup_tree_per_zone *mctz)
 {
-	spin_lock(&mctz->lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mctz->lock, flags);
 	__mem_cgroup_remove_exceeded(mz, mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irqrestore(&mctz->lock, flags);
 }
 
 
@@ -779,7 +781,9 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 		 * mem is over its softlimit.
 		 */
 		if (excess || mz->on_tree) {
-			spin_lock(&mctz->lock);
+			unsigned long flags;
+
+			spin_lock_irqsave(&mctz->lock, flags);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mz, mctz);
@@ -788,7 +792,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
 			 * If excess is 0, no tree ops.
 			 */
 			__mem_cgroup_insert_exceeded(mz, mctz, excess);
-			spin_unlock(&mctz->lock);
+			spin_unlock_irqrestore(&mctz->lock, flags);
 		}
 	}
 }
@@ -839,9 +843,9 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 {
 	struct mem_cgroup_per_zone *mz;
 
-	spin_lock(&mctz->lock);
+	spin_lock_irq(&mctz->lock);
 	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock(&mctz->lock);
+	spin_unlock_irq(&mctz->lock);
 	return mz;
 }
 
@@ -904,8 +908,9 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
 					 int nr_pages)
 {
-	preempt_disable();
+	unsigned long flags;
 
+	local_irq_save(flags);
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
@@ -930,7 +935,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	}
 
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
-	preempt_enable();
+	local_irq_restore(flags);
 }
 
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -1009,7 +1014,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 {
-	preempt_disable();
+	unsigned long flags;
+
+	local_irq_save(flags);
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -1022,7 +1029,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 		do_numainfo = mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_NUMAINFO);
 #endif
-		preempt_enable();
+		local_irq_restore(flags);
 
 		mem_cgroup_threshold(memcg);
 		if (unlikely(do_softlimit))
@@ -1032,7 +1039,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 			atomic_inc(&memcg->numainfo_events);
 #endif
 	} else
-		preempt_enable();
+		local_irq_restore(flags);
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
@@ -3620,6 +3627,9 @@ out:
 
 void mem_cgroup_uncharge_start(void)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
 	current->memcg_batch.do_batch++;
 	/* We can do nest. */
 	if (current->memcg_batch.do_batch == 1) {
@@ -3627,21 +3637,18 @@ void mem_cgroup_uncharge_start(void)
 		current->memcg_batch.nr_pages = 0;
 		current->memcg_batch.memsw_nr_pages = 0;
 	}
+	local_irq_restore(flags);
 }
 
 void mem_cgroup_uncharge_end(void)
 {
 	struct memcg_batch_info *batch = &current->memcg_batch;
+	unsigned long flags;
 
-	if (!batch->do_batch)
-		return;
-
-	batch->do_batch--;
-	if (batch->do_batch) /* If stacked, do nothing. */
-		return;
-
-	if (!batch->memcg)
-		return;
+	local_irq_save(flags);
+	VM_BUG_ON(!batch->do_batch);
+	if (--batch->do_batch) /* If stacked, do nothing */
+		goto out;
 	/*
 	 * This "batch->memcg" is valid without any css_get/put etc...
 	 * bacause we hide charges behind us.
@@ -3653,8 +3660,8 @@ void mem_cgroup_uncharge_end(void)
 		res_counter_uncharge(&batch->memcg->memsw,
 				     batch->memsw_nr_pages * PAGE_SIZE);
 	memcg_oom_recover(batch->memcg);
-	/* forget this pointer (for sanity check) */
-	batch->memcg = NULL;
+out:
+	local_irq_restore(flags);
 }
 
 #ifdef CONFIG_MEMCG_SWAP
@@ -6554,6 +6561,36 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	cancel_charge(memcg, nr_pages);
 }
 
+static bool uncharge_batched(struct mem_cgroup *memcg,
+			     unsigned long pc_flags,
+			     unsigned int nr_pages)
+{
+	struct memcg_batch_info *batch = &current->memcg_batch;
+	bool uncharged = false;
+	unsigned long flags;
+
+	if (nr_pages > 1)
+		return false;
+	if (test_thread_flag(TIF_MEMDIE))
+		return false;
+
+	local_irq_save(flags);
+	if (!batch->do_batch)
+		goto out;
+	if (batch->memcg && batch->memcg != memcg)
+		goto out;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	if (pc_flags & PCG_MEM)
+		batch->nr_pages++;
+	if (pc_flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	uncharged = true;
+out:
+	local_irq_restore(flags);
+	return uncharged;
+}
+
 /**
  * mem_cgroup_uncharge - uncharge a page
  * @page: page to uncharge
@@ -6563,11 +6600,10 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
  */
 void mem_cgroup_uncharge(struct page *page)
 {
-	struct memcg_batch_info *batch;
 	unsigned int nr_pages = 1;
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc;
-	unsigned long flags;
+	unsigned long pc_flags;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON_PAGE(page_count(page), page);
@@ -6591,35 +6627,20 @@ void mem_cgroup_uncharge(struct page *page)
 	 * exclusive access to the page.
 	 */
 	memcg = pc->mem_cgroup;
-	flags = pc->flags;
+	pc_flags = pc->flags;
 	pc->flags = 0;
 
 	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
 	memcg_check_events(memcg, page);
 
-	batch = &current->memcg_batch;
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	else if (batch->memcg != memcg)
-		goto uncharge;
-	if (nr_pages > 1)
-		goto uncharge;
-	if (!batch->do_batch)
-		goto uncharge;
-	if (test_thread_flag(TIF_MEMDIE))
-		goto uncharge;
-	if (flags & PCG_MEM)
-		batch->nr_pages++;
-	if (flags & PCG_MEMSW)
-		batch->memsw_nr_pages++;
-	return;
-uncharge:
-	if (flags & PCG_MEM)
+	if (uncharge_batched(memcg, pc_flags, nr_pages))
+		return;
+
+	if (pc_flags & PCG_MEM)
 		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (flags & PCG_MEMSW)
+	if (pc_flags & PCG_MEMSW)
 		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (batch->memcg != memcg)
-		memcg_oom_recover(memcg);
+	memcg_oom_recover(memcg);
 }
 
 /**
-- 
2.0.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
@ 2014-07-02 21:20     ` Johannes Weiner
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-02 21:20 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Tue, Jul 01, 2014 at 01:46:12PM -0400, Johannes Weiner wrote:
> Hi Hugh,
> 
> On Mon, Jun 30, 2014 at 04:55:10PM -0700, Hugh Dickins wrote:
> > Hi Hannes,
> > 
> > Your rewrite of the memcg charge/uncharge API is bold and attractive,
> > but I'm having some problems with the way release_pages() now does
> > uncharging in I/O completion context.
> 
> Yes, I need to make the uncharge path IRQ-safe.  This looks doable.
> 
> > At the bottom see the lockdep message I get when I start shmem swapping.
> > Which I have not begun to attempt to decipher (over to you!), but I do
> > see release_pages() mentioned in there (also i915, hope it's irrelevant).
> 
> This seems to be about uncharge acquiring the IRQ-unsafe soft limit
> tree lock while the outer release_pages() holds the IRQ-safe lru_lock.
> A separate issue, AFAICS, that would also be fixed by IRQ-proofing the
> uncharge path.
> 
> > Which was already worrying me on the PowerPC G5, when moving tasks from
> > one memcg to another and removing the old, while swapping and swappingoff
> > (I haven't tried much else actually, maybe it's much easier to reproduce).
> > 
> > I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
> > res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
> > free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
> > unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).
> > 
> > I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
> > memsw res_counter spinlock, if memcg is NULL.  I don't understand why
> > usually the PowerPC: I did see something like it once on this x86 laptop,
> > maybe having lockdep in on this slows things down enough not to hit that.
> > 
> > I've stopped those crashes with patch below: the memcg_batch uncharging
> > was never designed for use from interrupts.  But I bet it needs more work:
> > to disable interrupts, or do something clever with atomics, or... over to
> > you again.
> 
> I was convinced I had tested these changes with lockdep enabled, but
> it must have been at an earlier stage while developing the series.
> Otherwise, I should have gotten the same splat as you report.

Turns out this was because the soft limit was not set in my tests, and
without soft limit excess that spinlock is never acquired.  I could
reproduce it now.

> Thanks for the report, I hope to have something useful ASAP.

Could you give the following patch a spin?  I put it in the mmots
stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.

Thanks!

---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
  2014-07-02 21:20     ` Johannes Weiner
@ 2014-07-02 22:28       ` Hugh Dickins
  -1 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-07-02 22:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Wed, 2 Jul 2014, Johannes Weiner wrote:
> On Tue, Jul 01, 2014 at 01:46:12PM -0400, Johannes Weiner wrote:
> > Hi Hugh,
> > 
> > On Mon, Jun 30, 2014 at 04:55:10PM -0700, Hugh Dickins wrote:
> > > Hi Hannes,
> > > 
> > > Your rewrite of the memcg charge/uncharge API is bold and attractive,
> > > but I'm having some problems with the way release_pages() now does
> > > uncharging in I/O completion context.
> > 
> > Yes, I need to make the uncharge path IRQ-safe.  This looks doable.
> > 
> > > At the bottom see the lockdep message I get when I start shmem swapping.
> > > Which I have not begun to attempt to decipher (over to you!), but I do
> > > see release_pages() mentioned in there (also i915, hope it's irrelevant).
> > 
> > This seems to be about uncharge acquiring the IRQ-unsafe soft limit
> > tree lock while the outer release_pages() holds the IRQ-safe lru_lock.
> > A separate issue, AFAICS, that would also be fixed by IRQ-proofing the
> > uncharge path.
> > 
> > > Which was already worrying me on the PowerPC G5, when moving tasks from
> > > one memcg to another and removing the old, while swapping and swappingoff
> > > (I haven't tried much else actually, maybe it's much easier to reproduce).
> > > 
> > > I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
> > > res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
> > > free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
> > > unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).
> > > 
> > > I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
> > > memsw res_counter spinlock, if memcg is NULL.  I don't understand why
> > > usually the PowerPC: I did see something like it once on this x86 laptop,
> > > maybe having lockdep in on this slows things down enough not to hit that.
> > > 
> > > I've stopped those crashes with patch below: the memcg_batch uncharging
> > > was never designed for use from interrupts.  But I bet it needs more work:
> > > to disable interrupts, or do something clever with atomics, or... over to
> > > you again.
> > 
> > I was convinced I had tested these changes with lockdep enabled, but
> > it must have been at an earlier stage while developing the series.
> > Otherwise, I should have gotten the same splat as you report.
> 
> Turns out this was because the soft limit was not set in my tests, and
> without soft limit excess that spinlock is never acquired.  I could
> reproduce it now.
> 
> > Thanks for the report, I hope to have something useful ASAP.
> 
> Could you give the following patch a spin?  I put it in the mmots
> stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.

I'm just with the laptop until this evening.  I slapped it on top of
my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
- which incidentally continues to run without crashing on the G5),
and it quickly gave me this lockdep splat, which doesn't look very
different from the one before.

I see there's now an -rc3-mm1, I'll try it out on that in half an
hour... but unless I send word otherwise, assume that's the same.

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
3.16.0-rc2-mm1 #6 Not tainted
------------------------------------------------------
cc1/1272 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&rtpz->lock)->rlock){+.+.-.}, at: [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe

and this task is already holding:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239
which would create a new lock dependency:
 (&(&zone->lru_lock)->rlock){..-.-.} -> (&(&rtpz->lock)->rlock){+.+.-.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&zone->lru_lock)->rlock){..-.-.}
... which became SOFTIRQ-irq-safe at:
  [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
  [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
  [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
  [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
  [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
  [<ffffffff81134594>] end_swap_bio_write+0x5c/0x69
  [<ffffffff81234952>] bio_endio+0x50/0x6e
  [<ffffffff81239002>] blk_update_request+0x163/0x255
  [<ffffffff8123910b>] blk_update_bidi_request+0x17/0x65
  [<ffffffff81239456>] blk_end_bidi_request+0x1a/0x56
  [<ffffffff8123949d>] blk_end_request+0xb/0xd
  [<ffffffff813a097a>] scsi_io_completion+0x16d/0x553
  [<ffffffff81399e2f>] scsi_finish_command+0xb6/0xbf
  [<ffffffff813a0784>] scsi_softirq_done+0xe9/0xf0
  [<ffffffff8123eaf9>] blk_done_softirq+0x79/0x8b
  [<ffffffff81088675>] __do_softirq+0xfc/0x21f
  [<ffffffff8108898f>] irq_exit+0x3d/0x92
  [<ffffffff81032379>] do_IRQ+0xcc/0xe5
  [<ffffffff815bf7ec>] ret_from_intr+0x0/0x13
  [<ffffffff81145e78>] cache_alloc_debugcheck_after.isra.51+0x26/0x1ad
  [<ffffffff81147011>] kmem_cache_alloc+0x11f/0x171
  [<ffffffff81234a42>] bvec_alloc+0xa7/0xc7
  [<ffffffff81234b55>] bio_alloc_bioset+0xf3/0x17d
  [<ffffffff811c3a7a>] ext4_bio_write_page+0x1e2/0x2c8
  [<ffffffff811bcd89>] mpage_submit_page+0x5c/0x72
  [<ffffffff811bd28c>] mpage_map_and_submit_buffers+0x1a5/0x215
  [<ffffffff811c11de>] ext4_writepages+0x9dc/0xa1f
  [<ffffffff8110c389>] do_writepages+0x1c/0x2a
  [<ffffffff8117d218>] __writeback_single_inode+0x3c/0xee
  [<ffffffff8117da90>] writeback_sb_inodes+0x1c6/0x30b
  [<ffffffff8117dc44>] __writeback_inodes_wb+0x6f/0xb3
  [<ffffffff8117df27>] wb_writeback+0x101/0x195
  [<ffffffff8117e37d>] bdi_writeback_workfn+0x87/0x2a1
  [<ffffffff8109b122>] process_one_work+0x221/0x3c5
  [<ffffffff8109bdd5>] worker_thread+0x2ec/0x3ef
  [<ffffffff810a1923>] kthread+0xf1/0xf9
  [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0

to a SOFTIRQ-irq-unsafe lock:
 (&(&rtpz->lock)->rlock){+.+.-.}
... which became SOFTIRQ-irq-unsafe at:
...  [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815be0c7>] _raw_spin_lock+0x34/0x41
  [<ffffffff811566ca>] mem_cgroup_soft_limit_reclaim+0x260/0x6b9
  [<ffffffff81113012>] balance_pgdat+0x26e/0x503
  [<ffffffff811135ae>] kswapd+0x307/0x341
  [<ffffffff810a1923>] kthread+0xf1/0xf9
  [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0

other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&rtpz->lock)->rlock);
                               local_irq_disable();
                               lock(&(&zone->lru_lock)->rlock);
                               lock(&(&rtpz->lock)->rlock);
  <Interrupt>
    lock(&(&zone->lru_lock)->rlock);

 *** DEADLOCK ***

1 lock held by cc1/1272:
 #0:  (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&(&zone->lru_lock)->rlock){..-.-.} ops: 125969 {
   IN-SOFTIRQ-W at:
                    [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                    [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                    [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
                    [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
                    [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
                    [<ffffffff81134594>] end_swap_bio_write+0x5c/0x69
                    [<ffffffff81234952>] bio_endio+0x50/0x6e
                    [<ffffffff81239002>] blk_update_request+0x163/0x255
                    [<ffffffff8123910b>] blk_update_bidi_request+0x17/0x65
                    [<ffffffff81239456>] blk_end_bidi_request+0x1a/0x56
                    [<ffffffff8123949d>] blk_end_request+0xb/0xd
                    [<ffffffff813a097a>] scsi_io_completion+0x16d/0x553
                    [<ffffffff81399e2f>] scsi_finish_command+0xb6/0xbf
                    [<ffffffff813a0784>] scsi_softirq_done+0xe9/0xf0
                    [<ffffffff8123eaf9>] blk_done_softirq+0x79/0x8b
                    [<ffffffff81088675>] __do_softirq+0xfc/0x21f
                    [<ffffffff8108898f>] irq_exit+0x3d/0x92
                    [<ffffffff81032379>] do_IRQ+0xcc/0xe5
                    [<ffffffff815bf7ec>] ret_from_intr+0x0/0x13
                    [<ffffffff81145e78>] cache_alloc_debugcheck_after.isra.51+0x26/0x1ad
                    [<ffffffff81147011>] kmem_cache_alloc+0x11f/0x171
                    [<ffffffff81234a42>] bvec_alloc+0xa7/0xc7
                    [<ffffffff81234b55>] bio_alloc_bioset+0xf3/0x17d
                    [<ffffffff811c3a7a>] ext4_bio_write_page+0x1e2/0x2c8
                    [<ffffffff811bcd89>] mpage_submit_page+0x5c/0x72
                    [<ffffffff811bd28c>] mpage_map_and_submit_buffers+0x1a5/0x215
                    [<ffffffff811c11de>] ext4_writepages+0x9dc/0xa1f
                    [<ffffffff8110c389>] do_writepages+0x1c/0x2a
                    [<ffffffff8117d218>] __writeback_single_inode+0x3c/0xee
                    [<ffffffff8117da90>] writeback_sb_inodes+0x1c6/0x30b
                    [<ffffffff8117dc44>] __writeback_inodes_wb+0x6f/0xb3
                    [<ffffffff8117df27>] wb_writeback+0x101/0x195
                    [<ffffffff8117e37d>] bdi_writeback_workfn+0x87/0x2a1
                    [<ffffffff8109b122>] process_one_work+0x221/0x3c5
                    [<ffffffff8109bdd5>] worker_thread+0x2ec/0x3ef
                    [<ffffffff810a1923>] kthread+0xf1/0xf9
                    [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                       [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                       [<ffffffff8110dcc5>] __pagevec_lru_add+0x12/0x14
                       [<ffffffff8110e648>] lru_add_drain_cpu+0x28/0xb3
                       [<ffffffff8110e783>] lru_add_drain+0x1a/0x37
                       [<ffffffff81111be5>] shrink_active_list+0x62/0x2cb
                       [<ffffffff81112efa>] balance_pgdat+0x156/0x503
                       [<ffffffff811135ae>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                   [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                   [<ffffffff8110dcc5>] __pagevec_lru_add+0x12/0x14
                   [<ffffffff8110dd37>] __lru_cache_add+0x70/0x9f
                   [<ffffffff8110e44e>] lru_cache_add_anon+0x14/0x16
                   [<ffffffff81115b5c>] shmem_getpage_gfp+0x3be/0x68a
                   [<ffffffff81115f25>] shmem_read_mapping_page_gfp+0x2e/0x49
                   [<ffffffff813318af>] i915_gem_object_get_pages_gtt+0xe5/0x3f9
                   [<ffffffff8132d88e>] i915_gem_object_get_pages+0x64/0x8f
                   [<ffffffff813310ca>] i915_gem_object_pin+0x2a0/0x5af
                   [<ffffffff81340b1b>] intel_init_ring_buffer+0x2ba/0x3e6
                   [<ffffffff8134345a>] intel_init_render_ring_buffer+0x38b/0x3a6
                   [<ffffffff8132fcce>] i915_gem_init_hw+0x127/0x2c6
                   [<ffffffff8132ff77>] i915_gem_init+0x10a/0x189
                   [<ffffffff81381f2c>] i915_driver_load+0xb1b/0xde7
                   [<ffffffff81300180>] drm_dev_register+0x7f/0xf8
                   [<ffffffff813023a5>] drm_get_pci_dev+0xf7/0x1b4
                   [<ffffffff81311f4f>] i915_pci_probe+0x40/0x49
                   [<ffffffff8127dffd>] local_pci_probe+0x1f/0x51
                   [<ffffffff8127e0f5>] pci_device_probe+0xc6/0xec
                   [<ffffffff81389940>] driver_probe_device+0x99/0x1b9
                   [<ffffffff81389af4>] __driver_attach+0x5c/0x7e
                   [<ffffffff8138809f>] bus_for_each_dev+0x55/0x89
                   [<ffffffff81389616>] driver_attach+0x19/0x1b
                   [<ffffffff813891d2>] bus_add_driver+0xec/0x1d3
                   [<ffffffff8138a041>] driver_register+0x89/0xc5
                   [<ffffffff8127d6af>] __pci_register_driver+0x58/0x5b
                   [<ffffffff813024bb>] drm_pci_init+0x59/0xda
                   [<ffffffff8199497f>] i915_init+0x89/0x90
                   [<ffffffff8100030e>] do_one_initcall+0xea/0x18c
                   [<ffffffff81970f8d>] kernel_init_freeable+0x104/0x196
                   [<ffffffff815a92d8>] kernel_init+0x9/0xd5
                   [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
 }
 ... key      at: [<ffffffff8273c920>] __key.37664+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
   [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
   [<ffffffff81157384>] mem_cgroup_uncharge+0xfa/0x1fc
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81135036>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff81121503>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121ef1>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121f03>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112aafc>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b


the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
-> (&(&rtpz->lock)->rlock){+.+.-.} ops: 857 {
   HARDIRQ-ON-W at:
                    [<ffffffff810c2073>] __lock_acquire+0x5f4/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815be0c7>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811566ca>] mem_cgroup_soft_limit_reclaim+0x260/0x6b9
                    [<ffffffff81113012>] balance_pgdat+0x26e/0x503
                    [<ffffffff811135ae>] kswapd+0x307/0x341
                    [<ffffffff810a1923>] kthread+0xf1/0xf9
                    [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   SOFTIRQ-ON-W at:
                    [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815be0c7>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811566ca>] mem_cgroup_soft_limit_reclaim+0x260/0x6b9
                    [<ffffffff81113012>] balance_pgdat+0x26e/0x503
                    [<ffffffff811135ae>] kswapd+0x307/0x341
                    [<ffffffff810a1923>] kthread+0xf1/0xf9
                    [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815be231>] _raw_spin_lock_irq+0x3a/0x47
                       [<ffffffff811564ea>] mem_cgroup_soft_limit_reclaim+0x80/0x6b9
                       [<ffffffff81113012>] balance_pgdat+0x26e/0x503
                       [<ffffffff811135ae>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                   [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
                   [<ffffffff81153756>] commit_charge+0x260/0x26f
                   [<ffffffff811571da>] mem_cgroup_commit_charge+0xb1/0xdb
                   [<ffffffff81101535>] __add_to_page_cache_locked+0x205/0x23d
                   [<ffffffff8110159b>] add_to_page_cache_lru+0x20/0x63
                   [<ffffffff8118bbf5>] mpage_readpages+0x8c/0xfa
                   [<ffffffff811bccbb>] ext4_readpages+0x37/0x3e
                   [<ffffffff8110c95d>] __do_page_cache_readahead+0x1fa/0x27d
                   [<ffffffff8110cd5b>] ondemand_readahead+0x37b/0x38c
                   [<ffffffff8110ceaf>] page_cache_sync_readahead+0x38/0x3a
                   [<ffffffff811031d0>] generic_file_read_iter+0x1bd/0x588
                   [<ffffffff8115a28a>] new_sync_read+0x78/0x9c
                   [<ffffffff8115a630>] vfs_read+0x89/0x124
                   [<ffffffff8115afa7>] SyS_read+0x45/0x8c
                   [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b
 }
 ... key      at: [<ffffffff82747bf0>] __key.49550+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
   [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
   [<ffffffff81157384>] mem_cgroup_uncharge+0xfa/0x1fc
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81135036>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff81121503>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121ef1>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121f03>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112aafc>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b


stack backtrace:
CPU: 0 PID: 1272 Comm: cc1 Not tainted 3.16.0-rc2-mm1 #6
Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011
 0000000000000000 ffff8800108f3a08 ffffffff815b2d51 ffff8800100f1268
 ffff8800108f3b00 ffffffff810c0eb6 0000000000000000 ffff880000000000
 ffffffff00000001 0000000400000006 ffffffff81811f0a ffff8800108f3a50
Call Trace:
 [<ffffffff815b2d51>] dump_stack+0x4e/0x7a
 [<ffffffff810c0eb6>] check_usage+0x591/0x5a2
 [<ffffffff81151317>] ? lookup_page_cgroup_used+0x9/0x19
 [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
 [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
 [<ffffffff810c38a6>] lock_acquire+0x61/0x78
 [<ffffffff81151a4c>] ? memcg_check_events+0x174/0x1fe
 [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
 [<ffffffff81151a4c>] ? memcg_check_events+0x174/0x1fe
 [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
 [<ffffffff81157384>] mem_cgroup_uncharge+0xfa/0x1fc
 [<ffffffff8110da3f>] ? release_pages+0xe7/0x239
 [<ffffffff8110db2a>] release_pages+0x1d2/0x239
 [<ffffffff81135036>] free_pages_and_swap_cache+0x72/0x8c
 [<ffffffff81121503>] tlb_flush_mmu_free+0x21/0x3c
 [<ffffffff81121ef1>] tlb_flush_mmu+0x1b/0x1e
 [<ffffffff81121f03>] tlb_finish_mmu+0xf/0x34
 [<ffffffff8112aafc>] exit_mmap+0xb5/0x117
 [<ffffffff81081a9d>] mmput+0x52/0xce
 [<ffffffff81086842>] do_exit+0x355/0x9b7
 [<ffffffff815bf88e>] ? retint_swapgs+0xe/0x13
 [<ffffffff81086f46>] do_group_exit+0x76/0xb5
 [<ffffffff81086f94>] SyS_exit_group+0xf/0xf
 [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
@ 2014-07-02 22:28       ` Hugh Dickins
  0 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-07-02 22:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Wed, 2 Jul 2014, Johannes Weiner wrote:
> On Tue, Jul 01, 2014 at 01:46:12PM -0400, Johannes Weiner wrote:
> > Hi Hugh,
> > 
> > On Mon, Jun 30, 2014 at 04:55:10PM -0700, Hugh Dickins wrote:
> > > Hi Hannes,
> > > 
> > > Your rewrite of the memcg charge/uncharge API is bold and attractive,
> > > but I'm having some problems with the way release_pages() now does
> > > uncharging in I/O completion context.
> > 
> > Yes, I need to make the uncharge path IRQ-safe.  This looks doable.
> > 
> > > At the bottom see the lockdep message I get when I start shmem swapping.
> > > Which I have not begun to attempt to decipher (over to you!), but I do
> > > see release_pages() mentioned in there (also i915, hope it's irrelevant).
> > 
> > This seems to be about uncharge acquiring the IRQ-unsafe soft limit
> > tree lock while the outer release_pages() holds the IRQ-safe lru_lock.
> > A separate issue, AFAICS, that would also be fixed by IRQ-proofing the
> > uncharge path.
> > 
> > > Which was already worrying me on the PowerPC G5, when moving tasks from
> > > one memcg to another and removing the old, while swapping and swappingoff
> > > (I haven't tried much else actually, maybe it's much easier to reproduce).
> > > 
> > > I get "unable to handle kernel paging at 0x180" oops in __raw_spinlock <
> > > res_counter_uncharge_until < mem_cgroup_uncharge_end < release_pages <
> > > free_pages_and_swap_cache < tlb_flush_mmu_free < tlb_finish_mmu <
> > > unmap_region < do_munmap (or from exit_mmap < mmput < do_exit).
> > > 
> > > I do have CONFIG_MEMCG_SWAP=y, and I think 0x180 corresponds to the
> > > memsw res_counter spinlock, if memcg is NULL.  I don't understand why
> > > usually the PowerPC: I did see something like it once on this x86 laptop,
> > > maybe having lockdep in on this slows things down enough not to hit that.
> > > 
> > > I've stopped those crashes with patch below: the memcg_batch uncharging
> > > was never designed for use from interrupts.  But I bet it needs more work:
> > > to disable interrupts, or do something clever with atomics, or... over to
> > > you again.
> > 
> > I was convinced I had tested these changes with lockdep enabled, but
> > it must have been at an earlier stage while developing the series.
> > Otherwise, I should have gotten the same splat as you report.
> 
> Turns out this was because the soft limit was not set in my tests, and
> without soft limit excess that spinlock is never acquired.  I could
> reproduce it now.
> 
> > Thanks for the report, I hope to have something useful ASAP.
> 
> Could you give the following patch a spin?  I put it in the mmots
> stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.

I'm just with the laptop until this evening.  I slapped it on top of
my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
- which incidentally continues to run without crashing on the G5),
and it quickly gave me this lockdep splat, which doesn't look very
different from the one before.

I see there's now an -rc3-mm1, I'll try it out on that in half an
hour... but unless I send word otherwise, assume that's the same.

======================================================
[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
3.16.0-rc2-mm1 #6 Not tainted
------------------------------------------------------
cc1/1272 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&rtpz->lock)->rlock){+.+.-.}, at: [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe

and this task is already holding:
 (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239
which would create a new lock dependency:
 (&(&zone->lru_lock)->rlock){..-.-.} -> (&(&rtpz->lock)->rlock){+.+.-.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&zone->lru_lock)->rlock){..-.-.}
... which became SOFTIRQ-irq-safe at:
  [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
  [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
  [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
  [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
  [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
  [<ffffffff81134594>] end_swap_bio_write+0x5c/0x69
  [<ffffffff81234952>] bio_endio+0x50/0x6e
  [<ffffffff81239002>] blk_update_request+0x163/0x255
  [<ffffffff8123910b>] blk_update_bidi_request+0x17/0x65
  [<ffffffff81239456>] blk_end_bidi_request+0x1a/0x56
  [<ffffffff8123949d>] blk_end_request+0xb/0xd
  [<ffffffff813a097a>] scsi_io_completion+0x16d/0x553
  [<ffffffff81399e2f>] scsi_finish_command+0xb6/0xbf
  [<ffffffff813a0784>] scsi_softirq_done+0xe9/0xf0
  [<ffffffff8123eaf9>] blk_done_softirq+0x79/0x8b
  [<ffffffff81088675>] __do_softirq+0xfc/0x21f
  [<ffffffff8108898f>] irq_exit+0x3d/0x92
  [<ffffffff81032379>] do_IRQ+0xcc/0xe5
  [<ffffffff815bf7ec>] ret_from_intr+0x0/0x13
  [<ffffffff81145e78>] cache_alloc_debugcheck_after.isra.51+0x26/0x1ad
  [<ffffffff81147011>] kmem_cache_alloc+0x11f/0x171
  [<ffffffff81234a42>] bvec_alloc+0xa7/0xc7
  [<ffffffff81234b55>] bio_alloc_bioset+0xf3/0x17d
  [<ffffffff811c3a7a>] ext4_bio_write_page+0x1e2/0x2c8
  [<ffffffff811bcd89>] mpage_submit_page+0x5c/0x72
  [<ffffffff811bd28c>] mpage_map_and_submit_buffers+0x1a5/0x215
  [<ffffffff811c11de>] ext4_writepages+0x9dc/0xa1f
  [<ffffffff8110c389>] do_writepages+0x1c/0x2a
  [<ffffffff8117d218>] __writeback_single_inode+0x3c/0xee
  [<ffffffff8117da90>] writeback_sb_inodes+0x1c6/0x30b
  [<ffffffff8117dc44>] __writeback_inodes_wb+0x6f/0xb3
  [<ffffffff8117df27>] wb_writeback+0x101/0x195
  [<ffffffff8117e37d>] bdi_writeback_workfn+0x87/0x2a1
  [<ffffffff8109b122>] process_one_work+0x221/0x3c5
  [<ffffffff8109bdd5>] worker_thread+0x2ec/0x3ef
  [<ffffffff810a1923>] kthread+0xf1/0xf9
  [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0

to a SOFTIRQ-irq-unsafe lock:
 (&(&rtpz->lock)->rlock){+.+.-.}
... which became SOFTIRQ-irq-unsafe at:
...  [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
  [<ffffffff810c38a6>] lock_acquire+0x61/0x78
  [<ffffffff815be0c7>] _raw_spin_lock+0x34/0x41
  [<ffffffff811566ca>] mem_cgroup_soft_limit_reclaim+0x260/0x6b9
  [<ffffffff81113012>] balance_pgdat+0x26e/0x503
  [<ffffffff811135ae>] kswapd+0x307/0x341
  [<ffffffff810a1923>] kthread+0xf1/0xf9
  [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0

other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&rtpz->lock)->rlock);
                               local_irq_disable();
                               lock(&(&zone->lru_lock)->rlock);
                               lock(&(&rtpz->lock)->rlock);
  <Interrupt>
    lock(&(&zone->lru_lock)->rlock);

 *** DEADLOCK ***

1 lock held by cc1/1272:
 #0:  (&(&zone->lru_lock)->rlock){..-.-.}, at: [<ffffffff8110da3f>] release_pages+0xe7/0x239

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&(&zone->lru_lock)->rlock){..-.-.} ops: 125969 {
   IN-SOFTIRQ-W at:
                    [<ffffffff810c201e>] __lock_acquire+0x59f/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                    [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                    [<ffffffff8110dca4>] pagevec_move_tail+0x1d/0x2c
                    [<ffffffff8110e298>] rotate_reclaimable_page+0xb2/0xd4
                    [<ffffffff811018bf>] end_page_writeback+0x1c/0x45
                    [<ffffffff81134594>] end_swap_bio_write+0x5c/0x69
                    [<ffffffff81234952>] bio_endio+0x50/0x6e
                    [<ffffffff81239002>] blk_update_request+0x163/0x255
                    [<ffffffff8123910b>] blk_update_bidi_request+0x17/0x65
                    [<ffffffff81239456>] blk_end_bidi_request+0x1a/0x56
                    [<ffffffff8123949d>] blk_end_request+0xb/0xd
                    [<ffffffff813a097a>] scsi_io_completion+0x16d/0x553
                    [<ffffffff81399e2f>] scsi_finish_command+0xb6/0xbf
                    [<ffffffff813a0784>] scsi_softirq_done+0xe9/0xf0
                    [<ffffffff8123eaf9>] blk_done_softirq+0x79/0x8b
                    [<ffffffff81088675>] __do_softirq+0xfc/0x21f
                    [<ffffffff8108898f>] irq_exit+0x3d/0x92
                    [<ffffffff81032379>] do_IRQ+0xcc/0xe5
                    [<ffffffff815bf7ec>] ret_from_intr+0x0/0x13
                    [<ffffffff81145e78>] cache_alloc_debugcheck_after.isra.51+0x26/0x1ad
                    [<ffffffff81147011>] kmem_cache_alloc+0x11f/0x171
                    [<ffffffff81234a42>] bvec_alloc+0xa7/0xc7
                    [<ffffffff81234b55>] bio_alloc_bioset+0xf3/0x17d
                    [<ffffffff811c3a7a>] ext4_bio_write_page+0x1e2/0x2c8
                    [<ffffffff811bcd89>] mpage_submit_page+0x5c/0x72
                    [<ffffffff811bd28c>] mpage_map_and_submit_buffers+0x1a5/0x215
                    [<ffffffff811c11de>] ext4_writepages+0x9dc/0xa1f
                    [<ffffffff8110c389>] do_writepages+0x1c/0x2a
                    [<ffffffff8117d218>] __writeback_single_inode+0x3c/0xee
                    [<ffffffff8117da90>] writeback_sb_inodes+0x1c6/0x30b
                    [<ffffffff8117dc44>] __writeback_inodes_wb+0x6f/0xb3
                    [<ffffffff8117df27>] wb_writeback+0x101/0x195
                    [<ffffffff8117e37d>] bdi_writeback_workfn+0x87/0x2a1
                    [<ffffffff8109b122>] process_one_work+0x221/0x3c5
                    [<ffffffff8109bdd5>] worker_thread+0x2ec/0x3ef
                    [<ffffffff810a1923>] kthread+0xf1/0xf9
                    [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                       [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                       [<ffffffff8110dcc5>] __pagevec_lru_add+0x12/0x14
                       [<ffffffff8110e648>] lru_add_drain_cpu+0x28/0xb3
                       [<ffffffff8110e783>] lru_add_drain+0x1a/0x37
                       [<ffffffff81111be5>] shrink_active_list+0x62/0x2cb
                       [<ffffffff81112efa>] balance_pgdat+0x156/0x503
                       [<ffffffff811135ae>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                   [<ffffffff8110dc0e>] pagevec_lru_move_fn+0x7d/0xf6
                   [<ffffffff8110dcc5>] __pagevec_lru_add+0x12/0x14
                   [<ffffffff8110dd37>] __lru_cache_add+0x70/0x9f
                   [<ffffffff8110e44e>] lru_cache_add_anon+0x14/0x16
                   [<ffffffff81115b5c>] shmem_getpage_gfp+0x3be/0x68a
                   [<ffffffff81115f25>] shmem_read_mapping_page_gfp+0x2e/0x49
                   [<ffffffff813318af>] i915_gem_object_get_pages_gtt+0xe5/0x3f9
                   [<ffffffff8132d88e>] i915_gem_object_get_pages+0x64/0x8f
                   [<ffffffff813310ca>] i915_gem_object_pin+0x2a0/0x5af
                   [<ffffffff81340b1b>] intel_init_ring_buffer+0x2ba/0x3e6
                   [<ffffffff8134345a>] intel_init_render_ring_buffer+0x38b/0x3a6
                   [<ffffffff8132fcce>] i915_gem_init_hw+0x127/0x2c6
                   [<ffffffff8132ff77>] i915_gem_init+0x10a/0x189
                   [<ffffffff81381f2c>] i915_driver_load+0xb1b/0xde7
                   [<ffffffff81300180>] drm_dev_register+0x7f/0xf8
                   [<ffffffff813023a5>] drm_get_pci_dev+0xf7/0x1b4
                   [<ffffffff81311f4f>] i915_pci_probe+0x40/0x49
                   [<ffffffff8127dffd>] local_pci_probe+0x1f/0x51
                   [<ffffffff8127e0f5>] pci_device_probe+0xc6/0xec
                   [<ffffffff81389940>] driver_probe_device+0x99/0x1b9
                   [<ffffffff81389af4>] __driver_attach+0x5c/0x7e
                   [<ffffffff8138809f>] bus_for_each_dev+0x55/0x89
                   [<ffffffff81389616>] driver_attach+0x19/0x1b
                   [<ffffffff813891d2>] bus_add_driver+0xec/0x1d3
                   [<ffffffff8138a041>] driver_register+0x89/0xc5
                   [<ffffffff8127d6af>] __pci_register_driver+0x58/0x5b
                   [<ffffffff813024bb>] drm_pci_init+0x59/0xda
                   [<ffffffff8199497f>] i915_init+0x89/0x90
                   [<ffffffff8100030e>] do_one_initcall+0xea/0x18c
                   [<ffffffff81970f8d>] kernel_init_freeable+0x104/0x196
                   [<ffffffff815a92d8>] kernel_init+0x9/0xd5
                   [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
 }
 ... key      at: [<ffffffff8273c920>] __key.37664+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
   [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
   [<ffffffff81157384>] mem_cgroup_uncharge+0xfa/0x1fc
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81135036>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff81121503>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121ef1>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121f03>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112aafc>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b


the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
-> (&(&rtpz->lock)->rlock){+.+.-.} ops: 857 {
   HARDIRQ-ON-W at:
                    [<ffffffff810c2073>] __lock_acquire+0x5f4/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815be0c7>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811566ca>] mem_cgroup_soft_limit_reclaim+0x260/0x6b9
                    [<ffffffff81113012>] balance_pgdat+0x26e/0x503
                    [<ffffffff811135ae>] kswapd+0x307/0x341
                    [<ffffffff810a1923>] kthread+0xf1/0xf9
                    [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   SOFTIRQ-ON-W at:
                    [<ffffffff810c2095>] __lock_acquire+0x616/0x17e8
                    [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                    [<ffffffff815be0c7>] _raw_spin_lock+0x34/0x41
                    [<ffffffff811566ca>] mem_cgroup_soft_limit_reclaim+0x260/0x6b9
                    [<ffffffff81113012>] balance_pgdat+0x26e/0x503
                    [<ffffffff811135ae>] kswapd+0x307/0x341
                    [<ffffffff810a1923>] kthread+0xf1/0xf9
                    [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   IN-RECLAIM_FS-W at:
                       [<ffffffff810c20c3>] __lock_acquire+0x644/0x17e8
                       [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                       [<ffffffff815be231>] _raw_spin_lock_irq+0x3a/0x47
                       [<ffffffff811564ea>] mem_cgroup_soft_limit_reclaim+0x80/0x6b9
                       [<ffffffff81113012>] balance_pgdat+0x26e/0x503
                       [<ffffffff811135ae>] kswapd+0x307/0x341
                       [<ffffffff810a1923>] kthread+0xf1/0xf9
                       [<ffffffff815bec6c>] ret_from_fork+0x7c/0xb0
   INITIAL USE at:
                   [<ffffffff810c20db>] __lock_acquire+0x65c/0x17e8
                   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
                   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
                   [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
                   [<ffffffff81153756>] commit_charge+0x260/0x26f
                   [<ffffffff811571da>] mem_cgroup_commit_charge+0xb1/0xdb
                   [<ffffffff81101535>] __add_to_page_cache_locked+0x205/0x23d
                   [<ffffffff8110159b>] add_to_page_cache_lru+0x20/0x63
                   [<ffffffff8118bbf5>] mpage_readpages+0x8c/0xfa
                   [<ffffffff811bccbb>] ext4_readpages+0x37/0x3e
                   [<ffffffff8110c95d>] __do_page_cache_readahead+0x1fa/0x27d
                   [<ffffffff8110cd5b>] ondemand_readahead+0x37b/0x38c
                   [<ffffffff8110ceaf>] page_cache_sync_readahead+0x38/0x3a
                   [<ffffffff811031d0>] generic_file_read_iter+0x1bd/0x588
                   [<ffffffff8115a28a>] new_sync_read+0x78/0x9c
                   [<ffffffff8115a630>] vfs_read+0x89/0x124
                   [<ffffffff8115afa7>] SyS_read+0x45/0x8c
                   [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b
 }
 ... key      at: [<ffffffff82747bf0>] __key.49550+0x0/0x8
 ... acquired at:
   [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
   [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
   [<ffffffff810c38a6>] lock_acquire+0x61/0x78
   [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
   [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
   [<ffffffff81157384>] mem_cgroup_uncharge+0xfa/0x1fc
   [<ffffffff8110db2a>] release_pages+0x1d2/0x239
   [<ffffffff81135036>] free_pages_and_swap_cache+0x72/0x8c
   [<ffffffff81121503>] tlb_flush_mmu_free+0x21/0x3c
   [<ffffffff81121ef1>] tlb_flush_mmu+0x1b/0x1e
   [<ffffffff81121f03>] tlb_finish_mmu+0xf/0x34
   [<ffffffff8112aafc>] exit_mmap+0xb5/0x117
   [<ffffffff81081a9d>] mmput+0x52/0xce
   [<ffffffff81086842>] do_exit+0x355/0x9b7
   [<ffffffff81086f46>] do_group_exit+0x76/0xb5
   [<ffffffff81086f94>] __wake_up_parent+0x0/0x23
   [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b


stack backtrace:
CPU: 0 PID: 1272 Comm: cc1 Not tainted 3.16.0-rc2-mm1 #6
Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011
 0000000000000000 ffff8800108f3a08 ffffffff815b2d51 ffff8800100f1268
 ffff8800108f3b00 ffffffff810c0eb6 0000000000000000 ffff880000000000
 ffffffff00000001 0000000400000006 ffffffff81811f0a ffff8800108f3a50
Call Trace:
 [<ffffffff815b2d51>] dump_stack+0x4e/0x7a
 [<ffffffff810c0eb6>] check_usage+0x591/0x5a2
 [<ffffffff81151317>] ? lookup_page_cgroup_used+0x9/0x19
 [<ffffffff810c0f1b>] check_irq_usage+0x54/0xa8
 [<ffffffff810c2b50>] __lock_acquire+0x10d1/0x17e8
 [<ffffffff810c38a6>] lock_acquire+0x61/0x78
 [<ffffffff81151a4c>] ? memcg_check_events+0x174/0x1fe
 [<ffffffff815be1e5>] _raw_spin_lock_irqsave+0x3f/0x51
 [<ffffffff81151a4c>] ? memcg_check_events+0x174/0x1fe
 [<ffffffff81151a4c>] memcg_check_events+0x174/0x1fe
 [<ffffffff81157384>] mem_cgroup_uncharge+0xfa/0x1fc
 [<ffffffff8110da3f>] ? release_pages+0xe7/0x239
 [<ffffffff8110db2a>] release_pages+0x1d2/0x239
 [<ffffffff81135036>] free_pages_and_swap_cache+0x72/0x8c
 [<ffffffff81121503>] tlb_flush_mmu_free+0x21/0x3c
 [<ffffffff81121ef1>] tlb_flush_mmu+0x1b/0x1e
 [<ffffffff81121f03>] tlb_finish_mmu+0xf/0x34
 [<ffffffff8112aafc>] exit_mmap+0xb5/0x117
 [<ffffffff81081a9d>] mmput+0x52/0xce
 [<ffffffff81086842>] do_exit+0x355/0x9b7
 [<ffffffff815bf88e>] ? retint_swapgs+0xe/0x13
 [<ffffffff81086f46>] do_group_exit+0x76/0xb5
 [<ffffffff81086f94>] SyS_exit_group+0xf/0xf
 [<ffffffff815bed12>] system_call_fastpath+0x16/0x1b

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
  2014-07-02 22:28       ` Hugh Dickins
@ 2014-07-03 19:54         ` Hugh Dickins
  -1 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-07-03 19:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Wed, 2 Jul 2014, Hugh Dickins wrote:
> On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > 
> > Could you give the following patch a spin?  I put it in the mmots
> > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> 
> I'm just with the laptop until this evening.  I slapped it on top of
> my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> - which incidentally continues to run without crashing on the G5),
> and it quickly gave me this lockdep splat, which doesn't look very
> different from the one before.
> 
> I see there's now an -rc3-mm1, I'll try it out on that in half an
> hour... but unless I send word otherwise, assume that's the same.

Yes, I get that lockdep report each time on -rc3-mm1 + your patch.

I also twice got a flurry of res_counter.c:28 underflow warnings.
Hmm, 62 of them each time (I was checking for a number near 512,
which would suggest a THP/4k confusion, but no).  The majority
of them coming from mem_cgroup_reparent_charges.

But the laptop stayed up fine (for two hours before I had to stop
it), and the G5 has run fine with that load for 16 hours now, no
problems with release_pages, and not even a res_counter.c:28 (but
I don't use lockdep on it).

The x86 workstation ran fine for 4.5 hours, then hit some deadlock
which I doubt had any connection to your changes: looked more like
a jbd2 transaction was failing to complete (which, with me trying
ext4 on loop on tmpfs, might be more my problem than anyone else's).

Oh, but nearly forgot, I did an earlier run on the laptop last night,
which crashed within minutes on

VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
mm/memcontrol.c:6680!
page had count 1 mapcount 0 mapping anon index 0x196
flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
__alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
handle_mm_fault < __do_page_fault

I was expecting to reproduce that quite easily on the laptop or
workstation, and investigate more closely then; but in fact have
not seen it since.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
@ 2014-07-03 19:54         ` Hugh Dickins
  0 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-07-03 19:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Wed, 2 Jul 2014, Hugh Dickins wrote:
> On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > 
> > Could you give the following patch a spin?  I put it in the mmots
> > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> 
> I'm just with the laptop until this evening.  I slapped it on top of
> my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> - which incidentally continues to run without crashing on the G5),
> and it quickly gave me this lockdep splat, which doesn't look very
> different from the one before.
> 
> I see there's now an -rc3-mm1, I'll try it out on that in half an
> hour... but unless I send word otherwise, assume that's the same.

Yes, I get that lockdep report each time on -rc3-mm1 + your patch.

I also twice got a flurry of res_counter.c:28 underflow warnings.
Hmm, 62 of them each time (I was checking for a number near 512,
which would suggest a THP/4k confusion, but no).  The majority
of them coming from mem_cgroup_reparent_charges.

But the laptop stayed up fine (for two hours before I had to stop
it), and the G5 has run fine with that load for 16 hours now, no
problems with release_pages, and not even a res_counter.c:28 (but
I don't use lockdep on it).

The x86 workstation ran fine for 4.5 hours, then hit some deadlock
which I doubt had any connection to your changes: looked more like
a jbd2 transaction was failing to complete (which, with me trying
ext4 on loop on tmpfs, might be more my problem than anyone else's).

Oh, but nearly forgot, I did an earlier run on the laptop last night,
which crashed within minutes on

VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
mm/memcontrol.c:6680!
page had count 1 mapcount 0 mapping anon index 0x196
flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
__alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
handle_mm_fault < __do_page_fault

I was expecting to reproduce that quite easily on the laptop or
workstation, and investigate more closely then; but in fact have
not seen it since.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
  2014-07-03 19:54         ` Hugh Dickins
@ 2014-07-04  0:41           ` Johannes Weiner
  -1 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-04  0:41 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Thu, Jul 03, 2014 at 12:54:36PM -0700, Hugh Dickins wrote:
> On Wed, 2 Jul 2014, Hugh Dickins wrote:
> > On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > > 
> > > Could you give the following patch a spin?  I put it in the mmots
> > > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> > 
> > I'm just with the laptop until this evening.  I slapped it on top of
> > my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> > - which incidentally continues to run without crashing on the G5),
> > and it quickly gave me this lockdep splat, which doesn't look very
> > different from the one before.
> > 
> > I see there's now an -rc3-mm1, I'll try it out on that in half an
> > hour... but unless I send word otherwise, assume that's the same.
> 
> Yes, I get that lockdep report each time on -rc3-mm1 + your patch.

There are two instances where I missed to make &rtpz->lock IRQ-safe:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91b621846e10..bbaa3f4cf4db 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3919,7 +3919,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
+		spin_lock_irq(&mctz->lock);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -3959,7 +3959,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 		 */
 		/* If excess == 0, no tree ops */
 		__mem_cgroup_insert_exceeded(mz, mctz, excess);
-		spin_unlock(&mctz->lock);
+		spin_unlock_irq(&mctz->lock);
 		css_put(&mz->memcg->css);
 		loop++;
 		/*

That should make it complete - but the IRQ toggling costs are fairly
high so I'm rewriting the batching code to use the page lists that
most uncharges have anyway, and then batch the no-IRQ sections.

> I also twice got a flurry of res_counter.c:28 underflow warnings.
> Hmm, 62 of them each time (I was checking for a number near 512,
> which would suggest a THP/4k confusion, but no).  The majority
> of them coming from mem_cgroup_reparent_charges.

I haven't seen these yet.  But the location makes sense: if there are
any imbalances they'll be noticed during a group's final uncharges.

> But the laptop stayed up fine (for two hours before I had to stop
> it), and the G5 has run fine with that load for 16 hours now, no
> problems with release_pages, and not even a res_counter.c:28 (but
> I don't use lockdep on it).

Great!

> The x86 workstation ran fine for 4.5 hours, then hit some deadlock
> which I doubt had any connection to your changes: looked more like
> a jbd2 transaction was failing to complete (which, with me trying
> ext4 on loop on tmpfs, might be more my problem than anyone else's).
> 
> Oh, but nearly forgot, I did an earlier run on the laptop last night,
> which crashed within minutes on
> 
> VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
> mm/memcontrol.c:6680!
> page had count 1 mapcount 0 mapping anon index 0x196
> flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
> mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
> compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
> __alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
> handle_mm_fault < __do_page_fault

Haven't seen that one yet, either.  The only way I can see this happen
is when the same page gets selected for migration twice in a row.
Maybe a race with putback, where it gets added to the LRU but isolated
by compaction before putback drops the refcount - will verify that.

Thanks for your reports!

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
@ 2014-07-04  0:41           ` Johannes Weiner
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-04  0:41 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Thu, Jul 03, 2014 at 12:54:36PM -0700, Hugh Dickins wrote:
> On Wed, 2 Jul 2014, Hugh Dickins wrote:
> > On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > > 
> > > Could you give the following patch a spin?  I put it in the mmots
> > > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> > 
> > I'm just with the laptop until this evening.  I slapped it on top of
> > my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> > - which incidentally continues to run without crashing on the G5),
> > and it quickly gave me this lockdep splat, which doesn't look very
> > different from the one before.
> > 
> > I see there's now an -rc3-mm1, I'll try it out on that in half an
> > hour... but unless I send word otherwise, assume that's the same.
> 
> Yes, I get that lockdep report each time on -rc3-mm1 + your patch.

There are two instances where I missed to make &rtpz->lock IRQ-safe:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91b621846e10..bbaa3f4cf4db 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3919,7 +3919,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						    gfp_mask, &nr_scanned);
 		nr_reclaimed += reclaimed;
 		*total_scanned += nr_scanned;
-		spin_lock(&mctz->lock);
+		spin_lock_irq(&mctz->lock);
 
 		/*
 		 * If we failed to reclaim anything from this memory cgroup
@@ -3959,7 +3959,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 		 */
 		/* If excess == 0, no tree ops */
 		__mem_cgroup_insert_exceeded(mz, mctz, excess);
-		spin_unlock(&mctz->lock);
+		spin_unlock_irq(&mctz->lock);
 		css_put(&mz->memcg->css);
 		loop++;
 		/*

That should make it complete - but the IRQ toggling costs are fairly
high so I'm rewriting the batching code to use the page lists that
most uncharges have anyway, and then batch the no-IRQ sections.

> I also twice got a flurry of res_counter.c:28 underflow warnings.
> Hmm, 62 of them each time (I was checking for a number near 512,
> which would suggest a THP/4k confusion, but no).  The majority
> of them coming from mem_cgroup_reparent_charges.

I haven't seen these yet.  But the location makes sense: if there are
any imbalances they'll be noticed during a group's final uncharges.

> But the laptop stayed up fine (for two hours before I had to stop
> it), and the G5 has run fine with that load for 16 hours now, no
> problems with release_pages, and not even a res_counter.c:28 (but
> I don't use lockdep on it).

Great!

> The x86 workstation ran fine for 4.5 hours, then hit some deadlock
> which I doubt had any connection to your changes: looked more like
> a jbd2 transaction was failing to complete (which, with me trying
> ext4 on loop on tmpfs, might be more my problem than anyone else's).
> 
> Oh, but nearly forgot, I did an earlier run on the laptop last night,
> which crashed within minutes on
> 
> VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
> mm/memcontrol.c:6680!
> page had count 1 mapcount 0 mapping anon index 0x196
> flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
> mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
> compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
> __alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
> handle_mm_fault < __do_page_fault

Haven't seen that one yet, either.  The only way I can see this happen
is when the same page gets selected for migration twice in a row.
Maybe a race with putback, where it gets added to the LRU but isolated
by compaction before putback drops the refcount - will verify that.

Thanks for your reports!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
  2014-07-04  0:41           ` Johannes Weiner
@ 2014-07-05  2:12             ` Hugh Dickins
  -1 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-07-05  2:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Thu, 3 Jul 2014, Johannes Weiner wrote:
> On Thu, Jul 03, 2014 at 12:54:36PM -0700, Hugh Dickins wrote:
> > On Wed, 2 Jul 2014, Hugh Dickins wrote:
> > > On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > > > 
> > > > Could you give the following patch a spin?  I put it in the mmots
> > > > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> > > 
> > > I'm just with the laptop until this evening.  I slapped it on top of
> > > my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> > > - which incidentally continues to run without crashing on the G5),
> > > and it quickly gave me this lockdep splat, which doesn't look very
> > > different from the one before.
> > > 
> > > I see there's now an -rc3-mm1, I'll try it out on that in half an
> > > hour... but unless I send word otherwise, assume that's the same.
> > 
> > Yes, I get that lockdep report each time on -rc3-mm1 + your patch.
> 
> There are two instances where I missed to make &rtpz->lock IRQ-safe:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 91b621846e10..bbaa3f4cf4db 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3919,7 +3919,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						    gfp_mask, &nr_scanned);
>  		nr_reclaimed += reclaimed;
>  		*total_scanned += nr_scanned;
> -		spin_lock(&mctz->lock);
> +		spin_lock_irq(&mctz->lock);
>  
>  		/*
>  		 * If we failed to reclaim anything from this memory cgroup
> @@ -3959,7 +3959,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  		 */
>  		/* If excess == 0, no tree ops */
>  		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> -		spin_unlock(&mctz->lock);
> +		spin_unlock_irq(&mctz->lock);
>  		css_put(&mz->memcg->css);
>  		loop++;
>  		/*

Thanks, that fixes my lockdep reports.

> 
> That should make it complete - but the IRQ toggling costs are fairly
> high so I'm rewriting the batching code to use the page lists that
> most uncharges have anyway, and then batch the no-IRQ sections.

Sounds good.

> 
> > I also twice got a flurry of res_counter.c:28 underflow warnings.
> > Hmm, 62 of them each time (I was checking for a number near 512,
> > which would suggest a THP/4k confusion, but no).  The majority
> > of them coming from mem_cgroup_reparent_charges.
> 
> I haven't seen these yet.  But the location makes sense: if there are
> any imbalances they'll be noticed during a group's final uncharges.

I haven't seen any since adding your patch above, though I don't see
how it could affect them.  Of course I'll let you know if they reappear.

> 
> > But the laptop stayed up fine (for two hours before I had to stop
> > it), and the G5 has run fine with that load for 16 hours now, no
> > problems with release_pages, and not even a res_counter.c:28 (but
> > I don't use lockdep on it).
> 
> Great!
> 
> > The x86 workstation ran fine for 4.5 hours, then hit some deadlock
> > which I doubt had any connection to your changes: looked more like
> > a jbd2 transaction was failing to complete (which, with me trying
> > ext4 on loop on tmpfs, might be more my problem than anyone else's).
> > 
> > Oh, but nearly forgot, I did an earlier run on the laptop last night,
> > which crashed within minutes on
> > 
> > VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
> > mm/memcontrol.c:6680!
> > page had count 1 mapcount 0 mapping anon index 0x196
> > flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
> > mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
> > compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
> > __alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
> > handle_mm_fault < __do_page_fault

I got it again on the laptop, after 7 hours.

> 
> Haven't seen that one yet, either.  The only way I can see this happen
> is when the same page gets selected for migration twice in a row.
> Maybe a race with putback, where it gets added to the LRU but isolated
> by compaction before putback drops the refcount - will verify that.

Yes.  This is one of those cases where I read a mail too quickly,
misunderstand it, set it aside, plough through the source files,
pace around thinking, finally come up with a hypothesis, go back to
answer the mail, and find I've arrived at the same conclusion as you.

Not verified in any way, but yes, mem_cgroup_migrate() looks anomalous
to me, in clearing PCG_MEM and PGC_MEMSW but leaving PCG_USED.  Once
that old page is put back on LRU for freeing, it could get isolated
by another migrator, who discovers the anomalous state in its own
mem_cgroup_migrate().

mem_cgroup_migrate() should just set pc->flags = 0, shouldn't it?

But is there any point to PCG_USED now?  Couldn't PageCgroupUsed
(or better, PageCgroupCharged) just test PCG_MEM and PCG_MEMSW?
Which should be low bits of pc->mem_cgroup, halving the array.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
@ 2014-07-05  2:12             ` Hugh Dickins
  0 siblings, 0 replies; 16+ messages in thread
From: Hugh Dickins @ 2014-07-05  2:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Thu, 3 Jul 2014, Johannes Weiner wrote:
> On Thu, Jul 03, 2014 at 12:54:36PM -0700, Hugh Dickins wrote:
> > On Wed, 2 Jul 2014, Hugh Dickins wrote:
> > > On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > > > 
> > > > Could you give the following patch a spin?  I put it in the mmots
> > > > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> > > 
> > > I'm just with the laptop until this evening.  I slapped it on top of
> > > my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> > > - which incidentally continues to run without crashing on the G5),
> > > and it quickly gave me this lockdep splat, which doesn't look very
> > > different from the one before.
> > > 
> > > I see there's now an -rc3-mm1, I'll try it out on that in half an
> > > hour... but unless I send word otherwise, assume that's the same.
> > 
> > Yes, I get that lockdep report each time on -rc3-mm1 + your patch.
> 
> There are two instances where I missed to make &rtpz->lock IRQ-safe:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 91b621846e10..bbaa3f4cf4db 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3919,7 +3919,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						    gfp_mask, &nr_scanned);
>  		nr_reclaimed += reclaimed;
>  		*total_scanned += nr_scanned;
> -		spin_lock(&mctz->lock);
> +		spin_lock_irq(&mctz->lock);
>  
>  		/*
>  		 * If we failed to reclaim anything from this memory cgroup
> @@ -3959,7 +3959,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  		 */
>  		/* If excess == 0, no tree ops */
>  		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> -		spin_unlock(&mctz->lock);
> +		spin_unlock_irq(&mctz->lock);
>  		css_put(&mz->memcg->css);
>  		loop++;
>  		/*

Thanks, that fixes my lockdep reports.

> 
> That should make it complete - but the IRQ toggling costs are fairly
> high so I'm rewriting the batching code to use the page lists that
> most uncharges have anyway, and then batch the no-IRQ sections.

Sounds good.

> 
> > I also twice got a flurry of res_counter.c:28 underflow warnings.
> > Hmm, 62 of them each time (I was checking for a number near 512,
> > which would suggest a THP/4k confusion, but no).  The majority
> > of them coming from mem_cgroup_reparent_charges.
> 
> I haven't seen these yet.  But the location makes sense: if there are
> any imbalances they'll be noticed during a group's final uncharges.

I haven't seen any since adding your patch above, though I don't see
how it could affect them.  Of course I'll let you know if they reappear.

> 
> > But the laptop stayed up fine (for two hours before I had to stop
> > it), and the G5 has run fine with that load for 16 hours now, no
> > problems with release_pages, and not even a res_counter.c:28 (but
> > I don't use lockdep on it).
> 
> Great!
> 
> > The x86 workstation ran fine for 4.5 hours, then hit some deadlock
> > which I doubt had any connection to your changes: looked more like
> > a jbd2 transaction was failing to complete (which, with me trying
> > ext4 on loop on tmpfs, might be more my problem than anyone else's).
> > 
> > Oh, but nearly forgot, I did an earlier run on the laptop last night,
> > which crashed within minutes on
> > 
> > VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
> > mm/memcontrol.c:6680!
> > page had count 1 mapcount 0 mapping anon index 0x196
> > flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
> > mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
> > compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
> > __alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
> > handle_mm_fault < __do_page_fault

I got it again on the laptop, after 7 hours.

> 
> Haven't seen that one yet, either.  The only way I can see this happen
> is when the same page gets selected for migration twice in a row.
> Maybe a race with putback, where it gets added to the LRU but isolated
> by compaction before putback drops the refcount - will verify that.

Yes.  This is one of those cases where I read a mail too quickly,
misunderstand it, set it aside, plough through the source files,
pace around thinking, finally come up with a hypothesis, go back to
answer the mail, and find I've arrived at the same conclusion as you.

Not verified in any way, but yes, mem_cgroup_migrate() looks anomalous
to me, in clearing PCG_MEM and PGC_MEMSW but leaving PCG_USED.  Once
that old page is put back on LRU for freeing, it could get isolated
by another migrator, who discovers the anomalous state in its own
mem_cgroup_migrate().

mem_cgroup_migrate() should just set pc->flags = 0, shouldn't it?

But is there any point to PCG_USED now?  Couldn't PageCgroupUsed
(or better, PageCgroupCharged) just test PCG_MEM and PCG_MEMSW?
Which should be low bits of pc->mem_cgroup, halving the array.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
  2014-07-05  2:12             ` Hugh Dickins
@ 2014-07-07 13:44               ` Johannes Weiner
  -1 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-07 13:44 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Fri, Jul 04, 2014 at 07:12:04PM -0700, Hugh Dickins wrote:
> On Thu, 3 Jul 2014, Johannes Weiner wrote:
> > On Thu, Jul 03, 2014 at 12:54:36PM -0700, Hugh Dickins wrote:
> > > On Wed, 2 Jul 2014, Hugh Dickins wrote:
> > > > On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > > > > 
> > > > > Could you give the following patch a spin?  I put it in the mmots
> > > > > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> > > > 
> > > > I'm just with the laptop until this evening.  I slapped it on top of
> > > > my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> > > > - which incidentally continues to run without crashing on the G5),
> > > > and it quickly gave me this lockdep splat, which doesn't look very
> > > > different from the one before.
> > > > 
> > > > I see there's now an -rc3-mm1, I'll try it out on that in half an
> > > > hour... but unless I send word otherwise, assume that's the same.
> > > 
> > > Yes, I get that lockdep report each time on -rc3-mm1 + your patch.
> > 
> > There are two instances where I missed to make &rtpz->lock IRQ-safe:
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 91b621846e10..bbaa3f4cf4db 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3919,7 +3919,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  						    gfp_mask, &nr_scanned);
> >  		nr_reclaimed += reclaimed;
> >  		*total_scanned += nr_scanned;
> > -		spin_lock(&mctz->lock);
> > +		spin_lock_irq(&mctz->lock);
> >  
> >  		/*
> >  		 * If we failed to reclaim anything from this memory cgroup
> > @@ -3959,7 +3959,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  		 */
> >  		/* If excess == 0, no tree ops */
> >  		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> > -		spin_unlock(&mctz->lock);
> > +		spin_unlock_irq(&mctz->lock);
> >  		css_put(&mz->memcg->css);
> >  		loop++;
> >  		/*
> 
> Thanks, that fixes my lockdep reports.
> 
> > 
> > That should make it complete - but the IRQ toggling costs are fairly
> > high so I'm rewriting the batching code to use the page lists that
> > most uncharges have anyway, and then batch the no-IRQ sections.
> 
> Sounds good.
> 
> > 
> > > I also twice got a flurry of res_counter.c:28 underflow warnings.
> > > Hmm, 62 of them each time (I was checking for a number near 512,
> > > which would suggest a THP/4k confusion, but no).  The majority
> > > of them coming from mem_cgroup_reparent_charges.
> > 
> > I haven't seen these yet.  But the location makes sense: if there are
> > any imbalances they'll be noticed during a group's final uncharges.
> 
> I haven't seen any since adding your patch above, though I don't see
> how it could affect them.  Of course I'll let you know if they reappear.
> 
> > 
> > > But the laptop stayed up fine (for two hours before I had to stop
> > > it), and the G5 has run fine with that load for 16 hours now, no
> > > problems with release_pages, and not even a res_counter.c:28 (but
> > > I don't use lockdep on it).
> > 
> > Great!
> > 
> > > The x86 workstation ran fine for 4.5 hours, then hit some deadlock
> > > which I doubt had any connection to your changes: looked more like
> > > a jbd2 transaction was failing to complete (which, with me trying
> > > ext4 on loop on tmpfs, might be more my problem than anyone else's).
> > > 
> > > Oh, but nearly forgot, I did an earlier run on the laptop last night,
> > > which crashed within minutes on
> > > 
> > > VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
> > > mm/memcontrol.c:6680!
> > > page had count 1 mapcount 0 mapping anon index 0x196
> > > flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
> > > mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
> > > compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
> > > __alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
> > > handle_mm_fault < __do_page_fault
> 
> I got it again on the laptop, after 7 hours.
> 
> > 
> > Haven't seen that one yet, either.  The only way I can see this happen
> > is when the same page gets selected for migration twice in a row.
> > Maybe a race with putback, where it gets added to the LRU but isolated
> > by compaction before putback drops the refcount - will verify that.
> 
> Yes.  This is one of those cases where I read a mail too quickly,
> misunderstand it, set it aside, plough through the source files,
> pace around thinking, finally come up with a hypothesis, go back to
> answer the mail, and find I've arrived at the same conclusion as you.
> 
> Not verified in any way, but yes, mem_cgroup_migrate() looks anomalous
> to me, in clearing PCG_MEM and PGC_MEMSW but leaving PCG_USED.  Once
> that old page is put back on LRU for freeing, it could get isolated
> by another migrator, who discovers the anomalous state in its own
> mem_cgroup_migrate().
> 
> mem_cgroup_migrate() should just set pc->flags = 0, shouldn't it?
> 
> But is there any point to PCG_USED now?  Couldn't PageCgroupUsed
> (or better, PageCgroupCharged) just test PCG_MEM and PCG_MEMSW?
> Which should be low bits of pc->mem_cgroup, halving the array.

This is a great idea, actually.

You are right that by the point mem_cgroup_migrate() takes away the
charges, we might be able to uncharge the page entirely, but I have to
review the page's context there to make sure this is still safe.

mem_cgroup_swapout() similarly moves charges away from the page right
before it's freed.  If we were to uncharge the page in there too, we
could remove the flags entirely: pc->mem_cgroup is only set when the
page is charged, and whether the charge includes memory+swap depends
on do_swap_account, which is a runtime constant.

I'll get back to this once the bugs are ironed out :)

Thanks, Hugh!

---
>From d90a318b0916db685846045a0691b8ec1cdcc063 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat, 5 Jul 2014 11:53:04 -0400
Subject: [patch] mm: memcontrol: rewrite uncharge API fix - double migration

Hugh reports:

VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
mm/memcontrol.c:6680!
page had count 1 mapcount 0 mapping anon index 0x196
flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
__alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
handle_mm_fault < __do_page_fault

mem_cgroup_migrate() assumes that a page is only migrated once and
then freed immediately after.

However, putting the page back on the LRU list and dropping the
isolation refcount is not done atomically.  This allows a PFN-based
migrator like compaction to isolate the page, see the expected
anonymous page refcount of 1, and migrate the page once more.

Catch pages that have already been migrated and abort charge migration
gracefully.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1e3b27f8dc2f..e4afdbdda0a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6653,7 +6653,10 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 	if (!PageCgroupUsed(pc))
 		return;
 
-	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
+	/* Already migrated */
+	if (!(pc->flags & PCG_MEM))
+		return;
+
 	VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage);
 	pc->flags &= ~(PCG_MEM | PCG_MEMSW);
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: mm: memcontrol: rewrite uncharge API: problems
@ 2014-07-07 13:44               ` Johannes Weiner
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2014-07-07 13:44 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel

On Fri, Jul 04, 2014 at 07:12:04PM -0700, Hugh Dickins wrote:
> On Thu, 3 Jul 2014, Johannes Weiner wrote:
> > On Thu, Jul 03, 2014 at 12:54:36PM -0700, Hugh Dickins wrote:
> > > On Wed, 2 Jul 2014, Hugh Dickins wrote:
> > > > On Wed, 2 Jul 2014, Johannes Weiner wrote:
> > > > > 
> > > > > Could you give the following patch a spin?  I put it in the mmots
> > > > > stack on top of mm-memcontrol-rewrite-charge-api-fix-shmem_unuse-fix.
> > > > 
> > > > I'm just with the laptop until this evening.  I slapped it on top of
> > > > my 3.16-rc2-mm1 plus fixes (but obviously minus my memcg_batch one
> > > > - which incidentally continues to run without crashing on the G5),
> > > > and it quickly gave me this lockdep splat, which doesn't look very
> > > > different from the one before.
> > > > 
> > > > I see there's now an -rc3-mm1, I'll try it out on that in half an
> > > > hour... but unless I send word otherwise, assume that's the same.
> > > 
> > > Yes, I get that lockdep report each time on -rc3-mm1 + your patch.
> > 
> > There are two instances where I missed to make &rtpz->lock IRQ-safe:
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 91b621846e10..bbaa3f4cf4db 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3919,7 +3919,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  						    gfp_mask, &nr_scanned);
> >  		nr_reclaimed += reclaimed;
> >  		*total_scanned += nr_scanned;
> > -		spin_lock(&mctz->lock);
> > +		spin_lock_irq(&mctz->lock);
> >  
> >  		/*
> >  		 * If we failed to reclaim anything from this memory cgroup
> > @@ -3959,7 +3959,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  		 */
> >  		/* If excess == 0, no tree ops */
> >  		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> > -		spin_unlock(&mctz->lock);
> > +		spin_unlock_irq(&mctz->lock);
> >  		css_put(&mz->memcg->css);
> >  		loop++;
> >  		/*
> 
> Thanks, that fixes my lockdep reports.
> 
> > 
> > That should make it complete - but the IRQ toggling costs are fairly
> > high so I'm rewriting the batching code to use the page lists that
> > most uncharges have anyway, and then batch the no-IRQ sections.
> 
> Sounds good.
> 
> > 
> > > I also twice got a flurry of res_counter.c:28 underflow warnings.
> > > Hmm, 62 of them each time (I was checking for a number near 512,
> > > which would suggest a THP/4k confusion, but no).  The majority
> > > of them coming from mem_cgroup_reparent_charges.
> > 
> > I haven't seen these yet.  But the location makes sense: if there are
> > any imbalances they'll be noticed during a group's final uncharges.
> 
> I haven't seen any since adding your patch above, though I don't see
> how it could affect them.  Of course I'll let you know if they reappear.
> 
> > 
> > > But the laptop stayed up fine (for two hours before I had to stop
> > > it), and the G5 has run fine with that load for 16 hours now, no
> > > problems with release_pages, and not even a res_counter.c:28 (but
> > > I don't use lockdep on it).
> > 
> > Great!
> > 
> > > The x86 workstation ran fine for 4.5 hours, then hit some deadlock
> > > which I doubt had any connection to your changes: looked more like
> > > a jbd2 transaction was failing to complete (which, with me trying
> > > ext4 on loop on tmpfs, might be more my problem than anyone else's).
> > > 
> > > Oh, but nearly forgot, I did an earlier run on the laptop last night,
> > > which crashed within minutes on
> > > 
> > > VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM))
> > > mm/memcontrol.c:6680!
> > > page had count 1 mapcount 0 mapping anon index 0x196
> > > flags locked uptodate reclaim swapbacked, pcflags 1, memcg not root
> > > mem_cgroup_migrate < move_to_new_page < migrate_pages < compact_zone <
> > > compact_zone_order < try_to_compact_pages < __alloc_pages_direct_compact <
> > > __alloc_pages_nodemask < alloc_pages_vma < do_huge_pmd_anonymous_page <
> > > handle_mm_fault < __do_page_fault
> 
> I got it again on the laptop, after 7 hours.
> 
> > 
> > Haven't seen that one yet, either.  The only way I can see this happen
> > is when the same page gets selected for migration twice in a row.
> > Maybe a race with putback, where it gets added to the LRU but isolated
> > by compaction before putback drops the refcount - will verify that.
> 
> Yes.  This is one of those cases where I read a mail too quickly,
> misunderstand it, set it aside, plough through the source files,
> pace around thinking, finally come up with a hypothesis, go back to
> answer the mail, and find I've arrived at the same conclusion as you.
> 
> Not verified in any way, but yes, mem_cgroup_migrate() looks anomalous
> to me, in clearing PCG_MEM and PGC_MEMSW but leaving PCG_USED.  Once
> that old page is put back on LRU for freeing, it could get isolated
> by another migrator, who discovers the anomalous state in its own
> mem_cgroup_migrate().
> 
> mem_cgroup_migrate() should just set pc->flags = 0, shouldn't it?
> 
> But is there any point to PCG_USED now?  Couldn't PageCgroupUsed
> (or better, PageCgroupCharged) just test PCG_MEM and PCG_MEMSW?
> Which should be low bits of pc->mem_cgroup, halving the array.

This is a great idea, actually.

You are right that by the point mem_cgroup_migrate() takes away the
charges, we might be able to uncharge the page entirely, but I have to
review the page's context there to make sure this is still safe.

mem_cgroup_swapout() similarly moves charges away from the page right
before it's freed.  If we were to uncharge the page in there too, we
could remove the flags entirely: pc->mem_cgroup is only set when the
page is charged, and whether the charge includes memory+swap depends
on do_swap_account, which is a runtime constant.

I'll get back to this once the bugs are ironed out :)

Thanks, Hugh!

---

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-07-07 13:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-30 23:55 mm: memcontrol: rewrite uncharge API: problems Hugh Dickins
2014-06-30 23:55 ` Hugh Dickins
2014-07-01 17:46 ` Johannes Weiner
2014-07-01 17:46   ` Johannes Weiner
2014-07-02 21:20   ` Johannes Weiner
2014-07-02 21:20     ` Johannes Weiner
2014-07-02 22:28     ` Hugh Dickins
2014-07-02 22:28       ` Hugh Dickins
2014-07-03 19:54       ` Hugh Dickins
2014-07-03 19:54         ` Hugh Dickins
2014-07-04  0:41         ` Johannes Weiner
2014-07-04  0:41           ` Johannes Weiner
2014-07-05  2:12           ` Hugh Dickins
2014-07-05  2:12             ` Hugh Dickins
2014-07-07 13:44             ` Johannes Weiner
2014-07-07 13:44               ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.