All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [fuse-devel] Kernel panic under load
       [not found] <CAB3-ZyQ4Mbj2g6b6Zt4pGLhE7ew9O==rNbUgAaPLYSwdRK3Czw@mail.gmail.com>
@ 2016-09-09 14:32   ` Miklos Szeredi
  0 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2016-09-09 14:32 UTC (permalink / raw)
  To: Antonio SJ Musumeci; +Cc: fuse-devel, linux-mm, Johannes Weiner, linux-kernel

On Fri, Sep 9, 2016 at 3:34 PM, Antonio SJ Musumeci <trapexit@spawn.link> wrote:
> https://gist.github.com/bauruine/3bc00075c4d0b5b3353071d208ded30f
> https://github.com/trapexit/mergerfs/issues/295
>
> I've some users which are having issues with my filesystem where the
> system's load increases and then the kernel panics.
>
> Has anyone seen this before?

Quite possibly this is caused by fuse, but the BUG is deep in mm
territory and I have zero clue about what it means.

Hannes,  can you please look a the above crash in mm/workingset.c?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
@ 2016-09-09 14:32   ` Miklos Szeredi
  0 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2016-09-09 14:32 UTC (permalink / raw)
  To: Antonio SJ Musumeci; +Cc: fuse-devel, linux-mm, Johannes Weiner, linux-kernel

On Fri, Sep 9, 2016 at 3:34 PM, Antonio SJ Musumeci <trapexit@spawn.link> wrote:
> https://gist.github.com/bauruine/3bc00075c4d0b5b3353071d208ded30f
> https://github.com/trapexit/mergerfs/issues/295
>
> I've some users which are having issues with my filesystem where the
> system's load increases and then the kernel panics.
>
> Has anyone seen this before?

Quite possibly this is caused by fuse, but the BUG is deep in mm
territory and I have zero clue about what it means.

Hannes,  can you please look a the above crash in mm/workingset.c?

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
  2016-09-09 14:32   ` Miklos Szeredi
@ 2016-09-09 19:42     ` Johannes Weiner
  -1 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2016-09-09 19:42 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Antonio SJ Musumeci, fuse-devel, linux-mm, linux-kernel

Hi Miklos,

On Fri, Sep 09, 2016 at 04:32:49PM +0200, Miklos Szeredi wrote:
> On Fri, Sep 9, 2016 at 3:34 PM, Antonio SJ Musumeci <trapexit@spawn.link> wrote:
> > https://gist.github.com/bauruine/3bc00075c4d0b5b3353071d208ded30f
> > https://github.com/trapexit/mergerfs/issues/295
> >
> > I've some users which are having issues with my filesystem where the
> > system's load increases and then the kernel panics.
> >
> > Has anyone seen this before?
> 
> Quite possibly this is caused by fuse, but the BUG is deep in mm
> territory and I have zero clue about what it means.
> 
> Hannes,  can you please look a the above crash in mm/workingset.c?

The MM maintains a reclaimable list of page cache tree nodes that have
gone empty (all pages evicted) except for the shadow entries reclaimed
pages leave behind. When faulting a regular page back into such a node
the code in page_cache_tree_insert() removes it from the list again:

		workingset_node_pages_inc(node);
		/*
		 * Don't track node that contains actual pages.
		 *
		 * Avoid acquiring the list_lru lock if already
		 * untracked.  The list_empty() test is safe as
		 * node->private_list is protected by
		 * mapping->tree_lock.
		 */
		if (!list_empty(&node->private_list))
			list_lru_del(&workingset_shadow_nodes,
				     &node->private_list);

The BUG_ON() triggers when we later walk the reclaimable list and find
a radix tree node that has actual pages in it. This could happen when
pages are inserted into a mapping without using add_to_page_cache and
related functions. Does that maybe ring a bell?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
@ 2016-09-09 19:42     ` Johannes Weiner
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2016-09-09 19:42 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Antonio SJ Musumeci, fuse-devel, linux-mm, linux-kernel

Hi Miklos,

On Fri, Sep 09, 2016 at 04:32:49PM +0200, Miklos Szeredi wrote:
> On Fri, Sep 9, 2016 at 3:34 PM, Antonio SJ Musumeci <trapexit@spawn.link> wrote:
> > https://gist.github.com/bauruine/3bc00075c4d0b5b3353071d208ded30f
> > https://github.com/trapexit/mergerfs/issues/295
> >
> > I've some users which are having issues with my filesystem where the
> > system's load increases and then the kernel panics.
> >
> > Has anyone seen this before?
> 
> Quite possibly this is caused by fuse, but the BUG is deep in mm
> territory and I have zero clue about what it means.
> 
> Hannes,  can you please look a the above crash in mm/workingset.c?

The MM maintains a reclaimable list of page cache tree nodes that have
gone empty (all pages evicted) except for the shadow entries reclaimed
pages leave behind. When faulting a regular page back into such a node
the code in page_cache_tree_insert() removes it from the list again:

		workingset_node_pages_inc(node);
		/*
		 * Don't track node that contains actual pages.
		 *
		 * Avoid acquiring the list_lru lock if already
		 * untracked.  The list_empty() test is safe as
		 * node->private_list is protected by
		 * mapping->tree_lock.
		 */
		if (!list_empty(&node->private_list))
			list_lru_del(&workingset_shadow_nodes,
				     &node->private_list);

The BUG_ON() triggers when we later walk the reclaimable list and find
a radix tree node that has actual pages in it. This could happen when
pages are inserted into a mapping without using add_to_page_cache and
related functions. Does that maybe ring a bell?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
  2016-09-09 19:42     ` Johannes Weiner
@ 2016-09-13  8:42       ` Miklos Szeredi
  -1 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2016-09-13  8:42 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Antonio SJ Musumeci, fuse-devel, linux-mm, linux-kernel

On Fri, Sep 9, 2016 at 9:42 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hi Miklos,
>
> On Fri, Sep 09, 2016 at 04:32:49PM +0200, Miklos Szeredi wrote:
>> On Fri, Sep 9, 2016 at 3:34 PM, Antonio SJ Musumeci <trapexit@spawn.link> wrote:
>> > https://gist.github.com/bauruine/3bc00075c4d0b5b3353071d208ded30f
>> > https://github.com/trapexit/mergerfs/issues/295
>> >
>> > I've some users which are having issues with my filesystem where the
>> > system's load increases and then the kernel panics.
>> >
>> > Has anyone seen this before?
>>
>> Quite possibly this is caused by fuse, but the BUG is deep in mm
>> territory and I have zero clue about what it means.
>>
>> Hannes,  can you please look a the above crash in mm/workingset.c?
>
> The MM maintains a reclaimable list of page cache tree nodes that have
> gone empty (all pages evicted) except for the shadow entries reclaimed
> pages leave behind. When faulting a regular page back into such a node
> the code in page_cache_tree_insert() removes it from the list again:
>
>                 workingset_node_pages_inc(node);
>                 /*
>                  * Don't track node that contains actual pages.
>                  *
>                  * Avoid acquiring the list_lru lock if already
>                  * untracked.  The list_empty() test is safe as
>                  * node->private_list is protected by
>                  * mapping->tree_lock.
>                  */
>                 if (!list_empty(&node->private_list))
>                         list_lru_del(&workingset_shadow_nodes,
>                                      &node->private_list);
>
> The BUG_ON() triggers when we later walk the reclaimable list and find
> a radix tree node that has actual pages in it. This could happen when
> pages are inserted into a mapping without using add_to_page_cache and
> related functions. Does that maybe ring a bell?

Fuse allows pages to be spliced into the page cache when reading the
file.  It does this with replace_page_cache_page(), which is an atomic
version of delete_from_page_cache()+add_to_page_cache().

Fuse is the only user of replace_page_cache_page(), so I imagine bugs
can more easily escape notice than the more commonly used variants.

Could you please take a look at this function.  "git blame" shows that
it's older than the add/remove variants, but I haven't gone into the
details.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
@ 2016-09-13  8:42       ` Miklos Szeredi
  0 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2016-09-13  8:42 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Antonio SJ Musumeci, fuse-devel, linux-mm, linux-kernel

On Fri, Sep 9, 2016 at 9:42 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hi Miklos,
>
> On Fri, Sep 09, 2016 at 04:32:49PM +0200, Miklos Szeredi wrote:
>> On Fri, Sep 9, 2016 at 3:34 PM, Antonio SJ Musumeci <trapexit@spawn.link> wrote:
>> > https://gist.github.com/bauruine/3bc00075c4d0b5b3353071d208ded30f
>> > https://github.com/trapexit/mergerfs/issues/295
>> >
>> > I've some users which are having issues with my filesystem where the
>> > system's load increases and then the kernel panics.
>> >
>> > Has anyone seen this before?
>>
>> Quite possibly this is caused by fuse, but the BUG is deep in mm
>> territory and I have zero clue about what it means.
>>
>> Hannes,  can you please look a the above crash in mm/workingset.c?
>
> The MM maintains a reclaimable list of page cache tree nodes that have
> gone empty (all pages evicted) except for the shadow entries reclaimed
> pages leave behind. When faulting a regular page back into such a node
> the code in page_cache_tree_insert() removes it from the list again:
>
>                 workingset_node_pages_inc(node);
>                 /*
>                  * Don't track node that contains actual pages.
>                  *
>                  * Avoid acquiring the list_lru lock if already
>                  * untracked.  The list_empty() test is safe as
>                  * node->private_list is protected by
>                  * mapping->tree_lock.
>                  */
>                 if (!list_empty(&node->private_list))
>                         list_lru_del(&workingset_shadow_nodes,
>                                      &node->private_list);
>
> The BUG_ON() triggers when we later walk the reclaimable list and find
> a radix tree node that has actual pages in it. This could happen when
> pages are inserted into a mapping without using add_to_page_cache and
> related functions. Does that maybe ring a bell?

Fuse allows pages to be spliced into the page cache when reading the
file.  It does this with replace_page_cache_page(), which is an atomic
version of delete_from_page_cache()+add_to_page_cache().

Fuse is the only user of replace_page_cache_page(), so I imagine bugs
can more easily escape notice than the more commonly used variants.

Could you please take a look at this function.  "git blame" shows that
it's older than the add/remove variants, but I haven't gone into the
details.

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
  2016-09-13  8:42       ` Miklos Szeredi
@ 2016-09-14 14:31         ` Johannes Weiner
  -1 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2016-09-14 14:31 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Antonio SJ Musumeci, fuse-devel, linux-mm, linux-kernel

Hi Miklos,

On Tue, Sep 13, 2016 at 10:42:17AM +0200, Miklos Szeredi wrote:
> Fuse allows pages to be spliced into the page cache when reading the
> file.  It does this with replace_page_cache_page(), which is an atomic
> version of delete_from_page_cache()+add_to_page_cache().
> 
> Fuse is the only user of replace_page_cache_page(), so I imagine bugs
> can more easily escape notice than the more commonly used variants.
> 
> Could you please take a look at this function.  "git blame" shows that
> it's older than the add/remove variants, but I haven't gone into the
> details.

Indeed, replace_page_cache_page() uses a properly accounted deletion
of the old page followed by a raw, untracked radix_tree_insert(). It
would lead to an underflow that triggers the page counter assertion.

Thanks for the pointer, Miklos. This has been broken for a while.

Antonio, does the following patch resolve the issue for you? It
applies to the head of Linus's tree, let me know if you need it
backported to a different base.

---

>From 3a2bb511f5e04019ccc487ef995b94700db172e7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 14 Sep 2016 09:50:42 -0400
Subject: [PATCH] mm: workingset: fix shadow node leak in
 replace_page_cache_page()

Antonio reports the following crash when using fuse under memory
pressure:

[25192.515454] kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
[25192.517521] invalid opcode: 0000 [#1] SMP
[25192.519602] Modules linked in: netconsole ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 configfs binfmt_misc veth bridge stp llc nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter ip6_tables xt_multiport iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_CHECKSUM xt_tcpudp iptable_mangle ip_tables x_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp eeepc_wmi asus_wmi coretemp sparse_keymap kvm_intel ppdev kvm irqbypass mei_me 8250_fintek input_leds serio_raw parport_pc tpm_infineon mei shpchp mac_hid parport lpc_ich autofs4 drbg ansi_cprng dm_crypt algif_skcipher af_alg btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 i915 crct10dif_pclmul crc32_pclmul aesni_intel i2c_algo_bit aes_x86_64 drm_kms_helper lrw gf128mul glue_helper ablk_helper syscopyarea cryptd sysfillrect sysimgblt fb_sys_fops drm ahci r8169 libahci mii wmi fjes video [last unloaded: netconsole]
[25192.540910] CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
[25192.543411] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[25192.545840] task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
[25192.548277] RIP: 0010:[<ffffffff811ba501>]  [<ffffffff811ba501>] shadow_lru_isolate+0x181/0x190
[25192.550706] RSP: 0018:ffff88040748bbe0  EFLAGS: 00010002
[25192.553127] RAX: 0000000000001c81 RBX: ffff8802f91ee928 RCX: ffff8802f91eeb38
[25192.555544] RDX: ffff8802f91ee938 RSI: ffff8802f91ee928 RDI: ffff8804099ba2c0
[25192.557914] RBP: ffff88040748bc08 R08: 000000000001a7b6 R09: 000000000000003f
[25192.560237] R10: 000000000001a750 R11: 0000000000000000 R12: ffff8804099ba2c0
[25192.562512] R13: ffff8803157e9680 R14: ffff8803157e9668 R15: ffff8804099ba2c8
[25192.564724] FS:  0000000000000000(0000) GS:ffff88041f280000(0000) knlGS:0000000000000000
[25192.566990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25192.569201] CR2: 00007ffabb690000 CR3: 0000000001e0a000 CR4: 00000000000406e0
[25192.571419] Stack:
[25192.573550]  ffff8804099ba2c0 ffff88039e4f86f0 ffff8802f91ee928 ffff8804099ba2c8
[25192.575695]  ffff88040748bd08 ffff88040748bc58 ffffffff811b99bf 0000000000000052
[25192.577814]  0000000000000000 ffffffff811ba380 000000000000008a 0000000000000080
[25192.579947] Call Trace:
[25192.582022]  [<ffffffff811b99bf>] __list_lru_walk_one.isra.3+0x8f/0x130
[25192.584137]  [<ffffffff811ba380>] ? memcg_drain_all_list_lrus+0x190/0x190
[25192.586165]  [<ffffffff811b9a83>] list_lru_walk_one+0x23/0x30
[25192.588145]  [<ffffffff811ba544>] scan_shadow_nodes+0x34/0x50
[25192.590074]  [<ffffffff811a0e9d>] shrink_slab.part.40+0x1ed/0x3d0
[25192.591985]  [<ffffffff811a53da>] shrink_zone+0x2ca/0x2e0
[25192.593863]  [<ffffffff811a64ce>] kswapd+0x51e/0x990
[25192.595737]  [<ffffffff811a5fb0>] ? mem_cgroup_shrink_node_zone+0x1c0/0x1c0
[25192.597613]  [<ffffffff810a0808>] kthread+0xd8/0xf0
[25192.599495]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[25192.601335]  [<ffffffff8182e34f>] ret_from_fork+0x3f/0x70
[25192.603193]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[25192.605083] Code: 8d 7e 08 4c 89 fe e8 4f cc 23 00 84 c0 74 20 4c 89 ef c6 07 00 66 66 66 90 bb 01 00 00 00 e9 c5 fe ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
[25192.609252] RIP  [<ffffffff811ba501>] shadow_lru_isolate+0x181/0x190
[25192.611304]  RSP <ffff88040748bbe0>

which corresponds to the following sanity check in the shadow node
tracking:

  BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

The workingset code tracks radix tree nodes that exclusively contain
shadow entries of evicted pages in them, and this (somewhat obscure)
checks if there are real pages left that would interfere with reclaim
of the radix tree node under memory pressure.

Discussing ways of how fuse might sneak pages into the radix tree past
the workingset code, Miklos pointed to replace_page_cache_page(), and
indeed there is a problem there: it properly accounts for the old page
being removed (__delete_from_page_cache() does that), but then does a
raw raw radix_tree_insert(), not accounting for the replacement page;
the page counter bits in node->count eventually underflow.

To address this, make sure replace_page_cache_page() uses the tracked
page insertion code, page_cache_tree_insert().

Also, make the sanity checks a bit less obscure by using the helpers
for checking the number of pages and shadows in a radix tree node.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Cc: stable@vger.kernel.org # 3.15+
Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Debugged-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h |   2 +
 mm/filemap.c         | 114 +++++++++++++++++++++++++--------------------------
 mm/workingset.c      |  10 ++---
 3 files changed, 63 insertions(+), 63 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..4a529c984a3f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -257,6 +257,7 @@ static inline void workingset_node_pages_inc(struct radix_tree_node *node)
 
 static inline void workingset_node_pages_dec(struct radix_tree_node *node)
 {
+	VM_BUG_ON(!workingset_node_pages(node));
 	node->count--;
 }
 
@@ -272,6 +273,7 @@ static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
 
 static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
 {
+	VM_BUG_ON(!workingset_node_shadows(node));
 	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287dfc5372..2d0986a64f1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,6 +110,62 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static int page_cache_tree_insert(struct address_space *mapping,
+				  struct page *page, void **shadowp)
+{
+	struct radix_tree_node *node;
+	void **slot;
+	int error;
+
+	error = __radix_tree_create(&mapping->page_tree, page->index, 0,
+				    &node, &slot);
+	if (error)
+		return error;
+	if (*slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+
+		mapping->nrexceptional--;
+		if (!dax_mapping(mapping)) {
+			if (shadowp)
+				*shadowp = p;
+			if (node)
+				workingset_node_shadows_dec(node);
+		} else {
+			/* DAX can replace empty locked entry with a hole */
+			WARN_ON_ONCE(p !=
+				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+					 RADIX_DAX_ENTRY_LOCK));
+			/* DAX accounts exceptional entries as normal pages */
+			if (node)
+				workingset_node_pages_dec(node);
+			/* Wakeup waiters for exceptional entry lock */
+			dax_wake_mapping_entry_waiter(mapping, page->index,
+						      false);
+		}
+	}
+	radix_tree_replace_slot(slot, page);
+	mapping->nrpages++;
+	if (node) {
+		workingset_node_pages_inc(node);
+		/*
+		 * Don't track node that contains actual pages.
+		 *
+		 * Avoid acquiring the list_lru lock if already
+		 * untracked.  The list_empty() test is safe as
+		 * node->private_list is protected by
+		 * mapping->tree_lock.
+		 */
+		if (!list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+				     &node->private_list);
+	}
+	return 0;
+}
+
 static void page_cache_tree_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
@@ -561,7 +617,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		__delete_from_page_cache(old, NULL);
-		error = radix_tree_insert(&mapping->page_tree, offset, new);
+		error = page_cache_tree_insert(mapping, new, NULL);
 		BUG_ON(error);
 		mapping->nrpages++;
 
@@ -584,62 +640,6 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
-static int page_cache_tree_insert(struct address_space *mapping,
-				  struct page *page, void **shadowp)
-{
-	struct radix_tree_node *node;
-	void **slot;
-	int error;
-
-	error = __radix_tree_create(&mapping->page_tree, page->index, 0,
-				    &node, &slot);
-	if (error)
-		return error;
-	if (*slot) {
-		void *p;
-
-		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
-		if (!radix_tree_exceptional_entry(p))
-			return -EEXIST;
-
-		mapping->nrexceptional--;
-		if (!dax_mapping(mapping)) {
-			if (shadowp)
-				*shadowp = p;
-			if (node)
-				workingset_node_shadows_dec(node);
-		} else {
-			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
-			/* DAX accounts exceptional entries as normal pages */
-			if (node)
-				workingset_node_pages_dec(node);
-			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
-						      false);
-		}
-	}
-	radix_tree_replace_slot(slot, page);
-	mapping->nrpages++;
-	if (node) {
-		workingset_node_pages_inc(node);
-		/*
-		 * Don't track node that contains actual pages.
-		 *
-		 * Avoid acquiring the list_lru lock if already
-		 * untracked.  The list_empty() test is safe as
-		 * node->private_list is protected by
-		 * mapping->tree_lock.
-		 */
-		if (!list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-				     &node->private_list);
-	}
-	return 0;
-}
-
 static int __add_to_page_cache_locked(struct page *page,
 				      struct address_space *mapping,
 				      pgoff_t offset, gfp_t gfp_mask,
diff --git a/mm/workingset.c b/mm/workingset.c
index 69551cfae97b..617475f529f4 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,21 +418,19 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	 * no pages, so we expect to be able to remove them all and
 	 * delete and free the empty node afterwards.
 	 */
-
-	BUG_ON(!node->count);
-	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
+	BUG_ON(!workingset_node_shadows(node));
+	BUG_ON(workingset_node_pages(node));
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		if (node->slots[i]) {
 			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
 			node->slots[i] = NULL;
-			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
-			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+			workingset_node_shadows_dec(node);
 			BUG_ON(!mapping->nrexceptional);
 			mapping->nrexceptional--;
 		}
 	}
-	BUG_ON(node->count);
+	BUG_ON(workingset_node_shadows(node));
 	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
 		BUG();
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
@ 2016-09-14 14:31         ` Johannes Weiner
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2016-09-14 14:31 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Antonio SJ Musumeci, fuse-devel, linux-mm, linux-kernel

Hi Miklos,

On Tue, Sep 13, 2016 at 10:42:17AM +0200, Miklos Szeredi wrote:
> Fuse allows pages to be spliced into the page cache when reading the
> file.  It does this with replace_page_cache_page(), which is an atomic
> version of delete_from_page_cache()+add_to_page_cache().
> 
> Fuse is the only user of replace_page_cache_page(), so I imagine bugs
> can more easily escape notice than the more commonly used variants.
> 
> Could you please take a look at this function.  "git blame" shows that
> it's older than the add/remove variants, but I haven't gone into the
> details.

Indeed, replace_page_cache_page() uses a properly accounted deletion
of the old page followed by a raw, untracked radix_tree_insert(). It
would lead to an underflow that triggers the page counter assertion.

Thanks for the pointer, Miklos. This has been broken for a while.

Antonio, does the following patch resolve the issue for you? It
applies to the head of Linus's tree, let me know if you need it
backported to a different base.

---

From 3a2bb511f5e04019ccc487ef995b94700db172e7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 14 Sep 2016 09:50:42 -0400
Subject: [PATCH] mm: workingset: fix shadow node leak in
 replace_page_cache_page()

Antonio reports the following crash when using fuse under memory
pressure:

[25192.515454] kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
[25192.517521] invalid opcode: 0000 [#1] SMP
[25192.519602] Modules linked in: netconsole ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 configfs binfmt_misc veth bridge stp llc nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter ip6_tables xt_multiport iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_CHECKSUM xt_tcpudp iptable_mangle ip_tables x_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp eeepc_wmi asus_wmi coretemp sparse_keymap kvm_intel ppdev kvm irqbypass mei_me 8250_fintek input_leds serio_raw parport_pc tpm_infineon mei shpchp mac_hid parport lpc_ich autofs4 drbg ansi_cprng dm_crypt algif_skcipher af_alg btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 i915 crct10dif_pclmul crc32_pclmul aesni_intel i2c_algo_bit aes_x86_64 drm_kms_helper lrw gf128mul glue_helper ablk_helper syscopyarea cryptd sysfillrect sysimgblt fb_sys_fops drm ahci r8169 libahci mii wmi fjes video [last unloaded: netconsole]
[25192.540910] CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
[25192.543411] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[25192.545840] task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
[25192.548277] RIP: 0010:[<ffffffff811ba501>]  [<ffffffff811ba501>] shadow_lru_isolate+0x181/0x190
[25192.550706] RSP: 0018:ffff88040748bbe0  EFLAGS: 00010002
[25192.553127] RAX: 0000000000001c81 RBX: ffff8802f91ee928 RCX: ffff8802f91eeb38
[25192.555544] RDX: ffff8802f91ee938 RSI: ffff8802f91ee928 RDI: ffff8804099ba2c0
[25192.557914] RBP: ffff88040748bc08 R08: 000000000001a7b6 R09: 000000000000003f
[25192.560237] R10: 000000000001a750 R11: 0000000000000000 R12: ffff8804099ba2c0
[25192.562512] R13: ffff8803157e9680 R14: ffff8803157e9668 R15: ffff8804099ba2c8
[25192.564724] FS:  0000000000000000(0000) GS:ffff88041f280000(0000) knlGS:0000000000000000
[25192.566990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25192.569201] CR2: 00007ffabb690000 CR3: 0000000001e0a000 CR4: 00000000000406e0
[25192.571419] Stack:
[25192.573550]  ffff8804099ba2c0 ffff88039e4f86f0 ffff8802f91ee928 ffff8804099ba2c8
[25192.575695]  ffff88040748bd08 ffff88040748bc58 ffffffff811b99bf 0000000000000052
[25192.577814]  0000000000000000 ffffffff811ba380 000000000000008a 0000000000000080
[25192.579947] Call Trace:
[25192.582022]  [<ffffffff811b99bf>] __list_lru_walk_one.isra.3+0x8f/0x130
[25192.584137]  [<ffffffff811ba380>] ? memcg_drain_all_list_lrus+0x190/0x190
[25192.586165]  [<ffffffff811b9a83>] list_lru_walk_one+0x23/0x30
[25192.588145]  [<ffffffff811ba544>] scan_shadow_nodes+0x34/0x50
[25192.590074]  [<ffffffff811a0e9d>] shrink_slab.part.40+0x1ed/0x3d0
[25192.591985]  [<ffffffff811a53da>] shrink_zone+0x2ca/0x2e0
[25192.593863]  [<ffffffff811a64ce>] kswapd+0x51e/0x990
[25192.595737]  [<ffffffff811a5fb0>] ? mem_cgroup_shrink_node_zone+0x1c0/0x1c0
[25192.597613]  [<ffffffff810a0808>] kthread+0xd8/0xf0
[25192.599495]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[25192.601335]  [<ffffffff8182e34f>] ret_from_fork+0x3f/0x70
[25192.603193]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[25192.605083] Code: 8d 7e 08 4c 89 fe e8 4f cc 23 00 84 c0 74 20 4c 89 ef c6 07 00 66 66 66 90 bb 01 00 00 00 e9 c5 fe ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
[25192.609252] RIP  [<ffffffff811ba501>] shadow_lru_isolate+0x181/0x190
[25192.611304]  RSP <ffff88040748bbe0>

which corresponds to the following sanity check in the shadow node
tracking:

  BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

The workingset code tracks radix tree nodes that exclusively contain
shadow entries of evicted pages in them, and this (somewhat obscure)
checks if there are real pages left that would interfere with reclaim
of the radix tree node under memory pressure.

Discussing ways of how fuse might sneak pages into the radix tree past
the workingset code, Miklos pointed to replace_page_cache_page(), and
indeed there is a problem there: it properly accounts for the old page
being removed (__delete_from_page_cache() does that), but then does a
raw raw radix_tree_insert(), not accounting for the replacement page;
the page counter bits in node->count eventually underflow.

To address this, make sure replace_page_cache_page() uses the tracked
page insertion code, page_cache_tree_insert().

Also, make the sanity checks a bit less obscure by using the helpers
for checking the number of pages and shadows in a radix tree node.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Cc: stable@vger.kernel.org # 3.15+
Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Debugged-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h |   2 +
 mm/filemap.c         | 114 +++++++++++++++++++++++++--------------------------
 mm/workingset.c      |  10 ++---
 3 files changed, 63 insertions(+), 63 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..4a529c984a3f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -257,6 +257,7 @@ static inline void workingset_node_pages_inc(struct radix_tree_node *node)
 
 static inline void workingset_node_pages_dec(struct radix_tree_node *node)
 {
+	VM_BUG_ON(!workingset_node_pages(node));
 	node->count--;
 }
 
@@ -272,6 +273,7 @@ static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
 
 static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
 {
+	VM_BUG_ON(!workingset_node_shadows(node));
 	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287dfc5372..2d0986a64f1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,6 +110,62 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static int page_cache_tree_insert(struct address_space *mapping,
+				  struct page *page, void **shadowp)
+{
+	struct radix_tree_node *node;
+	void **slot;
+	int error;
+
+	error = __radix_tree_create(&mapping->page_tree, page->index, 0,
+				    &node, &slot);
+	if (error)
+		return error;
+	if (*slot) {
+		void *p;
+
+		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+		if (!radix_tree_exceptional_entry(p))
+			return -EEXIST;
+
+		mapping->nrexceptional--;
+		if (!dax_mapping(mapping)) {
+			if (shadowp)
+				*shadowp = p;
+			if (node)
+				workingset_node_shadows_dec(node);
+		} else {
+			/* DAX can replace empty locked entry with a hole */
+			WARN_ON_ONCE(p !=
+				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+					 RADIX_DAX_ENTRY_LOCK));
+			/* DAX accounts exceptional entries as normal pages */
+			if (node)
+				workingset_node_pages_dec(node);
+			/* Wakeup waiters for exceptional entry lock */
+			dax_wake_mapping_entry_waiter(mapping, page->index,
+						      false);
+		}
+	}
+	radix_tree_replace_slot(slot, page);
+	mapping->nrpages++;
+	if (node) {
+		workingset_node_pages_inc(node);
+		/*
+		 * Don't track node that contains actual pages.
+		 *
+		 * Avoid acquiring the list_lru lock if already
+		 * untracked.  The list_empty() test is safe as
+		 * node->private_list is protected by
+		 * mapping->tree_lock.
+		 */
+		if (!list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+				     &node->private_list);
+	}
+	return 0;
+}
+
 static void page_cache_tree_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
@@ -561,7 +617,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		__delete_from_page_cache(old, NULL);
-		error = radix_tree_insert(&mapping->page_tree, offset, new);
+		error = page_cache_tree_insert(mapping, new, NULL);
 		BUG_ON(error);
 		mapping->nrpages++;
 
@@ -584,62 +640,6 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
-static int page_cache_tree_insert(struct address_space *mapping,
-				  struct page *page, void **shadowp)
-{
-	struct radix_tree_node *node;
-	void **slot;
-	int error;
-
-	error = __radix_tree_create(&mapping->page_tree, page->index, 0,
-				    &node, &slot);
-	if (error)
-		return error;
-	if (*slot) {
-		void *p;
-
-		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
-		if (!radix_tree_exceptional_entry(p))
-			return -EEXIST;
-
-		mapping->nrexceptional--;
-		if (!dax_mapping(mapping)) {
-			if (shadowp)
-				*shadowp = p;
-			if (node)
-				workingset_node_shadows_dec(node);
-		} else {
-			/* DAX can replace empty locked entry with a hole */
-			WARN_ON_ONCE(p !=
-				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
-					 RADIX_DAX_ENTRY_LOCK));
-			/* DAX accounts exceptional entries as normal pages */
-			if (node)
-				workingset_node_pages_dec(node);
-			/* Wakeup waiters for exceptional entry lock */
-			dax_wake_mapping_entry_waiter(mapping, page->index,
-						      false);
-		}
-	}
-	radix_tree_replace_slot(slot, page);
-	mapping->nrpages++;
-	if (node) {
-		workingset_node_pages_inc(node);
-		/*
-		 * Don't track node that contains actual pages.
-		 *
-		 * Avoid acquiring the list_lru lock if already
-		 * untracked.  The list_empty() test is safe as
-		 * node->private_list is protected by
-		 * mapping->tree_lock.
-		 */
-		if (!list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-				     &node->private_list);
-	}
-	return 0;
-}
-
 static int __add_to_page_cache_locked(struct page *page,
 				      struct address_space *mapping,
 				      pgoff_t offset, gfp_t gfp_mask,
diff --git a/mm/workingset.c b/mm/workingset.c
index 69551cfae97b..617475f529f4 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,21 +418,19 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	 * no pages, so we expect to be able to remove them all and
 	 * delete and free the empty node afterwards.
 	 */
-
-	BUG_ON(!node->count);
-	BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
+	BUG_ON(!workingset_node_shadows(node));
+	BUG_ON(workingset_node_pages(node));
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		if (node->slots[i]) {
 			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
 			node->slots[i] = NULL;
-			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
-			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+			workingset_node_shadows_dec(node);
 			BUG_ON(!mapping->nrexceptional);
 			mapping->nrexceptional--;
 		}
 	}
-	BUG_ON(node->count);
+	BUG_ON(workingset_node_shadows(node));
 	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
 	if (!__radix_tree_delete_node(&mapping->page_tree, node))
 		BUG();
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
  2016-09-14 14:31         ` Johannes Weiner
  (?)
@ 2016-09-14 14:51         ` Antonio SJ Musumeci
  2016-09-21  8:54             ` Miklos Szeredi
  -1 siblings, 1 reply; 11+ messages in thread
From: Antonio SJ Musumeci @ 2016-09-14 14:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Miklos Szeredi, fuse-devel, linux-mm, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 15265 bytes --]

I was unable to reproduce the problem but I'll forward this on to my user
and see if they can test it.

I imagine the users would prefer it backported though they have worked
around the problem by turning off splicing.

On Wed, Sep 14, 2016 at 10:31 AM, Johannes Weiner <hannes@cmpxchg.org>
wrote:

> Hi Miklos,
>
> On Tue, Sep 13, 2016 at 10:42:17AM +0200, Miklos Szeredi wrote:
> > Fuse allows pages to be spliced into the page cache when reading the
> > file.  It does this with replace_page_cache_page(), which is an atomic
> > version of delete_from_page_cache()+add_to_page_cache().
> >
> > Fuse is the only user of replace_page_cache_page(), so I imagine bugs
> > can more easily escape notice than the more commonly used variants.
> >
> > Could you please take a look at this function.  "git blame" shows that
> > it's older than the add/remove variants, but I haven't gone into the
> > details.
>
> Indeed, replace_page_cache_page() uses a properly accounted deletion
> of the old page followed by a raw, untracked radix_tree_insert(). It
> would lead to an underflow that triggers the page counter assertion.
>
> Thanks for the pointer, Miklos. This has been broken for a while.
>
> Antonio, does the following patch resolve the issue for you? It
> applies to the head of Linus's tree, let me know if you need it
> backported to a different base.
>
> ---
>
> From 3a2bb511f5e04019ccc487ef995b94700db172e7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 14 Sep 2016 09:50:42 -0400
> Subject: [PATCH] mm: workingset: fix shadow node leak in
>  replace_page_cache_page()
>
> Antonio reports the following crash when using fuse under memory
> pressure:
>
> [25192.515454] kernel BUG at /build/linux-a2WvEb/linux-4.4.
> 0/mm/workingset.c:346!
> [25192.517521] invalid opcode: 0000 [#1] SMP
> [25192.519602] Modules linked in: netconsole ip6t_REJECT nf_reject_ipv6
> ipt_REJECT nf_reject_ipv4 configfs binfmt_misc veth bridge stp llc
> nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter ip6_tables
> xt_multiport iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4
> xt_comment xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
> nf_nat nf_conntrack xt_CHECKSUM xt_tcpudp iptable_mangle ip_tables x_tables
> intel_rapl x86_pkg_temp_thermal intel_powerclamp eeepc_wmi asus_wmi
> coretemp sparse_keymap kvm_intel ppdev kvm irqbypass mei_me 8250_fintek
> input_leds serio_raw parport_pc tpm_infineon mei shpchp mac_hid parport
> lpc_ich autofs4 drbg ansi_cprng dm_crypt algif_skcipher af_alg btrfs
> raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
> raid6_pq libcrc32c raid0 multipath linear raid10 raid1 i915
> crct10dif_pclmul crc32_pclmul aesni_intel i2c_algo_bit aes_x86_64
> drm_kms_helper lrw gf128mul glue_helper ablk_helper syscopyarea cryptd
> sysfillrect sysimgblt fb_sys_fops drm ahci r8169 libahci mii wmi fjes video
> [last unloaded: netconsole]
> [25192.540910] CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic
> #55-Ubuntu
> [25192.543411] Hardware name: System manufacturer System Product
> Name/P8H67-M PRO, BIOS 3904 04/27/2013
> [25192.545840] task: ffff88040cae6040 ti: ffff880407488000 task.ti:
> ffff880407488000
> [25192.548277] RIP: 0010:[<ffffffff811ba501>]  [<ffffffff811ba501>]
> shadow_lru_isolate+0x181/0x190
> [25192.550706] RSP: 0018:ffff88040748bbe0  EFLAGS: 00010002
> [25192.553127] RAX: 0000000000001c81 RBX: ffff8802f91ee928 RCX:
> ffff8802f91eeb38
> [25192.555544] RDX: ffff8802f91ee938 RSI: ffff8802f91ee928 RDI:
> ffff8804099ba2c0
> [25192.557914] RBP: ffff88040748bc08 R08: 000000000001a7b6 R09:
> 000000000000003f
> [25192.560237] R10: 000000000001a750 R11: 0000000000000000 R12:
> ffff8804099ba2c0
> [25192.562512] R13: ffff8803157e9680 R14: ffff8803157e9668 R15:
> ffff8804099ba2c8
> [25192.564724] FS:  0000000000000000(0000) GS:ffff88041f280000(0000)
> knlGS:0000000000000000
> [25192.566990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [25192.569201] CR2: 00007ffabb690000 CR3: 0000000001e0a000 CR4:
> 00000000000406e0
> [25192.571419] Stack:
> [25192.573550]  ffff8804099ba2c0 ffff88039e4f86f0 ffff8802f91ee928
> ffff8804099ba2c8
> [25192.575695]  ffff88040748bd08 ffff88040748bc58 ffffffff811b99bf
> 0000000000000052
> [25192.577814]  0000000000000000 ffffffff811ba380 000000000000008a
> 0000000000000080
> [25192.579947] Call Trace:
> [25192.582022]  [<ffffffff811b99bf>] __list_lru_walk_one.isra.3+0x8f/0x130
> [25192.584137]  [<ffffffff811ba380>] ? memcg_drain_all_list_lrus+
> 0x190/0x190
> [25192.586165]  [<ffffffff811b9a83>] list_lru_walk_one+0x23/0x30
> [25192.588145]  [<ffffffff811ba544>] scan_shadow_nodes+0x34/0x50
> [25192.590074]  [<ffffffff811a0e9d>] shrink_slab.part.40+0x1ed/0x3d0
> [25192.591985]  [<ffffffff811a53da>] shrink_zone+0x2ca/0x2e0
> [25192.593863]  [<ffffffff811a64ce>] kswapd+0x51e/0x990
> [25192.595737]  [<ffffffff811a5fb0>] ? mem_cgroup_shrink_node_zone+
> 0x1c0/0x1c0
> [25192.597613]  [<ffffffff810a0808>] kthread+0xd8/0xf0
> [25192.599495]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
> [25192.601335]  [<ffffffff8182e34f>] ret_from_fork+0x3f/0x70
> [25192.603193]  [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
> [25192.605083] Code: 8d 7e 08 4c 89 fe e8 4f cc 23 00 84 c0 74 20 4c 89 ef
> c6 07 00 66 66 66 90 bb 01 00 00 00 e9 c5 fe ff ff 0f 0b 0f 0b 0f 0b 0f 0b
> <0f> 0b 0f 0b 0f 0b 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
> [25192.609252] RIP  [<ffffffff811ba501>] shadow_lru_isolate+0x181/0x190
> [25192.611304]  RSP <ffff88040748bbe0>
>
> which corresponds to the following sanity check in the shadow node
> tracking:
>
>   BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
>
> The workingset code tracks radix tree nodes that exclusively contain
> shadow entries of evicted pages in them, and this (somewhat obscure)
> checks if there are real pages left that would interfere with reclaim
> of the radix tree node under memory pressure.
>
> Discussing ways of how fuse might sneak pages into the radix tree past
> the workingset code, Miklos pointed to replace_page_cache_page(), and
> indeed there is a problem there: it properly accounts for the old page
> being removed (__delete_from_page_cache() does that), but then does a
> raw raw radix_tree_insert(), not accounting for the replacement page;
> the page counter bits in node->count eventually underflow.
>
> To address this, make sure replace_page_cache_page() uses the tracked
> page insertion code, page_cache_tree_insert().
>
> Also, make the sanity checks a bit less obscure by using the helpers
> for checking the number of pages and shadows in a radix tree node.
>
> Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> Cc: stable@vger.kernel.org # 3.15+
> Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
> Debugged-by: Miklos Szeredi <miklos@szeredi.hu>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/swap.h |   2 +
>  mm/filemap.c         | 114 +++++++++++++++++++++++++-----
> ---------------------
>  mm/workingset.c      |  10 ++---
>  3 files changed, 63 insertions(+), 63 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b17cc4830fa6..4a529c984a3f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -257,6 +257,7 @@ static inline void workingset_node_pages_inc(struct
> radix_tree_node *node)
>
>  static inline void workingset_node_pages_dec(struct radix_tree_node
> *node)
>  {
> +       VM_BUG_ON(!workingset_node_pages(node));
>         node->count--;
>  }
>
> @@ -272,6 +273,7 @@ static inline void workingset_node_shadows_inc(struct
> radix_tree_node *node)
>
>  static inline void workingset_node_shadows_dec(struct radix_tree_node
> *node)
>  {
> +       VM_BUG_ON(!workingset_node_shadows(node));
>         node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
>  }
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a287dfc5372..2d0986a64f1f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -110,6 +110,62 @@
>   *   ->tasklist_lock            (memory_failure, collect_procs_ao)
>   */
>
> +static int page_cache_tree_insert(struct address_space *mapping,
> +                                 struct page *page, void **shadowp)
> +{
> +       struct radix_tree_node *node;
> +       void **slot;
> +       int error;
> +
> +       error = __radix_tree_create(&mapping->page_tree, page->index, 0,
> +                                   &node, &slot);
> +       if (error)
> +               return error;
> +       if (*slot) {
> +               void *p;
> +
> +               p = radix_tree_deref_slot_protected(slot,
> &mapping->tree_lock);
> +               if (!radix_tree_exceptional_entry(p))
> +                       return -EEXIST;
> +
> +               mapping->nrexceptional--;
> +               if (!dax_mapping(mapping)) {
> +                       if (shadowp)
> +                               *shadowp = p;
> +                       if (node)
> +                               workingset_node_shadows_dec(node);
> +               } else {
> +                       /* DAX can replace empty locked entry with a hole
> */
> +                       WARN_ON_ONCE(p !=
> +                               (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> +                                        RADIX_DAX_ENTRY_LOCK));
> +                       /* DAX accounts exceptional entries as normal
> pages */
> +                       if (node)
> +                               workingset_node_pages_dec(node);
> +                       /* Wakeup waiters for exceptional entry lock */
> +                       dax_wake_mapping_entry_waiter(mapping,
> page->index,
> +                                                     false);
> +               }
> +       }
> +       radix_tree_replace_slot(slot, page);
> +       mapping->nrpages++;
> +       if (node) {
> +               workingset_node_pages_inc(node);
> +               /*
> +                * Don't track node that contains actual pages.
> +                *
> +                * Avoid acquiring the list_lru lock if already
> +                * untracked.  The list_empty() test is safe as
> +                * node->private_list is protected by
> +                * mapping->tree_lock.
> +                */
> +               if (!list_empty(&node->private_list))
> +                       list_lru_del(&workingset_shadow_nodes,
> +                                    &node->private_list);
> +       }
> +       return 0;
> +}
> +
>  static void page_cache_tree_delete(struct address_space *mapping,
>                                    struct page *page, void *shadow)
>  {
> @@ -561,7 +617,7 @@ int replace_page_cache_page(struct page *old, struct
> page *new, gfp_t gfp_mask)
>
>                 spin_lock_irqsave(&mapping->tree_lock, flags);
>                 __delete_from_page_cache(old, NULL);
> -               error = radix_tree_insert(&mapping->page_tree, offset,
> new);
> +               error = page_cache_tree_insert(mapping, new, NULL);
>                 BUG_ON(error);
>                 mapping->nrpages++;
>
> @@ -584,62 +640,6 @@ int replace_page_cache_page(struct page *old, struct
> page *new, gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>
> -static int page_cache_tree_insert(struct address_space *mapping,
> -                                 struct page *page, void **shadowp)
> -{
> -       struct radix_tree_node *node;
> -       void **slot;
> -       int error;
> -
> -       error = __radix_tree_create(&mapping->page_tree, page->index, 0,
> -                                   &node, &slot);
> -       if (error)
> -               return error;
> -       if (*slot) {
> -               void *p;
> -
> -               p = radix_tree_deref_slot_protected(slot,
> &mapping->tree_lock);
> -               if (!radix_tree_exceptional_entry(p))
> -                       return -EEXIST;
> -
> -               mapping->nrexceptional--;
> -               if (!dax_mapping(mapping)) {
> -                       if (shadowp)
> -                               *shadowp = p;
> -                       if (node)
> -                               workingset_node_shadows_dec(node);
> -               } else {
> -                       /* DAX can replace empty locked entry with a hole
> */
> -                       WARN_ON_ONCE(p !=
> -                               (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> -                                        RADIX_DAX_ENTRY_LOCK));
> -                       /* DAX accounts exceptional entries as normal
> pages */
> -                       if (node)
> -                               workingset_node_pages_dec(node);
> -                       /* Wakeup waiters for exceptional entry lock */
> -                       dax_wake_mapping_entry_waiter(mapping,
> page->index,
> -                                                     false);
> -               }
> -       }
> -       radix_tree_replace_slot(slot, page);
> -       mapping->nrpages++;
> -       if (node) {
> -               workingset_node_pages_inc(node);
> -               /*
> -                * Don't track node that contains actual pages.
> -                *
> -                * Avoid acquiring the list_lru lock if already
> -                * untracked.  The list_empty() test is safe as
> -                * node->private_list is protected by
> -                * mapping->tree_lock.
> -                */
> -               if (!list_empty(&node->private_list))
> -                       list_lru_del(&workingset_shadow_nodes,
> -                                    &node->private_list);
> -       }
> -       return 0;
> -}
> -
>  static int __add_to_page_cache_locked(struct page *page,
>                                       struct address_space *mapping,
>                                       pgoff_t offset, gfp_t gfp_mask,
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 69551cfae97b..617475f529f4 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -418,21 +418,19 @@ static enum lru_status shadow_lru_isolate(struct
> list_head *item,
>          * no pages, so we expect to be able to remove them all and
>          * delete and free the empty node afterwards.
>          */
> -
> -       BUG_ON(!node->count);
> -       BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
> +       BUG_ON(!workingset_node_shadows(node));
> +       BUG_ON(workingset_node_pages(node));
>
>         for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
>                 if (node->slots[i]) {
>                         BUG_ON(!radix_tree_exceptional_entry(node->slots[
> i]));
>                         node->slots[i] = NULL;
> -                       BUG_ON(node->count < (1U <<
> RADIX_TREE_COUNT_SHIFT));
> -                       node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> +                       workingset_node_shadows_dec(node);
>                         BUG_ON(!mapping->nrexceptional);
>                         mapping->nrexceptional--;
>                 }
>         }
> -       BUG_ON(node->count);
> +       BUG_ON(workingset_node_shadows(node));
>         inc_node_state(page_pgdat(virt_to_page(node)),
> WORKINGSET_NODERECLAIM);
>         if (!__radix_tree_delete_node(&mapping->page_tree, node))
>                 BUG();
> --
> 2.9.3
>

[-- Attachment #2: Type: text/html, Size: 18569 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
  2016-09-14 14:51         ` Antonio SJ Musumeci
@ 2016-09-21  8:54             ` Miklos Szeredi
  0 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2016-09-21  8:54 UTC (permalink / raw)
  To: Antonio SJ Musumeci
  Cc: Johannes Weiner, fuse-devel, linux-mm, Linux Kernel Mailing List

On Wed, Sep 14, 2016 at 4:51 PM, Antonio SJ Musumeci
<trapexit@spawn.link> wrote:
> I was unable to reproduce the problem but I'll forward this on to my user
> and see if they can test it.
>
> I imagine the users would prefer it backported though they have worked
> around the problem by turning off splicing.

Since users were reporting this on Ubuntu 16.04, it would make sense
to open a bug against that kernel (backport doens't seem difficult,
but it's not trivial either, and I'm really not familiar with that
code).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [fuse-devel] Kernel panic under load
@ 2016-09-21  8:54             ` Miklos Szeredi
  0 siblings, 0 replies; 11+ messages in thread
From: Miklos Szeredi @ 2016-09-21  8:54 UTC (permalink / raw)
  To: Antonio SJ Musumeci
  Cc: Johannes Weiner, fuse-devel, linux-mm, Linux Kernel Mailing List

On Wed, Sep 14, 2016 at 4:51 PM, Antonio SJ Musumeci
<trapexit@spawn.link> wrote:
> I was unable to reproduce the problem but I'll forward this on to my user
> and see if they can test it.
>
> I imagine the users would prefer it backported though they have worked
> around the problem by turning off splicing.

Since users were reporting this on Ubuntu 16.04, it would make sense
to open a bug against that kernel (backport doens't seem difficult,
but it's not trivial either, and I'm really not familiar with that
code).

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-09-21  8:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAB3-ZyQ4Mbj2g6b6Zt4pGLhE7ew9O==rNbUgAaPLYSwdRK3Czw@mail.gmail.com>
2016-09-09 14:32 ` [fuse-devel] Kernel panic under load Miklos Szeredi
2016-09-09 14:32   ` Miklos Szeredi
2016-09-09 19:42   ` Johannes Weiner
2016-09-09 19:42     ` Johannes Weiner
2016-09-13  8:42     ` Miklos Szeredi
2016-09-13  8:42       ` Miklos Szeredi
2016-09-14 14:31       ` Johannes Weiner
2016-09-14 14:31         ` Johannes Weiner
2016-09-14 14:51         ` Antonio SJ Musumeci
2016-09-21  8:54           ` Miklos Szeredi
2016-09-21  8:54             ` Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.