All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] Bad page states
@ 2016-08-05 17:13 Eric Dumazet
  2016-08-08 17:48   ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2016-08-05 17:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds

Hi guys

I noticed following errors with current tree.

Bisected to nowhere :(

Anyone has an idea ?

git bisect start
# bad: [dd7fd3a82ce55e5772d41b1faa2439f15318a902] ACPI / hotplug / PCI: Runtime resume bridges before bus rescans
git bisect bad dd7fd3a82ce55e5772d41b1faa2439f15318a902
# good: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7
git bisect good 523d939ef98fd712632d93a5a2b588e477a7565e
# good: [52e9f87ae8be96a863e44c7d8d7f482fb279dddd] mm, page_alloc: wake kswapd based on the highest eligible zone
git bisect good 52e9f87ae8be96a863e44c7d8d7f482fb279dddd
# good: [043248cd4e9603e2e8858c4e20810d8e40be7d9d] Merge tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good 043248cd4e9603e2e8858c4e20810d8e40be7d9d
# good: [d0e62ef6ed257715a88d0e5d7cd850a1695429e2] drm/nouveau/gr/nv3x: fix instobj write offsets in gr setup
git bisect good d0e62ef6ed257715a88d0e5d7cd850a1695429e2
# good: [affe8a2abd0d7815bb2653eea2717d0e0f8ac7e3] Merge tag 'for-linus-20160801' of git://git.infradead.org/linux-mtd
git bisect good affe8a2abd0d7815bb2653eea2717d0e0f8ac7e3
# good: [f0936155f2204720655aefbc2f17c25a2c80a59e] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good f0936155f2204720655aefbc2f17c25a2c80a59e
# bad: [fdf1f7ff1bd7f1c6d1d5dc2b29b6b11a4f722276] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
git bisect bad fdf1f7ff1bd7f1c6d1d5dc2b29b6b11a4f722276
# bad: [3a303258ef40a29dd4a3ebd29dcb16afd546bd59] Merge tag 'upstream-4.8-rc1' of git://git.infradead.org/linux-ubifs
git bisect bad 3a303258ef40a29dd4a3ebd29dcb16afd546bd59
# bad: [c1ece76719205690f4b448460d9b85c130e8021b] Merge tag 'media/v4.8-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect bad c1ece76719205690f4b448460d9b85c130e8021b
# good: [9049fc745300c5e2236cbfc69f5e8cadb6f1f57c] dynamic_debug: add jump label support
git bisect good 9049fc745300c5e2236cbfc69f5e8cadb6f1f57c
# good: [49aadcf1b6f4240751921dad52e86c760d70a5f1] extable.h: add stddef.h so "NULL" definition is not implicit
git bisect good 49aadcf1b6f4240751921dad52e86c760d70a5f1
# bad: [cd70c37b5a23eee7aa411205e2b4c74097f52336] [media] staging: add MEDIA_SUPPORT dependency
git bisect bad cd70c37b5a23eee7aa411205e2b4c74097f52336
# good: [9f8b333308dcb6af321526a54612cfaa7999a995] [media] vim2m: copy the other colorspace-related fields as well
git bisect good 9f8b333308dcb6af321526a54612cfaa7999a995
# good: [52e37f0a6655158c731a35b79106d3150fd3847c] [media] media: adv7180: Fix broken interrupt register access
git bisect good 52e37f0a6655158c731a35b79106d3150fd3847c
# bad: [5bbea66bf8d9ba898abbe5499f06998a993364f6] [media] vivid: don't handle CEC_MSG_SET_STREAM_PATH
git bisect bad 5bbea66bf8d9ba898abbe5499f06998a993364f6
# first bad commit: [5bbea66bf8d9ba898abbe5499f06998a993364f6] [media] vivid: don't handle CEC_MSG_SET_STREAM_PATH



[   32.666450] BUG: Bad page state in process swapper/0  pfn:1fd945c
[   32.672542] page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
[   32.680823] flags: 0x1000000000000000()
[   32.684655] page dumped because: nonzero mapcount
[   32.689360] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core xt_TCPMSS ip6table_mangle ip6_tables ipv6
[   32.689376] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.7.0-smp-DEV #240
[   32.689378]  0000000000000000 ffff881fff803d98 ffffffff81655177 ffffea007f651700
[   32.689379]  ffffffff819eb8b0 ffff881fff803dc8 ffffffff816528ab ffff881fff803dc8
[   32.689381]  0000000000000000 0000000000000002 0000000000000000 ffff881fff803dd8
[   32.689382] Call Trace:
[   32.689383]  <IRQ>  [<ffffffff81655177>] dump_stack+0x4d/0x63
[   32.689390]  [<ffffffff816528ab>] bad_page+0x101/0x11e
[   32.689391]  [<ffffffff8165293e>] free_pages_check_bad+0x76/0x78
[   32.689394]  [<ffffffff811a8d02>] __free_pages_ok+0x3d2/0x4a0
[   32.689395]  [<ffffffff811a9b29>] __free_pages+0x19/0x30
[   32.689397]  [<ffffffff810b8c95>] free_task+0x65/0xa0
[   32.689399]  [<ffffffff810b8d67>] __put_task_struct+0x97/0x130
[   32.689401]  [<ffffffff810be214>] delayed_put_task_struct+0x84/0x90
[   32.689403]  [<ffffffff8111457d>] rcu_process_callbacks+0x1ed/0x5b0
[   32.689405]  [<ffffffff8165fecb>] __do_softirq+0xcb/0x279
[   32.689407]  [<ffffffff810c10ba>] irq_exit+0x9a/0xb0
[   32.689408]  [<ffffffff8165fd06>] smp_apic_timer_interrupt+0x46/0x60
[   32.689409]  [<ffffffff8165e38f>] apic_timer_interrupt+0x7f/0x90
[   32.689410]  <EOI>  [<ffffffff81556405>] ? cpuidle_enter_state+0xe5/0x250
[   32.689414]  [<ffffffff815563e8>] ? cpuidle_enter_state+0xc8/0x250
[   32.689415]  [<ffffffff815565a7>] cpuidle_enter+0x17/0x20
[   32.689417]  [<ffffffff810fa027>] cpu_startup_entry+0x277/0x340
[   32.689418]  [<ffffffff81657a37>] rest_init+0x77/0x80
[   32.689421]  [<ffffffff81d20034>] start_kernel+0x415/0x422
[   32.689422]  [<ffffffff81d1fa04>] ? set_init_arg+0x5a/0x5a
[   32.689423]  [<ffffffff81d1f591>] x86_64_start_reservations+0x2f/0x31
[   32.689424]  [<ffffffff81d1f684>] x86_64_start_kernel+0xf1/0xf5
...
[   43.477693] BUG: Bad page state in process S05containers  pfn:1ff02a3
[   43.484417] page:ffffea007fc0a8c0 count:0 mapcount:-511 mapping:          (null) index:0x0
[   43.492737] flags: 0x1000000000000000()
[   43.496602] page dumped because: nonzero mapcount
[   43.501298] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core xt_TCPMSS ip6table_mangle ip6_tables ipv6
[   43.501313] CPU: 14 PID: 5488 Comm: S05containers Tainted: G    B           4.7.0-smp-DEV #240
[   43.501315]  0000000000000000 ffff881fdcbeb978 ffffffff81655177 ffffea007fc0a8c0
[   43.501316]  ffffffff819eb8b0 ffff881fdcbeb9a8 ffffffff816528ab ffffea007f5b1880
[   43.501318]  0000000000000000 ffffea007fc0a8c0 0000000000000000 ffff881fdcbeb9b8
[   43.501319] Call Trace:
[   43.501326]  [<ffffffff81655177>] dump_stack+0x4d/0x63
[   43.501328]  [<ffffffff816528ab>] bad_page+0x101/0x11e
[   43.501330]  [<ffffffff8165293e>] free_pages_check_bad+0x76/0x78
[   43.501332]  [<ffffffff811a82cd>] free_pcppages_bulk+0x38d/0x3a0
[   43.501334]  [<ffffffff811a9ad3>] free_hot_cold_page+0x1b3/0x1f0
[   43.501335]  [<ffffffff811a9e68>] free_hot_cold_page_list+0x48/0xb0
[   43.501337]  [<ffffffff811b1c49>] release_pages+0x279/0x380
[   43.501340]  [<ffffffff811eaafe>] free_pages_and_swap_cache+0x8e/0xb0
[   43.501344]  [<ffffffff811d313a>] tlb_flush_mmu_free+0x3a/0x60
[   43.501346]  [<ffffffff811d3fbe>] tlb_flush_mmu+0x1e/0x30
[   43.501347]  [<ffffffff811d3fe4>] tlb_finish_mmu+0x14/0x40
[   43.501348]  [<ffffffff811dd608>] exit_mmap+0xe8/0x140
[   43.501352]  [<ffffffff810b8fcd>] mmput+0x3d/0xe0
[   43.501354]  [<ffffffff8121c7f0>] flush_old_exec+0x370/0x680
[   43.501358]  [<ffffffff8126df92>] load_elf_binary+0x332/0x1780
[   43.501359]  [<ffffffff8126b07c>] ? load_misc_binary+0x6c/0x420
[   43.501361]  [<ffffffff811d2994>] ? get_user_pages_remote+0x54/0x60
[   43.501362]  [<ffffffff8121d1e2>] ? copy_strings.isra.25+0x1b2/0x370
[   43.501363]  [<ffffffff8121bee6>] search_binary_handler+0xa6/0x1f0
[   43.501364]  [<ffffffff8121dc2e>] do_execveat_common.isra.37+0x53e/0x700
[   43.501366]  [<ffffffff8121de1c>] do_execve+0x2c/0x30
[   43.501368]  [<ffffffff81226002>] ? getname+0x12/0x20
[   43.501369]  [<ffffffff8121e07e>] SyS_execve+0x2e/0x40
[   43.501371]  [<ffffffff810028a0>] do_syscall_64+0x50/0xa0
[   43.501374]  [<ffffffff8165d87c>] entry_SYSCALL64_slow_path+0x25/0x25

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Bad page states
  2016-08-05 17:13 [BUG] Bad page states Eric Dumazet
@ 2016-08-08 17:48   ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2016-08-08 17:48 UTC (permalink / raw)
  To: Eric Dumazet, linux-mm, Vladimir Davydov, Andrew Morton
  Cc: linux-kernel, Minchan Kim, Johannes Weiner, Michal Hocko

Adding more people, and linux-mm.

On Fri, Aug 5, 2016 at 10:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Bisected to nowhere :(
>
> Anyone has an idea ?

How easy is it for you to reproduce? It must be *fairly* easy since
you tried bisecting it, and presumably all the ones you marked "bad"
really are reliably bad.

Which means that I would expect that since the bisect failed, some of
the "good" ones (particularly at the end) might really be bad, just
didn't have the time/load to reproduce.

So maybe you could re-test the good ones for a bit longer? Trust all
the ones you've marked bad (and presumably trust v4.7 itself as good),
and re-try the bisection.

That said, it looks pretty bleak. If you don't trust any of the good
ones, you'd start out with

  git bisect start
  git bisect bad 5bbea66bf8d9ba898abbe5499f06998a993364f6
  git bisect good v4.7

and that's still almost 6000 commits.

So let's narrow it down by looking at the details:

> [   32.666450] BUG: Bad page state in process swapper/0  pfn:1fd945c
> [   32.672542] page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
> [   32.680823] flags: 0x1000000000000000()
> [   32.684655] page dumped because: nonzero mapcount
> ...
> [   43.477693] BUG: Bad page state in process S05containers  pfn:1ff02a3
> [   43.484417] page:ffffea007fc0a8c0 count:0 mapcount:-511 mapping:          (null) index:0x0
> [   43.492737] flags: 0x1000000000000000()
> [   43.496602] page dumped because: nonzero mapcount

Hmm. The _mapcount field is a union with other fields, but that number
doesn't make sense for any of the other fields.

So it's almost certainly related to "PAGE_KMEMCG_MAPCOUNT_VALUE". So
something presumably mapped such a page into an address space, and
incremented the number. That should never have happened, of course.

Oh. Actually, I guess it *is* PAGE_KMEMCG_MAPCOUNT_VALUE, and what
happens is that __page_mapcount() returns "_mapcount+1", so no other
increment needed.

The fact that one of the trces comes from tlb_flush_mmu_free() still
does mean that it has been mapped into a VM, though.

Unrelated to that, the "flags" field has bit 60 set, which is
presumably just part of the zone/node/section number. Maybe the
page_debug() code should print out that information too, not just the
flag bit names?

Anyway, the PAGE_KMEMCG_MAPCOUNT_VALUE connection makes me blame
Vladimir Davydov and commit 4949148ad433. Maybe you could center your
testing around that one (ie rather than bisection, try the immediate
parent, and then that commit).

And maybe the page mapping code could have some debug code for "am I
mapping a page that has a mapcount < -1", and alert people to that? To
more easily find the path that triggers this?

Vladimir?

                      Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Bad page states
@ 2016-08-08 17:48   ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2016-08-08 17:48 UTC (permalink / raw)
  To: Eric Dumazet, linux-mm, Vladimir Davydov, Andrew Morton
  Cc: linux-kernel, Minchan Kim, Johannes Weiner, Michal Hocko

Adding more people, and linux-mm.

On Fri, Aug 5, 2016 at 10:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Bisected to nowhere :(
>
> Anyone has an idea ?

How easy is it for you to reproduce? It must be *fairly* easy since
you tried bisecting it, and presumably all the ones you marked "bad"
really are reliably bad.

Which means that I would expect that since the bisect failed, some of
the "good" ones (particularly at the end) might really be bad, just
didn't have the time/load to reproduce.

So maybe you could re-test the good ones for a bit longer? Trust all
the ones you've marked bad (and presumably trust v4.7 itself as good),
and re-try the bisection.

That said, it looks pretty bleak. If you don't trust any of the good
ones, you'd start out with

  git bisect start
  git bisect bad 5bbea66bf8d9ba898abbe5499f06998a993364f6
  git bisect good v4.7

and that's still almost 6000 commits.

So let's narrow it down by looking at the details:

> [   32.666450] BUG: Bad page state in process swapper/0  pfn:1fd945c
> [   32.672542] page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
> [   32.680823] flags: 0x1000000000000000()
> [   32.684655] page dumped because: nonzero mapcount
> ...
> [   43.477693] BUG: Bad page state in process S05containers  pfn:1ff02a3
> [   43.484417] page:ffffea007fc0a8c0 count:0 mapcount:-511 mapping:          (null) index:0x0
> [   43.492737] flags: 0x1000000000000000()
> [   43.496602] page dumped because: nonzero mapcount

Hmm. The _mapcount field is a union with other fields, but that number
doesn't make sense for any of the other fields.

So it's almost certainly related to "PAGE_KMEMCG_MAPCOUNT_VALUE". So
something presumably mapped such a page into an address space, and
incremented the number. That should never have happened, of course.

Oh. Actually, I guess it *is* PAGE_KMEMCG_MAPCOUNT_VALUE, and what
happens is that __page_mapcount() returns "_mapcount+1", so no other
increment needed.

The fact that one of the trces comes from tlb_flush_mmu_free() still
does mean that it has been mapped into a VM, though.

Unrelated to that, the "flags" field has bit 60 set, which is
presumably just part of the zone/node/section number. Maybe the
page_debug() code should print out that information too, not just the
flag bit names?

Anyway, the PAGE_KMEMCG_MAPCOUNT_VALUE connection makes me blame
Vladimir Davydov and commit 4949148ad433. Maybe you could center your
testing around that one (ie rather than bisection, try the immediate
parent, and then that commit).

And maybe the page mapping code could have some debug code for "am I
mapping a page that has a mapcount < -1", and alert people to that? To
more easily find the path that triggers this?

Vladimir?

                      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Bad page states
  2016-08-08 17:48   ` Linus Torvalds
@ 2016-08-08 18:37     ` Vladimir Davydov
  -1 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2016-08-08 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, linux-mm, Andrew Morton, linux-kernel, Minchan Kim,
	Johannes Weiner, Michal Hocko

On Mon, Aug 08, 2016 at 10:48:45AM -0700, Linus Torvalds wrote:
...
> > [   43.477693] BUG: Bad page state in process S05containers  pfn:1ff02a3
> > [   43.484417] page:ffffea007fc0a8c0 count:0 mapcount:-511 mapping:          (null) index:0x0
> > [   43.492737] flags: 0x1000000000000000()
> > [   43.496602] page dumped because: nonzero mapcount
> 
> Hmm. The _mapcount field is a union with other fields, but that number
> doesn't make sense for any of the other fields.
> 
> So it's almost certainly related to "PAGE_KMEMCG_MAPCOUNT_VALUE". So

Yes, it is - my bad. The thing is I set/clear PAGE_KMEMCG_MAPCOUNT_VALUE
for pages allocated with __GFP_ACCOUNT iff memcg_kmem_enabled() is true
(see __alloc_pages_nodemask and free_pages_prepare), while the latter
gets disabled when the last cgroup gets destroyed. So if you do

 mkdir /sys/fs/cgroup/memory/test
 # run something in the root cgroup that allocates pages with
 # __GFP_ACCOUNT, e.g. a program using pipe
 rmdir /sys/fs/cgroup/memory/test

Then, if there are no other memory cgroups, you'll see the bug.

Sorry about that :-(

Obviously, the PageKmemcg flag should only be set for pages that are
actually accounted to a non-root kmemcg and hence pin memcg_kmem_enabled
static key. I'll fix that.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG] Bad page states
@ 2016-08-08 18:37     ` Vladimir Davydov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2016-08-08 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, linux-mm, Andrew Morton, linux-kernel, Minchan Kim,
	Johannes Weiner, Michal Hocko

On Mon, Aug 08, 2016 at 10:48:45AM -0700, Linus Torvalds wrote:
...
> > [   43.477693] BUG: Bad page state in process S05containers  pfn:1ff02a3
> > [   43.484417] page:ffffea007fc0a8c0 count:0 mapcount:-511 mapping:          (null) index:0x0
> > [   43.492737] flags: 0x1000000000000000()
> > [   43.496602] page dumped because: nonzero mapcount
> 
> Hmm. The _mapcount field is a union with other fields, but that number
> doesn't make sense for any of the other fields.
> 
> So it's almost certainly related to "PAGE_KMEMCG_MAPCOUNT_VALUE". So

Yes, it is - my bad. The thing is I set/clear PAGE_KMEMCG_MAPCOUNT_VALUE
for pages allocated with __GFP_ACCOUNT iff memcg_kmem_enabled() is true
(see __alloc_pages_nodemask and free_pages_prepare), while the latter
gets disabled when the last cgroup gets destroyed. So if you do

 mkdir /sys/fs/cgroup/memory/test
 # run something in the root cgroup that allocates pages with
 # __GFP_ACCOUNT, e.g. a program using pipe
 rmdir /sys/fs/cgroup/memory/test

Then, if there are no other memory cgroups, you'll see the bug.

Sorry about that :-(

Obviously, the PageKmemcg flag should only be set for pages that are
actually accounted to a non-root kmemcg and hence pin memcg_kmem_enabled
static key. I'll fix that.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
  2016-08-08 18:37     ` Vladimir Davydov
  (?)
@ 2016-08-08 20:03       ` Vladimir Davydov
  -1 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2016-08-08 20:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, Johannes Weiner, Michal Hocko, Eric Dumazet,
	Linus Torvalds, Minchan Kim, linux-mm, linux-kernel

To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg in
__alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context. To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true. The latter is set iff there are kmem-enabled memory cgroups
(online or offline). The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

  # no memory cgroups has been created yet, create one
  mkdir /sys/fs/cgroup/memory/test
  # run something allocating pages with __GFP_ACCOUNT, e.g.
  # a program using pipe
  dmesg | tail
  # remove the memory cgroup
  rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

  BUG: Bad page state in process swapper/0  pfn:1fd945c
  page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
  flags: 0x1000000000000000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.7+]
---
 fs/pipe.c       |  4 +---
 mm/memcontrol.c | 14 ++++++++++++--
 mm/page_alloc.c | 14 +++++---------
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 4b32928f5426..4ebe6b2e5217 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -144,10 +144,8 @@ static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
 	struct page *page = buf->page;
 
 	if (page_count(page) == 1) {
-		if (memcg_kmem_enabled()) {
+		if (memcg_kmem_enabled())
 			memcg_kmem_uncharge(page, 0);
-			__ClearPageKmemcg(page);
-		}
 		__SetPageLocked(page);
 		return 0;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c0aa59fd333..7c9c0bd475f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2308,8 +2308,11 @@ int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 		return 0;
 
 	memcg = get_mem_cgroup_from_mm(current->mm);
-	if (!mem_cgroup_is_root(memcg))
+	if (!mem_cgroup_is_root(memcg)) {
 		ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		if (!ret)
+			__SetPageKmemcg(page);
+	}
 	css_put(&memcg->css);
 	return ret;
 }
@@ -2336,6 +2339,11 @@ void memcg_kmem_uncharge(struct page *page, int order)
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	page->mem_cgroup = NULL;
+
+	/* slab pages do not have PageKmemcg flag set */
+	if (PageKmemcg(page))
+		__ClearPageKmemcg(page);
+
 	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* !CONFIG_SLOB */
@@ -5533,8 +5541,10 @@ static void uncharge_list(struct list_head *page_list)
 			else
 				nr_file += nr_pages;
 			pgpgout++;
-		} else
+		} else {
 			nr_kmem += 1 << compound_order(page);
+			__ClearPageKmemcg(page);
+		}
 
 		page->mem_cgroup = NULL;
 	} while (next != page_list);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee5a4b20daf4..3a7d3a1fb6a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1009,10 +1009,8 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	}
 	if (PageMappingFlags(page))
 		page->mapping = NULL;
-	if (memcg_kmem_enabled() && PageKmemcg(page)) {
+	if (memcg_kmem_enabled() && PageKmemcg(page))
 		memcg_kmem_uncharge(page, order);
-		__ClearPageKmemcg(page);
-	}
 	if (check_free)
 		bad += free_pages_check(page);
 	if (bad)
@@ -3788,12 +3786,10 @@ no_zone:
 	}
 
 out:
-	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page) {
-		if (unlikely(memcg_kmem_charge(page, gfp_mask, order))) {
-			__free_pages(page, order);
-			page = NULL;
-		} else
-			__SetPageKmemcg(page);
+	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
+	    unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
+		__free_pages(page, order);
+		page = NULL;
 	}
 
 	if (kmemcheck_enabled && page)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
@ 2016-08-08 20:03       ` Vladimir Davydov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2016-08-08 20:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, Johannes Weiner, Michal Hocko, Eric Dumazet,
	Linus Torvalds, Minchan Kim, linux-mm, linux-kernel

To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg in
__alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context. To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true. The latter is set iff there are kmem-enabled memory cgroups
(online or offline). The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

  # no memory cgroups has been created yet, create one
  mkdir /sys/fs/cgroup/memory/test
  # run something allocating pages with __GFP_ACCOUNT, e.g.
  # a program using pipe
  dmesg | tail
  # remove the memory cgroup
  rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

  BUG: Bad page state in process swapper/0  pfn:1fd945c
  page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
  flags: 0x1000000000000000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.7+]
---
 fs/pipe.c       |  4 +---
 mm/memcontrol.c | 14 ++++++++++++--
 mm/page_alloc.c | 14 +++++---------
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 4b32928f5426..4ebe6b2e5217 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -144,10 +144,8 @@ static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
 	struct page *page = buf->page;
 
 	if (page_count(page) == 1) {
-		if (memcg_kmem_enabled()) {
+		if (memcg_kmem_enabled())
 			memcg_kmem_uncharge(page, 0);
-			__ClearPageKmemcg(page);
-		}
 		__SetPageLocked(page);
 		return 0;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c0aa59fd333..7c9c0bd475f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2308,8 +2308,11 @@ int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 		return 0;
 
 	memcg = get_mem_cgroup_from_mm(current->mm);
-	if (!mem_cgroup_is_root(memcg))
+	if (!mem_cgroup_is_root(memcg)) {
 		ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		if (!ret)
+			__SetPageKmemcg(page);
+	}
 	css_put(&memcg->css);
 	return ret;
 }
@@ -2336,6 +2339,11 @@ void memcg_kmem_uncharge(struct page *page, int order)
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	page->mem_cgroup = NULL;
+
+	/* slab pages do not have PageKmemcg flag set */
+	if (PageKmemcg(page))
+		__ClearPageKmemcg(page);
+
 	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* !CONFIG_SLOB */
@@ -5533,8 +5541,10 @@ static void uncharge_list(struct list_head *page_list)
 			else
 				nr_file += nr_pages;
 			pgpgout++;
-		} else
+		} else {
 			nr_kmem += 1 << compound_order(page);
+			__ClearPageKmemcg(page);
+		}
 
 		page->mem_cgroup = NULL;
 	} while (next != page_list);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee5a4b20daf4..3a7d3a1fb6a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1009,10 +1009,8 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	}
 	if (PageMappingFlags(page))
 		page->mapping = NULL;
-	if (memcg_kmem_enabled() && PageKmemcg(page)) {
+	if (memcg_kmem_enabled() && PageKmemcg(page))
 		memcg_kmem_uncharge(page, order);
-		__ClearPageKmemcg(page);
-	}
 	if (check_free)
 		bad += free_pages_check(page);
 	if (bad)
@@ -3788,12 +3786,10 @@ no_zone:
 	}
 
 out:
-	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page) {
-		if (unlikely(memcg_kmem_charge(page, gfp_mask, order))) {
-			__free_pages(page, order);
-			page = NULL;
-		} else
-			__SetPageKmemcg(page);
+	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
+	    unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
+		__free_pages(page, order);
+		page = NULL;
 	}
 
 	if (kmemcheck_enabled && page)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
@ 2016-08-08 20:03       ` Vladimir Davydov
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Davydov @ 2016-08-08 20:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: stable, Johannes Weiner, Michal Hocko, Eric Dumazet,
	Linus Torvalds, Minchan Kim, linux-mm, linux-kernel

To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg in
__alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context. To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true. The latter is set iff there are kmem-enabled memory cgroups
(online or offline). The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

  # no memory cgroups has been created yet, create one
  mkdir /sys/fs/cgroup/memory/test
  # run something allocating pages with __GFP_ACCOUNT, e.g.
  # a program using pipe
  dmesg | tail
  # remove the memory cgroup
  rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

  BUG: Bad page state in process swapper/0  pfn:1fd945c
  page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
  flags: 0x1000000000000000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.7+]
---
 fs/pipe.c       |  4 +---
 mm/memcontrol.c | 14 ++++++++++++--
 mm/page_alloc.c | 14 +++++---------
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 4b32928f5426..4ebe6b2e5217 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -144,10 +144,8 @@ static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
 	struct page *page = buf->page;
 
 	if (page_count(page) == 1) {
-		if (memcg_kmem_enabled()) {
+		if (memcg_kmem_enabled())
 			memcg_kmem_uncharge(page, 0);
-			__ClearPageKmemcg(page);
-		}
 		__SetPageLocked(page);
 		return 0;
 	}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c0aa59fd333..7c9c0bd475f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2308,8 +2308,11 @@ int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 		return 0;
 
 	memcg = get_mem_cgroup_from_mm(current->mm);
-	if (!mem_cgroup_is_root(memcg))
+	if (!mem_cgroup_is_root(memcg)) {
 		ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		if (!ret)
+			__SetPageKmemcg(page);
+	}
 	css_put(&memcg->css);
 	return ret;
 }
@@ -2336,6 +2339,11 @@ void memcg_kmem_uncharge(struct page *page, int order)
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	page->mem_cgroup = NULL;
+
+	/* slab pages do not have PageKmemcg flag set */
+	if (PageKmemcg(page))
+		__ClearPageKmemcg(page);
+
 	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* !CONFIG_SLOB */
@@ -5533,8 +5541,10 @@ static void uncharge_list(struct list_head *page_list)
 			else
 				nr_file += nr_pages;
 			pgpgout++;
-		} else
+		} else {
 			nr_kmem += 1 << compound_order(page);
+			__ClearPageKmemcg(page);
+		}
 
 		page->mem_cgroup = NULL;
 	} while (next != page_list);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee5a4b20daf4..3a7d3a1fb6a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1009,10 +1009,8 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	}
 	if (PageMappingFlags(page))
 		page->mapping = NULL;
-	if (memcg_kmem_enabled() && PageKmemcg(page)) {
+	if (memcg_kmem_enabled() && PageKmemcg(page))
 		memcg_kmem_uncharge(page, order);
-		__ClearPageKmemcg(page);
-	}
 	if (check_free)
 		bad += free_pages_check(page);
 	if (bad)
@@ -3788,12 +3786,10 @@ no_zone:
 	}
 
 out:
-	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page) {
-		if (unlikely(memcg_kmem_charge(page, gfp_mask, order))) {
-			__free_pages(page, order);
-			page = NULL;
-		} else
-			__SetPageKmemcg(page);
+	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
+	    unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
+		__free_pages(page, order);
+		page = NULL;
 	}
 
 	if (kmemcheck_enabled && page)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
  2016-08-08 20:03       ` Vladimir Davydov
@ 2016-08-08 21:20         ` Linus Torvalds
  -1 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2016-08-08 21:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, stable, Johannes Weiner, Michal Hocko,
	Eric Dumazet, Minchan Kim, linux-mm, Linux Kernel Mailing List

On Mon, Aug 8, 2016 at 1:03 PM, Vladimir Davydov <vdavydov@virtuozzo.com> wrote:
> To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg [..]

Eric, can you confirm that this fixes your issue?

             Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
@ 2016-08-08 21:20         ` Linus Torvalds
  0 siblings, 0 replies; 12+ messages in thread
From: Linus Torvalds @ 2016-08-08 21:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, stable, Johannes Weiner, Michal Hocko,
	Eric Dumazet, Minchan Kim, linux-mm, Linux Kernel Mailing List

On Mon, Aug 8, 2016 at 1:03 PM, Vladimir Davydov <vdavydov@virtuozzo.com> wrote:
> To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg [..]

Eric, can you confirm that this fixes your issue?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
  2016-08-08 21:20         ` Linus Torvalds
@ 2016-08-09 16:58           ` Eric Dumazet
  -1 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2016-08-09 16:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vladimir Davydov, Andrew Morton, stable, Johannes Weiner,
	Michal Hocko, Minchan Kim, linux-mm, Linux Kernel Mailing List

On Mon, 2016-08-08 at 14:20 -0700, Linus Torvalds wrote:
> On Mon, Aug 8, 2016 at 1:03 PM, Vladimir Davydov <vdavydov@virtuozzo.com> wrote:
> > To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg [..]
> 
> Eric, can you confirm that this fixes your issue?
> 

It is fixing the issue for me, thanks a lot Vladimir and Linus.

(I also checked that commit 4949148ad433 was indeed the bug origin)

Tested-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg
@ 2016-08-09 16:58           ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2016-08-09 16:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vladimir Davydov, Andrew Morton, stable, Johannes Weiner,
	Michal Hocko, Minchan Kim, linux-mm, Linux Kernel Mailing List

On Mon, 2016-08-08 at 14:20 -0700, Linus Torvalds wrote:
> On Mon, Aug 8, 2016 at 1:03 PM, Vladimir Davydov <vdavydov@virtuozzo.com> wrote:
> > To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg [..]
> 
> Eric, can you confirm that this fixes your issue?
> 

It is fixing the issue for me, thanks a lot Vladimir and Linus.

(I also checked that commit 4949148ad433 was indeed the bug origin)

Tested-by: Eric Dumazet <edumazet@google.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-08-09 16:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-05 17:13 [BUG] Bad page states Eric Dumazet
2016-08-08 17:48 ` Linus Torvalds
2016-08-08 17:48   ` Linus Torvalds
2016-08-08 18:37   ` Vladimir Davydov
2016-08-08 18:37     ` Vladimir Davydov
2016-08-08 20:03     ` [PATCH] mm: memcontrol: only mark charged pages with PageKmemcg Vladimir Davydov
2016-08-08 20:03       ` Vladimir Davydov
2016-08-08 20:03       ` Vladimir Davydov
2016-08-08 21:20       ` Linus Torvalds
2016-08-08 21:20         ` Linus Torvalds
2016-08-09 16:58         ` Eric Dumazet
2016-08-09 16:58           ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.