linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* linux-next: kernel BUG at mm/slub.c:1447!
@ 2015-10-01 16:06 Guenter Roeck
  2015-10-01 16:58 ` Christoph Lameter
  2015-10-01 20:49 ` Andrew Morton
  0 siblings, 2 replies; 12+ messages in thread
From: Guenter Roeck @ 2015-10-01 16:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: Christoph Lameter, Kirill A. Shutemov, Andrew Morton

Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration.
On a re-run, I have seen it with the same image, but this time when simulating IvyBridge,
so it is not CPU dependent. I did not previously see the problem.

Log is at
http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio

I'll try to bisect. The problem is not seen with every boot, so that may take a while.

Guenter

---
gfp: 2
------------[ cut here ]------------
invalid opcode: 0000 [#1] PREEMPT
Modules linked in:
CPU: 0 PID: 121 Comm: udevd Not tainted 4.3.0-rc3-next-20151001-yocto-standard #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
task: ced90000 ti: ced8c000 task.ti: ced8c000
EIP: 0060:[<c1128873>] EFLAGS: 00000092 CPU: 0
EIP is at new_slab+0x353/0x360
EAX: 00000006 EBX: 00000000 ECX: 00000001 EDX: 80000001
ESI: cf8019c0 EDI: 00000000 EBP: ced8daa4 ESP: ced8da7c
  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
CR0: 8005003b CR2: 080791c0 CR3: 0ed6c000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: 00000000 DR7: 00000000
Stack:
  c19a42cf 00000002 c137542e ced8da90 c137544c ffffffff c144c8a8 00000000
  cf8019c0 00000000 ced8db28 c1129ca8 0203128a c144f346 00004e20 cec2e740
  c10ee933 0203128a cf8019c0 c181d6c0 c181d460 000001a1 00150015 c0011c00
Call Trace:
  [<c137542e>] ? __delay+0xe/0x10
  [<c137544c>] ? __const_udelay+0x1c/0x20
  [<c144c8a8>] ? ide_execute_command+0x68/0xa0
  [<c1129ca8>] ___slab_alloc.constprop.75+0x248/0x310
  [<c144f346>] ? do_rw_taskfile+0x286/0x320
  [<c10ee933>] ? mempool_alloc_slab+0x13/0x20
  [<c1457d12>] ? ide_do_rw_disk+0x222/0x320
  [<c1136219>] __slab_alloc.isra.72.constprop.74+0x18/0x1f
  [<c112a2f2>] kmem_cache_alloc+0x122/0x1c0
  [<c10ee933>] ? mempool_alloc_slab+0x13/0x20
  [<c10ee933>] mempool_alloc_slab+0x13/0x20
  [<c10eebe5>] mempool_alloc+0x45/0x170
  [<c1345202>] bio_alloc_bioset+0xd2/0x1b0
  [<c1172e9f>] mpage_alloc+0x2f/0xa0
  [<c1037979>] ? kmap_atomic_prot+0x59/0xf0
  [<c1173523>] do_mpage_readpage+0x4d3/0x7e0
  [<c10f31b8>] ? __alloc_pages_nodemask+0xf8/0x8c0
  [<c134ed67>] ? blk_queue_bio+0x267/0x2d0
  [<c112a24a>] ? kmem_cache_alloc+0x7a/0x1c0
  [<c138357f>] ? __this_cpu_preempt_check+0xf/0x20
  [<c1173894>] mpage_readpage+0x64/0x80
  [<c119fa90>] ? __ext2_truncate_blocks+0x450/0x450
  [<c10f8dfd>] ? lru_cache_add+0xd/0x10
  [<c10ec367>] ? add_to_page_cache_lru+0x57/0x90
  [<c119f3b4>] ext2_readpage+0x14/0x20
  [<c10ec62b>] do_read_cache_page+0x7b/0x1c0
  [<c119f3a0>] ? ext2_writepages+0x20/0x20
  [<c10ec7c4>] read_cache_page+0x24/0x30
  [<c119d1e3>] ext2_get_page.isra.10+0x23/0x250
  [<c114c663>] ? __d_rehash+0x43/0x60
  [<c114c6ca>] ? d_rehash+0x4a/0x70
  [<c114dbec>] ? d_splice_alias+0x7c/0x280
  [<c11a0dba>] ? ext2_iget+0x1fa/0x370
  [<c119d7b0>] ext2_find_entry+0x80/0x200
  [<c114e44a>] ? d_alloc+0x4a/0x70
  [<c114151e>] ? lookup_real+0x1e/0x50
  [<c114e082>] ? __d_alloc+0x22/0x120
  [<c119d99b>] ext2_inode_by_name+0x1b/0x40
  [<c11a1d92>] ext2_lookup+0x42/0xa0
  [<c114e44a>] ? d_alloc+0x4a/0x70
  [<c114151e>] lookup_real+0x1e/0x50
  [<c11453d1>] path_openat+0x791/0xdc0
  [<c139da03>] ? device_show+0x23/0x30
  [<c11991fe>] ? kernfs_put_open_node.isra.7+0x7e/0xa0
  [<c1105715>] ? kvfree+0x45/0x50
  [<c1146940>] do_filp_open+0x60/0xb0
  [<c1152a47>] ? __alloc_fd+0xb7/0x100
  [<c1138043>] do_sys_open+0x123/0x220
  [<c113a2cc>] ? fput+0x4c/0x90
  [<c1138162>] SyS_open+0x22/0x30
  [<c17be051>] syscall_call+0x7/0x7
Code: 8b 46 7c d3 e2 89 d1 89 fa e8 1a ae 00 00 85 c0 0f 85 dc fe ff ff e9 34 fd ff ff 89 44 24 04 c7 04 24 cf 42 9a c1 e8 b9 1f fc ff <0f> 0b 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 57 56 53 83 ec
EIP: [<c1128873>] new_slab+0x353/0x360 SS:ESP 0068:ced8da7c
---[ end trace 7503d0e5896d8e13 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception
qemu: terminating on signal 15 from pid 18056

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-01 16:06 linux-next: kernel BUG at mm/slub.c:1447! Guenter Roeck
@ 2015-10-01 16:58 ` Christoph Lameter
  2015-10-01 20:49 ` Andrew Morton
  1 sibling, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2015-10-01 16:58 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: linux-kernel, Kirill A. Shutemov, Andrew Morton

On Thu, 1 Oct 2015, Guenter Roeck wrote:

> Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP
> configuration.
> On a re-run, I have seen it with the same image, but this time when simulating
> IvyBridge,
> so it is not CPU dependent. I did not previously see the problem.

Could you boot with slub_debug on the kernel command line to enable
diagnostics?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-01 16:06 linux-next: kernel BUG at mm/slub.c:1447! Guenter Roeck
  2015-10-01 16:58 ` Christoph Lameter
@ 2015-10-01 20:49 ` Andrew Morton
  2015-10-02  7:25   ` Michal Hocko
  2015-10-02 13:36   ` Guenter Roeck
  1 sibling, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2015-10-01 20:49 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: linux-kernel, Christoph Lameter, Kirill A. Shutemov, linux-mm,
	Michal Hocko, Dave Chinner

On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote:

> Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration.
> On a re-run, I have seen it with the same image, but this time when simulating IvyBridge,
> so it is not CPU dependent. I did not previously see the problem.
> 
> Log is at
> http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio
> 
> I'll try to bisect. The problem is not seen with every boot, so that may take a while.

Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()",
I expect.

> ---
> gfp: 2

That's __GFP_HIGHMEM

> ------------[ cut here ]------------
> invalid opcode: 0000 [#1] PREEMPT
> Modules linked in:
> CPU: 0 PID: 121 Comm: udevd Not tainted 4.3.0-rc3-next-20151001-yocto-standard #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> task: ced90000 ti: ced8c000 task.ti: ced8c000
> EIP: 0060:[<c1128873>] EFLAGS: 00000092 CPU: 0
> EIP is at new_slab+0x353/0x360
> EAX: 00000006 EBX: 00000000 ECX: 00000001 EDX: 80000001
> ESI: cf8019c0 EDI: 00000000 EBP: ced8daa4 ESP: ced8da7c
>   DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
> CR0: 8005003b CR2: 080791c0 CR3: 0ed6c000 CR4: 000006d0
> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> DR6: 00000000 DR7: 00000000
> Stack:
>   c19a42cf 00000002 c137542e ced8da90 c137544c ffffffff c144c8a8 00000000
>   cf8019c0 00000000 ced8db28 c1129ca8 0203128a c144f346 00004e20 cec2e740
>   c10ee933 0203128a cf8019c0 c181d6c0 c181d460 000001a1 00150015 c0011c00
> Call Trace:
>   [<c137542e>] ? __delay+0xe/0x10
>   [<c137544c>] ? __const_udelay+0x1c/0x20
>   [<c144c8a8>] ? ide_execute_command+0x68/0xa0
>   [<c1129ca8>] ___slab_alloc.constprop.75+0x248/0x310
>   [<c144f346>] ? do_rw_taskfile+0x286/0x320
>   [<c10ee933>] ? mempool_alloc_slab+0x13/0x20
>   [<c1457d12>] ? ide_do_rw_disk+0x222/0x320
>   [<c1136219>] __slab_alloc.isra.72.constprop.74+0x18/0x1f
>   [<c112a2f2>] kmem_cache_alloc+0x122/0x1c0
>   [<c10ee933>] ? mempool_alloc_slab+0x13/0x20
>   [<c10ee933>] mempool_alloc_slab+0x13/0x20
>   [<c10eebe5>] mempool_alloc+0x45/0x170
>   [<c1345202>] bio_alloc_bioset+0xd2/0x1b0
>   [<c1172e9f>] mpage_alloc+0x2f/0xa0
>   [<c1037979>] ? kmap_atomic_prot+0x59/0xf0
>   [<c1173523>] do_mpage_readpage+0x4d3/0x7e0
>   [<c10f31b8>] ? __alloc_pages_nodemask+0xf8/0x8c0
>   [<c134ed67>] ? blk_queue_bio+0x267/0x2d0
>   [<c112a24a>] ? kmem_cache_alloc+0x7a/0x1c0
>   [<c138357f>] ? __this_cpu_preempt_check+0xf/0x20
>   [<c1173894>] mpage_readpage+0x64/0x80

mpage_readpage() is getting the __GFP_HIGHMEM from mapping_gfp_mask()
and that got passed all the way into kmem_cache_alloc() to allocate a
bio.  slab goes BUG if asked for highmem.

A fix would be to mask off __GFP_HIGHMEM right there in
mpage_readpage().

But I think the patch needs a bit of a rethink.  mapping_gfp_mask() is
the mask for allocating a file's pagecache.  It isn't designed for
allocation of memory for IO structures, file metadata, etc.

Now, we could redefine mapping_gfp_mask()'s purpose (or formalize
stuff which has been sneaking in anyway).  Treat mapping_gfp_mask() as
a constraint mask - instead of it being "use this gfp for this
mapping", it becomes "don't use these gfp flags for this mapping".

Hence something like:

gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in)
{
	return mapping_gfp_mask(mapping) & gfp_in;
}

So instead of doing this:

@@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index,
+					gfp)) {

Michal's patch will do:

@@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-				page->index, GFP_KERNEL)) {
+				page->index,
+				mapping_gfp_constraint(mapping, GFP_KERNEL))) {

ie: use mapping_gfp_mask() to strip out any GFP flags which the
filesystem doesn't want used.  If the filesystem has *added* flags to
mapping_gfp_mask() then obviously this won't work and we'll need two
fields in the address_space or something.

Meanwhile I'll drop "mm, fs: obey gfp_mapping for add_to_page_cache()",
thanks for the report.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-01 20:49 ` Andrew Morton
@ 2015-10-02  7:25   ` Michal Hocko
  2015-10-02  8:53     ` Michal Hocko
  2015-10-02 20:49     ` Andrew Morton
  2015-10-02 13:36   ` Guenter Roeck
  1 sibling, 2 replies; 12+ messages in thread
From: Michal Hocko @ 2015-10-02  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guenter Roeck, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Thu 01-10-15 13:49:04, Andrew Morton wrote:
[...]
> mpage_readpage() is getting the __GFP_HIGHMEM from mapping_gfp_mask()
> and that got passed all the way into kmem_cache_alloc() to allocate a
> bio.  slab goes BUG if asked for highmem.
> 
> A fix would be to mask off __GFP_HIGHMEM right there in
> mpage_readpage().

Yes, this is an obvious bug in the patch. It should only make the gfp
mask more restrictive.

> But I think the patch needs a bit of a rethink.  mapping_gfp_mask() is
> the mask for allocating a file's pagecache.  It isn't designed for
> allocation of memory for IO structures, file metadata, etc.
>
> Now, we could redefine mapping_gfp_mask()'s purpose (or formalize
> stuff which has been sneaking in anyway).  Treat mapping_gfp_mask() as
> a constraint mask - instead of it being "use this gfp for this
> mapping", it becomes "don't use these gfp flags for this mapping".
> 
> Hence something like:
> 
> gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in)
> {
> 	return mapping_gfp_mask(mapping) & gfp_in;
> }
> 
> So instead of doing this:
> 
> @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
>  		prefetchw(&page->flags);
>  		list_del(&page->lru);
>  		if (!add_to_page_cache_lru(page, mapping,
> -					page->index, GFP_KERNEL)) {
> +					page->index,
> +					gfp)) {
> 
> Michal's patch will do:
> 
> @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
>  		prefetchw(&page->flags);
>  		list_del(&page->lru);
>  		if (!add_to_page_cache_lru(page, mapping,
> -				page->index, GFP_KERNEL)) {
> +				page->index,
> +				mapping_gfp_constraint(mapping, GFP_KERNEL))) {
> 
> ie: use mapping_gfp_mask() to strip out any GFP flags which the
> filesystem doesn't want used.  If the filesystem has *added* flags to
> mapping_gfp_mask() then obviously this won't work and we'll need two
> fields in the address_space or something.
> 
> Meanwhile I'll drop "mm, fs: obey gfp_mapping for add_to_page_cache()",
> thanks for the report.

mapping_gfp_mask is used at many places so I think it would be better to
fix this particular place (others seem to be correct). It would make the
stable backport much easier. We can build a more sane API on top. What
do you think?

Here is the respin of the original patch. I will post another one which
will add mapping_gfp_constraint on top. It will surely be less error
prone.
---
>From 783643af1d21ed09b8c8587c2f0fc9a9ab2fe1d7 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 1 Oct 2015 09:41:28 +1000
Subject: [PATCH] mm, fs: obey gfp_mapping for add_to_page_cache()

6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache allocation
paths") has caught some users of hardcoded GFP_KERNEL used in the page
cache allocation paths.  This, however, wasn't complete and there were
others which went unnoticed.

Dave Chinner has reported the following deadlock for xfs on loop device:
: With the recent merge of the loop device changes, I'm now seeing
: XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
:
: The deadlocked is as follows:
:
: kloopd1: loop_queue_read_work
:       xfs_file_iter_read
:       lock XFS inode XFS_IOLOCK_SHARED (on image file)
:       page cache read (GFP_KERNEL)
:       radix tree alloc
:       memory reclaim
:       reclaim XFS inodes
:       log force to unpin inodes
:       <wait for log IO completion>
:
: xfs-cil/loop1: <does log force IO work>
:       xlog_cil_push
:       xlog_write
:       <loop issuing log writes>
:               xlog_state_get_iclog_space()
:               <blocks due to all log buffers under write io>
:               <waits for IO completion>
:
: kloopd1: loop_queue_write_work
:       xfs_file_write_iter
:       lock XFS inode XFS_IOLOCK_EXCL (on image file)
:       <wait for inode to be unlocked>
:
: i.e. the kloopd, with it's split read and write work queues, has
: introduced a dependency through memory reclaim. i.e. that writes
: need to be able to progress for reads make progress.
:
: The problem, fundamentally, is that mpage_readpages() does a
: GFP_KERNEL allocation, rather than paying attention to the inode's
: mapping gfp mask, which is set to GFP_NOFS.
:
: The didn't used to happen, because the loop device used to issue
: reads through the splice path and that does:
:
:       error = add_to_page_cache_lru(page, mapping, index,
:                       GFP_KERNEL & mapping_gfp_mask(mapping));

This has changed by aa4d86163e4 ("block: loop: switch to VFS ITER_BVEC").

This patch changes mpage_readpage{s} to follow gfp mask set for the
mapping.  There are, however, other places which are doing basically the
same.

lustre:ll_dir_filler is doing GFP_KERNEL from the function which
apparently uses GFP_NOFS for other allocations so let's make this
consistent.

cifs:readpages_get_pages is called from cifs_readpages and
__cifs_readpages_from_fscache called from the same path obeys mapping
gfp.

ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
regardless it uses mapping_gfp_mask for the page allocation.

ext4_mpage_readpages is the called from the page cache allocation path
same as read_pages and read_cache_pages

As I've noticed in my previous post I cannot say I would be happy about
sprinkling mapping_gfp_mask all over the place and it sounds like we
should drop gfp_mask argument altogether and use it internally in
__add_to_page_cache_locked that would require all the filesystems to use
mapping gfp consistently which I am not sure is the case here.  From a
quick glance it seems that some file system use it all the time while
others are selective.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 drivers/staging/lustre/lustre/llite/dir.c |  2 +-
 fs/cifs/file.c                            |  5 +++--
 fs/ext4/readpage.c                        |  3 ++-
 fs/mpage.c                                | 15 +++++++++------
 fs/ramfs/file-nommu.c                     |  5 +++--
 mm/readahead.c                            |  6 ++++--
 6 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index cc6f0f596ffe..b512ad6f30fe 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -224,7 +224,7 @@ static int ll_dir_filler(void *_hash, struct page *page0)
 
 		prefetchw(&page->flags);
 		ret = add_to_page_cache_lru(page, inode->i_mapping, offset,
-					    GFP_KERNEL);
+					    GFP_NOFS);
 		if (ret == 0) {
 			unlock_page(page);
 		} else {
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c22db5dae2b2..5cf50653796a 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3380,6 +3380,7 @@ readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
 	struct page *page, *tpage;
 	unsigned int expected_index;
 	int rc;
+	gfp_t gfp = GFP_KERNEL & mapping_gfp_mask(mapping);
 
 	INIT_LIST_HEAD(tmplist);
 
@@ -3392,7 +3393,7 @@ readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
 	 */
 	__SetPageLocked(page);
 	rc = add_to_page_cache_locked(page, mapping,
-				      page->index, GFP_KERNEL);
+				      page->index, gfp);
 
 	/* give up if we can't stick it in the cache */
 	if (rc) {
@@ -3418,7 +3419,7 @@ readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
 			break;
 
 		__SetPageLocked(page);
-		if (add_to_page_cache_locked(page, mapping, page->index, GFP_KERNEL)) {
+		if (add_to_page_cache_locked(page, mapping, page->index, gfp)) {
 			__ClearPageLocked(page);
 			break;
 		}
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index e26803fb210d..c31d64d70cc0 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -166,7 +166,8 @@ int ext4_mpage_readpages(struct address_space *mapping,
 			page = list_entry(pages->prev, struct page, lru);
 			list_del(&page->lru);
 			if (add_to_page_cache_lru(page, mapping,
-						  page->index, GFP_KERNEL))
+						  page->index,
+						  GFP_KERNEL & mapping_gfp_mask(mapping)))
 				goto next_page;
 		}
 
diff --git a/fs/mpage.c b/fs/mpage.c
index 2ebf91652ecb..09abba7653aa 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -139,7 +139,8 @@ map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block)
 static struct bio *
 do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 		sector_t *last_block_in_bio, struct buffer_head *map_bh,
-		unsigned long *first_logical_block, get_block_t get_block)
+		unsigned long *first_logical_block, get_block_t get_block,
+		gfp_t gfp)
 {
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
@@ -277,8 +278,7 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 				goto out;
 		}
 		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
-				min_t(int, nr_pages, BIO_MAX_PAGES),
-				GFP_KERNEL);
+				min_t(int, nr_pages, BIO_MAX_PAGES), gfp);
 		if (bio == NULL)
 			goto confused;
 	}
@@ -361,6 +361,7 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
 	sector_t last_block_in_bio = 0;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
+	gfp_t gfp = GFP_KERNEL & mapping_gfp_mask(mapping);
 
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
@@ -370,12 +371,13 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index,
+					gfp)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
-					get_block);
+					get_block, gfp);
 		}
 		page_cache_release(page);
 	}
@@ -395,11 +397,12 @@ int mpage_readpage(struct page *page, get_block_t get_block)
 	sector_t last_block_in_bio = 0;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
+	gfp_t gfp = GFP_KERNEL & mapping_gfp_mask(page->mapping);
 
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
 	bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
-			&map_bh, &first_logical_block, get_block);
+			&map_bh, &first_logical_block, get_block, gfp);
 	if (bio)
 		mpage_bio_submit(READ, bio);
 	return 0;
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index ba1323a94924..a586467f6ff6 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -70,6 +70,7 @@ int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize)
 	unsigned order;
 	void *data;
 	int ret;
+	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
 
 	/* make various checks */
 	order = get_order(newsize);
@@ -84,7 +85,7 @@ int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize)
 
 	/* allocate enough contiguous pages to be able to satisfy the
 	 * request */
-	pages = alloc_pages(mapping_gfp_mask(inode->i_mapping), order);
+	pages = alloc_pages(gfp, order);
 	if (!pages)
 		return -ENOMEM;
 
@@ -108,7 +109,7 @@ int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize)
 		struct page *page = pages + loop;
 
 		ret = add_to_page_cache_lru(page, inode->i_mapping, loop,
-					GFP_KERNEL);
+					gfp);
 		if (ret < 0)
 			goto add_error;
 
diff --git a/mm/readahead.c b/mm/readahead.c
index b4937a6bfcd6..1ede8c0d3702 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -90,7 +90,8 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 		page = list_to_page(pages);
 		list_del(&page->lru);
 		if (add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index,
+					GFP_KERNEL & mapping_gfp_mask(mapping))) {
 			read_cache_pages_invalidate_page(mapping, page);
 			continue;
 		}
@@ -128,7 +129,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 		struct page *page = list_to_page(pages);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index,
+					GFP_KERNEL & mapping_gfp_mask(mapping))) {
 			mapping->a_ops->readpage(filp, page);
 		}
 		page_cache_release(page);
-- 
2.5.1



-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02  7:25   ` Michal Hocko
@ 2015-10-02  8:53     ` Michal Hocko
  2015-10-02 20:49     ` Andrew Morton
  1 sibling, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2015-10-02  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guenter Roeck, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri 02-10-15 09:25:22, Michal Hocko wrote:
> On Thu 01-10-15 13:49:04, Andrew Morton wrote:
[...]
> > Now, we could redefine mapping_gfp_mask()'s purpose (or formalize
> > stuff which has been sneaking in anyway).  Treat mapping_gfp_mask() as
> > a constraint mask - instead of it being "use this gfp for this
> > mapping", it becomes "don't use these gfp flags for this mapping".
> > 
> > Hence something like:
> > 
> > gfp_t mapping_gfp_constraint(struct address_space *mapping, gfp_t gfp_in)
> > {
> > 	return mapping_gfp_mask(mapping) & gfp_in;
> > }
> > 
> > So instead of doing this:
> > 
> > @@ -370,12 +371,13 @@ mpage_readpages(struct address_space *ma
> >  		prefetchw(&page->flags);
> >  		list_del(&page->lru);
> >  		if (!add_to_page_cache_lru(page, mapping,
> > -					page->index, GFP_KERNEL)) {
> > +					page->index,
> > +					gfp)) {
> > 
[...]
> I will post another one which
> will add mapping_gfp_constraint on top. It will surely be less error
> prone.

OK, so here we go. There are still few direct users of mapping_gfp_mask
but most of them use it unrestricted. The only exception seems to be
loop driver and luste lloop which save and restrict mapping_gfp which
didn't seem worthwhile converting.

This is on top of the current linux-next + the updated fix for the loop
hang.
---
>From d9c7c1c757c633cfebe51da66ee0c65244a40876 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 2 Oct 2015 09:57:16 +0200
Subject: [PATCH] mm, fs: introduce mapping_gfp_constraint

There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.
Let's introduce a helper function which makes the restriction explicit
and easier to track. This patch doesn't introduce any functional
changes.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/gpu/drm/drm_gem.c       | 2 +-
 drivers/gpu/drm/i915/i915_gem.c | 3 +--
 fs/btrfs/compression.c          | 3 +--
 fs/btrfs/ctree.h                | 2 +-
 fs/btrfs/free-space-cache.c     | 4 ++--
 fs/buffer.c                     | 2 +-
 fs/ceph/addr.c                  | 7 ++++---
 fs/cifs/file.c                  | 2 +-
 fs/ext4/inode.c                 | 2 +-
 fs/ext4/readpage.c              | 2 +-
 fs/logfs/segment.c              | 2 +-
 fs/mpage.c                      | 4 ++--
 fs/namei.c                      | 2 +-
 fs/nilfs2/inode.c               | 4 ++--
 fs/ntfs/file.c                  | 2 +-
 fs/splice.c                     | 2 +-
 include/linux/pagemap.h         | 7 +++++++
 mm/filemap.c                    | 4 ++--
 mm/readahead.c                  | 4 ++--
 19 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 3c2d4abd71c5..1d47d2e9487c 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -491,7 +491,7 @@ struct page **drm_gem_get_pages(struct drm_gem_object *obj)
 		 * __GFP_DMA32 to be set in mapping_gfp_mask(inode->i_mapping)
 		 * so shmem can relocate pages during swapin if required.
 		 */
-		BUG_ON((mapping_gfp_mask(mapping) & __GFP_DMA32) &&
+		BUG_ON(mapping_gfp_constraint(mapping, __GFP_DMA32) &&
 				(page_to_pfn(p) >= 0x00100000UL));
 	}
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 249c2829bb41..f1d670f09032 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2216,9 +2216,8 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 	 * Fail silently without starting the shrinker
 	 */
 	mapping = file_inode(obj->base.filp)->i_mapping;
-	gfp = mapping_gfp_mask(mapping);
+	gfp = mapping_gfp_constraint(mapping, ~(__GFP_IO | __GFP_RECLAIM));
 	gfp |= __GFP_NORETRY | __GFP_NOWARN;
-	gfp &= ~(__GFP_IO | __GFP_RECLAIM);
 	sg = st->sgl;
 	st->nents = 0;
 	for (i = 0; i < page_count; i++) {
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 57ee8ca29b06..147a9761e47f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -482,8 +482,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 			goto next;
 		}
 
-		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+		page = __page_cache_alloc(mapping_gfp_constraint(mapping, ~__GFP_FS));
 		if (!page)
 			break;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe33be80..eb90f0f1a124 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3316,7 +3316,7 @@ static inline bool btrfs_mixed_space_info(struct btrfs_space_info *space_info)
 
 static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
 {
-	return mapping_gfp_mask(mapping) & ~__GFP_FS;
+	return mapping_gfp_constraint(mapping, ~__GFP_FS);
 }
 
 /* extent-tree.c */
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index abe3a66bd3ba..ed05da1b977e 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -85,8 +85,8 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
 	}
 
 	mapping_set_gfp_mask(inode->i_mapping,
-			mapping_gfp_mask(inode->i_mapping) &
-			~(__GFP_FS | __GFP_HIGHMEM));
+			mapping_gfp_constraint(inode->i_mapping,
+			~(__GFP_FS | __GFP_HIGHMEM)));
 
 	return inode;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index 82283abb2795..51aff0296ce2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -999,7 +999,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	int ret = 0;		/* Will call free_more_memory() */
 	gfp_t gfp_mask;
 
-	gfp_mask = (mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS) | gfp;
+	gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
 
 	/*
 	 * XXX: __getblk_slow() can not really deal with failure and
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 9d23e788d1df..b7d218a168fb 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1283,8 +1283,8 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		int ret1;
 		struct address_space *mapping = inode->i_mapping;
 		struct page *page = find_or_create_page(mapping, 0,
-						mapping_gfp_mask(mapping) &
-						~__GFP_FS);
+						mapping_gfp_constraint(mapping,
+						~__GFP_FS));
 		if (!page) {
 			ret = VM_FAULT_OOM;
 			goto out;
@@ -1428,7 +1428,8 @@ void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 		if (i_size_read(inode) == 0)
 			return;
 		page = find_or_create_page(mapping, 0,
-					   mapping_gfp_mask(mapping) & ~__GFP_FS);
+					   mapping_gfp_constraint(mapping,
+					   ~__GFP_FS));
 		if (!page)
 			return;
 		if (PageUptodate(page)) {
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 5cf50653796a..2d319e66b8f8 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3380,7 +3380,7 @@ readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
 	struct page *page, *tpage;
 	unsigned int expected_index;
 	int rc;
-	gfp_t gfp = GFP_KERNEL & mapping_gfp_mask(mapping);
+	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
 
 	INIT_LIST_HEAD(tmplist);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 612fbcf76b5c..60aaecd5598b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3344,7 +3344,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	int err = 0;
 
 	page = find_or_create_page(mapping, from >> PAGE_CACHE_SHIFT,
-				   mapping_gfp_mask(mapping) & ~__GFP_FS);
+				   mapping_gfp_constraint(mapping, ~__GFP_FS));
 	if (!page)
 		return -ENOMEM;
 
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index c31d64d70cc0..1ac204caafe3 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -167,7 +167,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 			list_del(&page->lru);
 			if (add_to_page_cache_lru(page, mapping,
 						  page->index,
-						  GFP_KERNEL & mapping_gfp_mask(mapping)))
+						  mapping_gfp_constraint(mapping, GFP_KERNEL)))
 				goto next_page;
 		}
 
diff --git a/fs/logfs/segment.c b/fs/logfs/segment.c
index 7f9b096d8d57..6de0fbfc6c00 100644
--- a/fs/logfs/segment.c
+++ b/fs/logfs/segment.c
@@ -57,7 +57,7 @@ static struct page *get_mapping_page(struct super_block *sb, pgoff_t index,
 	filler_t *filler = super->s_devops->readpage;
 	struct page *page;
 
-	BUG_ON(mapping_gfp_mask(mapping) & __GFP_FS);
+	BUG_ON(mapping_gfp_constraint(mapping, __GFP_FS));
 	if (use_filler)
 		page = read_cache_page(mapping, index, filler, sb);
 	else {
diff --git a/fs/mpage.c b/fs/mpage.c
index 09abba7653aa..1480d3a18037 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -361,7 +361,7 @@ mpage_readpages(struct address_space *mapping, struct list_head *pages,
 	sector_t last_block_in_bio = 0;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
-	gfp_t gfp = GFP_KERNEL & mapping_gfp_mask(mapping);
+	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
 
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
@@ -397,7 +397,7 @@ int mpage_readpage(struct page *page, get_block_t get_block)
 	sector_t last_block_in_bio = 0;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
-	gfp_t gfp = GFP_KERNEL & mapping_gfp_mask(page->mapping);
+	gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
 
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
diff --git a/fs/namei.c b/fs/namei.c
index b090c4f44956..7bbc5e9611bb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4677,7 +4677,7 @@ EXPORT_SYMBOL(__page_symlink);
 int page_symlink(struct inode *inode, const char *symname, int len)
 {
 	return __page_symlink(inode, symname, len,
-			!(mapping_gfp_mask(inode->i_mapping) & __GFP_FS));
+			!mapping_gfp_constraint(inode->i_mapping, __GFP_FS));
 }
 EXPORT_SYMBOL(page_symlink);
 
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 4a73d6dffabf..b2ee592d48d2 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -356,7 +356,7 @@ struct inode *nilfs_new_inode(struct inode *dir, umode_t mode)
 		goto failed;
 
 	mapping_set_gfp_mask(inode->i_mapping,
-			     mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
+			     mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS));
 
 	root = NILFS_I(dir)->i_root;
 	ii = NILFS_I(inode);
@@ -522,7 +522,7 @@ static int __nilfs_read_inode(struct super_block *sb,
 	up_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
 	nilfs_set_inode_flags(inode);
 	mapping_set_gfp_mask(inode->i_mapping,
-			     mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
+			     mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS));
 	return 0;
 
  failed_unmap:
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 262561fea923..916e3efeaa0e 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -526,7 +526,7 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 			}
 			err = add_to_page_cache_lru(*cached_page, mapping,
 					index,
-					GFP_KERNEL & mapping_gfp_mask(mapping));
+					mapping_gfp_constraint(mapping, GFP_KERNEL));
 			if (unlikely(err)) {
 				if (err == -EEXIST)
 					continue;
diff --git a/fs/splice.c b/fs/splice.c
index 5fc1e50a7f30..9ce27b1e9690 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -360,7 +360,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 				break;
 
 			error = add_to_page_cache_lru(page, mapping, index,
-					GFP_KERNEL & mapping_gfp_mask(mapping));
+					mapping_gfp_constraint(mapping, GFP_KERNEL));
 			if (unlikely(error)) {
 				page_cache_release(page);
 				if (error == -EEXIST)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3e95fb6a77af..df214a4b886d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -69,6 +69,13 @@ static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
 }
 
+/* Restricts the given gfp_mask to what the mapping allows. */
+static inline gfp_t mapping_gfp_constraint(struct address_space *mapping,
+		gfp_t gfp_mask)
+{
+	return mapping_gfp_mask(mapping) & gfp_mask;
+}
+
 /*
  * This is non-atomic.  Only to be used before the mapping is activated.
  * Probably needs a barrier...
diff --git a/mm/filemap.c b/mm/filemap.c
index 972251764243..f86aba253950 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1725,7 +1725,7 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 			goto out;
 		}
 		error = add_to_page_cache_lru(page, mapping, index,
-					GFP_KERNEL & mapping_gfp_mask(mapping));
+					mapping_gfp_constraint(mapping, GFP_KERNEL));
 		if (error) {
 			page_cache_release(page);
 			if (error == -EEXIST) {
@@ -1827,7 +1827,7 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 			return -ENOMEM;
 
 		ret = add_to_page_cache_lru(page, mapping, offset,
-				GFP_KERNEL & mapping_gfp_mask(mapping));
+				mapping_gfp_constraint(mapping, GFP_KERNEL));
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
diff --git a/mm/readahead.c b/mm/readahead.c
index 1ede8c0d3702..bd55f8384f5d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -91,7 +91,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 		list_del(&page->lru);
 		if (add_to_page_cache_lru(page, mapping,
 					page->index,
-					GFP_KERNEL & mapping_gfp_mask(mapping))) {
+					mapping_gfp_constraint(mapping, GFP_KERNEL))) {
 			read_cache_pages_invalidate_page(mapping, page);
 			continue;
 		}
@@ -130,7 +130,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
 					page->index,
-					GFP_KERNEL & mapping_gfp_mask(mapping))) {
+					mapping_gfp_constraint(mapping, GFP_KERNEL))) {
 			mapping->a_ops->readpage(filp, page);
 		}
 		page_cache_release(page);
-- 
2.5.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-01 20:49 ` Andrew Morton
  2015-10-02  7:25   ` Michal Hocko
@ 2015-10-02 13:36   ` Guenter Roeck
  2015-10-02 13:42     ` Michal Hocko
  1 sibling, 1 reply; 12+ messages in thread
From: Guenter Roeck @ 2015-10-02 13:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Christoph Lameter, Kirill A. Shutemov, linux-mm,
	Michal Hocko, Dave Chinner

On 10/01/2015 01:49 PM, Andrew Morton wrote:
> On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>
>> Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration.
>> On a re-run, I have seen it with the same image, but this time when simulating IvyBridge,
>> so it is not CPU dependent. I did not previously see the problem.
>>
>> Log is at
>> http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio
>>
>> I'll try to bisect. The problem is not seen with every boot, so that may take a while.
>
> Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()",
> I expect.
>
I tried to bisect to be sure, but the problem doesn't happen often enough, and I got some
false negatives. I assume bisect is no longer necessary. If I need to try again, please
let me know.

Thanks,
Guenter


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02 13:36   ` Guenter Roeck
@ 2015-10-02 13:42     ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2015-10-02 13:42 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Andrew Morton, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri 02-10-15 06:36:57, Guenter Roeck wrote:
> On 10/01/2015 01:49 PM, Andrew Morton wrote:
> >On Thu, 1 Oct 2015 09:06:15 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> >
> >>Seen with next-20151001, running qemu, simulating Opteron_G1 with a non-SMP configuration.
> >>On a re-run, I have seen it with the same image, but this time when simulating IvyBridge,
> >>so it is not CPU dependent. I did not previously see the problem.
> >>
> >>Log is at
> >>http://server.roeck-us.net:8010/builders/qemu-x86-next/builds/259/steps/qemubuildcommand/logs/stdio
> >>
> >>I'll try to bisect. The problem is not seen with every boot, so that may take a while.
> >
> >Caused by mhocko's "mm, fs: obey gfp_mapping for add_to_page_cache()",
> >I expect.
> >
> I tried to bisect to be sure, but the problem doesn't happen often enough, and I got some
> false negatives. I assume bisect is no longer necessary. If I need to try again, please
> let me know.

The updated patch has been posted here:
http://lkml.kernel.org/r/20151002085324.GA2927%40dhcp22.suse.cz

Andrew's analysis seems decent so I am pretty sure it should fix the
issue.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02  7:25   ` Michal Hocko
  2015-10-02  8:53     ` Michal Hocko
@ 2015-10-02 20:49     ` Andrew Morton
  2015-10-05 13:47       ` Michal Hocko
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2015-10-02 20:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Guenter Roeck, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri, 2 Oct 2015 09:25:23 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache allocation
> paths") has caught some users of hardcoded GFP_KERNEL used in the page
> cache allocation paths.  This, however, wasn't complete and there were
> others which went unnoticed.
> 
> Dave Chinner has reported the following deadlock for xfs on loop device:
> : With the recent merge of the loop device changes, I'm now seeing
> : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
> :
> : The deadlocked is as follows:
> :
> : kloopd1: loop_queue_read_work
> :       xfs_file_iter_read
> :       lock XFS inode XFS_IOLOCK_SHARED (on image file)
> :       page cache read (GFP_KERNEL)
> :       radix tree alloc
> :       memory reclaim
> :       reclaim XFS inodes
> :       log force to unpin inodes
> :       <wait for log IO completion>
> :
> : xfs-cil/loop1: <does log force IO work>
> :       xlog_cil_push
> :       xlog_write
> :       <loop issuing log writes>
> :               xlog_state_get_iclog_space()
> :               <blocks due to all log buffers under write io>
> :               <waits for IO completion>
> :
> : kloopd1: loop_queue_write_work
> :       xfs_file_write_iter
> :       lock XFS inode XFS_IOLOCK_EXCL (on image file)
> :       <wait for inode to be unlocked>
> :
> : i.e. the kloopd, with it's split read and write work queues, has
> : introduced a dependency through memory reclaim. i.e. that writes
> : need to be able to progress for reads make progress.
> :
> : The problem, fundamentally, is that mpage_readpages() does a
> : GFP_KERNEL allocation, rather than paying attention to the inode's
> : mapping gfp mask, which is set to GFP_NOFS.
> :
> : The didn't used to happen, because the loop device used to issue
> : reads through the splice path and that does:
> :
> :       error = add_to_page_cache_lru(page, mapping, index,
> :                       GFP_KERNEL & mapping_gfp_mask(mapping));
> 
> This has changed by aa4d86163e4 ("block: loop: switch to VFS ITER_BVEC").
> 
> This patch changes mpage_readpage{s} to follow gfp mask set for the
> mapping.  There are, however, other places which are doing basically the
> same.
> 
> lustre:ll_dir_filler is doing GFP_KERNEL from the function which
> apparently uses GFP_NOFS for other allocations so let's make this
> consistent.
> 
> cifs:readpages_get_pages is called from cifs_readpages and
> __cifs_readpages_from_fscache called from the same path obeys mapping
> gfp.
> 
> ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
> regardless it uses mapping_gfp_mask for the page allocation.
> 
> ext4_mpage_readpages is the called from the page cache allocation path
> same as read_pages and read_cache_pages
> 
> As I've noticed in my previous post I cannot say I would be happy about
> sprinkling mapping_gfp_mask all over the place and it sounds like we
> should drop gfp_mask argument altogether and use it internally in
> __add_to_page_cache_locked that would require all the filesystems to use
> mapping gfp consistently which I am not sure is the case here.  From a
> quick glance it seems that some file system use it all the time while
> others are selective.

There's a lot of confusion here, so let's try to do some clear
thinking.

mapping_gfp_mask() describes the mask to be used for allocating this
mapping's pagecache pages.  Nothing else.  I introduced it to provide a
clean way to ensure that blockdev inode pagecache is not allocated from
highmem.

This is totally different from "I'm holding a lock so I can't do
__GFP_FS"!  __GFP_HIGHMEM and __GFP_FS are utterly unrelated concepts -
the only commonality is that they happen to be passed to callees in the
same memory word.

At present the VFS provides no way for the filesystem to specify the
allocation mode for pagecache operations.  The VFS assumes
__GFP_FS|__GFP_IO|__GFP_WAIT|etc.  mapping_gfp_mask() may *look* like
it provides a way, but it doesn't.

The way for a caller to communicate "I can't do __GFP_FS from this
particular callsite" is to pass that info in a gfp_t, in a function
call argument.  It is very wrong to put this info into
mapping_gfp_mask(), as XFS has done.  For obvious reasons: different
callsites may want different values.

One can easily envision a filesystem whose read() can allocate
pagecache with GFP_HIGHUSER but the write() needs
GFP_HIGHUSER&~__GFP_FS.  This obviously won't fly if we're (ab)using
mapping_gpf_mask() in this fashion.  Also callsites who *can* use the
stronger __GFP_FS are artificially prevented from doing so.


So.

By far the best way of addressing this bug is to fix XFS so that it can
use __GFP_FS for allocating pagecache.  It's really quite lame that the
filesystem cannot use the strongest memory allocation mode for the
largest volume of allocation.  Obviously this fix is too difficult,
otherwise it would already have been made.


The second best way of solving this bug is to pass a gfp_t into the
relevant callees, in the time-honoured manner.  That would involve
alteration of probably several address_space_operations function
pointers and lots of mind-numbing mechanical edits and bloat.


The third best way is to pass the gfp_t via the task_struct.  See
memalloc_noio_save() and memalloc_noio_restore().  This is a pretty
grubby way of passing function arguments, but I'm OK with it in this
special case, because

  a) Adding a gfp_t arg to 11 billion functions has costs of
     several forms

  b) Adding all these costs just because one filesystem is being
     weird doesn't make sense

  c) The mpage functions already have too many arguments.  Adding
     yet another is getting kinda ridiculous, and will cost more stack
     on some of our deepest paths.


The fourth best way of fixing this is a nasty short-term bodge, such a
the one you just sent ;) But if we're going to do this, it should be
the minimal bodge which fixes this deadlock.  Is it possible to come up
with a one-liner (plus suitable comment) to get us out of this mess?


Longer-term I suggest we look at generalising the memalloc_noio_foo()
stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in
callees.  I have a suspicion we should have done this 15 years ago
(which is about when I started wanting to do it).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-02 20:49     ` Andrew Morton
@ 2015-10-05 13:47       ` Michal Hocko
  2015-10-05 19:29         ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2015-10-05 13:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Guenter Roeck, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Fri 02-10-15 13:49:53, Andrew Morton wrote:
[...]
> There's a lot of confusion here, so let's try to do some clear
> thinking.
> 
> mapping_gfp_mask() describes the mask to be used for allocating this
> mapping's pagecache pages.  Nothing else.  I introduced it to provide a
> clean way to ensure that blockdev inode pagecache is not allocated from
> highmem.

I am afraid this is not how it ended up being used.

> This is totally different from "I'm holding a lock so I can't do
> __GFP_FS"!  __GFP_HIGHMEM and __GFP_FS are utterly unrelated concepts -
> the only commonality is that they happen to be passed to callees in the
> same memory word.

Which is exactly how those filesytems overcome the lack of consistent
API to tell the restrictions all the way down to the allocator. AFAIR
this is the case beyond xfs.

> At present the VFS provides no way for the filesystem to specify the
> allocation mode for pagecache operations.  The VFS assumes
> __GFP_FS|__GFP_IO|__GFP_WAIT|etc.  mapping_gfp_mask() may *look* like
> it provides a way, but it doesn't.
> 
> The way for a caller to communicate "I can't do __GFP_FS from this
> particular callsite" is to pass that info in a gfp_t, in a function
> call argument.  It is very wrong to put this info into
> mapping_gfp_mask(), as XFS has done.  For obvious reasons: different
> callsites may want different values.
> 
> One can easily envision a filesystem whose read() can allocate
> pagecache with GFP_HIGHUSER but the write() needs
> GFP_HIGHUSER&~__GFP_FS.  This obviously won't fly if we're (ab)using
> mapping_gpf_mask() in this fashion.  Also callsites who *can* use the
> stronger __GFP_FS are artificially prevented from doing so.

Yes I am not happy about the state we have grown into as well.

> So.
> 
> By far the best way of addressing this bug is to fix XFS so that it can
> use __GFP_FS for allocating pagecache.  It's really quite lame that the
> filesystem cannot use the strongest memory allocation mode for the
> largest volume of allocation.  Obviously this fix is too difficult,
> otherwise it would already have been made.
> 
> 
> The second best way of solving this bug is to pass a gfp_t into the
> relevant callees, in the time-honoured manner.  That would involve
> alteration of probably several address_space_operations function
> pointers and lots of mind-numbing mechanical edits and bloat.
> 
> 
> The third best way is to pass the gfp_t via the task_struct.  See
> memalloc_noio_save() and memalloc_noio_restore().  This is a pretty
> grubby way of passing function arguments, but I'm OK with it in this
> special case, because
> 
>   a) Adding a gfp_t arg to 11 billion functions has costs of
>      several forms
> 
>   b) Adding all these costs just because one filesystem is being
>      weird doesn't make sense
> 
>   c) The mpage functions already have too many arguments.  Adding
>      yet another is getting kinda ridiculous, and will cost more stack
>      on some of our deepest paths.
> 
> 
> The fourth best way of fixing this is a nasty short-term bodge, such a
> the one you just sent ;) But if we're going to do this, it should be
> the minimal bodge which fixes this deadlock.  Is it possible to come up
> with a one-liner (plus suitable comment) to get us out of this mess?

Yes I do agree that the fix I am proposing is short-term but this seems
like the easiest way to go for stable and older kernels that might be
affected. I thought your proposal for mapping_gfp_constraint was exactly
to have all such places annotated for an easier future transition to
something more reasonable.
I can reduce the patch to fs/mpage.c chunk which would be few liners and
fix the issue in the same time but what about the other calls which are
inconsistent and broken probably?
 
> Longer-term I suggest we look at generalising the memalloc_noio_foo()
> stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in
> callees.  I have a suspicion we should have done this 15 years ago
> (which is about when I started wanting to do it).

I am not sure memalloc_noio_foo is a huge win. It is an easy hack where
the whole allocation transaction is clear - like in the PM code. I am
not sure this is true also for the FS.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-05 13:47       ` Michal Hocko
@ 2015-10-05 19:29         ` Andrew Morton
  2015-10-06  2:12           ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2015-10-05 19:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Guenter Roeck, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Mon, 5 Oct 2015 15:47:13 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > The fourth best way of fixing this is a nasty short-term bodge, such a
> > the one you just sent ;) But if we're going to do this, it should be
> > the minimal bodge which fixes this deadlock.  Is it possible to come up
> > with a one-liner (plus suitable comment) to get us out of this mess?
> 
> Yes I do agree that the fix I am proposing is short-term but this seems
> like the easiest way to go for stable and older kernels that might be
> affected. I thought your proposal for mapping_gfp_constraint was exactly
> to have all such places annotated for an easier future transition to
> something more reasonable.

hm, OK, let's go that way.  But I expect this mess will continue to
float around for a long time - fixing it nicely will be somewhat
intrusive.

> > Longer-term I suggest we look at generalising the memalloc_noio_foo()
> > stuff so as to permit callers to mask off (ie: zero) __GFP_ flags in
> > callees.  I have a suspicion we should have done this 15 years ago
> > (which is about when I started wanting to do it).
> 
> I am not sure memalloc_noio_foo is a huge win. It is an easy hack where
> the whole allocation transaction is clear - like in the PM code. I am
> not sure this is true also for the FS.

mm..  I think it'll work out OK - a set/restore around particular
callsites.

It might get messy in core MM though.  Do we apply current->mask at the
very low levels of the page allocator?  If so, that might muck up
intermediate callers who are peeking into specific gfp_t flags.

Perhaps it would be better to apply the mask at the highest possible
level: wherever a function which was not passed a gfp_t decides to
create one.  Basically a grep for "GFP_".  But then we need to decide
*which* gfp_t-creators need the treatment.  All of them (yikes) or is
this mechanism only for called-via-address_space_operations code?  That
might work.

Maybe it would be better to add the gfp_t argument to the
address_space_operations.  At a minimum, writepage(), readpage(),
writepages(), readpages().  What a pickle.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-05 19:29         ` Andrew Morton
@ 2015-10-06  2:12           ` Andrew Morton
  2015-10-06  5:12             ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2015-10-06  2:12 UTC (permalink / raw)
  To: Michal Hocko, Guenter Roeck, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm, Dave Chinner

On Mon, 5 Oct 2015 12:29:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> Maybe it would be better to add the gfp_t argument to the
> address_space_operations.  At a minimum, writepage(), readpage(),
> writepages(), readpages().  What a pickle.

I'm being dumb.  All we need to do is to add a new

	address_space_operations.readpage_gfp(..., gfp_t gfp)

etc.  That should be trivial.  Each fs type only has 2-4 instances of
address_space_operations so the overhead is miniscule.

As a background janitorial thing we can migrate filesystems over to the
new interface then remove address_space_operations.readpage() etc.  But
this *would* add overhead: more stack and more text for no gain.  So
perhaps we just leave things as they are.

That's so simple that I expect we can short-term fix this bug with that
approach.  umm, something like

--- a/fs/mpage.c~a
+++ a/fs/mpage.c
@@ -139,7 +139,8 @@ map_buffer_to_page(struct page *page, st
 static struct bio *
 do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 		sector_t *last_block_in_bio, struct buffer_head *map_bh,
-		unsigned long *first_logical_block, get_block_t get_block)
+		unsigned long *first_logical_block, get_block_t get_block,
+		gfp_t gfp)
 {
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
@@ -278,7 +279,7 @@ alloc_new:
 		}
 		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
 				min_t(int, nr_pages, BIO_MAX_PAGES),
-				GFP_KERNEL);
+				gfp);
 		if (bio == NULL)
 			goto confused;
 	}
@@ -310,7 +311,7 @@ confused:
 }
 
 /**
- * mpage_readpages - populate an address space with some pages & start reads against them
+ * mpage_readpages_gfp - populate an address space with some pages & start reads against them
  * @mapping: the address_space
  * @pages: The address of a list_head which contains the target pages.  These
  *   pages have their ->index populated and are otherwise uninitialised.
@@ -318,6 +319,7 @@ confused:
  *   issued in @pages->prev to @pages->next order.
  * @nr_pages: The number of pages at *@pages
  * @get_block: The filesystem's block mapper function.
+ * @gfp: memory allocation constraints
  *
  * This function walks the pages and the blocks within each page, building and
  * emitting large BIOs.
@@ -352,9 +354,8 @@ confused:
  *
  * This all causes the disk requests to be issued in the correct order.
  */
-int
-mpage_readpages(struct address_space *mapping, struct list_head *pages,
-				unsigned nr_pages, get_block_t get_block)
+int mpage_readpages_gfp(struct address_space *mapping, struct list_head *pages,
+			unsigned nr_pages, get_block_t get_block, gfp_t gfp)
 {
 	struct bio *bio = NULL;
 	unsigned page_idx;
@@ -370,12 +371,12 @@ mpage_readpages(struct address_space *ma
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+					page->index, gfp)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
-					get_block);
+					get_block, gfp);
 		}
 		page_cache_release(page);
 	}
@@ -384,6 +385,14 @@ mpage_readpages(struct address_space *ma
 		mpage_bio_submit(READ, bio);
 	return 0;
 }
+EXPORT_SYMBOL(mpage_readpages_gfp);
+
+int mpage_readpages(struct address_space *mapping, struct list_head *pages,
+			unsigned nr_pages, get_block_t get_block)
+{
+	return mpage_readpages_gfp(mapping, pages, nr_pages, get_block,
+				   GFP_KERNEL);
+}
 EXPORT_SYMBOL(mpage_readpages);
 
 /*
@@ -399,7 +408,7 @@ int mpage_readpage(struct page *page, ge
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
 	bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
-			&map_bh, &first_logical_block, get_block);
+			&map_bh, &first_logical_block, get_block, GFP_KERNEL);
 	if (bio)
 		mpage_bio_submit(READ, bio);
 	return 0;
diff -puN include/linux/fs.h~a include/linux/fs.h
diff -puN include/linux/mpage.h~a include/linux/mpage.h
--- a/include/linux/mpage.h~a
+++ a/include/linux/mpage.h
@@ -15,6 +15,8 @@ struct writeback_control;
 
 int mpage_readpages(struct address_space *mapping, struct list_head *pages,
 				unsigned nr_pages, get_block_t get_block);
+int mpage_readpages_gfp(struct address_space *mapping, struct list_head *pages,
+			unsigned nr_pages, get_block_t get_block, gfp_t gfp);
 int mpage_readpage(struct page *page, get_block_t get_block);
 int mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block);
diff -puN fs/xfs/xfs_aops.c~a fs/xfs/xfs_aops.c
--- a/fs/xfs/xfs_aops.c~a
+++ a/fs/xfs/xfs_aops.c
@@ -1908,13 +1908,14 @@ xfs_vm_readpage(
 }
 
 STATIC int
-xfs_vm_readpages(
+xfs_vm_readpages_gfp(
 	struct file		*unused,
 	struct address_space	*mapping,
 	struct list_head	*pages,
 	unsigned		nr_pages)
 {
-	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
+	return mpage_readpages_gfp(mapping, pages, nr_pages, xfs_get_blocks,
+				   GFP_NOFS);
 }
 
 /*
@@ -1987,7 +1988,7 @@ xfs_vm_set_page_dirty(
 
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
-	.readpages		= xfs_vm_readpages,
+	.readpages		= xfs_vm_readpages_gfp,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
 	.set_page_dirty		= xfs_vm_set_page_dirty,
_


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: linux-next: kernel BUG at mm/slub.c:1447!
  2015-10-06  2:12           ` Andrew Morton
@ 2015-10-06  5:12             ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2015-10-06  5:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Guenter Roeck, linux-kernel, Christoph Lameter,
	Kirill A. Shutemov, linux-mm

On Mon, Oct 05, 2015 at 07:12:17PM -0700, Andrew Morton wrote:
> On Mon, 5 Oct 2015 12:29:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Maybe it would be better to add the gfp_t argument to the
> > address_space_operations.  At a minimum, writepage(), readpage(),
> > writepages(), readpages().  What a pickle.
> 
> I'm being dumb.  All we need to do is to add a new
> 
> 	address_space_operations.readpage_gfp(..., gfp_t gfp)
> 
> etc.  That should be trivial.  Each fs type only has 2-4 instances of
> address_space_operations so the overhead is miniscule.
> 
> As a background janitorial thing we can migrate filesystems over to the
> new interface then remove address_space_operations.readpage() etc.  But
> this *would* add overhead: more stack and more text for no gain.  So
> perhaps we just leave things as they are.
> 
> That's so simple that I expect we can short-term fix this bug with that
> approach.  umm, something like
> 
> --- a/fs/mpage.c~a
> +++ a/fs/mpage.c
> @@ -139,7 +139,8 @@ map_buffer_to_page(struct page *page, st
>  static struct bio *
>  do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
>  		sector_t *last_block_in_bio, struct buffer_head *map_bh,
> -		unsigned long *first_logical_block, get_block_t get_block)
> +		unsigned long *first_logical_block, get_block_t get_block,
> +		gfp_t gfp)

Simple enough, but not sufficient to avoid deadlocks because this
doesn't address the common vector for deadlock that was reported.
i.e. deadlocks due to the layering dependency the loop device
creates between two otherwise independent filesystems.

If read IO through the loop device does GFP_KERNEL allocations, then
it is susceptible to deadlock as that can force writeback and
transactions from the filesystem on top of the loop device, which
does more IO to the loop device, which then backs up on the backing
file inode mutex. Forwards progress cannot be made until the
GFP_KERNEL allocation succeeds, but that depends on the loop device
making forwards progress...

The loop device indicates this is a known problems tries to avoid
these deadlocks by doing this on it's backing file:

	mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))

to try to ensure that mapping related allocations do not cause
inappropriate reclaim contexts to be triggered.

This is a problem independent of any specific filesystem - let's not
hack a workaround into a specific filesystem just because the
problem was reported on that filesystem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-10-06  5:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-01 16:06 linux-next: kernel BUG at mm/slub.c:1447! Guenter Roeck
2015-10-01 16:58 ` Christoph Lameter
2015-10-01 20:49 ` Andrew Morton
2015-10-02  7:25   ` Michal Hocko
2015-10-02  8:53     ` Michal Hocko
2015-10-02 20:49     ` Andrew Morton
2015-10-05 13:47       ` Michal Hocko
2015-10-05 19:29         ` Andrew Morton
2015-10-06  2:12           ` Andrew Morton
2015-10-06  5:12             ` Dave Chinner
2015-10-02 13:36   ` Guenter Roeck
2015-10-02 13:42     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).