linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/8] Allow GFP_NOFS allocation to fail
@ 2015-08-05  9:51 mhocko
  2015-08-05  9:51 ` [RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves mhocko
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

Hi,
small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
traditionally even though their reclaim capabilities are restricted
because the VM code cannot recurse into filesystems to clean dirty
pages. At the same time these allocation requests do not allow to
trigger the OOM killer because that would lead to pre-mature OOM killing
during heavy fs metadata workloads.

This leaves the VM code in an unfortunate situation where GFP_NOFS
requests is looping inside the allocator relying on somebody else to
make a progress on its behalf. This is prone to deadlocks when the
request is holding resources which are necessary for other task to make
a progress and release memory (e.g. OOM victim is blocked on the lock
held by the NONFS request). Another drawback is that the caller of
the allocator cannot define any fallback strategy because the request
doesn't fail.

As the VM cannot do much about these requests we should face the reality
and allow those allocations to fail. Johannes has already posted the
patch which does that (http://marc.info/?l=linux-mm&m=142726428514236&w=2)
but the discussion died pretty quickly.

I was playing with this patch and xfs, ext[34] and btrfs for a while to
see what is the effect under heavy memory pressure. As expected this led
to some fallouts.
My test consisted of a simple memory hog which allocates a lot of
anonymous memory and writes to a fs mainly to trigger a fs activity on
exit. In parallel there is a parallel fs metadata load (multiple tasks
creating thousands of empty files and directories). All is running
in a VM with small amount of memory to emulate an under provisioned
system. The metadata load is triggering a sufficient load to invoke the
direct reclaim even without the memory hog. The memory hog forks several
tasks sharing the VM and OOM killer manages to kill it without locking
up the system (this was based on the test case from Tetsuo Handa -
http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't
want to kill my machine ;)).
With all the patches applied none of the 4 filesystems gets aborted
transactions and RO remount (well xfs didn't need any special
treatment). This is obviously not sufficient to claim that failing
GFP_NOFS is OK now but I think it is a good start for the further
discussion. I would be grateful if FS people could have a look at those
patches.  I have simply used __GFP_NOFAIL in the critical paths. This
might be not the best strategy but it sounds like a good first step.

The first patch in the series also allows __GFP_NOFAIL allocations to
access memory reserves when the system is OOM which should help those
requests to make a forward progress - especially in combination with
GFP_NOFS.

The second patch tries to address a potential pre-mature OOM killer from
the page fault path. I have posted it separately but it didn't get much
traction.

The third patch allows GFP_NOFS to fail and I believe it should see much
more testing coverage. It would be really great if it could sit in the
mmotm tree for few release cycles so that we can catch more fallouts.

The rest are the FS specific patches to fortify allocations
requests which are really needed to finish transactions without RO
remounts. There might be more needed but my test case survives with
these in place.
They would obviously need some rewording if they are going to be applied
even without Patch3 and I will do that if respective maintainers will
take them. Ext3 and JBD are going away soon so they might be dropped but
they have been in the tree while I was testing so I've kept them.

Thoughts? Opinions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-05  9:51 ` [RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation mhocko
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_NOFAIL is a big hammer used to ensure that the allocation
request can never fail. This is a strong requirement and as such
it also deserves a special treatment when the system is OOM. The
primary problem here is that the allocation request might have
come with some locks held and the oom victim might be blocked
on the same locks. This is basically an OOM deadlock situation.

This patch tries to reduce the risk of such a deadlocks by giving
__GFP_NOFAIL allocations a special treatment and let them dive into
memory reserves after oom killer invocation. This should help them
to make a progress and release resources they are holding. The OOM
victim should compensate for the reserves consumption.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f9ffbb087cb..ee69c338ca2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2732,8 +2732,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 	/* Exhausted what can be done so it's blamo time */
 	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
-			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
+
+		if (gfp_mask & __GFP_NOFAIL) {
+			page = get_page_from_freelist(gfp_mask, order,
+					ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac);
+			WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation."
+				    " Consider increasing min_free_kbytes.\n");
+		}
+	}
 out:
 	mutex_unlock(&oom_lock);
 	return page;
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
  2015-08-05  9:51 ` [RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-05  9:51 ` [RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM mhocko
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layera directly so it doesn't need this
protection normally.

ceph and ocfs2 which call filemap_fault from their fault handlers
seem to be OK because they are not taking any fs lock before invoking
generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe
from the reclaim recursion POV because this lock serializes truncate
and punch hole with the page faults and it doesn't get involved in the
reclaim.

The GFP_NOFS protection might be even harmful. There is a push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

We cannot play with mapping_gfp_mask directly because that would be racy
wrt. parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask. The mask is also
inode proper so it would even be a layering violation. What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.

Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this
should be safe from the page fault path normally. Why do we care
about mapping_gfp_mask at all then? Because this doesn't hold only
reclaim protection flags but it also might contain zone and movability
restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect
those.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |  4 ++++
 mm/filemap.c       |  9 ++++-----
 mm/memory.c        | 17 +++++++++++++++++
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f471789781a..962e37c7cd6a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -220,10 +220,14 @@ extern pgprot_t protection_map[16];
  * ->fault function. The vma's ->fault is responsible for returning a bitmask
  * of VM_FAULT_xxx flags that give details about how the fault was handled.
  *
+ * MM layer fills up gfp_mask for page allocations but fault handler might
+ * alter it if its implementation requires a different allocation context.
+ *
  * pgoff should be used in favour of virtual_address, if possible.
  */
 struct vm_fault {
 	unsigned int flags;		/* FAULT_FLAG_xxx flags */
+	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	void __user *virtual_address;	/* Faulting virtual address */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b63fb81df336..8a16a07bbe02 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1774,19 +1774,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page;
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = __page_cache_alloc(gfp_mask|__GFP_COLD);
 		if (!page)
 			return -ENOMEM;
 
-		ret = add_to_page_cache_lru(page, mapping, offset,
-				GFP_KERNEL & mapping_gfp_mask(mapping));
+		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL & gfp_mask);
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
@@ -1969,7 +1968,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	error = page_cache_read(file, offset, vmf->gfp_mask);
 
 	/*
 	 * The page we want has now been added to the page cache.
diff --git a/mm/memory.c b/mm/memory.c
index 8a2fc9945b46..25ab29560dca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1949,6 +1949,20 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
 		copy_user_highpage(dst, src, va, vma);
 }
 
+static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
+{
+	struct file *vm_file = vma->vm_file;
+
+	if (vm_file)
+		return mapping_gfp_mask(vm_file->f_mapping) | GFP_IOFS;
+
+	/*
+	 * Special mappings (e.g. VDSO) do not have any file so fake
+	 * a default GFP_KERNEL for them.
+	 */
+	return GFP_KERNEL;
+}
+
 /*
  * Notify the address space that the page is about to become writable so that
  * it can prohibit this or wait for the page to get into an appropriate state.
@@ -1964,6 +1978,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = page->index;
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
 	vmf.page = page;
 	vmf.cow_page = NULL;
 
@@ -2763,6 +2778,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
 	vmf.cow_page = cow_page;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
@@ -2929,6 +2945,7 @@ static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
 	vmf.pgoff = pgoff;
 	vmf.max_pgoff = max_pgoff;
 	vmf.flags = flags;
+	vmf.gfp_mask = __get_fault_gfp_mask(vma);
 	vma->vm_ops->map_pages(vma, &vmf);
 }
 
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
  2015-08-05  9:51 ` [RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves mhocko
  2015-08-05  9:51 ` [RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-05  9:51 ` [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure mhocko
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Johannes Weiner <hannes@cmpxchg.org>

GFP_NOFS allocations are not allowed to invoke the OOM killer since
their reclaim abilities are severely diminished.  However, without the
OOM killer available there is no hope of progress once the reclaimable
pages have been exhausted.

Don't risk hanging these allocations.  Leave it to the allocation site
to implement the fallback policy for failing allocations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee69c338ca2a..024d45d51700 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2715,15 +2715,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
 		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 */
-			*did_some_progress = 1;
+		if (!(gfp_mask & __GFP_FS))
 			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (2 preceding siblings ...)
  2015-08-05  9:51 ` [RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-05 11:42   ` Jan Kara
                     ` (2 more replies)
  2015-08-05  9:51 ` [RFC 5/8] ext4: Do not fail journal due to block allocator mhocko
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Journal transaction might fail prematurely because the frozen_buffer
is allocated by GFP_NOFS request:
[   72.440013] do_get_write_access: OOM for frozen_buffer
[   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
(...snipped....)
[   72.495559] do_get_write_access: OOM for frozen_buffer
[   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.496839] do_get_write_access: OOM for frozen_buffer
[   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.505766] Aborting journal on device sda1-8.
[   72.505851] EXT4-fs (sda1): Remounting filesystem read-only

This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
allocations upon OOM" because small GPF_NOFS allocations never failed.
This allocation seems essential for the journal and GFP_NOFS is too
restrictive to the memory allocator so let's use __GFP_NOFAIL here to
emulate the previous behavior.

jbd code has the very same issue so let's do the same there as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd/transaction.c  | 11 +----------
 fs/jbd2/transaction.c | 14 +++-----------
 2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..bf7474deda2f 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
 					jbd_alloc(jh2bh(jh)->b_size,
-							 GFP_NOFS);
-				if (!frozen_buffer) {
-					printk(KERN_ERR
-					       "%s: OOM for frozen_buffer\n",
-					       __func__);
-					JBUFFER_TRACE(jh, "oom!");
-					error = -ENOMEM;
-					jbd_lock_bh_state(bh);
-					goto done;
-				}
+							 GFP_NOFS|__GFP_NOFAIL);
 				goto repeat;
 			}
 			jh->b_frozen_data = frozen_buffer;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ff2f2e6ad311..bff071e21553 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
 					jbd2_alloc(jh2bh(jh)->b_size,
-							 GFP_NOFS);
-				if (!frozen_buffer) {
-					printk(KERN_ERR
-					       "%s: OOM for frozen_buffer\n",
-					       __func__);
-					JBUFFER_TRACE(jh, "oom!");
-					error = -ENOMEM;
-					jbd_lock_bh_state(bh);
-					goto done;
-				}
+							 GFP_NOFS|__GFP_NOFAIL);
 				goto repeat;
 			}
 			jh->b_frozen_data = frozen_buffer;
@@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd2_alloc(jh2bh(jh)->b_size,
+					    GFP_NOFS|__GFP_NOFAIL);
 		if (!committed_data) {
 			printk(KERN_ERR "%s: No memory for committed data\n",
 				__func__);
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 5/8] ext4: Do not fail journal due to block allocator
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (3 preceding siblings ...)
  2015-08-05  9:51 ` [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-05 11:43   ` Jan Kara
  2015-08-18 10:39   ` [RFC -v2 " Michal Hocko
  2015-08-05  9:51 ` [RFC 6/8] ext3: Do not abort journal prematurely mhocko
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
memory allocator doesn't endlessly loop to satisfy low-order allocations
and instead fails them to allow callers to handle them gracefully.

Some of the callers are not yet prepared for this behavior though. ext4
block allocator relies solely on GFP_NOFS allocation requests and
allocation failures lead to aborting yournal too easily:

[  345.028333] oom-trash: page allocation failure: order:0, mode:0x50
[  345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: G        W       4.0.0-nofs3-00006-gdfe9931f5f68 #588
[  345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[  345.028339]  0000000000000000 ffff880005a17708 ffffffff81538a54 ffffffff8107a40f
[  345.028341]  0000000000000050 ffff880005a17798 ffffffff810fe854 0000000180000000
[  345.028342]  0000000000000046 0000000000000000 ffffffff81a52100 0000000000000246
[  345.028343] Call Trace:
[  345.028348]  [<ffffffff81538a54>] dump_stack+0x4f/0x7b
[  345.028370]  [<ffffffff810fe854>] warn_alloc_failed+0x12a/0x13f
[  345.028373]  [<ffffffff81101bd2>] __alloc_pages_nodemask+0x7f3/0x8aa
[  345.028375]  [<ffffffff810f9933>] pagecache_get_page+0x12a/0x1c9
[  345.028390]  [<ffffffffa005bc64>] ext4_mb_load_buddy+0x220/0x367 [ext4]
[  345.028414]  [<ffffffffa006014f>] ext4_free_blocks+0x522/0xa4c [ext4]
[  345.028425]  [<ffffffffa0054e14>] ext4_ext_remove_space+0x833/0xf22 [ext4]
[  345.028434]  [<ffffffffa005677e>] ext4_ext_truncate+0x8c/0xb0 [ext4]
[  345.028441]  [<ffffffffa00342bf>] ext4_truncate+0x20b/0x38d [ext4]
[  345.028462]  [<ffffffffa003573c>] ext4_evict_inode+0x32b/0x4c1 [ext4]
[  345.028464]  [<ffffffff8116d04f>] evict+0xa0/0x148
[  345.028466]  [<ffffffff8116dca8>] iput+0x1a1/0x1f0
[  345.028468]  [<ffffffff811697b4>] __dentry_kill+0x136/0x1a6
[  345.028470]  [<ffffffff81169a3e>] dput+0x21a/0x243
[  345.028472]  [<ffffffff81157cda>] __fput+0x184/0x19b
[  345.028473]  [<ffffffff81157d29>] ____fput+0xe/0x10
[  345.028475]  [<ffffffff8105a05f>] task_work_run+0x8a/0xa1
[  345.028477]  [<ffffffff810452f0>] do_exit+0x3c6/0x8dc
[  345.028482]  [<ffffffff8104588a>] do_group_exit+0x4d/0xb2
[  345.028483]  [<ffffffff8104eeeb>] get_signal+0x5b1/0x5f5
[  345.028488]  [<ffffffff81002202>] do_signal+0x28/0x5d0
[...]
[  345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory
[  345.033097] Aborting journal on device hdb1-8.
[  345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
[  345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted
[  345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted
[  345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted
[  345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted
[  345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted

The failure is really premature because GFP_NOFS allocation context is
very restricted - especially in the fs metadata heavy loads. Before we
go with a more sofisticated solution, let's simply imitate the previous
behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
buddy block allocator. I wasn't able to trigger the issue with this
patch anymore.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/ext4/mballoc.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5b1613a54307..e6361622bfd5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
 	block = group * 2;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
-	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, pnum,
+				   GFP_NOFS|__GFP_NOFAIL);
 	if (!page)
 		return -ENOMEM;
 	BUG_ON(page->mapping != inode->i_mapping);
@@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
 
 	block++;
 	pnum = block / blocks_per_page;
-	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, pnum,
+				   GFP_NOFS|__GFP_NOFAIL);
 	if (!page)
 		return -ENOMEM;
 	BUG_ON(page->mapping != inode->i_mapping);
@@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 			 * wait for it to initialize.
 			 */
 			page_cache_release(page);
-		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+		page = find_or_create_page(inode->i_mapping, pnum,
+					   GFP_NOFS|__GFP_NOFAIL);
 		if (page) {
 			BUG_ON(page->mapping != inode->i_mapping);
 			if (!PageUptodate(page)) {
@@ -1194,7 +1197,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			page_cache_release(page);
-		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+		page = find_or_create_page(inode->i_mapping, pnum,
+					   GFP_NOFS|__GFP_NOFAIL);
 		if (page) {
 			BUG_ON(page->mapping != inode->i_mapping);
 			if (!PageUptodate(page)) {
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 6/8] ext3: Do not abort journal prematurely
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (4 preceding siblings ...)
  2015-08-05  9:51 ` [RFC 5/8] ext4: Do not fail journal due to block allocator mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-18 10:39   ` [RFC -v2 " Michal Hocko
  2015-08-05  9:51 ` [RFC 7/8] btrfs: Prevent from early transaction abort mhocko
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

journal_get_undo_access is relying on GFP_NOFS allocation yet it is
essential for the journal transaction:

[   83.256914] journal_get_undo_access: No memory for committed data
[   83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out
of memory in __ext3_journal_get_undo_access
[   83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of
memory
[   83.267130] Aborting journal on device hdb1.
[   83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected
aborted journal
[   83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
these allocation requests are allowed to fail so we need to use
__GFP_NOFAIL to imitate the previous behavior.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd/transaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index bf7474deda2f..6c60376a29bc 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -887,7 +887,7 @@ int journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS | __GFP_NOFAIL);
 		if (!committed_data) {
 			printk(KERN_ERR "%s: No memory for committed data\n",
 				__func__);
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 7/8] btrfs: Prevent from early transaction abort
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (5 preceding siblings ...)
  2015-08-05  9:51 ` [RFC 6/8] ext3: Do not abort journal prematurely mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-05 16:31   ` David Sterba
  2015-08-18 10:40   ` [RFC -v2 " Michal Hocko
  2015-08-05  9:51 ` [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio mhocko
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[   55.328093] Call Trace:
[   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[   55.384280] ------------[ cut here ]------------
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/btrfs/extent_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..88fad7051e38 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,7 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 {
 	struct extent_buffer *eb = NULL;
 
-	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
+	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
 	if (eb == NULL)
 		return NULL;
 	eb->start = start;
@@ -4867,7 +4867,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		return NULL;
 
 	for (i = 0; i < num_pages; i++, index++) {
-		p = find_or_create_page(mapping, index, GFP_NOFS);
+		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
 		if (!p)
 			goto free_eb;
 
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (6 preceding siblings ...)
  2015-08-05  9:51 ` [RFC 7/8] btrfs: Prevent from early transaction abort mhocko
@ 2015-08-05  9:51 ` mhocko
  2015-08-05 16:32   ` David Sterba
  2015-08-18 10:41   ` [RFC -v2 " Michal Hocko
  2015-08-05 19:58 ` [RFC 0/8] Allow GFP_NOFS allocation to fail Andreas Dilger
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 32+ messages in thread
From: mhocko @ 2015-08-05  9:51 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since "mm:
page_alloc: do not lock up GFP_NOFS allocations upon OOM" this is
allowed to fail which can lead to
[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

This is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53af23f2c087..57a99d19533d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4914,7 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
 		 * and the stripes
 		 */
 		sizeof(u64) * (total_stripes),
-		GFP_NOFS);
+		GFP_NOFS|__GFP_NOFAIL);
 	if (!bbio)
 		return NULL;
 
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-05  9:51 ` [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure mhocko
@ 2015-08-05 11:42   ` Jan Kara
  2015-08-05 16:49   ` Greg Thelen
  2015-08-18 10:38   ` [RFC -v2 " Michal Hocko
  2 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2015-08-05 11:42 UTC (permalink / raw)
  To: mhocko
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

On Wed 05-08-15 11:51:20, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Journal transaction might fail prematurely because the frozen_buffer
> is allocated by GFP_NOFS request:
> [   72.440013] do_get_write_access: OOM for frozen_buffer
> [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
> (...snipped....)
> [   72.495559] do_get_write_access: OOM for frozen_buffer
> [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [   72.496839] do_get_write_access: OOM for frozen_buffer
> [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [   72.505766] Aborting journal on device sda1-8.
> [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
> 
> This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> allocations upon OOM" because small GPF_NOFS allocations never failed.
> This allocation seems essential for the journal and GFP_NOFS is too
> restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> emulate the previous behavior.
> 
> jbd code has the very same issue so let's do the same there as well.

The patch looks good. Btw, the patch 6 can be folded into this patch since
it fixes the issue you fix for jbd2 here... But jbd parts will be dropped
in the next merge window anyway so it doesn't really matter.

You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/jbd/transaction.c  | 11 +----------
>  fs/jbd2/transaction.c | 14 +++-----------
>  2 files changed, 4 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..bf7474deda2f 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
>  					jbd_alloc(jh2bh(jh)->b_size,
> -							 GFP_NOFS);
> -				if (!frozen_buffer) {
> -					printk(KERN_ERR
> -					       "%s: OOM for frozen_buffer\n",
> -					       __func__);
> -					JBUFFER_TRACE(jh, "oom!");
> -					error = -ENOMEM;
> -					jbd_lock_bh_state(bh);
> -					goto done;
> -				}
> +							 GFP_NOFS|__GFP_NOFAIL);
>  				goto repeat;
>  			}
>  			jh->b_frozen_data = frozen_buffer;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index ff2f2e6ad311..bff071e21553 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
>  					jbd2_alloc(jh2bh(jh)->b_size,
> -							 GFP_NOFS);
> -				if (!frozen_buffer) {
> -					printk(KERN_ERR
> -					       "%s: OOM for frozen_buffer\n",
> -					       __func__);
> -					JBUFFER_TRACE(jh, "oom!");
> -					error = -ENOMEM;
> -					jbd_lock_bh_state(bh);
> -					goto done;
> -				}
> +							 GFP_NOFS|__GFP_NOFAIL);
>  				goto repeat;
>  			}
>  			jh->b_frozen_data = frozen_buffer;
> @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
>  
>  repeat:
>  	if (!jh->b_committed_data) {
> -		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> +		committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> +					    GFP_NOFS|__GFP_NOFAIL);
>  		if (!committed_data) {
>  			printk(KERN_ERR "%s: No memory for committed data\n",
>  				__func__);
> -- 
> 2.5.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 5/8] ext4: Do not fail journal due to block allocator
  2015-08-05  9:51 ` [RFC 5/8] ext4: Do not fail journal due to block allocator mhocko
@ 2015-08-05 11:43   ` Jan Kara
  2015-08-18 10:39   ` [RFC -v2 " Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Kara @ 2015-08-05 11:43 UTC (permalink / raw)
  To: mhocko
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

On Wed 05-08-15 11:51:21, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> memory allocator doesn't endlessly loop to satisfy low-order allocations
> and instead fails them to allow callers to handle them gracefully.
> 
> Some of the callers are not yet prepared for this behavior though. ext4
> block allocator relies solely on GFP_NOFS allocation requests and
> allocation failures lead to aborting yournal too easily:
> 
> [  345.028333] oom-trash: page allocation failure: order:0, mode:0x50
> [  345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: G        W       4.0.0-nofs3-00006-gdfe9931f5f68 #588
> [  345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
> [  345.028339]  0000000000000000 ffff880005a17708 ffffffff81538a54 ffffffff8107a40f
> [  345.028341]  0000000000000050 ffff880005a17798 ffffffff810fe854 0000000180000000
> [  345.028342]  0000000000000046 0000000000000000 ffffffff81a52100 0000000000000246
> [  345.028343] Call Trace:
> [  345.028348]  [<ffffffff81538a54>] dump_stack+0x4f/0x7b
> [  345.028370]  [<ffffffff810fe854>] warn_alloc_failed+0x12a/0x13f
> [  345.028373]  [<ffffffff81101bd2>] __alloc_pages_nodemask+0x7f3/0x8aa
> [  345.028375]  [<ffffffff810f9933>] pagecache_get_page+0x12a/0x1c9
> [  345.028390]  [<ffffffffa005bc64>] ext4_mb_load_buddy+0x220/0x367 [ext4]
> [  345.028414]  [<ffffffffa006014f>] ext4_free_blocks+0x522/0xa4c [ext4]
> [  345.028425]  [<ffffffffa0054e14>] ext4_ext_remove_space+0x833/0xf22 [ext4]
> [  345.028434]  [<ffffffffa005677e>] ext4_ext_truncate+0x8c/0xb0 [ext4]
> [  345.028441]  [<ffffffffa00342bf>] ext4_truncate+0x20b/0x38d [ext4]
> [  345.028462]  [<ffffffffa003573c>] ext4_evict_inode+0x32b/0x4c1 [ext4]
> [  345.028464]  [<ffffffff8116d04f>] evict+0xa0/0x148
> [  345.028466]  [<ffffffff8116dca8>] iput+0x1a1/0x1f0
> [  345.028468]  [<ffffffff811697b4>] __dentry_kill+0x136/0x1a6
> [  345.028470]  [<ffffffff81169a3e>] dput+0x21a/0x243
> [  345.028472]  [<ffffffff81157cda>] __fput+0x184/0x19b
> [  345.028473]  [<ffffffff81157d29>] ____fput+0xe/0x10
> [  345.028475]  [<ffffffff8105a05f>] task_work_run+0x8a/0xa1
> [  345.028477]  [<ffffffff810452f0>] do_exit+0x3c6/0x8dc
> [  345.028482]  [<ffffffff8104588a>] do_group_exit+0x4d/0xb2
> [  345.028483]  [<ffffffff8104eeeb>] get_signal+0x5b1/0x5f5
> [  345.028488]  [<ffffffff81002202>] do_signal+0x28/0x5d0
> [...]
> [  345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory
> [  345.033097] Aborting journal on device hdb1-8.
> [  345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
> [  345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [  345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [  345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted
> [  345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted
> [  345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [  345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted
> [  345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> [  345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted
> [  345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
> 
> The failure is really premature because GFP_NOFS allocation context is
> very restricted - especially in the fs metadata heavy loads. Before we
> go with a more sofisticated solution, let's simply imitate the previous
> behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
> buddy block allocator. I wasn't able to trigger the issue with this
> patch anymore.
 
The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/ext4/mballoc.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5b1613a54307..e6361622bfd5 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
>  	block = group * 2;
>  	pnum = block / blocks_per_page;
>  	poff = block % blocks_per_page;
> -	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> +	page = find_or_create_page(inode->i_mapping, pnum,
> +				   GFP_NOFS|__GFP_NOFAIL);
>  	if (!page)
>  		return -ENOMEM;
>  	BUG_ON(page->mapping != inode->i_mapping);
> @@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
>  
>  	block++;
>  	pnum = block / blocks_per_page;
> -	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> +	page = find_or_create_page(inode->i_mapping, pnum,
> +				   GFP_NOFS|__GFP_NOFAIL);
>  	if (!page)
>  		return -ENOMEM;
>  	BUG_ON(page->mapping != inode->i_mapping);
> @@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
>  			 * wait for it to initialize.
>  			 */
>  			page_cache_release(page);
> -		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> +		page = find_or_create_page(inode->i_mapping, pnum,
> +					   GFP_NOFS|__GFP_NOFAIL);
>  		if (page) {
>  			BUG_ON(page->mapping != inode->i_mapping);
>  			if (!PageUptodate(page)) {
> @@ -1194,7 +1197,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
>  	if (page == NULL || !PageUptodate(page)) {
>  		if (page)
>  			page_cache_release(page);
> -		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> +		page = find_or_create_page(inode->i_mapping, pnum,
> +					   GFP_NOFS|__GFP_NOFAIL);
>  		if (page) {
>  			BUG_ON(page->mapping != inode->i_mapping);
>  			if (!PageUptodate(page)) {
> -- 
> 2.5.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 7/8] btrfs: Prevent from early transaction abort
  2015-08-05  9:51 ` [RFC 7/8] btrfs: Prevent from early transaction abort mhocko
@ 2015-08-05 16:31   ` David Sterba
  2015-08-18 10:40   ` [RFC -v2 " Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: David Sterba @ 2015-08-05 16:31 UTC (permalink / raw)
  To: mhocko
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

On Wed, Aug 05, 2015 at 11:51:23AM +0200, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
... 
> Fix this by reintroducing the no-fail behavior of this allocation path
> with the explicit __GFP_NOFAIL.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: David Sterba <dsterba@suse.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
  2015-08-05  9:51 ` [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio mhocko
@ 2015-08-05 16:32   ` David Sterba
  2015-08-18 10:41   ` [RFC -v2 " Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: David Sterba @ 2015-08-05 16:32 UTC (permalink / raw)
  To: mhocko
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko

On Wed, Aug 05, 2015 at 11:51:24AM +0200, mhocko@kernel.org wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since "mm:
> page_alloc: do not lock up GFP_NOFS allocations upon OOM" this is
> allowed to fail which can lead to
> [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
> 
> This is clearly undesirable and the nofail behavior should be explicit
> if the allocation failure cannot be tolerated.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-05  9:51 ` [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure mhocko
  2015-08-05 11:42   ` Jan Kara
@ 2015-08-05 16:49   ` Greg Thelen
  2015-08-12  9:14     ` Michal Hocko
  2015-08-18 10:38   ` [RFC -v2 " Michal Hocko
  2 siblings, 1 reply; 32+ messages in thread
From: Greg Thelen @ 2015-08-05 16:49 UTC (permalink / raw)
  To: mhocko
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara, Michal Hocko


mhocko@kernel.org wrote:

> From: Michal Hocko <mhocko@suse.com>
>
> Journal transaction might fail prematurely because the frozen_buffer
> is allocated by GFP_NOFS request:
> [   72.440013] do_get_write_access: OOM for frozen_buffer
> [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
> (...snipped....)
> [   72.495559] do_get_write_access: OOM for frozen_buffer
> [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [   72.496839] do_get_write_access: OOM for frozen_buffer
> [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> [   72.505766] Aborting journal on device sda1-8.
> [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
>
> This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> allocations upon OOM" because small GPF_NOFS allocations never failed.
> This allocation seems essential for the journal and GFP_NOFS is too
> restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> emulate the previous behavior.
>
> jbd code has the very same issue so let's do the same there as well.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/jbd/transaction.c  | 11 +----------
>  fs/jbd2/transaction.c | 14 +++-----------
>  2 files changed, 4 insertions(+), 21 deletions(-)
>
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..bf7474deda2f 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
>  					jbd_alloc(jh2bh(jh)->b_size,
> -							 GFP_NOFS);
> -				if (!frozen_buffer) {
> -					printk(KERN_ERR
> -					       "%s: OOM for frozen_buffer\n",
> -					       __func__);
> -					JBUFFER_TRACE(jh, "oom!");
> -					error = -ENOMEM;
> -					jbd_lock_bh_state(bh);
> -					goto done;
> -				}
> +							 GFP_NOFS|__GFP_NOFAIL);
>  				goto repeat;
>  			}
>  			jh->b_frozen_data = frozen_buffer;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index ff2f2e6ad311..bff071e21553 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
>  					jbd2_alloc(jh2bh(jh)->b_size,
> -							 GFP_NOFS);
> -				if (!frozen_buffer) {
> -					printk(KERN_ERR
> -					       "%s: OOM for frozen_buffer\n",
> -					       __func__);
> -					JBUFFER_TRACE(jh, "oom!");
> -					error = -ENOMEM;
> -					jbd_lock_bh_state(bh);
> -					goto done;
> -				}
> +							 GFP_NOFS|__GFP_NOFAIL);
>  				goto repeat;
>  			}
>  			jh->b_frozen_data = frozen_buffer;
> @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
>  
>  repeat:
>  	if (!jh->b_committed_data) {
> -		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> +		committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> +					    GFP_NOFS|__GFP_NOFAIL);
>  		if (!committed_data) {
>  			printk(KERN_ERR "%s: No memory for committed data\n",
>  				__func__);

Is this "if (!committed_data) {" check now dead code?

I also see other similar suspected dead sites in the rest of the series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/8] Allow GFP_NOFS allocation to fail
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (7 preceding siblings ...)
  2015-08-05  9:51 ` [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio mhocko
@ 2015-08-05 19:58 ` Andreas Dilger
  2015-08-06 14:34 ` Michal Hocko
  2015-09-07 16:51 ` Tetsuo Handa
  10 siblings, 0 replies; 32+ messages in thread
From: Andreas Dilger @ 2015-08-05 19:58 UTC (permalink / raw)
  To: mhocko
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

On Aug 5, 2015, at 3:51 AM, mhocko@kernel.org wrote:
> Hi,
> small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing
> traditionally even though their reclaim capabilities are restricted
> because the VM code cannot recurse into filesystems to clean dirty
> pages. At the same time these allocation requests do not allow to
> trigger the OOM killer because that would lead to pre-mature OOM killing
> during heavy fs metadata workloads.
> 
> This leaves the VM code in an unfortunate situation where GFP_NOFS
> requests is looping inside the allocator relying on somebody else to
> make a progress on its behalf. This is prone to deadlocks when the
> request is holding resources which are necessary for other task to make
> a progress and release memory (e.g. OOM victim is blocked on the lock
> held by the NONFS request). Another drawback is that the caller of
> the allocator cannot define any fallback strategy because the request
> doesn't fail.
> 
> As the VM cannot do much about these requests we should face the reality
> and allow those allocations to fail. Johannes has already posted the
> patch which does that (http://marc.info/?l=linux-mm&m=142726428514236&w=2)
> but the discussion died pretty quickly.
> 
> I was playing with this patch and xfs, ext[34] and btrfs for a while
> to see what is the effect under heavy memory pressure. As expected
> this led to some fallouts.
> 
> My test consisted of a simple memory hog which allocates a lot of
> anonymous memory and writes to a fs mainly to trigger a fs activity on
> exit. In parallel there is a parallel fs metadata load (multiple tasks
> creating thousands of empty files and directories). All is running
> in a VM with small amount of memory to emulate an under provisioned
> system. The metadata load is triggering a sufficient load to invoke
> the direct reclaim even without the memory hog. The memory hog forks
> several tasks sharing the VM and OOM killer manages to kill it without 
> locking up the system (this was based on the test case from Tetsuo
> Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html -
> I just didn't want to kill my machine ;)).
> 
> With all the patches applied none of the 4 filesystems gets aborted
> transactions and RO remount (well xfs didn't need any special
> treatment). This is obviously not sufficient to claim that failing
> GFP_NOFS is OK now but I think it is a good start for the further
> discussion. I would be grateful if FS people could have a look at
> those patches.  I have simply used __GFP_NOFAIL in the critical paths. 
> This might be not the best strategy but it sounds like a good first
> step.
> 
> The first patch in the series also allows __GFP_NOFAIL allocations to
> access memory reserves when the system is OOM which should help those
> requests to make a forward progress - especially in combination with
> GFP_NOFS.
> 
> The second patch tries to address a potential pre-mature OOM killer
> from the page fault path. I have posted it separately but it didn't
> get much traction.
> 
> The third patch allows GFP_NOFS to fail and I believe it should see
> much more testing coverage. It would be really great if it could sit
> in the mmotm tree for few release cycles so that we can catch more
> fallouts.
> 
> The rest are the FS specific patches to fortify allocations
> requests which are really needed to finish transactions without RO
> remounts. There might be more needed but my test case survives with
> these in place.

Wouldn't it make more sense to order the fs-specific patches _before_
the "GFP_NOFS can fail" patch (#3), so that once that patch is applied
all known failures have already been fixed?  Otherwise it could show
test failures during bisection that would be confusing.

Cheers, Andreas

> They would obviously need some rewording if they are going to be
> applied even without Patch3 and I will do that if respective
> maintainers will take them. Ext3 and JBD are going away soon so they
> might be dropped but they have been in the tree while I was testing
> so I've kept them.
> 
> Thoughts? Opinions?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/8] Allow GFP_NOFS allocation to fail
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (8 preceding siblings ...)
  2015-08-05 19:58 ` [RFC 0/8] Allow GFP_NOFS allocation to fail Andreas Dilger
@ 2015-08-06 14:34 ` Michal Hocko
  2015-09-07 16:51 ` Tetsuo Handa
  10 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-06 14:34 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Johannes Weiner, Dave Chinner, Tetsuo Handa, linux-mm,
	Andrew Morton, Theodore Ts'o, Jan Kara, linux-btrfs,
	linux-ext4, linux-fsdevel, LKML

On Wed 05-08-15 20:58:25, Andreas Dilger wrote:
> On Aug 5, 2015, at 3:51 AM, mhocko@kernel.org wrote:
[...]
> > The rest are the FS specific patches to fortify allocations
> > requests which are really needed to finish transactions without RO
> > remounts. There might be more needed but my test case survives with
> > these in place.
> 
> Wouldn't it make more sense to order the fs-specific patches _before_
> the "GFP_NOFS can fail" patch (#3), so that once that patch is applied
> all known failures have already been fixed?  Otherwise it could show
> test failures during bisection that would be confusing.

As I write below. If maintainers consider them useful even when GFP_NOFS
doesn't fail I will reword them and resend. But you cannot fix the world
without breaking it first in this case ;)
 
> > They would obviously need some rewording if they are going to be
> > applied even without Patch3 and I will do that if respective
> > maintainers will take them. Ext3 and JBD are going away soon so they
> > might be dropped but they have been in the tree while I was testing
> > so I've kept them.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-05 16:49   ` Greg Thelen
@ 2015-08-12  9:14     ` Michal Hocko
  2015-08-15 13:54       ` Theodore Ts'o
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2015-08-12  9:14 UTC (permalink / raw)
  To: Greg Thelen
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

On Wed 05-08-15 09:49:24, Greg Thelen wrote:
> 
> mhocko@kernel.org wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Journal transaction might fail prematurely because the frozen_buffer
> > is allocated by GFP_NOFS request:
> > [   72.440013] do_get_write_access: OOM for frozen_buffer
> > [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> > [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
> > (...snipped....)
> > [   72.495559] do_get_write_access: OOM for frozen_buffer
> > [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> > [   72.496839] do_get_write_access: OOM for frozen_buffer
> > [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
> > [   72.505766] Aborting journal on device sda1-8.
> > [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
> >
> > This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> > allocations upon OOM" because small GPF_NOFS allocations never failed.
> > This allocation seems essential for the journal and GFP_NOFS is too
> > restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> > emulate the previous behavior.
> >
> > jbd code has the very same issue so let's do the same there as well.
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/jbd/transaction.c  | 11 +----------
> >  fs/jbd2/transaction.c | 14 +++-----------
> >  2 files changed, 4 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> > index 1695ba8334a2..bf7474deda2f 100644
> > --- a/fs/jbd/transaction.c
> > +++ b/fs/jbd/transaction.c
> > @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> >  				jbd_unlock_bh_state(bh);
> >  				frozen_buffer =
> >  					jbd_alloc(jh2bh(jh)->b_size,
> > -							 GFP_NOFS);
> > -				if (!frozen_buffer) {
> > -					printk(KERN_ERR
> > -					       "%s: OOM for frozen_buffer\n",
> > -					       __func__);
> > -					JBUFFER_TRACE(jh, "oom!");
> > -					error = -ENOMEM;
> > -					jbd_lock_bh_state(bh);
> > -					goto done;
> > -				}
> > +							 GFP_NOFS|__GFP_NOFAIL);
> >  				goto repeat;
> >  			}
> >  			jh->b_frozen_data = frozen_buffer;
> > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > index ff2f2e6ad311..bff071e21553 100644
> > --- a/fs/jbd2/transaction.c
> > +++ b/fs/jbd2/transaction.c
> > @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
> >  				jbd_unlock_bh_state(bh);
> >  				frozen_buffer =
> >  					jbd2_alloc(jh2bh(jh)->b_size,
> > -							 GFP_NOFS);
> > -				if (!frozen_buffer) {
> > -					printk(KERN_ERR
> > -					       "%s: OOM for frozen_buffer\n",
> > -					       __func__);
> > -					JBUFFER_TRACE(jh, "oom!");
> > -					error = -ENOMEM;
> > -					jbd_lock_bh_state(bh);
> > -					goto done;
> > -				}
> > +							 GFP_NOFS|__GFP_NOFAIL);
> >  				goto repeat;
> >  			}
> >  			jh->b_frozen_data = frozen_buffer;
> > @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
> >  
> >  repeat:
> >  	if (!jh->b_committed_data) {
> > -		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> > +		committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> > +					    GFP_NOFS|__GFP_NOFAIL);
> >  		if (!committed_data) {
> >  			printk(KERN_ERR "%s: No memory for committed data\n",
> >  				__func__);
> 
> Is this "if (!committed_data) {" check now dead code?
> 
> I also see other similar suspected dead sites in the rest of the series.

You are absolutely right. I have updated the patches.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-12  9:14     ` Michal Hocko
@ 2015-08-15 13:54       ` Theodore Ts'o
  2015-08-18 10:36         ` Michal Hocko
  2015-08-24 12:06         ` Michal Hocko
  0 siblings, 2 replies; 32+ messages in thread
From: Theodore Ts'o @ 2015-08-15 13:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Thelen, LKML, linux-mm, linux-fsdevel, Andrew Morton,
	Johannes Weiner, Tetsuo Handa, Dave Chinner, linux-btrfs,
	linux-ext4, Jan Kara

On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
> > Is this "if (!committed_data) {" check now dead code?
> > 
> > I also see other similar suspected dead sites in the rest of the series.
> 
> You are absolutely right. I have updated the patches.

Have you sent out an updated version of these patches?  Maybe I missed
it, but I don't think I saw them.

Thanks,

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-15 13:54       ` Theodore Ts'o
@ 2015-08-18 10:36         ` Michal Hocko
  2015-08-24 12:06         ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 10:36 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Greg Thelen, LKML, linux-mm, linux-fsdevel, Andrew Morton,
	Johannes Weiner, Tetsuo Handa, Dave Chinner, linux-btrfs,
	linux-ext4, Jan Kara

On Sat 15-08-15 09:54:22, Theodore Ts'o wrote:
> On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
> > > Is this "if (!committed_data) {" check now dead code?
> > > 
> > > I also see other similar suspected dead sites in the rest of the series.
> > 
> > You are absolutely right. I have updated the patches.
> 
> Have you sent out an updated version of these patches?  Maybe I missed
> it, but I don't think I saw them.

I haven't yet. I was waiting for more feedback and didn't want to spam
the mailing list too much. I will post them now.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC -v2 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-05  9:51 ` [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure mhocko
  2015-08-05 11:42   ` Jan Kara
  2015-08-05 16:49   ` Greg Thelen
@ 2015-08-18 10:38   ` Michal Hocko
  2 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 10:38 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

From: Michal Hocko <mhocko@suse.com>

Journal transaction might fail prematurely because the frozen_buffer
is allocated by GFP_NOFS request:
[   72.440013] do_get_write_access: OOM for frozen_buffer
[   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
(...snipped....)
[   72.495559] do_get_write_access: OOM for frozen_buffer
[   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.496839] do_get_write_access: OOM for frozen_buffer
[   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[   72.505766] Aborting journal on device sda1-8.
[   72.505851] EXT4-fs (sda1): Remounting filesystem read-only

This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
allocations upon OOM" because small GPF_NOFS allocations never failed.
This allocation seems essential for the journal and GFP_NOFS is too
restrictive to the memory allocator so let's use __GFP_NOFAIL here to
emulate the previous behavior.

jbd code has the very same issue so let's do the same there as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd/transaction.c  | 11 +----------
 fs/jbd2/transaction.c | 23 ++++-------------------
 2 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..bf7474deda2f 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
 					jbd_alloc(jh2bh(jh)->b_size,
-							 GFP_NOFS);
-				if (!frozen_buffer) {
-					printk(KERN_ERR
-					       "%s: OOM for frozen_buffer\n",
-					       __func__);
-					JBUFFER_TRACE(jh, "oom!");
-					error = -ENOMEM;
-					jbd_lock_bh_state(bh);
-					goto done;
-				}
+							 GFP_NOFS|__GFP_NOFAIL);
 				goto repeat;
 			}
 			jh->b_frozen_data = frozen_buffer;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ff2f2e6ad311..4d63c5911afa 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
 					jbd2_alloc(jh2bh(jh)->b_size,
-							 GFP_NOFS);
-				if (!frozen_buffer) {
-					printk(KERN_ERR
-					       "%s: OOM for frozen_buffer\n",
-					       __func__);
-					JBUFFER_TRACE(jh, "oom!");
-					error = -ENOMEM;
-					jbd_lock_bh_state(bh);
-					goto done;
-				}
+							 GFP_NOFS|__GFP_NOFAIL);
 				goto repeat;
 			}
 			jh->b_frozen_data = frozen_buffer;
@@ -1156,15 +1147,9 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
 		goto out;
 
 repeat:
-	if (!jh->b_committed_data) {
-		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
-		if (!committed_data) {
-			printk(KERN_ERR "%s: No memory for committed data\n",
-				__func__);
-			err = -ENOMEM;
-			goto out;
-		}
-	}
+	if (!jh->b_committed_data)
+		committed_data = jbd2_alloc(jh2bh(jh)->b_size,
+					    GFP_NOFS|__GFP_NOFAIL);
 
 	jbd_lock_bh_state(bh);
 	if (!jh->b_committed_data) {
-- 
2.5.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC -v2 5/8] ext4: Do not fail journal due to block allocator
  2015-08-05  9:51 ` [RFC 5/8] ext4: Do not fail journal due to block allocator mhocko
  2015-08-05 11:43   ` Jan Kara
@ 2015-08-18 10:39   ` Michal Hocko
  2015-08-18 10:55     ` Michal Hocko
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 10:39 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

From: Michal Hocko <mhocko@suse.com>

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
memory allocator doesn't endlessly loop to satisfy low-order allocations
and instead fails them to allow callers to handle them gracefully.

Some of the callers are not yet prepared for this behavior though. ext4
block allocator relies solely on GFP_NOFS allocation requests and
allocation failures lead to aborting yournal too easily:

[  345.028333] oom-trash: page allocation failure: order:0, mode:0x50
[  345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: G        W       4.0.0-nofs3-00006-gdfe9931f5f68 #588
[  345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[  345.028339]  0000000000000000 ffff880005a17708 ffffffff81538a54 ffffffff8107a40f
[  345.028341]  0000000000000050 ffff880005a17798 ffffffff810fe854 0000000180000000
[  345.028342]  0000000000000046 0000000000000000 ffffffff81a52100 0000000000000246
[  345.028343] Call Trace:
[  345.028348]  [<ffffffff81538a54>] dump_stack+0x4f/0x7b
[  345.028370]  [<ffffffff810fe854>] warn_alloc_failed+0x12a/0x13f
[  345.028373]  [<ffffffff81101bd2>] __alloc_pages_nodemask+0x7f3/0x8aa
[  345.028375]  [<ffffffff810f9933>] pagecache_get_page+0x12a/0x1c9
[  345.028390]  [<ffffffffa005bc64>] ext4_mb_load_buddy+0x220/0x367 [ext4]
[  345.028414]  [<ffffffffa006014f>] ext4_free_blocks+0x522/0xa4c [ext4]
[  345.028425]  [<ffffffffa0054e14>] ext4_ext_remove_space+0x833/0xf22 [ext4]
[  345.028434]  [<ffffffffa005677e>] ext4_ext_truncate+0x8c/0xb0 [ext4]
[  345.028441]  [<ffffffffa00342bf>] ext4_truncate+0x20b/0x38d [ext4]
[  345.028462]  [<ffffffffa003573c>] ext4_evict_inode+0x32b/0x4c1 [ext4]
[  345.028464]  [<ffffffff8116d04f>] evict+0xa0/0x148
[  345.028466]  [<ffffffff8116dca8>] iput+0x1a1/0x1f0
[  345.028468]  [<ffffffff811697b4>] __dentry_kill+0x136/0x1a6
[  345.028470]  [<ffffffff81169a3e>] dput+0x21a/0x243
[  345.028472]  [<ffffffff81157cda>] __fput+0x184/0x19b
[  345.028473]  [<ffffffff81157d29>] ____fput+0xe/0x10
[  345.028475]  [<ffffffff8105a05f>] task_work_run+0x8a/0xa1
[  345.028477]  [<ffffffff810452f0>] do_exit+0x3c6/0x8dc
[  345.028482]  [<ffffffff8104588a>] do_group_exit+0x4d/0xb2
[  345.028483]  [<ffffffff8104eeeb>] get_signal+0x5b1/0x5f5
[  345.028488]  [<ffffffff81002202>] do_signal+0x28/0x5d0
[...]
[  345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory
[  345.033097] Aborting journal on device hdb1-8.
[  345.036339] EXT4-fs (hdb1): Remounting filesystem read-only
[  345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted
[  345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted
[  345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted
[  345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted
[  345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted
[  345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted

The failure is really premature because GFP_NOFS allocation context is
very restricted - especially in the fs metadata heavy loads. Before we
go with a more sofisticated solution, let's simply imitate the previous
behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the
buddy block allocator. I wasn't able to trigger the issue with this
patch anymore.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/ext4/mballoc.c | 52 ++++++++++++++++++++++++----------------------------
 1 file changed, 24 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5b1613a54307..0360ea32c30f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -992,9 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
 	block = group * 2;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
-	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
-	if (!page)
-		return -ENOMEM;
+	page = find_or_create_page(inode->i_mapping, pnum,
+				   GFP_NOFS|__GFP_NOFAIL);
 	BUG_ON(page->mapping != inode->i_mapping);
 	e4b->bd_bitmap_page = page;
 	e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
@@ -1006,9 +1005,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
 
 	block++;
 	pnum = block / blocks_per_page;
-	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
-	if (!page)
-		return -ENOMEM;
+	page = find_or_create_page(inode->i_mapping, pnum,
+				   GFP_NOFS|__GFP_NOFAIL);
 	BUG_ON(page->mapping != inode->i_mapping);
 	e4b->bd_buddy_page = page;
 	return 0;
@@ -1158,20 +1156,19 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 			 * wait for it to initialize.
 			 */
 			page_cache_release(page);
-		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
-		if (page) {
-			BUG_ON(page->mapping != inode->i_mapping);
-			if (!PageUptodate(page)) {
-				ret = ext4_mb_init_cache(page, NULL);
-				if (ret) {
-					unlock_page(page);
-					goto err;
-				}
-				mb_cmp_bitmaps(e4b, page_address(page) +
-					       (poff * sb->s_blocksize));
+		page = find_or_create_page(inode->i_mapping, pnum,
+					   GFP_NOFS|__GFP_NOFAIL);
+		BUG_ON(page->mapping != inode->i_mapping);
+		if (!PageUptodate(page)) {
+			ret = ext4_mb_init_cache(page, NULL);
+			if (ret) {
+				unlock_page(page);
+				goto err;
 			}
-			unlock_page(page);
+			mb_cmp_bitmaps(e4b, page_address(page) +
+				       (poff * sb->s_blocksize));
 		}
+		unlock_page(page);
 	}
 	if (page == NULL) {
 		ret = -ENOMEM;
@@ -1194,18 +1191,17 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			page_cache_release(page);
-		page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
-		if (page) {
-			BUG_ON(page->mapping != inode->i_mapping);
-			if (!PageUptodate(page)) {
-				ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
-				if (ret) {
-					unlock_page(page);
-					goto err;
-				}
+		page = find_or_create_page(inode->i_mapping, pnum,
+					   GFP_NOFS|__GFP_NOFAIL);
+		BUG_ON(page->mapping != inode->i_mapping);
+		if (!PageUptodate(page)) {
+			ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
+			if (ret) {
+				unlock_page(page);
+				goto err;
 			}
-			unlock_page(page);
 		}
+		unlock_page(page);
 	}
 	if (page == NULL) {
 		ret = -ENOMEM;
-- 
2.5.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC -v2 6/8] ext3: Do not abort journal prematurely
  2015-08-05  9:51 ` [RFC 6/8] ext3: Do not abort journal prematurely mhocko
@ 2015-08-18 10:39   ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 10:39 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

From: Michal Hocko <mhocko@suse.com>

journal_get_undo_access is relying on GFP_NOFS allocation yet it is
essential for the journal transaction:

[   83.256914] journal_get_undo_access: No memory for committed data
[   83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out
of memory in __ext3_journal_get_undo_access
[   83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of
memory
[   83.267130] Aborting journal on device hdb1.
[   83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected
aborted journal
[   83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only

Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
these allocation requests are allowed to fail so we need to use
__GFP_NOFAIL to imitate the previous behavior.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd/transaction.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index bf7474deda2f..2151b80276c3 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -886,15 +886,8 @@ int journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
 		goto out;
 
 repeat:
-	if (!jh->b_committed_data) {
-		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
-		if (!committed_data) {
-			printk(KERN_ERR "%s: No memory for committed data\n",
-				__func__);
-			err = -ENOMEM;
-			goto out;
-		}
-	}
+	if (!jh->b_committed_data)
+		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS | __GFP_NOFAIL);
 
 	jbd_lock_bh_state(bh);
 	if (!jh->b_committed_data) {
-- 
2.5.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC -v2 7/8] btrfs: Prevent from early transaction abort
  2015-08-05  9:51 ` [RFC 7/8] btrfs: Prevent from early transaction abort mhocko
  2015-08-05 16:31   ` David Sterba
@ 2015-08-18 10:40   ` Michal Hocko
  2015-08-18 11:01     ` Michal Hocko
  2015-08-18 17:11     ` Chris Mason
  1 sibling, 2 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 10:40 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

From: Michal Hocko <mhocko@suse.com>

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[   55.328093] Call Trace:
[   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[   55.384280] ------------[ cut here ]------------
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/btrfs/extent_io.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..d855ddffd5fe 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,9 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 {
 	struct extent_buffer *eb = NULL;
 
-	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
-	if (eb == NULL)
-		return NULL;
+	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
 	eb->start = start;
 	eb->len = len;
 	eb->fs_info = fs_info;
@@ -4867,9 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		return NULL;
 
 	for (i = 0; i < num_pages; i++, index++) {
-		p = find_or_create_page(mapping, index, GFP_NOFS);
-		if (!p)
-			goto free_eb;
+		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
 
 		spin_lock(&mapping->private_lock);
 		if (PagePrivate(p)) {
-- 
2.5.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC -v2 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
  2015-08-05  9:51 ` [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio mhocko
  2015-08-05 16:32   ` David Sterba
@ 2015-08-18 10:41   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 10:41 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

From: Michal Hocko <mhocko@suse.com>

alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since "mm:
page_alloc: do not lock up GFP_NOFS allocations upon OOM" this is
allowed to fail which can lead to
[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

This is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 53af23f2c087..42b9949dd71d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4914,9 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
 		 * and the stripes
 		 */
 		sizeof(u64) * (total_stripes),
-		GFP_NOFS);
-	if (!bbio)
-		return NULL;
+		GFP_NOFS|__GFP_NOFAIL);
 
 	atomic_set(&bbio->error, 0);
 	atomic_set(&bbio->refs, 1);
-- 
2.5.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC -v2 5/8] ext4: Do not fail journal due to block allocator
  2015-08-18 10:39   ` [RFC -v2 " Michal Hocko
@ 2015-08-18 10:55     ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 10:55 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

On Tue 18-08-15 12:39:03, Michal Hocko wrote:
[...]
> @@ -992,9 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb,
>  	block = group * 2;
>  	pnum = block / blocks_per_page;
>  	poff = block % blocks_per_page;
> -	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
> -	if (!page)
> -		return -ENOMEM;
> +	page = find_or_create_page(inode->i_mapping, pnum,
> +				   GFP_NOFS|__GFP_NOFAIL);

Scratch this one. find_or_create_page is allowed to return NULL. The
patch is bogus. I was overly eager to turn all places to not check the
return value.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort
  2015-08-18 10:40   ` [RFC -v2 " Michal Hocko
@ 2015-08-18 11:01     ` Michal Hocko
  2015-08-18 17:11     ` Chris Mason
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 11:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

On Tue 18-08-15 12:40:31, Michal Hocko wrote:
[...]
> @@ -4867,9 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
>  		return NULL;
>  
>  	for (i = 0; i < num_pages; i++, index++) {
> -		p = find_or_create_page(mapping, index, GFP_NOFS);
> -		if (!p)
> -			goto free_eb;
> +		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
>  
>  		spin_lock(&mapping->private_lock);
>  		if (PagePrivate(p)) {

Same here. find_or_create_page might return NULL.
---
>From f430e5f54367b8815e1099f26fedd2873b597a07 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 15 Jul 2015 19:27:06 +0200
Subject: [PATCH] btrfs: Prevent from early transaction abort

Btrfs relies on GFP_NOFS allocation when commiting the transaction but
since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
those allocations are allowed to fail which can lead to a pre-mature
transaction abort:

[   55.328093] Call Trace:
[   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
[   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
[   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
[   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
[   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
[   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
[   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
[   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
[   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
[   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
[   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
[   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
[   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
[   55.384280] ------------[ cut here ]------------
[   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
[...]
[   55.384337] Call Trace:
[   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
[   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
[   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
[   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
[   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
[   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
[   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
[   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
[   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
[   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
[   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
[   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
[   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
[   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
[   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
[   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
[...]
[   55.384608] ---[ end trace c29799da1d4dd621 ]---
[   55.437323] BTRFS info (device hdb1): forced readonly
[   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry

Fix this by reintroducing the no-fail behavior of this allocation path
with the explicit __GFP_NOFAIL.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/btrfs/extent_io.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c374e1e71e5f..f4d6eea975d7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4607,9 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 {
 	struct extent_buffer *eb = NULL;
 
-	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
-	if (eb == NULL)
-		return NULL;
+	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
 	eb->start = start;
 	eb->len = len;
 	eb->fs_info = fs_info;
@@ -4867,7 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		return NULL;
 
 	for (i = 0; i < num_pages; i++, index++) {
-		p = find_or_create_page(mapping, index, GFP_NOFS);
+		p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
 		if (!p)
 			goto free_eb;
 
-- 
2.5.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort
  2015-08-18 10:40   ` [RFC -v2 " Michal Hocko
  2015-08-18 11:01     ` Michal Hocko
@ 2015-08-18 17:11     ` Chris Mason
  2015-08-18 17:29       ` Michal Hocko
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Mason @ 2015-08-18 17:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

On Tue, Aug 18, 2015 at 12:40:32PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Btrfs relies on GFP_NOFS allocation when commiting the transaction but
> since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> those allocations are allowed to fail which can lead to a pre-mature
> transaction abort:

I can either put the btrfs nofail ones on my pull for Linus, or you can
add my sob and send as one unit.  Just let me know how you'd rather do
it.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort
  2015-08-18 17:11     ` Chris Mason
@ 2015-08-18 17:29       ` Michal Hocko
  2015-08-19 12:26         ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2015-08-18 17:29 UTC (permalink / raw)
  To: Chris Mason
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

On Tue 18-08-15 13:11:44, Chris Mason wrote:
> On Tue, Aug 18, 2015 at 12:40:32PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Btrfs relies on GFP_NOFS allocation when commiting the transaction but
> > since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> > those allocations are allowed to fail which can lead to a pre-mature
> > transaction abort:
> 
> I can either put the btrfs nofail ones on my pull for Linus, or you can
> add my sob and send as one unit.  Just let me know how you'd rather do
> it.

OK, I will rephrase the changelogs (tomorrow) to not refer to an
unmerged patch and would appreciate if you can take them and route them
through your tree. I will then drop them from my pile.

Thanks.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC -v2 7/8] btrfs: Prevent from early transaction abort
  2015-08-18 17:29       ` Michal Hocko
@ 2015-08-19 12:26         ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-19 12:26 UTC (permalink / raw)
  To: Chris Mason
  Cc: LKML, linux-mm, linux-fsdevel, Andrew Morton, Johannes Weiner,
	Tetsuo Handa, Dave Chinner, Theodore Ts'o, linux-btrfs,
	linux-ext4, Jan Kara

On Tue 18-08-15 19:29:14, Michal Hocko wrote:
> On Tue 18-08-15 13:11:44, Chris Mason wrote:
> > On Tue, Aug 18, 2015 at 12:40:32PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Btrfs relies on GFP_NOFS allocation when commiting the transaction but
> > > since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM"
> > > those allocations are allowed to fail which can lead to a pre-mature
> > > transaction abort:
> > 
> > I can either put the btrfs nofail ones on my pull for Linus, or you can
> > add my sob and send as one unit.  Just let me know how you'd rather do
> > it.
> 
> OK, I will rephrase the changelogs (tomorrow) to not refer to an
> unmerged patch and would appreciate if you can take them and route them
> through your tree. I will then drop them from my pile.

Poste in a separate thread
http://lkml.kernel.org/r/1439986661-15896-1-git-send-email-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
  2015-08-15 13:54       ` Theodore Ts'o
  2015-08-18 10:36         ` Michal Hocko
@ 2015-08-24 12:06         ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2015-08-24 12:06 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Greg Thelen, LKML, linux-mm, linux-fsdevel, Andrew Morton,
	Johannes Weiner, Tetsuo Handa, Dave Chinner, linux-btrfs,
	linux-ext4, Jan Kara

Hi Ted,

On Sat 15-08-15 09:54:22, Theodore Ts'o wrote:
> On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
> > > Is this "if (!committed_data) {" check now dead code?
> > > 
> > > I also see other similar suspected dead sites in the rest of the series.
> > 
> > You are absolutely right. I have updated the patches.
> 
> Have you sent out an updated version of these patches?  Maybe I missed
> it, but I don't think I saw them.

would you be interested in these two patches sent with rephrased
changelog to not depend on the patch which allows GFP_NOFS to fail? The
way this has been handled for btrfs...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/8] Allow GFP_NOFS allocation to fail
  2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
                   ` (9 preceding siblings ...)
  2015-08-06 14:34 ` Michal Hocko
@ 2015-09-07 16:51 ` Tetsuo Handa
  2015-09-15 13:16   ` Tetsuo Handa
  10 siblings, 1 reply; 32+ messages in thread
From: Tetsuo Handa @ 2015-09-07 16:51 UTC (permalink / raw)
  To: mhocko, linux-kernel
  Cc: linux-mm, linux-fsdevel, akpm, hannes, david, tytso, jack

Michal Hocko wrote:
> As the VM cannot do much about these requests we should face the reality
> and allow those allocations to fail. Johannes has already posted the
> patch which does that (http://marc.info/?l=linux-mm&m=142726428514236&w=2)
> but the discussion died pretty quickly.

Addition of __GFP_NOFAIL to some locations is accepted, but otherwise
this patchset seems to be stalled.

> With all the patches applied none of the 4 filesystems gets aborted
> transactions and RO remount (well xfs didn't need any special
> treatment). This is obviously not sufficient to claim that failing
> GFP_NOFS is OK now but I think it is a good start for the further
> discussion. I would be grateful if FS people could have a look at those
> patches.  I have simply used __GFP_NOFAIL in the critical paths. This
> might be not the best strategy but it sounds like a good first step.

I posted my comment at
https://osdn.jp/projects/tomoyo/lists/archive/users-en/2015-September/000630.html .

> The third patch allows GFP_NOFS to fail and I believe it should see much
> more testing coverage. It would be really great if it could sit in the
> mmotm tree for few release cycles so that we can catch more fallouts.

Guessing from responses to this patchset, sitting in the mmotm tree can
hardly acquire testing coverage. Also, FS is not the only location that
needs to be tested. If you really want to push "GFP_NOFS can fail" patch,
I think you need to make a lot of effort to encourage kernel developers to
test using mandatory fault injection.

> Thoughts? Opinions?

To me, fixing callers (adding __GFP_NORETRY to callers) in a step-by-step
fashion after adding proactive countermeasure sounds better than changing
the default behavior (implicitly applying __GFP_NORETRY inside).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/8] Allow GFP_NOFS allocation to fail
  2015-09-07 16:51 ` Tetsuo Handa
@ 2015-09-15 13:16   ` Tetsuo Handa
  0 siblings, 0 replies; 32+ messages in thread
From: Tetsuo Handa @ 2015-09-15 13:16 UTC (permalink / raw)
  To: mhocko, linux-kernel
  Cc: linux-mm, linux-fsdevel, akpm, hannes, david, tytso, jack

Tetsuo Handa wrote:
> > Thoughts? Opinions?
> 
> To me, fixing callers (adding __GFP_NORETRY to callers) in a step-by-step
> fashion after adding proactive countermeasure sounds better than changing
> the default behavior (implicitly applying __GFP_NORETRY inside).
> 

Ping?

I showed you at http://marc.info/?l=linux-mm&m=144198479931388 that
changing the default behavior can not terminate the game of Whack-A-Mole.
As long as there are unkillable threads, we can't kill context-sensitive
moles.

I believe that what we need to do now is to add a proactive countermeasure
(e.g. kill more processes) than try to reduce the possibility of hitting
this issue (e.g. allow !__GFP_FS to fail).

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-09-15 13:16 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-05  9:51 [RFC 0/8] Allow GFP_NOFS allocation to fail mhocko
2015-08-05  9:51 ` [RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves mhocko
2015-08-05  9:51 ` [RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation mhocko
2015-08-05  9:51 ` [RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM mhocko
2015-08-05  9:51 ` [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure mhocko
2015-08-05 11:42   ` Jan Kara
2015-08-05 16:49   ` Greg Thelen
2015-08-12  9:14     ` Michal Hocko
2015-08-15 13:54       ` Theodore Ts'o
2015-08-18 10:36         ` Michal Hocko
2015-08-24 12:06         ` Michal Hocko
2015-08-18 10:38   ` [RFC -v2 " Michal Hocko
2015-08-05  9:51 ` [RFC 5/8] ext4: Do not fail journal due to block allocator mhocko
2015-08-05 11:43   ` Jan Kara
2015-08-18 10:39   ` [RFC -v2 " Michal Hocko
2015-08-18 10:55     ` Michal Hocko
2015-08-05  9:51 ` [RFC 6/8] ext3: Do not abort journal prematurely mhocko
2015-08-18 10:39   ` [RFC -v2 " Michal Hocko
2015-08-05  9:51 ` [RFC 7/8] btrfs: Prevent from early transaction abort mhocko
2015-08-05 16:31   ` David Sterba
2015-08-18 10:40   ` [RFC -v2 " Michal Hocko
2015-08-18 11:01     ` Michal Hocko
2015-08-18 17:11     ` Chris Mason
2015-08-18 17:29       ` Michal Hocko
2015-08-19 12:26         ` Michal Hocko
2015-08-05  9:51 ` [RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio mhocko
2015-08-05 16:32   ` David Sterba
2015-08-18 10:41   ` [RFC -v2 " Michal Hocko
2015-08-05 19:58 ` [RFC 0/8] Allow GFP_NOFS allocation to fail Andreas Dilger
2015-08-06 14:34 ` Michal Hocko
2015-09-07 16:51 ` Tetsuo Handa
2015-09-15 13:16   ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).