linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] fs: Use memalloc_nofs_save/restore scope API
@ 2018-03-21 22:44 Goldwyn Rodrigues
  2018-03-21 22:44 ` [PATCH 1/3] fs: Perform writebacks under memalloc_nofs Goldwyn Rodrigues
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Goldwyn Rodrigues @ 2018-03-21 22:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, willy, david, Goldwyn Rodrigues

The goal of these patch set is to define the scope of the filesystems
code which should not be called back into in low memory allocations.
This primarily covers page writebacks, inode writebacks and writing
cache pages.

Eventually, once we are sure that FS code does not recurse in low memory
situations, we can use GFP_KERNEL instead of GFP_NOFS (without being
unsure of which flag to use ;)) However, that is a long way to go.

A previous discussion on this is listed here [1]

If you know of more situations, I would be glad to add.

[1] https://marc.info/?l=linux-fsdevel&m=152055278014609&w=2

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-21 22:44 [PATCH 0/3] fs: Use memalloc_nofs_save/restore scope API Goldwyn Rodrigues
@ 2018-03-21 22:44 ` Goldwyn Rodrigues
  2018-03-22  7:08   ` Michal Hocko
  2018-03-21 22:44 ` [PATCH 2/3] fs: use memalloc_nofs API while shrinking superblock Goldwyn Rodrigues
  2018-03-21 22:44 ` [PATCH 3/3] fs: Use memalloc_nofs_save in generic_perform_write Goldwyn Rodrigues
  2 siblings, 1 reply; 15+ messages in thread
From: Goldwyn Rodrigues @ 2018-03-21 22:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, willy, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

writebacks can recurse into itself under low memory situations.
Set memalloc_nofs_save() in order to make sure it does not
recurse.

Since writeouts are performed for fdatawrites as well, set
memalloc_nofs_save while performing fdatawrites.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/fs-writeback.c | 8 ++++++++
 fs/xfs/xfs_aops.c | 7 -------
 mm/filemap.c      | 3 +++
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fee568a..42f7f4c2063b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -28,6 +28,7 @@
 #include <linux/tracepoint.h>
 #include <linux/device.h>
 #include <linux/memcontrol.h>
+#include <linux/sched/mm.h>
 #include "internal.h"
 
 /*
@@ -1713,7 +1714,12 @@ static long wb_writeback(struct bdi_writeback *wb,
 	struct inode *inode;
 	long progress;
 	struct blk_plug plug;
+	unsigned flags;
 
+	if (current->flags & PF_MEMALLOC_NOFS)
+		return 0;
+
+	flags = memalloc_nofs_save();
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
 
@@ -1797,6 +1803,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 	spin_unlock(&wb->list_lock);
 	blk_finish_plug(&plug);
 
+	memalloc_nofs_restore(flags);
+
 	return nr_pages - work->nr_pages;
 }
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..943ade03489a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1086,13 +1086,6 @@ xfs_do_writepage(
 			PF_MEMALLOC))
 		goto redirty;
 
-	/*
-	 * Given that we do not allow direct reclaim to call us, we should
-	 * never be called while in a filesystem transaction.
-	 */
-	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
-		goto redirty;
-
 	/*
 	 * Is this page beyond the end of the file?
 	 *
diff --git a/mm/filemap.c b/mm/filemap.c
index 693f62212a59..3c9ead9a1e32 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
 {
 	int ret;
+	unsigned flags;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
 		.nr_to_write = LONG_MAX,
@@ -440,9 +441,11 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 	if (!mapping_cap_writeback_dirty(mapping))
 		return 0;
 
+	flags = memalloc_nofs_save();
 	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
 	ret = do_writepages(mapping, &wbc);
 	wbc_detach_inode(&wbc);
+	memalloc_nofs_restore(flags);
 	return ret;
 }
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] fs: use memalloc_nofs API while shrinking superblock
  2018-03-21 22:44 [PATCH 0/3] fs: Use memalloc_nofs_save/restore scope API Goldwyn Rodrigues
  2018-03-21 22:44 ` [PATCH 1/3] fs: Perform writebacks under memalloc_nofs Goldwyn Rodrigues
@ 2018-03-21 22:44 ` Goldwyn Rodrigues
  2018-03-22  7:09   ` Michal Hocko
  2018-03-21 22:44 ` [PATCH 3/3] fs: Use memalloc_nofs_save in generic_perform_write Goldwyn Rodrigues
  2 siblings, 1 reply; 15+ messages in thread
From: Goldwyn Rodrigues @ 2018-03-21 22:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, willy, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

The superblock shrinkers are responsible for pruning dcache and icache.
which evicts the inode by calling into local filesystem code. Protect
allocations under memalloc_nofs_save/restore().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/super.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 672538ca9831..26fc2679118d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -35,6 +35,7 @@
 #include <linux/fsnotify.h>
 #include <linux/lockdep.h>
 #include <linux/user_namespace.h>
+#include <linux/sched/mm.h>
 #include "internal.h"
 
 
@@ -63,6 +64,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	long	freed = 0;
 	long	dentries;
 	long	inodes;
+	unsigned flags;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -70,9 +72,11 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
 	 * to recurse into the FS that called us in clear_inode() and friends..
 	 */
-	if (!(sc->gfp_mask & __GFP_FS))
+	if (!(sc->gfp_mask & __GFP_FS) || (current->flags & PF_MEMALLOC_NOFS))
 		return SHRINK_STOP;
 
+	flags = memalloc_nofs_save();
+
 	if (!trylock_super(sb))
 		return SHRINK_STOP;
 
@@ -107,6 +111,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		freed += sb->s_op->free_cached_objects(sb, sc);
 	}
 
+	memalloc_nofs_restore(flags);
 	up_read(&sb->s_umount);
 	return freed;
 }
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] fs: Use memalloc_nofs_save in generic_perform_write
  2018-03-21 22:44 [PATCH 0/3] fs: Use memalloc_nofs_save/restore scope API Goldwyn Rodrigues
  2018-03-21 22:44 ` [PATCH 1/3] fs: Perform writebacks under memalloc_nofs Goldwyn Rodrigues
  2018-03-21 22:44 ` [PATCH 2/3] fs: use memalloc_nofs API while shrinking superblock Goldwyn Rodrigues
@ 2018-03-21 22:44 ` Goldwyn Rodrigues
  2018-03-22  7:10   ` Michal Hocko
  2 siblings, 1 reply; 15+ messages in thread
From: Goldwyn Rodrigues @ 2018-03-21 22:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, willy, david, Goldwyn Rodrigues

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Perform generic_perform_write() under memalloc_nofs because any allocations
should not recurse into fs writebacks.
This covers grab and write cache pages,

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 mm/filemap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3c9ead9a1e32..5fe54614c69f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@
 #include <linux/cleancache.h>
 #include <linux/shmem_fs.h>
 #include <linux/rmap.h>
+#include <linux/sched/mm.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -3105,6 +3106,7 @@ ssize_t generic_perform_write(struct file *file,
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned nofs_flags = memalloc_nofs_save();
 
 	do {
 		struct page *page;
@@ -3177,6 +3179,8 @@ ssize_t generic_perform_write(struct file *file,
 		balance_dirty_pages_ratelimited(mapping);
 	} while (iov_iter_count(i));
 
+	memalloc_nofs_restore(nofs_flags);
+
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_perform_write);
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-21 22:44 ` [PATCH 1/3] fs: Perform writebacks under memalloc_nofs Goldwyn Rodrigues
@ 2018-03-22  7:08   ` Michal Hocko
  2018-03-27 12:52     ` Goldwyn Rodrigues
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2018-03-22  7:08 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-mm, willy, david, Goldwyn Rodrigues

On Wed 21-03-18 17:44:27, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> writebacks can recurse into itself under low memory situations.
> Set memalloc_nofs_save() in order to make sure it does not
> recurse.

How? We are not doing writeback from the direct reclaim context.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] fs: use memalloc_nofs API while shrinking superblock
  2018-03-21 22:44 ` [PATCH 2/3] fs: use memalloc_nofs API while shrinking superblock Goldwyn Rodrigues
@ 2018-03-22  7:09   ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2018-03-22  7:09 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-mm, willy, david, Goldwyn Rodrigues

On Wed 21-03-18 17:44:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> The superblock shrinkers are responsible for pruning dcache and icache.
> which evicts the inode by calling into local filesystem code. Protect
> allocations under memalloc_nofs_save/restore().

This is just wrong. PF_MEMALLOC_NOFS implies GFP_NOFS. Have a look at
current_gfp_context. Shrinkers really do not and should not care about
it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] fs: Use memalloc_nofs_save in generic_perform_write
  2018-03-21 22:44 ` [PATCH 3/3] fs: Use memalloc_nofs_save in generic_perform_write Goldwyn Rodrigues
@ 2018-03-22  7:10   ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2018-03-22  7:10 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, linux-mm, willy, david, Goldwyn Rodrigues

On Wed 21-03-18 17:44:29, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Perform generic_perform_write() under memalloc_nofs because any allocations
> should not recurse into fs writebacks.

Why? What is the deadlock scenario?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-22  7:08   ` Michal Hocko
@ 2018-03-27 12:52     ` Goldwyn Rodrigues
  2018-03-27 14:21       ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Goldwyn Rodrigues @ 2018-03-27 12:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-fsdevel, linux-mm, willy, david, Goldwyn Rodrigues



On 03/22/2018 02:08 AM, Michal Hocko wrote:
> On Wed 21-03-18 17:44:27, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> writebacks can recurse into itself under low memory situations.
>> Set memalloc_nofs_save() in order to make sure it does not
>> recurse.
> 
> How? We are not doing writeback from the direct reclaim context.
> 

I am not sure if I missed a condition in the code, but here is one of
the call lineup:

writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
__alloc_pages_nodemask -> __alloc_pages_slowpath ->
__alloc_pages_direct_reclaim() -> try_to_free_pages() ->
do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
do_writepages -> writepages()

Please note, most filesystems currently have a safeguard in writepage()
which will return if the PF_MEMALLOC is set. The other safeguard is
__GFP_FS which we are trying to eliminate.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-27 12:52     ` Goldwyn Rodrigues
@ 2018-03-27 14:21       ` Matthew Wilcox
  2018-03-27 15:13         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2018-03-27 14:21 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Michal Hocko, linux-fsdevel, linux-mm, david, Goldwyn Rodrigues

On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote:
> I am not sure if I missed a condition in the code, but here is one of
> the call lineup:
> 
> writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
> __alloc_pages_nodemask -> __alloc_pages_slowpath ->
> __alloc_pages_direct_reclaim() -> try_to_free_pages() ->
> do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
> evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
> do_writepages -> writepages()
> 
> Please note, most filesystems currently have a safeguard in writepage()
> which will return if the PF_MEMALLOC is set. The other safeguard is
> __GFP_FS which we are trying to eliminate.

But is that harmful?  ext4_writepage() (for example) says that it will
not deadlock in that circumstance:

 * We can get recursively called as show below.
 *
 *      ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() ->
 *              ext4_writepage()
 *
 * But since we don't do any block allocation we should not deadlock.
 * Page also have the dirty flag cleared so we don't get recurive page_lock.

One might well argue that it's not *useful*; if we've gone into
writepage already, there's no point in re-entering writepage.  And the
last thing we want to do is 
But I could see filesystems behaving differently when entered
for writepage-for-regularly-scheduled-writeback versus
writepage-for-shrinking, so maybe they can make progress.

Maybe no real filesystem behaves that way.  We need feedback from
filesystem people.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-27 14:21       ` Matthew Wilcox
@ 2018-03-27 15:13         ` Goldwyn Rodrigues
  2018-03-27 16:45           ` Matthew Wilcox
  2018-03-28  7:01           ` Michal Hocko
  0 siblings, 2 replies; 15+ messages in thread
From: Goldwyn Rodrigues @ 2018-03-27 15:13 UTC (permalink / raw)
  To: Matthew Wilcox, Goldwyn Rodrigues
  Cc: Michal Hocko, linux-fsdevel, linux-mm, david



On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote:
>> I am not sure if I missed a condition in the code, but here is one of
>> the call lineup:
>>
>> writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
>> __alloc_pages_nodemask -> __alloc_pages_slowpath ->
>> __alloc_pages_direct_reclaim() -> try_to_free_pages() ->
>> do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
>> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
>> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
>> evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
>> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
>> do_writepages -> writepages()
>>
>> Please note, most filesystems currently have a safeguard in writepage()
>> which will return if the PF_MEMALLOC is set. The other safeguard is
>> __GFP_FS which we are trying to eliminate.
> 
> But is that harmful?  ext4_writepage() (for example) says that it will
> not deadlock in that circumstance:

No, it is not harmful.

> 
>  * We can get recursively called as show below.
>  *
>  *      ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() ->
>  *              ext4_writepage()
>  *
>  * But since we don't do any block allocation we should not deadlock.
>  * Page also have the dirty flag cleared so we don't get recurive page_lock.

Yes, and it avoids this by checking for PF_MEMALLOC flag.

> 
> One might well argue that it's not *useful*; if we've gone into
> writepage already, there's no point in re-entering writepage.  And the
> last thing we want to do is 

?

> But I could see filesystems behaving differently when entered
> for writepage-for-regularly-scheduled-writeback versus
> writepage-for-shrinking, so maybe they can make progress.
> 

do_writepages() is the same for both, and hence the memalloc_* API patch.

> Maybe no real filesystem behaves that way.  We need feedback from
> filesystem people.

The idea is to:
* Keep a central location for check, rather than individual filesystem
writepage(). It should reduce code as well.
* Filesystem developers call memory allocations without thinking twice
about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
eliminate GFP_NOFS.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-27 15:13         ` Goldwyn Rodrigues
@ 2018-03-27 16:45           ` Matthew Wilcox
  2018-03-28  7:01           ` Michal Hocko
  1 sibling, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2018-03-27 16:45 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Michal Hocko, linux-fsdevel, linux-mm, david

On Tue, Mar 27, 2018 at 10:13:53AM -0500, Goldwyn Rodrigues wrote:
> On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> > On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote:
> >> I am not sure if I missed a condition in the code, but here is one of
> >> the call lineup:
> >>
> >> writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
> >> __alloc_pages_nodemask -> __alloc_pages_slowpath ->
> >> __alloc_pages_direct_reclaim() -> try_to_free_pages() ->
> >> do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
> >> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
> >> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
> >> evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
> >> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
> >> do_writepages -> writepages()
> >>
> >> Please note, most filesystems currently have a safeguard in writepage()
> >> which will return if the PF_MEMALLOC is set. The other safeguard is
> >> __GFP_FS which we are trying to eliminate.
> > 
> > But is that harmful?  ext4_writepage() (for example) says that it will
> > not deadlock in that circumstance:
> 
> No, it is not harmful.
> 
> > 
> >  * We can get recursively called as show below.
> >  *
> >  *      ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() ->
> >  *              ext4_writepage()
> >  *
> >  * But since we don't do any block allocation we should not deadlock.
> >  * Page also have the dirty flag cleared so we don't get recurive page_lock.
> 
> Yes, and it avoids this by checking for PF_MEMALLOC flag.
> 
> > 
> > One might well argue that it's not *useful*; if we've gone into
> > writepage already, there's no point in re-entering writepage.  And the
> > last thing we want to do is 
> 
> ?

Sorry, got cut off.  The last thing we want to do is blow the stack by
recursing too deeply, but I don't think we're going to go through this
loop more than once.

> > But I could see filesystems behaving differently when entered
> > for writepage-for-regularly-scheduled-writeback versus
> > writepage-for-shrinking, so maybe they can make progress.
> > 
> 
> do_writepages() is the same for both, and hence the memalloc_* API patch.

But we don't want to avoid this particular recursion.  We only need to
avoid the recursion if it would result in a deadlock.

> > Maybe no real filesystem behaves that way.  We need feedback from
> > filesystem people.
> 
> The idea is to:
> * Keep a central location for check, rather than individual filesystem
> writepage(). It should reduce code as well.
> * Filesystem developers call memory allocations without thinking twice
> about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> eliminate GFP_NOFS.

I know the goal is to eliminate GFP_NOFS.  I'm very much in favour
of that idea.  I'm just not sure you're going about it the right way.
Probably we will have a good discussion about it next month.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-27 15:13         ` Goldwyn Rodrigues
  2018-03-27 16:45           ` Matthew Wilcox
@ 2018-03-28  7:01           ` Michal Hocko
  2018-03-28 23:57             ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2018-03-28  7:01 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, david

On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> 
> 
> On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
[...]
> > Maybe no real filesystem behaves that way.  We need feedback from
> > filesystem people.
> 
> The idea is to:
> * Keep a central location for check, rather than individual filesystem
> writepage(). It should reduce code as well.
> * Filesystem developers call memory allocations without thinking twice
> about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> eliminate GFP_NOFS.

I do not think this is the right approach. We do want to eliminate
explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
usage as well. The later requires that we drop the __GFP_FS only for
those contexts that really might cause reclaim recursion problems. So in
your example, it would be much better to add the scope into those
writepage(s) implementations which actually can trigger the writeback
from the reclaim path rather from the generic implementation which has
no means to know that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-28  7:01           ` Michal Hocko
@ 2018-03-28 23:57             ` Dave Chinner
  2018-03-29  7:01               ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2018-03-28 23:57 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Goldwyn Rodrigues, Matthew Wilcox, linux-fsdevel, linux-mm

On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote:
> On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> > 
> > 
> > On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> [...]
> > > Maybe no real filesystem behaves that way.  We need feedback from
> > > filesystem people.
> > 
> > The idea is to:
> > * Keep a central location for check, rather than individual filesystem
> > writepage(). It should reduce code as well.
> > * Filesystem developers call memory allocations without thinking twice
> > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> > eliminate GFP_NOFS.
> 
> I do not think this is the right approach. We do want to eliminate
> explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
> usage as well. The later requires that we drop the __GFP_FS only for
> those contexts that really might cause reclaim recursion problems.

As I've said before, moving to a scoped API will not reduce the
number of GFP_NOFS scope allocation points - removing individual
GFP_NOFS annotations doesn't do anything to avoid the deadlock paths
it protects against.

The issue is that GFP_NOFS is a big hammer - it stops reclaim from
all filesystem scopes, not just the one we hold locks on and are
doing the allocation for. i.e. we can be in one filesystem and quite
safely do reclaim from other filesystems. The global scope of
GFP_NOFS just doesn't allow this sort of fine-grained control to be
expressed in reclaim.

IOWs, if we want to reduce the scope of GFP_NOFS, we need a context
to be passed from allocation to reclaim so that the reclaim context
can check that it's a safe allocation context to reclaim from. e.g.
for GFP_NOFS, we can use the superblock of the allocating filesystem
as the context, and check it against the superblock that the current
reclaim context (e.g. shrinker invocation) belongs to. If they
match, we skip it. If they don't match, then we can perform reclaim
on that context.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-28 23:57             ` Dave Chinner
@ 2018-03-29  7:01               ` Michal Hocko
  2018-03-31 21:21                 ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2018-03-29  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Goldwyn Rodrigues, Matthew Wilcox, linux-fsdevel, linux-mm

On Thu 29-03-18 10:57:02, Dave Chinner wrote:
> On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote:
> > On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> > > 
> > > 
> > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> > [...]
> > > > Maybe no real filesystem behaves that way.  We need feedback from
> > > > filesystem people.
> > > 
> > > The idea is to:
> > > * Keep a central location for check, rather than individual filesystem
> > > writepage(). It should reduce code as well.
> > > * Filesystem developers call memory allocations without thinking twice
> > > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> > > eliminate GFP_NOFS.
> > 
> > I do not think this is the right approach. We do want to eliminate
> > explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
> > usage as well. The later requires that we drop the __GFP_FS only for
> > those contexts that really might cause reclaim recursion problems.
> 
> As I've said before, moving to a scoped API will not reduce the
> number of GFP_NOFS scope allocation points - removing individual
> GFP_NOFS annotations doesn't do anything to avoid the deadlock paths
> it protects against.

Maybe it doesn't for some filesystems like xfs but I am quite sure it
will for some others which overuse GFP_NOFS just to be sure. E.g. btrfs.

> The issue is that GFP_NOFS is a big hammer - it stops reclaim from
> all filesystem scopes, not just the one we hold locks on and are
> doing the allocation for. i.e. we can be in one filesystem and quite
> safely do reclaim from other filesystems. The global scope of
> GFP_NOFS just doesn't allow this sort of fine-grained control to be
> expressed in reclaim.

Agreed!

> IOWs, if we want to reduce the scope of GFP_NOFS, we need a context
> to be passed from allocation to reclaim so that the reclaim context
> can check that it's a safe allocation context to reclaim from. e.g.
> for GFP_NOFS, we can use the superblock of the allocating filesystem
> as the context, and check it against the superblock that the current
> reclaim context (e.g. shrinker invocation) belongs to. If they
> match, we skip it. If they don't match, then we can perform reclaim
> on that context.

Agreed again. But this is hardly doable without actually defining what
those scopes are. Once we have them we can expand to add more context.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] fs: Perform writebacks under memalloc_nofs
  2018-03-29  7:01               ` Michal Hocko
@ 2018-03-31 21:21                 ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2018-03-31 21:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Goldwyn Rodrigues, Matthew Wilcox, linux-fsdevel, linux-mm

On Thu, Mar 29, 2018 at 09:01:08AM +0200, Michal Hocko wrote:
> On Thu 29-03-18 10:57:02, Dave Chinner wrote:
> > On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote:
> > > On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> > > > 
> > > > 
> > > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> > > [...]
> > > > > Maybe no real filesystem behaves that way.  We need feedback from
> > > > > filesystem people.
> > > > 
> > > > The idea is to:
> > > > * Keep a central location for check, rather than individual filesystem
> > > > writepage(). It should reduce code as well.
> > > > * Filesystem developers call memory allocations without thinking twice
> > > > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> > > > eliminate GFP_NOFS.
> > > 
> > > I do not think this is the right approach. We do want to eliminate
> > > explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
> > > usage as well. The later requires that we drop the __GFP_FS only for
> > > those contexts that really might cause reclaim recursion problems.
> > 
> > As I've said before, moving to a scoped API will not reduce the
> > number of GFP_NOFS scope allocation points - removing individual
> > GFP_NOFS annotations doesn't do anything to avoid the deadlock paths
> > it protects against.
> 
> Maybe it doesn't for some filesystems like xfs but I am quite sure it
> will for some others which overuse GFP_NOFS just to be sure. E.g. btrfs.
> 
> > The issue is that GFP_NOFS is a big hammer - it stops reclaim from
> > all filesystem scopes, not just the one we hold locks on and are
> > doing the allocation for. i.e. we can be in one filesystem and quite
> > safely do reclaim from other filesystems. The global scope of
> > GFP_NOFS just doesn't allow this sort of fine-grained control to be
> > expressed in reclaim.
> 
> Agreed!
> 
> > IOWs, if we want to reduce the scope of GFP_NOFS, we need a context
> > to be passed from allocation to reclaim so that the reclaim context
> > can check that it's a safe allocation context to reclaim from. e.g.
> > for GFP_NOFS, we can use the superblock of the allocating filesystem
> > as the context, and check it against the superblock that the current
> > reclaim context (e.g. shrinker invocation) belongs to. If they
> > match, we skip it. If they don't match, then we can perform reclaim
> > on that context.
> 
> Agreed again. But this is hardly doable without actually defining what
> those scopes are. Once we have them we can expand to add more context.

Some filesystems already have well defined scopes (e.g. XFS's
transaction scope) - all we need is the infrastructure that passes
the scope pointer to reclaim rather than having the allocation code
intercept PF_MEMALLOC_NOFS and turn it into GFP_NOFS allocation
context...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-03-31 21:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 22:44 [PATCH 0/3] fs: Use memalloc_nofs_save/restore scope API Goldwyn Rodrigues
2018-03-21 22:44 ` [PATCH 1/3] fs: Perform writebacks under memalloc_nofs Goldwyn Rodrigues
2018-03-22  7:08   ` Michal Hocko
2018-03-27 12:52     ` Goldwyn Rodrigues
2018-03-27 14:21       ` Matthew Wilcox
2018-03-27 15:13         ` Goldwyn Rodrigues
2018-03-27 16:45           ` Matthew Wilcox
2018-03-28  7:01           ` Michal Hocko
2018-03-28 23:57             ` Dave Chinner
2018-03-29  7:01               ` Michal Hocko
2018-03-31 21:21                 ` Dave Chinner
2018-03-21 22:44 ` [PATCH 2/3] fs: use memalloc_nofs API while shrinking superblock Goldwyn Rodrigues
2018-03-22  7:09   ` Michal Hocko
2018-03-21 22:44 ` [PATCH 3/3] fs: Use memalloc_nofs_save in generic_perform_write Goldwyn Rodrigues
2018-03-22  7:10   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).