[PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:09 ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layer so it doesn't need this
protection. Even ceph and ocfs2 which call filemap_fault from their
fault handlers seem to be OK because they are not taking any fs lock
before invoking generic implementation.

The protection might be even harmful. There is a strong push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

Use GFP_KERNEL mask instead because it is safe from the reclaim
recursion POV. We are already doing GFP_KERNEL allocations down
add_to_page_cache_lru path.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 968cd8e03d2e..26f62ba79f50 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 	int ret;

 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD);
 		if (!page)
 			return -ENOMEM;

-- 
2.1.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:09 ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layer so it doesn't need this
protection. Even ceph and ocfs2 which call filemap_fault from their
fault handlers seem to be OK because they are not taking any fs lock
before invoking generic implementation.

The protection might be even harmful. There is a strong push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

Use GFP_KERNEL mask instead because it is safe from the reclaim
recursion POV. We are already doing GFP_KERNEL allocations down
add_to_page_cache_lru path.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 968cd8e03d2e..26f62ba79f50 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 	int ret;

 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD);
 		if (!page)
 			return -ENOMEM;

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:09 ` Michal Hocko
@ 2015-03-18 14:32   ` Rik van Riel
  -1 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2015-03-18 14:32 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa,
	Sage Weil, Mark Fasheh, linux-mm, LKML

On 03/18/2015 10:09 AM, Michal Hocko wrote:

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 968cd8e03d2e..26f62ba79f50 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset)
>  	int ret;
>  
>  	do {
> -		page = page_cache_alloc_cold(mapping);
> +		page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD);
>  		if (!page)
>  			return -ENOMEM;

Won't this break on highmem systems, by failing to
allocate the page cache from highmem, where previously
it would?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:32   ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2015-03-18 14:32 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa,
	Sage Weil, Mark Fasheh, linux-mm, LKML

On 03/18/2015 10:09 AM, Michal Hocko wrote:

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 968cd8e03d2e..26f62ba79f50 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset)
>  	int ret;
>  
>  	do {
> -		page = page_cache_alloc_cold(mapping);
> +		page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD);
>  		if (!page)
>  			return -ENOMEM;

Won't this break on highmem systems, by failing to
allocate the page cache from highmem, where previously
it would?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:32   ` Rik van Riel
@ 2015-03-18 14:37     ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed 18-03-15 10:32:57, Rik van Riel wrote:
> On 03/18/2015 10:09 AM, Michal Hocko wrote:
> 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 968cd8e03d2e..26f62ba79f50 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset)
> >  	int ret;
> >  
> >  	do {
> > -		page = page_cache_alloc_cold(mapping);
> > +		page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD);
> >  		if (!page)
> >  			return -ENOMEM;
> 
> Won't this break on highmem systems, by failing to
> allocate the page cache from highmem, where previously
> it would?

It will! This is broken. I can see inode_init_always now. We need to add
GFP_HIGHUSER_MOVABLE here.

Thanks for pointing this out!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:37     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed 18-03-15 10:32:57, Rik van Riel wrote:
> On 03/18/2015 10:09 AM, Michal Hocko wrote:
> 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 968cd8e03d2e..26f62ba79f50 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1752,7 +1752,7 @@ static int page_cache_read(struct file *file, pgoff_t offset)
> >  	int ret;
> >  
> >  	do {
> > -		page = page_cache_alloc_cold(mapping);
> > +		page = __page_cache_alloc(GFP_KERNEL|__GFP_COLD);
> >  		if (!page)
> >  			return -ENOMEM;
> 
> Won't this break on highmem systems, by failing to
> allocate the page cache from highmem, where previously
> it would?

It will! This is broken. I can see inode_init_always now. We need to add
GFP_HIGHUSER_MOVABLE here.

Thanks for pointing this out!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:09 ` Michal Hocko
@ 2015-03-18 14:38   ` Mel Gorman
  -1 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2015-03-18 14:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Rik van Riel,
	Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed, Mar 18, 2015 at 03:09:26PM +0100, Michal Hocko wrote:
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page. This means that mapping_gfp_mask is used as the
> base for the gfp_mask. Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> however, not called from the fs layer so it doesn't need this
> protection. Even ceph and ocfs2 which call filemap_fault from their
> fault handlers seem to be OK because they are not taking any fs lock
> before invoking generic implementation.
> 
> The protection might be even harmful. There is a strong push to fail
> GFP_NOFS allocations rather than loop within allocator indefinitely with
> a very limited reclaim ability. Once we start failing those requests
> the OOM killer might be triggered prematurely because the page cache
> allocation failure is propagated up the page fault path and end up in
> pagefault_out_of_memory.
> 
> Use GFP_KERNEL mask instead because it is safe from the reclaim
> recursion POV. We are already doing GFP_KERNEL allocations down
> add_to_page_cache_lru path.
> 
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

I'm very far behind after LSF/MM so do not know where this came out of
but it loses addressing restriction hints from the driver such as

drivers/gpu/drm/gma500/gem.c: mapping_set_gfp_mask(r->gem.filp->f_mapping, GFP_KERNEL | __GFP_DMA32);

It also loses mobility hints for fragmentation avoidance.

fs/inode.c:     mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);

If users of mapping_set_gfp_mask are now being ignored then it should at
least trigger a once-off warning that the flags are being ignored so
it's obvious if a recursion does occur and cause problems.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:38   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2015-03-18 14:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Rik van Riel,
	Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed, Mar 18, 2015 at 03:09:26PM +0100, Michal Hocko wrote:
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page. This means that mapping_gfp_mask is used as the
> base for the gfp_mask. Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> however, not called from the fs layer so it doesn't need this
> protection. Even ceph and ocfs2 which call filemap_fault from their
> fault handlers seem to be OK because they are not taking any fs lock
> before invoking generic implementation.
> 
> The protection might be even harmful. There is a strong push to fail
> GFP_NOFS allocations rather than loop within allocator indefinitely with
> a very limited reclaim ability. Once we start failing those requests
> the OOM killer might be triggered prematurely because the page cache
> allocation failure is propagated up the page fault path and end up in
> pagefault_out_of_memory.
> 
> Use GFP_KERNEL mask instead because it is safe from the reclaim
> recursion POV. We are already doing GFP_KERNEL allocations down
> add_to_page_cache_lru path.
> 
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

I'm very far behind after LSF/MM so do not know where this came out of
but it loses addressing restriction hints from the driver such as

drivers/gpu/drm/gma500/gem.c: mapping_set_gfp_mask(r->gem.filp->f_mapping, GFP_KERNEL | __GFP_DMA32);

It also loses mobility hints for fragmentation avoidance.

fs/inode.c:     mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);

If users of mapping_set_gfp_mask are now being ignored then it should at
least trigger a once-off warning that the flags are being ignored so
it's obvious if a recursion does occur and cause problems.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:38   ` Mel Gorman
@ 2015-03-18 14:43     ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Rik van Riel,
	Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed 18-03-15 14:38:47, Mel Gorman wrote:
> On Wed, Mar 18, 2015 at 03:09:26PM +0100, Michal Hocko wrote:
> > page_cache_read has been historically using page_cache_alloc_cold to
> > allocate a new page. This means that mapping_gfp_mask is used as the
> > base for the gfp_mask. Many filesystems are setting this mask to
> > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > however, not called from the fs layer so it doesn't need this
> > protection. Even ceph and ocfs2 which call filemap_fault from their
> > fault handlers seem to be OK because they are not taking any fs lock
> > before invoking generic implementation.
> > 
> > The protection might be even harmful. There is a strong push to fail
> > GFP_NOFS allocations rather than loop within allocator indefinitely with
> > a very limited reclaim ability. Once we start failing those requests
> > the OOM killer might be triggered prematurely because the page cache
> > allocation failure is propagated up the page fault path and end up in
> > pagefault_out_of_memory.
> > 
> > Use GFP_KERNEL mask instead because it is safe from the reclaim
> > recursion POV. We are already doing GFP_KERNEL allocations down
> > add_to_page_cache_lru path.
> > 
> > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> I'm very far behind after LSF/MM so do not know where this came out of
> but it loses addressing restriction hints from the driver such as
> 
> drivers/gpu/drm/gma500/gem.c: mapping_set_gfp_mask(r->gem.filp->f_mapping, GFP_KERNEL | __GFP_DMA32);

OK, I have missed these drivers. Scratch the patch for now I will think
about it more.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:43     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Rik van Riel,
	Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed 18-03-15 14:38:47, Mel Gorman wrote:
> On Wed, Mar 18, 2015 at 03:09:26PM +0100, Michal Hocko wrote:
> > page_cache_read has been historically using page_cache_alloc_cold to
> > allocate a new page. This means that mapping_gfp_mask is used as the
> > base for the gfp_mask. Many filesystems are setting this mask to
> > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > however, not called from the fs layer so it doesn't need this
> > protection. Even ceph and ocfs2 which call filemap_fault from their
> > fault handlers seem to be OK because they are not taking any fs lock
> > before invoking generic implementation.
> > 
> > The protection might be even harmful. There is a strong push to fail
> > GFP_NOFS allocations rather than loop within allocator indefinitely with
> > a very limited reclaim ability. Once we start failing those requests
> > the OOM killer might be triggered prematurely because the page cache
> > allocation failure is propagated up the page fault path and end up in
> > pagefault_out_of_memory.
> > 
> > Use GFP_KERNEL mask instead because it is safe from the reclaim
> > recursion POV. We are already doing GFP_KERNEL allocations down
> > add_to_page_cache_lru path.
> > 
> > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> I'm very far behind after LSF/MM so do not know where this came out of
> but it loses addressing restriction hints from the driver such as
> 
> drivers/gpu/drm/gma500/gem.c: mapping_set_gfp_mask(r->gem.filp->f_mapping, GFP_KERNEL | __GFP_DMA32);

OK, I have missed these drivers. Scratch the patch for now I will think
about it more.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:09 ` Michal Hocko
@ 2015-03-18 14:44   ` Rik van Riel
  -1 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2015-03-18 14:44 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa,
	Sage Weil, Mark Fasheh, linux-mm, LKML

On 03/18/2015 10:09 AM, Michal Hocko wrote:
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page. This means that mapping_gfp_mask is used as the
> base for the gfp_mask. Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> however, not called from the fs layer 

Is that true for filesystems that have directories in
the page cache?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:44   ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2015-03-18 14:44 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Neil Brown, Tetsuo Handa,
	Sage Weil, Mark Fasheh, linux-mm, LKML

On 03/18/2015 10:09 AM, Michal Hocko wrote:
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page. This means that mapping_gfp_mask is used as the
> base for the gfp_mask. Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> however, not called from the fs layer 

Is that true for filesystems that have directories in
the page cache?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:44   ` Rik van Riel
@ 2015-03-18 14:55     ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > page_cache_read has been historically using page_cache_alloc_cold to
> > allocate a new page. This means that mapping_gfp_mask is used as the
> > base for the gfp_mask. Many filesystems are setting this mask to
> > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > however, not called from the fs layer 
> 
> Is that true for filesystems that have directories in
> the page cache?

I haven't found any explicit callers of filemap_fault except for ocfs2
and ceph and those seem OK to me. Which filesystems you have in mind?

Btw. how would that work as we already have GFP_KERNEL allocation few
lines below?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 14:55     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 14:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > page_cache_read has been historically using page_cache_alloc_cold to
> > allocate a new page. This means that mapping_gfp_mask is used as the
> > base for the gfp_mask. Many filesystems are setting this mask to
> > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > however, not called from the fs layer 
> 
> Is that true for filesystems that have directories in
> the page cache?

I haven't found any explicit callers of filemap_fault except for ocfs2
and ceph and those seem OK to me. Which filesystems you have in mind?

Btw. how would that work as we already have GFP_KERNEL allocation few
lines below?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:09 ` Michal Hocko
@ 2015-03-18 15:45   ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 15:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

What do you think about this v2? I cannot say I would like it but I
really dislike the whole mapping_gfp_mask API to be honest.
---
>From d88010d6f5f59d7eb87b691e27e201d12cab9141 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 18 Mar 2015 16:06:40 +0100
Subject: [PATCH] mm: Allow __GFP_FS for page_cache_read page cache allocation

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layer so it doesn't need this
protection. Even ceph and ocfs2 which call filemap_fault from their
fault handlers seem to be OK because they are not taking any fs lock
before invoking generic implementation.

The protection might be even harmful. There is a strong push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

Add __GFP_FS and __GFPIO to the gfp mask which is coming from the
mapping to fix this issue.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/filemap.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 968cd8e03d2e..8b50d5eb52b2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1752,7 +1752,15 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 	int ret;

 	do {
-		page = page_cache_alloc_cold(mapping);
+		gfp_t page_cache_gfp = mapping_gfp_mask(mapping)|__GFP_COLD;
+
+		/*
+		 * This code is not called from the fs layer so we do not need
+		 * reclaim recursion protection. !GFP_FS might fail too easy
+		 * and trigger OOM killer prematuraly.
+		 */
+		page_cache_gfp |= __GFP_FS | __GFP_IO;
+		page = __page_cache_alloc(page_cache_gfp);
 		if (!page)
 			return -ENOMEM;

-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-18 15:45   ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-18 15:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Johannes Weiner, Mel Gorman, Rik van Riel, Neil Brown,
	Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm, LKML

What do you think about this v2? I cannot say I would like it but I
really dislike the whole mapping_gfp_mask API to be honest.
---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 15:45   ` Michal Hocko
  (?)
@ 2015-03-18 21:38   ` NeilBrown
  2015-03-19 13:55       ` Michal Hocko
  -1 siblings, 1 reply; 47+ messages in thread
From: NeilBrown @ 2015-03-18 21:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman,
	Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm,
	LKML

[-- Attachment #1: Type: text/plain, Size: 2983 bytes --]

On Wed, 18 Mar 2015 16:45:40 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> What do you think about this v2? I cannot say I would like it but I
> really dislike the whole mapping_gfp_mask API to be honest.
> ---
> >From d88010d6f5f59d7eb87b691e27e201d12cab9141 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 18 Mar 2015 16:06:40 +0100
> Subject: [PATCH] mm: Allow __GFP_FS for page_cache_read page cache allocation
> 
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page. This means that mapping_gfp_mask is used as the
> base for the gfp_mask. Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> however, not called from the fs layer so it doesn't need this
> protection. Even ceph and ocfs2 which call filemap_fault from their
> fault handlers seem to be OK because they are not taking any fs lock
> before invoking generic implementation.
> 
> The protection might be even harmful. There is a strong push to fail
> GFP_NOFS allocations rather than loop within allocator indefinitely with
> a very limited reclaim ability. Once we start failing those requests
> the OOM killer might be triggered prematurely because the page cache
> allocation failure is propagated up the page fault path and end up in
> pagefault_out_of_memory.
> 
> Add __GFP_FS and __GFPIO to the gfp mask which is coming from the
> mapping to fix this issue.
> 
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/filemap.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 968cd8e03d2e..8b50d5eb52b2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1752,7 +1752,15 @@ static int page_cache_read(struct file *file, pgoff_t offset)
>  	int ret;
>  
>  	do {
> -		page = page_cache_alloc_cold(mapping);
> +		gfp_t page_cache_gfp = mapping_gfp_mask(mapping)|__GFP_COLD;
> +
> +		/*
> +		 * This code is not called from the fs layer so we do not need
> +		 * reclaim recursion protection. !GFP_FS might fail too easy
> +		 * and trigger OOM killer prematuraly.
> +		 */
> +		page_cache_gfp |= __GFP_FS | __GFP_IO;
> +		page = __page_cache_alloc(page_cache_gfp);
>  		if (!page)
>  			return -ENOMEM;
>  

Nearly half the places in the kernel which call mapping_gfp_mask() remove the
__GFP_FS bit.

That suggests to me that it might make sense to have
   mapping_gfp_mask_fs()
and
   mapping_gfp_mask_nofs()

and let the presence of __GFP_FS (and __GFP_IO) be determined by the
call-site rather than the filesystem.

However I am a bit concerned about drivers/block/loop.c.
Might a filesystem read on the loop block device wait for a page_cache_read()
on the loop-mounted file?  In that case you really don't want __GFP_FS set
when allocating that page.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 14:55     ` Michal Hocko
@ 2015-03-19  7:14       ` Dave Chinner
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-19  7:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > page_cache_read has been historically using page_cache_alloc_cold to
> > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > base for the gfp_mask. Many filesystems are setting this mask to
> > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > however, not called from the fs layer 
> > 
> > Is that true for filesystems that have directories in
> > the page cache?
> 
> I haven't found any explicit callers of filemap_fault except for ocfs2
> and ceph and those seem OK to me. Which filesystems you have in mind?

Just about every major filesystem calls filemap_fault through the
.fault callout.

C symbol: filemap_fault

  File           Function            Line
  0 9p/vfs_file.c  <global>             831 .fault = filemap_fault,
  1 9p/vfs_file.c  <global>             838 .fault = filemap_fault,
  2 btrfs/file.c   <global>            2081 .fault = filemap_fault,
  3 cifs/file.c    <global>            3242 .fault = filemap_fault,
  4 ext4/file.c    <global>             215 .fault = filemap_fault,
  5 f2fs/file.c    <global>              93 .fault = filemap_fault,
  6 fuse/file.c    <global>            2062 .fault = filemap_fault,
  7 gfs2/file.c    <global>             498 .fault = filemap_fault,
  8 nfs/file.c     <global>             653 .fault = filemap_fault,
  9 nilfs2/file.c  <global>             128 .fault = filemap_fault,
  a ubifs/file.c   <global>            1536 .fault = filemap_fault,
  b xfs/xfs_file.c <global>            1420 .fault = filemap_fault,


> Btw. how would that work as we already have GFP_KERNEL allocation few
> lines below?

GFP_KERNEL allocation for mappings is simply wrong. All mapping
allocations where the caller cannot pass a gfp_mask need to obey
the mapping_gfp_mask that is set by the mapping owner....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-19  7:14       ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-19  7:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > page_cache_read has been historically using page_cache_alloc_cold to
> > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > base for the gfp_mask. Many filesystems are setting this mask to
> > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > however, not called from the fs layer 
> > 
> > Is that true for filesystems that have directories in
> > the page cache?
> 
> I haven't found any explicit callers of filemap_fault except for ocfs2
> and ceph and those seem OK to me. Which filesystems you have in mind?

Just about every major filesystem calls filemap_fault through the
.fault callout.

C symbol: filemap_fault

  File           Function            Line
  0 9p/vfs_file.c  <global>             831 .fault = filemap_fault,
  1 9p/vfs_file.c  <global>             838 .fault = filemap_fault,
  2 btrfs/file.c   <global>            2081 .fault = filemap_fault,
  3 cifs/file.c    <global>            3242 .fault = filemap_fault,
  4 ext4/file.c    <global>             215 .fault = filemap_fault,
  5 f2fs/file.c    <global>              93 .fault = filemap_fault,
  6 fuse/file.c    <global>            2062 .fault = filemap_fault,
  7 gfs2/file.c    <global>             498 .fault = filemap_fault,
  8 nfs/file.c     <global>             653 .fault = filemap_fault,
  9 nilfs2/file.c  <global>             128 .fault = filemap_fault,
  a ubifs/file.c   <global>            1536 .fault = filemap_fault,
  b xfs/xfs_file.c <global>            1420 .fault = filemap_fault,


> Btw. how would that work as we already have GFP_KERNEL allocation few
> lines below?

GFP_KERNEL allocation for mappings is simply wrong. All mapping
allocations where the caller cannot pass a gfp_mask need to obey
the mapping_gfp_mask that is set by the mapping owner....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache inpage_cache_read
  2015-03-19  7:14       ` Dave Chinner
@ 2015-03-19 11:11         ` Tetsuo Handa
  -1 siblings, 0 replies; 47+ messages in thread
From: Tetsuo Handa @ 2015-03-19 11:11 UTC (permalink / raw)
  To: david, mhocko
  Cc: riel, akpm, viro, hannes, mgorman, neilb, sage, mfasheh,
	linux-mm, linux-kernel

Dave Chinner wrote:
> On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > however, not called from the fs layer 
> > > 
> > > Is that true for filesystems that have directories in
> > > the page cache?
> > 
> > I haven't found any explicit callers of filemap_fault except for ocfs2
> > and ceph and those seem OK to me. Which filesystems you have in mind?
> 
> Just about every major filesystem calls filemap_fault through the
> .fault callout.
> 
> C symbol: filemap_fault
> 
>   File           Function            Line
>   0 9p/vfs_file.c  <global>             831 .fault = filemap_fault,
>   1 9p/vfs_file.c  <global>             838 .fault = filemap_fault,
>   2 btrfs/file.c   <global>            2081 .fault = filemap_fault,
>   3 cifs/file.c    <global>            3242 .fault = filemap_fault,
>   4 ext4/file.c    <global>             215 .fault = filemap_fault,
>   5 f2fs/file.c    <global>              93 .fault = filemap_fault,
>   6 fuse/file.c    <global>            2062 .fault = filemap_fault,
>   7 gfs2/file.c    <global>             498 .fault = filemap_fault,
>   8 nfs/file.c     <global>             653 .fault = filemap_fault,
>   9 nilfs2/file.c  <global>             128 .fault = filemap_fault,
>   a ubifs/file.c   <global>            1536 .fault = filemap_fault,
>   b xfs/xfs_file.c <global>            1420 .fault = filemap_fault,
> 
> 
> > Btw. how would that work as we already have GFP_KERNEL allocation few
> > lines below?
> 
> GFP_KERNEL allocation for mappings is simply wrong. All mapping
> allocations where the caller cannot pass a gfp_mask need to obey
> the mapping_gfp_mask that is set by the mapping owner....
> 

Is there any chance to annotate which GFP flag needs to be used like
https://lkml.org/lkml/2015/3/17/507 ?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache inpage_cache_read
@ 2015-03-19 11:11         ` Tetsuo Handa
  0 siblings, 0 replies; 47+ messages in thread
From: Tetsuo Handa @ 2015-03-19 11:11 UTC (permalink / raw)
  To: david, mhocko
  Cc: riel, akpm, viro, hannes, mgorman, neilb, sage, mfasheh,
	linux-mm, linux-kernel

Dave Chinner wrote:
> On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > however, not called from the fs layer 
> > > 
> > > Is that true for filesystems that have directories in
> > > the page cache?
> > 
> > I haven't found any explicit callers of filemap_fault except for ocfs2
> > and ceph and those seem OK to me. Which filesystems you have in mind?
> 
> Just about every major filesystem calls filemap_fault through the
> .fault callout.
> 
> C symbol: filemap_fault
> 
>   File           Function            Line
>   0 9p/vfs_file.c  <global>             831 .fault = filemap_fault,
>   1 9p/vfs_file.c  <global>             838 .fault = filemap_fault,
>   2 btrfs/file.c   <global>            2081 .fault = filemap_fault,
>   3 cifs/file.c    <global>            3242 .fault = filemap_fault,
>   4 ext4/file.c    <global>             215 .fault = filemap_fault,
>   5 f2fs/file.c    <global>              93 .fault = filemap_fault,
>   6 fuse/file.c    <global>            2062 .fault = filemap_fault,
>   7 gfs2/file.c    <global>             498 .fault = filemap_fault,
>   8 nfs/file.c     <global>             653 .fault = filemap_fault,
>   9 nilfs2/file.c  <global>             128 .fault = filemap_fault,
>   a ubifs/file.c   <global>            1536 .fault = filemap_fault,
>   b xfs/xfs_file.c <global>            1420 .fault = filemap_fault,
> 
> 
> > Btw. how would that work as we already have GFP_KERNEL allocation few
> > lines below?
> 
> GFP_KERNEL allocation for mappings is simply wrong. All mapping
> allocations where the caller cannot pass a gfp_mask need to obey
> the mapping_gfp_mask that is set by the mapping owner....
> 

Is there any chance to annotate which GFP flag needs to be used like
https://lkml.org/lkml/2015/3/17/507 ?

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-19  7:14       ` Dave Chinner
@ 2015-03-19 12:44         ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-19 12:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Thu 19-03-15 18:14:39, Dave Chinner wrote:
> On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > however, not called from the fs layer 
> > > 
> > > Is that true for filesystems that have directories in
> > > the page cache?
> > 
> > I haven't found any explicit callers of filemap_fault except for ocfs2
> > and ceph and those seem OK to me. Which filesystems you have in mind?
> 
> Just about every major filesystem calls filemap_fault through the
> .fault callout.

That is right but the callback is called from the VM layer where we
obviously do not take any fs locks (we are holding only mmap_sem
for reading).
Those who call filemap_fault directly (ocfs2 and ceph) and those
who call the callback directly: qxl_ttm_fault, radeon_ttm_fault,
kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion
POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem
to be used from the reclaim context.

Or did I miss your point? Are you concerned about some fs overloading
filemap_fault and do some locking before delegating to filemap_fault?

> C symbol: filemap_fault
> 
>   File           Function            Line
>   0 9p/vfs_file.c  <global>             831 .fault = filemap_fault,
>   1 9p/vfs_file.c  <global>             838 .fault = filemap_fault,
>   2 btrfs/file.c   <global>            2081 .fault = filemap_fault,
>   3 cifs/file.c    <global>            3242 .fault = filemap_fault,
>   4 ext4/file.c    <global>             215 .fault = filemap_fault,
>   5 f2fs/file.c    <global>              93 .fault = filemap_fault,
>   6 fuse/file.c    <global>            2062 .fault = filemap_fault,
>   7 gfs2/file.c    <global>             498 .fault = filemap_fault,
>   8 nfs/file.c     <global>             653 .fault = filemap_fault,
>   9 nilfs2/file.c  <global>             128 .fault = filemap_fault,
>   a ubifs/file.c   <global>            1536 .fault = filemap_fault,
>   b xfs/xfs_file.c <global>            1420 .fault = filemap_fault,
> 
> 
> > Btw. how would that work as we already have GFP_KERNEL allocation few
> > lines below?
> 
> GFP_KERNEL allocation for mappings is simply wrong. All mapping
> allocations where the caller cannot pass a gfp_mask need to obey
> the mapping_gfp_mask that is set by the mapping owner....

Hmm, I thought this is true only when the function might be called from
the fs path.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-19 12:44         ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-19 12:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Thu 19-03-15 18:14:39, Dave Chinner wrote:
> On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > however, not called from the fs layer 
> > > 
> > > Is that true for filesystems that have directories in
> > > the page cache?
> > 
> > I haven't found any explicit callers of filemap_fault except for ocfs2
> > and ceph and those seem OK to me. Which filesystems you have in mind?
> 
> Just about every major filesystem calls filemap_fault through the
> .fault callout.

That is right but the callback is called from the VM layer where we
obviously do not take any fs locks (we are holding only mmap_sem
for reading).
Those who call filemap_fault directly (ocfs2 and ceph) and those
who call the callback directly: qxl_ttm_fault, radeon_ttm_fault,
kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion
POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem
to be used from the reclaim context.

Or did I miss your point? Are you concerned about some fs overloading
filemap_fault and do some locking before delegating to filemap_fault?

> C symbol: filemap_fault
> 
>   File           Function            Line
>   0 9p/vfs_file.c  <global>             831 .fault = filemap_fault,
>   1 9p/vfs_file.c  <global>             838 .fault = filemap_fault,
>   2 btrfs/file.c   <global>            2081 .fault = filemap_fault,
>   3 cifs/file.c    <global>            3242 .fault = filemap_fault,
>   4 ext4/file.c    <global>             215 .fault = filemap_fault,
>   5 f2fs/file.c    <global>              93 .fault = filemap_fault,
>   6 fuse/file.c    <global>            2062 .fault = filemap_fault,
>   7 gfs2/file.c    <global>             498 .fault = filemap_fault,
>   8 nfs/file.c     <global>             653 .fault = filemap_fault,
>   9 nilfs2/file.c  <global>             128 .fault = filemap_fault,
>   a ubifs/file.c   <global>            1536 .fault = filemap_fault,
>   b xfs/xfs_file.c <global>            1420 .fault = filemap_fault,
> 
> 
> > Btw. how would that work as we already have GFP_KERNEL allocation few
> > lines below?
> 
> GFP_KERNEL allocation for mappings is simply wrong. All mapping
> allocations where the caller cannot pass a gfp_mask need to obey
> the mapping_gfp_mask that is set by the mapping owner....

Hmm, I thought this is true only when the function might be called from
the fs path.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-18 21:38   ` NeilBrown
@ 2015-03-19 13:55       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-19 13:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman,
	Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm,
	LKML

On Thu 19-03-15 08:38:35, Neil Brown wrote:
[...]
> Nearly half the places in the kernel which call mapping_gfp_mask() remove the
> __GFP_FS bit.
> 
> That suggests to me that it might make sense to have
>    mapping_gfp_mask_fs()
> and
>    mapping_gfp_mask_nofs()
>
> and let the presence of __GFP_FS (and __GFP_IO) be determined by the
> call-site rather than the filesystem.

Sounds reasonable to me but filesystems tend to use this in a very
different ways.
- xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are
  NOFS.
- reiserfs drops GFP_FS only before calling read_mapping_page in
  reiserfs_get_page and never restores the original mask.
- btrfs doesn't seem to rely on mapping_gfp_mask for anything other than
  btree_inode (unless it gets inherrited in a way I haven't noticed).
- ext* doesn't seem to rely on the mapping gfp mask at all.

So it is not clear to me how we should change that into callsites. But I
guess we can change at least the page fault path like the following. I
like it much more than the previous way which is too hackish.
---
>From 0aff17ef2daf1e4d7b5c7a2fcf3c0aac2670f527 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 19 Mar 2015 14:46:07 +0100
Subject: [PATCH] mm: Allow __GFP_FS for page_cache_read page cache allocation

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layer so it doesn't need this
protection.

The protection might be even harmful. There is a strong push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

This patch updates mapping_gfp_mask and adds both __GFP_FS and __GFP_IO
in __do_fault before we the address space fault handler is called and
restores the original flags after the callback. This will allow to use
the go into FS from the direct reclaim if needed and prevent from
pre-mature OOM killer for most filesystems and still allow FS layer
to overwrite the mask from the .fault handler should there be a need.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index c4b08e3ef058..ba528787e25f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2701,13 +2701,24 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 {
 	struct vm_fault vmf;
 	int ret;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	gfp_t mapping_gfp;
 
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
 
+	/*
+	 * Some filesystems always drop __GFP_FS to prevent from reclaim
+	 * recursion back to FS code. This is not the case here because
+	 * we are at the top of the call chain. Add GFP_FS flags to prevent
+	 * from premature OOM killer.
+	 */
+	mapping_gfp = mapping_gfp_mask(mapping);
+	mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO);
 	ret = vma->vm_ops->fault(vma, &vmf);
+	mapping_set_gfp_mask(mapping, mapping_gfp);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
2.1.4

> However I am a bit concerned about drivers/block/loop.c.
> Might a filesystem read on the loop block device wait for a page_cache_read()
> on the loop-mounted file?  In that case you really don't want __GFP_FS set
> when allocating that page.

I guess I see what you mean but I fail to see how would the deadlock
happen from the page fault path. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-19 13:55       ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-19 13:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman,
	Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm,
	LKML

On Thu 19-03-15 08:38:35, Neil Brown wrote:
[...]
> Nearly half the places in the kernel which call mapping_gfp_mask() remove the
> __GFP_FS bit.
> 
> That suggests to me that it might make sense to have
>    mapping_gfp_mask_fs()
> and
>    mapping_gfp_mask_nofs()
>
> and let the presence of __GFP_FS (and __GFP_IO) be determined by the
> call-site rather than the filesystem.

Sounds reasonable to me but filesystems tend to use this in a very
different ways.
- xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are
  NOFS.
- reiserfs drops GFP_FS only before calling read_mapping_page in
  reiserfs_get_page and never restores the original mask.
- btrfs doesn't seem to rely on mapping_gfp_mask for anything other than
  btree_inode (unless it gets inherrited in a way I haven't noticed).
- ext* doesn't seem to rely on the mapping gfp mask at all.

So it is not clear to me how we should change that into callsites. But I
guess we can change at least the page fault path like the following. I
like it much more than the previous way which is too hackish.
---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-19 13:55       ` Michal Hocko
@ 2015-03-19 14:27         ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-19 14:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman,
	Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm,
	LKML

On Thu 19-03-15 14:55:58, Michal Hocko wrote:
> On Thu 19-03-15 08:38:35, Neil Brown wrote:
> [...]
> > Nearly half the places in the kernel which call mapping_gfp_mask() remove the
> > __GFP_FS bit.
> > 
> > That suggests to me that it might make sense to have
> >    mapping_gfp_mask_fs()
> > and
> >    mapping_gfp_mask_nofs()
> >
> > and let the presence of __GFP_FS (and __GFP_IO) be determined by the
> > call-site rather than the filesystem.
> 
> Sounds reasonable to me but filesystems tend to use this in a very
> different ways.
> - xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are
>   NOFS.
> - reiserfs drops GFP_FS only before calling read_mapping_page in
>   reiserfs_get_page and never restores the original mask.
> - btrfs doesn't seem to rely on mapping_gfp_mask for anything other than
>   btree_inode (unless it gets inherrited in a way I haven't noticed).
> - ext* doesn't seem to rely on the mapping gfp mask at all.
> 
> So it is not clear to me how we should change that into callsites. But I
> guess we can change at least the page fault path like the following. I
> like it much more than the previous way which is too hackish.

But this is racy instead... And I do not think we can make it raceless
so scratch this and get back to the original approach.
[...]
> +	/*
> +	 * Some filesystems always drop __GFP_FS to prevent from reclaim
> +	 * recursion back to FS code. This is not the case here because
> +	 * we are at the top of the call chain. Add GFP_FS flags to prevent
> +	 * from premature OOM killer.
> +	 */
> +	mapping_gfp = mapping_gfp_mask(mapping);
> +	mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO);
>  	ret = vma->vm_ops->fault(vma, &vmf);
> +	mapping_set_gfp_mask(mapping, mapping_gfp);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-19 14:27         ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-19 14:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman,
	Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm,
	LKML

On Thu 19-03-15 14:55:58, Michal Hocko wrote:
> On Thu 19-03-15 08:38:35, Neil Brown wrote:
> [...]
> > Nearly half the places in the kernel which call mapping_gfp_mask() remove the
> > __GFP_FS bit.
> > 
> > That suggests to me that it might make sense to have
> >    mapping_gfp_mask_fs()
> > and
> >    mapping_gfp_mask_nofs()
> >
> > and let the presence of __GFP_FS (and __GFP_IO) be determined by the
> > call-site rather than the filesystem.
> 
> Sounds reasonable to me but filesystems tend to use this in a very
> different ways.
> - xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are
>   NOFS.
> - reiserfs drops GFP_FS only before calling read_mapping_page in
>   reiserfs_get_page and never restores the original mask.
> - btrfs doesn't seem to rely on mapping_gfp_mask for anything other than
>   btree_inode (unless it gets inherrited in a way I haven't noticed).
> - ext* doesn't seem to rely on the mapping gfp mask at all.
> 
> So it is not clear to me how we should change that into callsites. But I
> guess we can change at least the page fault path like the following. I
> like it much more than the previous way which is too hackish.

But this is racy instead... And I do not think we can make it raceless
so scratch this and get back to the original approach.
[...]
> +	/*
> +	 * Some filesystems always drop __GFP_FS to prevent from reclaim
> +	 * recursion back to FS code. This is not the case here because
> +	 * we are at the top of the call chain. Add GFP_FS flags to prevent
> +	 * from premature OOM killer.
> +	 */
> +	mapping_gfp = mapping_gfp_mask(mapping);
> +	mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO);
>  	ret = vma->vm_ops->fault(vma, &vmf);
> +	mapping_set_gfp_mask(mapping, mapping_gfp);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-19 12:44         ` Michal Hocko
@ 2015-03-20  3:48           ` Dave Chinner
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-20  3:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> On Thu 19-03-15 18:14:39, Dave Chinner wrote:
> > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > > however, not called from the fs layer 
> > > > 
> > > > Is that true for filesystems that have directories in
> > > > the page cache?
> > > 
> > > I haven't found any explicit callers of filemap_fault except for ocfs2
> > > and ceph and those seem OK to me. Which filesystems you have in mind?
> > 
> > Just about every major filesystem calls filemap_fault through the
> > .fault callout.
> 
> That is right but the callback is called from the VM layer where we
> obviously do not take any fs locks (we are holding only mmap_sem
> for reading).
> Those who call filemap_fault directly (ocfs2 and ceph) and those
> who call the callback directly: qxl_ttm_fault, radeon_ttm_fault,
> kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion
> POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem
> to be used from the reclaim context.
> 
> Or did I miss your point? Are you concerned about some fs overloading
> filemap_fault and do some locking before delegating to filemap_fault?

The latter:

https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b

> > GFP_KERNEL allocation for mappings is simply wrong. All mapping
> > allocations where the caller cannot pass a gfp_mask need to obey
> > the mapping_gfp_mask that is set by the mapping owner....
> 
> Hmm, I thought this is true only when the function might be called from
> the fs path.

How do you know in, say, mpage_readpages, you aren't being called
from a fs path that holds locks? e.g. we can get there from ext4
doing readdir, so it is holding an i_mutex lock at that point.

Many other paths into mpages_readpages don't hold locks, but there
are some that do, and those that do need functionals like this to
obey the mapping_gfp_mask because it is set appropriately for the
allocation context of the inode that owns the mapping....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-20  3:48           ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-20  3:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> On Thu 19-03-15 18:14:39, Dave Chinner wrote:
> > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > > however, not called from the fs layer 
> > > > 
> > > > Is that true for filesystems that have directories in
> > > > the page cache?
> > > 
> > > I haven't found any explicit callers of filemap_fault except for ocfs2
> > > and ceph and those seem OK to me. Which filesystems you have in mind?
> > 
> > Just about every major filesystem calls filemap_fault through the
> > .fault callout.
> 
> That is right but the callback is called from the VM layer where we
> obviously do not take any fs locks (we are holding only mmap_sem
> for reading).
> Those who call filemap_fault directly (ocfs2 and ceph) and those
> who call the callback directly: qxl_ttm_fault, radeon_ttm_fault,
> kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion
> POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem
> to be used from the reclaim context.
> 
> Or did I miss your point? Are you concerned about some fs overloading
> filemap_fault and do some locking before delegating to filemap_fault?

The latter:

https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b

> > GFP_KERNEL allocation for mappings is simply wrong. All mapping
> > allocations where the caller cannot pass a gfp_mask need to obey
> > the mapping_gfp_mask that is set by the mapping owner....
> 
> Hmm, I thought this is true only when the function might be called from
> the fs path.

How do you know in, say, mpage_readpages, you aren't being called
from a fs path that holds locks? e.g. we can get there from ext4
doing readdir, so it is holding an i_mutex lock at that point.

Many other paths into mpages_readpages don't hold locks, but there
are some that do, and those that do need functionals like this to
obey the mapping_gfp_mask because it is set appropriately for the
allocation context of the inode that owns the mapping....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-19 13:55       ` Michal Hocko
@ 2015-03-20  3:57         ` Dave Chinner
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-20  3:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman,
	Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm,
	LKML

On Thu, Mar 19, 2015 at 02:55:58PM +0100, Michal Hocko wrote:
> @@ -2701,13 +2701,24 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  {
>  	struct vm_fault vmf;
>  	int ret;
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	gfp_t mapping_gfp;
>  
>  	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
>  	vmf.pgoff = pgoff;
>  	vmf.flags = flags;
>  	vmf.page = NULL;
>  
> +	/*
> +	 * Some filesystems always drop __GFP_FS to prevent from reclaim
> +	 * recursion back to FS code. This is not the case here because
> +	 * we are at the top of the call chain. Add GFP_FS flags to prevent
> +	 * from premature OOM killer.
> +	 */
> +	mapping_gfp = mapping_gfp_mask(mapping);
> +	mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO);
>  	ret = vma->vm_ops->fault(vma, &vmf);
> +	mapping_set_gfp_mask(mapping, mapping_gfp);

Urk! The inode owns the mapping and makes these decisions, not the
page fault path. These mapping flags may be set for reasons you
don't expect or know about (e.g. a subsystem specific shrinker
constraint) so paths like this have no business clearing flags they
don't own.

cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-20  3:57         ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-20  3:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Andrew Morton, Al Viro, Johannes Weiner, Mel Gorman,
	Rik van Riel, Tetsuo Handa, Sage Weil, Mark Fasheh, linux-mm,
	LKML

On Thu, Mar 19, 2015 at 02:55:58PM +0100, Michal Hocko wrote:
> @@ -2701,13 +2701,24 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
>  {
>  	struct vm_fault vmf;
>  	int ret;
> +	struct address_space *mapping = vma->vm_file->f_mapping;
> +	gfp_t mapping_gfp;
>  
>  	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
>  	vmf.pgoff = pgoff;
>  	vmf.flags = flags;
>  	vmf.page = NULL;
>  
> +	/*
> +	 * Some filesystems always drop __GFP_FS to prevent from reclaim
> +	 * recursion back to FS code. This is not the case here because
> +	 * we are at the top of the call chain. Add GFP_FS flags to prevent
> +	 * from premature OOM killer.
> +	 */
> +	mapping_gfp = mapping_gfp_mask(mapping);
> +	mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO);
>  	ret = vma->vm_ops->fault(vma, &vmf);
> +	mapping_set_gfp_mask(mapping, mapping_gfp);

Urk! The inode owns the mapping and makes these decisions, not the
page fault path. These mapping flags may be set for reasons you
don't expect or know about (e.g. a subsystem specific shrinker
constraint) so paths like this have no business clearing flags they
don't own.

cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-20  3:48           ` Dave Chinner
@ 2015-03-20 13:14             ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-20 13:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > On Thu 19-03-15 18:14:39, Dave Chinner wrote:
> > > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > > > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > > > however, not called from the fs layer 
> > > > > 
> > > > > Is that true for filesystems that have directories in
> > > > > the page cache?
> > > > 
> > > > I haven't found any explicit callers of filemap_fault except for ocfs2
> > > > and ceph and those seem OK to me. Which filesystems you have in mind?
> > > 
> > > Just about every major filesystem calls filemap_fault through the
> > > .fault callout.
> > 
> > That is right but the callback is called from the VM layer where we
> > obviously do not take any fs locks (we are holding only mmap_sem
> > for reading).
> > Those who call filemap_fault directly (ocfs2 and ceph) and those
> > who call the callback directly: qxl_ttm_fault, radeon_ttm_fault,
> > kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion
> > POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem
> > to be used from the reclaim context.
> > 
> > Or did I miss your point? Are you concerned about some fs overloading
> > filemap_fault and do some locking before delegating to filemap_fault?
> 
> The latter:
> 
> https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b

I will have a look at the code to see what we can do about it.
 
> > > GFP_KERNEL allocation for mappings is simply wrong. All mapping
> > > allocations where the caller cannot pass a gfp_mask need to obey
> > > the mapping_gfp_mask that is set by the mapping owner....
> > 
> > Hmm, I thought this is true only when the function might be called from
> > the fs path.
> 
> How do you know in, say, mpage_readpages, you aren't being called
> from a fs path that holds locks? e.g. we can get there from ext4
> doing readdir, so it is holding an i_mutex lock at that point.
> 
> Many other paths into mpages_readpages don't hold locks, but there
> are some that do, and those that do need functionals like this to
> obey the mapping_gfp_mask because it is set appropriately for the
> allocation context of the inode that owns the mapping....

What about the following?
---
>From 5d905cb291138d61bbab056845d6e53bc4451ec8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 19 Mar 2015 14:56:56 +0100
Subject: [PATCH 1/2] mm: do not ignore mapping_gfp_mask in page cache
 allocation paths

page_cache_read, do_generic_file_read, __generic_file_splice_read and
__ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
add_to_page_cache_lru which might cause recursion into fs down in the
direct reclaim path if the mapping really relies on GFP_NOFS semantic.

This doesn't seem to be the case now because page_cache_read (page fault
path) doesn't seem to suffer from the reclaim recursion issues and
do_generic_file_read and __generic_file_splice_read also shouldn't be
called under fs locks which would deadlock in the reclaim path. Anyway
it is better to obey mapping gfp mask and prevent from later breakage.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/ntfs/file.c | 2 +-
 fs/splice.c    | 2 +-
 mm/filemap.c   | 6 ++++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 1da9b2d184dc..568c9dbc7e61 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -422,7 +422,7 @@ static inline int __ntfs_grab_cache_pages(struct address_space *mapping,
 				}
 			}
 			err = add_to_page_cache_lru(*cached_page, mapping, index,
-					GFP_KERNEL);
+					GFP_KERNEL & mapping_gfp_mask(mapping));
 			if (unlikely(err)) {
 				if (err == -EEXIST)
 					continue;
diff --git a/fs/splice.c b/fs/splice.c
index 75c6058eabf2..71f6c51f019a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -360,7 +360,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 				break;
 
 			error = add_to_page_cache_lru(page, mapping, index,
-						GFP_KERNEL);
+						GFP_KERNEL & mapping_gfp_mask(mapping));
 			if (unlikely(error)) {
 				page_cache_release(page);
 				if (error == -EEXIST)
diff --git a/mm/filemap.c b/mm/filemap.c
index 968cd8e03d2e..4756cba51655 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1656,7 +1656,8 @@ no_cached_page:
 			goto out;
 		}
 		error = add_to_page_cache_lru(page, mapping,
-						index, GFP_KERNEL);
+						index,
+						GFP_KERNEL & mapping_gfp_mask(mapping));
 		if (error) {
 			page_cache_release(page);
 			if (error == -EEXIST) {
@@ -1756,7 +1757,8 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 		if (!page)
 			return -ENOMEM;
 
-		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+		ret = add_to_page_cache_lru(page, mapping, offset,
+				GFP_KERNEL & mapping_gfp_mask(mapping));
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
-- 
2.1.4


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-20 13:14             ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-20 13:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > On Thu 19-03-15 18:14:39, Dave Chinner wrote:
> > > On Wed, Mar 18, 2015 at 03:55:28PM +0100, Michal Hocko wrote:
> > > > On Wed 18-03-15 10:44:11, Rik van Riel wrote:
> > > > > On 03/18/2015 10:09 AM, Michal Hocko wrote:
> > > > > > page_cache_read has been historically using page_cache_alloc_cold to
> > > > > > allocate a new page. This means that mapping_gfp_mask is used as the
> > > > > > base for the gfp_mask. Many filesystems are setting this mask to
> > > > > > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > > > > > however, not called from the fs layer 
> > > > > 
> > > > > Is that true for filesystems that have directories in
> > > > > the page cache?
> > > > 
> > > > I haven't found any explicit callers of filemap_fault except for ocfs2
> > > > and ceph and those seem OK to me. Which filesystems you have in mind?
> > > 
> > > Just about every major filesystem calls filemap_fault through the
> > > .fault callout.
> > 
> > That is right but the callback is called from the VM layer where we
> > obviously do not take any fs locks (we are holding only mmap_sem
> > for reading).
> > Those who call filemap_fault directly (ocfs2 and ceph) and those
> > who call the callback directly: qxl_ttm_fault, radeon_ttm_fault,
> > kernfs_vma_fault, shm_fault seem to be safe from the reclaim recursion
> > POV. radeon_ttm_fault takes a lock for reading but that one doesn't seem
> > to be used from the reclaim context.
> > 
> > Or did I miss your point? Are you concerned about some fs overloading
> > filemap_fault and do some locking before delegating to filemap_fault?
> 
> The latter:
> 
> https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b

I will have a look at the code to see what we can do about it.
 
> > > GFP_KERNEL allocation for mappings is simply wrong. All mapping
> > > allocations where the caller cannot pass a gfp_mask need to obey
> > > the mapping_gfp_mask that is set by the mapping owner....
> > 
> > Hmm, I thought this is true only when the function might be called from
> > the fs path.
> 
> How do you know in, say, mpage_readpages, you aren't being called
> from a fs path that holds locks? e.g. we can get there from ext4
> doing readdir, so it is holding an i_mutex lock at that point.
> 
> Many other paths into mpages_readpages don't hold locks, but there
> are some that do, and those that do need functionals like this to
> obey the mapping_gfp_mask because it is set appropriately for the
> allocation context of the inode that owns the mapping....

What about the following?
---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-20 13:14             ` Michal Hocko
@ 2015-03-20 22:51               ` Dave Chinner
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-20 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri, Mar 20, 2015 at 02:14:53PM +0100, Michal Hocko wrote:
> On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > > > allocations where the caller cannot pass a gfp_mask need to obey
> > > > the mapping_gfp_mask that is set by the mapping owner....
> > > 
> > > Hmm, I thought this is true only when the function might be called from
> > > the fs path.
> > 
> > How do you know in, say, mpage_readpages, you aren't being called
> > from a fs path that holds locks? e.g. we can get there from ext4
> > doing readdir, so it is holding an i_mutex lock at that point.
> > 
> > Many other paths into mpages_readpages don't hold locks, but there
> > are some that do, and those that do need functionals like this to
> > obey the mapping_gfp_mask because it is set appropriately for the
> > allocation context of the inode that owns the mapping....
> 
> What about the following?
> ---
> From 5d905cb291138d61bbab056845d6e53bc4451ec8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 19 Mar 2015 14:56:56 +0100
> Subject: [PATCH 1/2] mm: do not ignore mapping_gfp_mask in page cache
>  allocation paths

Looks reasonable, though I though there were more places that that
in the mapping paths that need to be careful...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-20 22:51               ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-20 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri, Mar 20, 2015 at 02:14:53PM +0100, Michal Hocko wrote:
> On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > > > allocations where the caller cannot pass a gfp_mask need to obey
> > > > the mapping_gfp_mask that is set by the mapping owner....
> > > 
> > > Hmm, I thought this is true only when the function might be called from
> > > the fs path.
> > 
> > How do you know in, say, mpage_readpages, you aren't being called
> > from a fs path that holds locks? e.g. we can get there from ext4
> > doing readdir, so it is holding an i_mutex lock at that point.
> > 
> > Many other paths into mpages_readpages don't hold locks, but there
> > are some that do, and those that do need functionals like this to
> > obey the mapping_gfp_mask because it is set appropriately for the
> > allocation context of the inode that owns the mapping....
> 
> What about the following?
> ---
> From 5d905cb291138d61bbab056845d6e53bc4451ec8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 19 Mar 2015 14:56:56 +0100
> Subject: [PATCH 1/2] mm: do not ignore mapping_gfp_mask in page cache
>  allocation paths

Looks reasonable, though I though there were more places that that
in the mapping paths that need to be careful...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-20 22:51               ` Dave Chinner
@ 2015-03-23 13:02                 ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-23 13:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Sat 21-03-15 09:51:39, Dave Chinner wrote:
> On Fri, Mar 20, 2015 at 02:14:53PM +0100, Michal Hocko wrote:
> > On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > > > > allocations where the caller cannot pass a gfp_mask need to obey
> > > > > the mapping_gfp_mask that is set by the mapping owner....
> > > > 
> > > > Hmm, I thought this is true only when the function might be called from
> > > > the fs path.
> > > 
> > > How do you know in, say, mpage_readpages, you aren't being called
> > > from a fs path that holds locks? e.g. we can get there from ext4
> > > doing readdir, so it is holding an i_mutex lock at that point.
> > > 
> > > Many other paths into mpages_readpages don't hold locks, but there
> > > are some that do, and those that do need functionals like this to
> > > obey the mapping_gfp_mask because it is set appropriately for the
> > > allocation context of the inode that owns the mapping....
> > 
> > What about the following?
> > ---
> > From 5d905cb291138d61bbab056845d6e53bc4451ec8 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Thu, 19 Mar 2015 14:56:56 +0100
> > Subject: [PATCH 1/2] mm: do not ignore mapping_gfp_mask in page cache
> >  allocation paths
> 
> Looks reasonable, though I though there were more places that that
> in the mapping paths that need to be careful...

I have focused on those which involve page cache allocation because
those are obvious. We might need others but I do not see them right now.

I will include this patch for the next submit after I manage to wrap my
head around up-coming xfs changes and come up with something for
page_cache_read vs OOM killer issue.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-23 13:02                 ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-23 13:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Sat 21-03-15 09:51:39, Dave Chinner wrote:
> On Fri, Mar 20, 2015 at 02:14:53PM +0100, Michal Hocko wrote:
> > On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > > > > allocations where the caller cannot pass a gfp_mask need to obey
> > > > > the mapping_gfp_mask that is set by the mapping owner....
> > > > 
> > > > Hmm, I thought this is true only when the function might be called from
> > > > the fs path.
> > > 
> > > How do you know in, say, mpage_readpages, you aren't being called
> > > from a fs path that holds locks? e.g. we can get there from ext4
> > > doing readdir, so it is holding an i_mutex lock at that point.
> > > 
> > > Many other paths into mpages_readpages don't hold locks, but there
> > > are some that do, and those that do need functionals like this to
> > > obey the mapping_gfp_mask because it is set appropriately for the
> > > allocation context of the inode that owns the mapping....
> > 
> > What about the following?
> > ---
> > From 5d905cb291138d61bbab056845d6e53bc4451ec8 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Thu, 19 Mar 2015 14:56:56 +0100
> > Subject: [PATCH 1/2] mm: do not ignore mapping_gfp_mask in page cache
> >  allocation paths
> 
> Looks reasonable, though I though there were more places that that
> in the mapping paths that need to be careful...

I have focused on those which involve page cache allocation because
those are obvious. We might need others but I do not see them right now.

I will include this patch for the next submit after I manage to wrap my
head around up-coming xfs changes and come up with something for
page_cache_read vs OOM killer issue.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-20  3:48           ` Dave Chinner
@ 2015-03-26  9:53             ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-26  9:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
[...]
> > Or did I miss your point? Are you concerned about some fs overloading
> > filemap_fault and do some locking before delegating to filemap_fault?
> 
> The latter:
> 
> https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b

Hmm. I am completely unfamiliar with the xfs code but my reading of
964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
OK from the reclaim recursion POV. It protects against truncate and
punch hole, right? Or are there any internal paths which I am missing
and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-26  9:53             ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-26  9:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
[...]
> > Or did I miss your point? Are you concerned about some fs overloading
> > filemap_fault and do some locking before delegating to filemap_fault?
> 
> The latter:
> 
> https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b

Hmm. I am completely unfamiliar with the xfs code but my reading of
964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
OK from the reclaim recursion POV. It protects against truncate and
punch hole, right? Or are there any internal paths which I am missing
and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-26  9:53             ` Michal Hocko
@ 2015-03-26 21:43               ` Dave Chinner
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-26 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote:
> On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> [...]
> > > Or did I miss your point? Are you concerned about some fs overloading
> > > filemap_fault and do some locking before delegating to filemap_fault?
> > 
> > The latter:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b
> 
> Hmm. I am completely unfamiliar with the xfs code but my reading of
> 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
> OK from the reclaim recursion POV. It protects against truncate and
> punch hole, right? Or are there any internal paths which I am missing
> and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?

It might be OK, but you're only looking at the example I gave you,
not the fundamental issue it demonstrates. That is: filesystems may
have *internal dependencies that are unknown to the page cache or mm
subsystem*. Hence the page cache or mm allocations cannot
arbitrarily ignore allocation constraints the filesystem assigns to
mapping operations....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-26 21:43               ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-26 21:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote:
> On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> [...]
> > > Or did I miss your point? Are you concerned about some fs overloading
> > > filemap_fault and do some locking before delegating to filemap_fault?
> > 
> > The latter:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b
> 
> Hmm. I am completely unfamiliar with the xfs code but my reading of
> 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
> OK from the reclaim recursion POV. It protects against truncate and
> punch hole, right? Or are there any internal paths which I am missing
> and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?

It might be OK, but you're only looking at the example I gave you,
not the fundamental issue it demonstrates. That is: filesystems may
have *internal dependencies that are unknown to the page cache or mm
subsystem*. Hence the page cache or mm allocations cannot
arbitrarily ignore allocation constraints the filesystem assigns to
mapping operations....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-26 21:43               ` Dave Chinner
@ 2015-03-30  8:22                 ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-30  8:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri 27-03-15 08:43:54, Dave Chinner wrote:
> On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote:
> > On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > [...]
> > > > Or did I miss your point? Are you concerned about some fs overloading
> > > > filemap_fault and do some locking before delegating to filemap_fault?
> > > 
> > > The latter:
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b
> > 
> > Hmm. I am completely unfamiliar with the xfs code but my reading of
> > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
> > OK from the reclaim recursion POV. It protects against truncate and
> > punch hole, right? Or are there any internal paths which I am missing
> > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?
> 
> It might be OK, but you're only looking at the example I gave you,
> not the fundamental issue it demonstrates. That is: filesystems may
> have *internal dependencies that are unknown to the page cache or mm
> subsystem*. Hence the page cache or mm allocations cannot
> arbitrarily ignore allocation constraints the filesystem assigns to
> mapping operations....

I fully understand that. I am just trying to understand what are the
real requirements from filesystems wrt filemap_fault. mapping gfp mask is
not usable much for that because e.g. xfs has GFP_NOFS set for the whole
inode life AFAICS. And it seems that this context is not really required
even after the recent code changes.
We can add gfp_mask into struct vm_fault and initialize it to
mapping_gfp_mask | GFP_IOFS and .fault() callback might overwrite it. This
would be cleaner than unconditional gfp manipulation (the patch below).

But we are in a really hard position if the GFP_NOFS context is really
required here. We shouldn't really trigger OOM killer because that could
be premature and way too disruptive. We can retry the page fault or the
allocation but that both sound suboptimal to me.

Do you have any other suggestions?

This hasn't been tested yet it just shows the idea mentioned above.
---
>From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 27 Mar 2015 13:33:51 +0100
Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layera directly so it doesn't need this
protection normally.

ceph and ocfs2 which call filemap_fault from their fault handlers
seem to be OK because they are not taking any fs lock before invoking
generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe
from the reclaim recursion POV because this lock serializes truncate
and punch hole with the page faults and it doesn't get involved in the
reclaim.

The GFP_NOFS protection might be even harmful. There is a push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

We cannot play with mapping_gfp_mask directly because that would be racy
wrt. parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask. The mask is also
inode proper so it would even be a layering violation. What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.

Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this
should be safe from the page fault path normally. Why do we care
about mapping_gfp_mask at all then? Because this doesn't hold only
reclaim protection flags but it also might contain zone and movability
restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect
those.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/mm.h | 4 ++++
 mm/filemap.c       | 9 ++++-----
 mm/memory.c        | 3 +++
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b720b5146a4e..c919e48664d5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -217,10 +217,14 @@ extern pgprot_t protection_map[16];
  * ->fault function. The vma's ->fault is responsible for returning a bitmask
  * of VM_FAULT_xxx flags that give details about how the fault was handled.
  *
+ * MM layer fills up gfp_mask for page allocations but fault handler might
+ * alter it if its implementation requires a different allocation context.
+ *
  * pgoff should be used in favour of virtual_address, if possible.
  */
 struct vm_fault {
 	unsigned int flags;		/* FAULT_FLAG_xxx flags */
+	gfp_t gfp_mask			/* gfp mask to be used for allocations */
 	pgoff_t pgoff;			/* Logical page offset based on vma */
 	void __user *virtual_address;	/* Faulting virtual address */

diff --git a/mm/filemap.c b/mm/filemap.c
index 4756cba51655..4e620a79ab4d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1746,19 +1746,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
 {
 	struct address_space *mapping = file->f_mapping;
 	struct page *page;
 	int ret;

 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = __page_cache_alloc(gfp_mask|__GFP_COLD);
 		if (!page)
 			return -ENOMEM;

-		ret = add_to_page_cache_lru(page, mapping, offset,
-				GFP_KERNEL & mapping_gfp_mask(mapping));
+		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL & gfp_mask);
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
@@ -1941,7 +1940,7 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	error = page_cache_read(file, offset, vmf.gfp_mask);

 	/*
 	 * The page we want has now been added to the page cache.
diff --git a/mm/memory.c b/mm/memory.c
index c4b08e3ef058..ae350c752d54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1962,6 +1962,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = page->index;
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf.gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | GFP_IOFS;
 	vmf.page = page;

 	ret = vma->vm_ops->page_mkwrite(vma, &vmf);
@@ -2705,6 +2706,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | GFP_IOFS;
 	vmf.page = NULL;

 	ret = vma->vm_ops->fault(vma, &vmf);
@@ -2868,6 +2870,7 @@ static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
 	vmf.pgoff = pgoff;
 	vmf.max_pgoff = max_pgoff;
 	vmf.flags = flags;
+	vmf.gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | GFP_IOFS;
 	vma->vm_ops->map_pages(vma, &vmf);
 }

-- 
2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-30  8:22                 ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-03-30  8:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Fri 27-03-15 08:43:54, Dave Chinner wrote:
> On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote:
> > On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > [...]
> > > > Or did I miss your point? Are you concerned about some fs overloading
> > > > filemap_fault and do some locking before delegating to filemap_fault?
> > > 
> > > The latter:
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b
> > 
> > Hmm. I am completely unfamiliar with the xfs code but my reading of
> > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
> > OK from the reclaim recursion POV. It protects against truncate and
> > punch hole, right? Or are there any internal paths which I am missing
> > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?
> 
> It might be OK, but you're only looking at the example I gave you,
> not the fundamental issue it demonstrates. That is: filesystems may
> have *internal dependencies that are unknown to the page cache or mm
> subsystem*. Hence the page cache or mm allocations cannot
> arbitrarily ignore allocation constraints the filesystem assigns to
> mapping operations....

I fully understand that. I am just trying to understand what are the
real requirements from filesystems wrt filemap_fault. mapping gfp mask is
not usable much for that because e.g. xfs has GFP_NOFS set for the whole
inode life AFAICS. And it seems that this context is not really required
even after the recent code changes.
We can add gfp_mask into struct vm_fault and initialize it to
mapping_gfp_mask | GFP_IOFS and .fault() callback might overwrite it. This
would be cleaner than unconditional gfp manipulation (the patch below).

But we are in a really hard position if the GFP_NOFS context is really
required here. We shouldn't really trigger OOM killer because that could
be premature and way too disruptive. We can retry the page fault or the
allocation but that both sound suboptimal to me.

Do you have any other suggestions?

This hasn't been tested yet it just shows the idea mentioned above.
---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-30  8:22                 ` Michal Hocko
@ 2015-03-31 21:46                   ` Dave Chinner
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-31 21:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Mon, Mar 30, 2015 at 10:22:18AM +0200, Michal Hocko wrote:
> On Fri 27-03-15 08:43:54, Dave Chinner wrote:
> > On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote:
> > > On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > Or did I miss your point? Are you concerned about some fs overloading
> > > > > filemap_fault and do some locking before delegating to filemap_fault?
> > > > 
> > > > The latter:
> > > > 
> > > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b
> > > 
> > > Hmm. I am completely unfamiliar with the xfs code but my reading of
> > > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
> > > OK from the reclaim recursion POV. It protects against truncate and
> > > punch hole, right? Or are there any internal paths which I am missing
> > > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?
> > 
> > It might be OK, but you're only looking at the example I gave you,
> > not the fundamental issue it demonstrates. That is: filesystems may
> > have *internal dependencies that are unknown to the page cache or mm
> > subsystem*. Hence the page cache or mm allocations cannot
> > arbitrarily ignore allocation constraints the filesystem assigns to
> > mapping operations....
> 
> I fully understand that. I am just trying to understand what are the
> real requirements from filesystems wrt filemap_fault. mapping gfp mask is
> not usable much for that because e.g. xfs has GFP_NOFS set for the whole
> inode life AFAICS. And it seems that this context is not really required
> even after the recent code changes.
> We can add gfp_mask into struct vm_fault and initialize it to
> mapping_gfp_mask | GFP_IOFS and .fault() callback might overwrite it. This
> would be cleaner than unconditional gfp manipulation (the patch below).
> 
> But we are in a really hard position if the GFP_NOFS context is really
> required here. We shouldn't really trigger OOM killer because that could
> be premature and way too disruptive. We can retry the page fault or the
> allocation but that both sound suboptimal to me.

GFP_NOFS has also been required in the mapping mask in the past
because reclaim from page cache allocation points had a nasty habit
of blowing the stack. In XFS, we replaced AOP_FLAG_NOFS calls with
the GFP_NOFS mapping mask because the AOP_FLAG_NOFS didn't give us
the coverage needed on mapping allocation calls to prevent the
problems being reported.

Yes, we now have a larger stack, but as I've said in the past,
GFP_NOFS is used for several different reasons by subsystems -
recursion can be bad in more ways than one, and GFP_NOFS is the only
hammer we have...

> This hasn't been tested yet it just shows the idea mentioned above.
> ---
> From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 27 Mar 2015 13:33:51 +0100
> Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation
> 
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page. This means that mapping_gfp_mask is used as the
> base for the gfp_mask. Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> however, not called from the fs layera directly so it doesn't need this
> protection normally.

It can be called from a page fault while copying into or out of a
user buffer from a read()/write() system call. Hence the page fault
can be nested inside filesystem locks. Indeed, the canonical reason
for why we can't take the i_mutex in the page fault path is exactly
this. i.e. the user buffer might be a mmap()d region of the same
file and so we have mmap_sem/i_mutex inversion issues.

This is the same case - we can be taking page faults with filesystem
locks held, and that means we've got problems if the page fault then
recurses back into the filesystem and trips over those locks...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-03-31 21:46                   ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2015-03-31 21:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Mon, Mar 30, 2015 at 10:22:18AM +0200, Michal Hocko wrote:
> On Fri 27-03-15 08:43:54, Dave Chinner wrote:
> > On Thu, Mar 26, 2015 at 10:53:02AM +0100, Michal Hocko wrote:
> > > On Fri 20-03-15 14:48:20, Dave Chinner wrote:
> > > > On Thu, Mar 19, 2015 at 01:44:41PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > Or did I miss your point? Are you concerned about some fs overloading
> > > > > filemap_fault and do some locking before delegating to filemap_fault?
> > > > 
> > > > The latter:
> > > > 
> > > > https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-mmap-lock&id=de0e8c20ba3a65b0f15040aabbefdc1999876e6b
> > > 
> > > Hmm. I am completely unfamiliar with the xfs code but my reading of
> > > 964aa8d9e4d3..723cac484733 is that the newly introduced lock should be
> > > OK from the reclaim recursion POV. It protects against truncate and
> > > punch hole, right? Or are there any internal paths which I am missing
> > > and would cause problems if we do GFP_FS with XFS_MMAPLOCK_SHARED held?
> > 
> > It might be OK, but you're only looking at the example I gave you,
> > not the fundamental issue it demonstrates. That is: filesystems may
> > have *internal dependencies that are unknown to the page cache or mm
> > subsystem*. Hence the page cache or mm allocations cannot
> > arbitrarily ignore allocation constraints the filesystem assigns to
> > mapping operations....
> 
> I fully understand that. I am just trying to understand what are the
> real requirements from filesystems wrt filemap_fault. mapping gfp mask is
> not usable much for that because e.g. xfs has GFP_NOFS set for the whole
> inode life AFAICS. And it seems that this context is not really required
> even after the recent code changes.
> We can add gfp_mask into struct vm_fault and initialize it to
> mapping_gfp_mask | GFP_IOFS and .fault() callback might overwrite it. This
> would be cleaner than unconditional gfp manipulation (the patch below).
> 
> But we are in a really hard position if the GFP_NOFS context is really
> required here. We shouldn't really trigger OOM killer because that could
> be premature and way too disruptive. We can retry the page fault or the
> allocation but that both sound suboptimal to me.

GFP_NOFS has also been required in the mapping mask in the past
because reclaim from page cache allocation points had a nasty habit
of blowing the stack. In XFS, we replaced AOP_FLAG_NOFS calls with
the GFP_NOFS mapping mask because the AOP_FLAG_NOFS didn't give us
the coverage needed on mapping allocation calls to prevent the
problems being reported.

Yes, we now have a larger stack, but as I've said in the past,
GFP_NOFS is used for several different reasons by subsystems -
recursion can be bad in more ways than one, and GFP_NOFS is the only
hammer we have...

> This hasn't been tested yet it just shows the idea mentioned above.
> ---
> From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 27 Mar 2015 13:33:51 +0100
> Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation
> 
> page_cache_read has been historically using page_cache_alloc_cold to
> allocate a new page. This means that mapping_gfp_mask is used as the
> base for the gfp_mask. Many filesystems are setting this mask to
> GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> however, not called from the fs layera directly so it doesn't need this
> protection normally.

It can be called from a page fault while copying into or out of a
user buffer from a read()/write() system call. Hence the page fault
can be nested inside filesystem locks. Indeed, the canonical reason
for why we can't take the i_mutex in the page fault path is exactly
this. i.e. the user buffer might be a mmap()d region of the same
file and so we have mmap_sem/i_mutex inversion issues.

This is the same case - we can be taking page faults with filesystem
locks held, and that means we've got problems if the page fault then
recurses back into the filesystem and trips over those locks...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
  2015-03-31 21:46                   ` Dave Chinner
@ 2015-04-07 12:16                     ` Michal Hocko
  -1 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-04-07 12:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Wed 01-04-15 08:46:51, Dave Chinner wrote:
[...]
> GFP_NOFS has also been required in the mapping mask in the past
> because reclaim from page cache allocation points had a nasty habit
> of blowing the stack.

Yeah I remember some scary traces but we are talking about the page
fault path and we definitely have to handle GFP_IOFS allocations
there. We cannot use GFP_NOFS as a workaround e.g. for anonymous pages.

[...]
> > From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Fri, 27 Mar 2015 13:33:51 +0100
> > Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation
> > 
> > page_cache_read has been historically using page_cache_alloc_cold to
> > allocate a new page. This means that mapping_gfp_mask is used as the
> > base for the gfp_mask. Many filesystems are setting this mask to
> > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > however, not called from the fs layera directly so it doesn't need this
> > protection normally.
> 
> It can be called from a page fault while copying into or out of a
> user buffer from a read()/write() system call. Hence the page fault
> can be nested inside filesystem locks.

As pointed above, the user buffer might be an anonymous memory as well
and so we have to be able to handle GFP_IOFS allocations from the page
fault without recalaim deadlocks. Besides that we are allocating page
tables which are GFP_KERNEL and probably some more. So either we are
broken by definition or GFP_IOFS is safe from under i_mutex lock. My
code inspection suggests the later but the code is really hard to follow
and dependencies might be not direct.
I remember that nfs_release_page would be prone to i_mutex deadlock
when server and client are on the same machine. But this shouldn't be a
problem anymore because the amount of time client waits for the server is
limited (9590544694bec).
I might be missing other places of course but to me it sounds that
GFP_IOFS must be safe under _some_ FS locks and i_mutex is one of them.

> Indeed, the canonical reason
> for why we can't take the i_mutex in the page fault path is exactly
> this. i.e. the user buffer might be a mmap()d region of the same
> file and so we have mmap_sem/i_mutex inversion issues.
>
> This is the same case - we can be taking page faults with filesystem
> locks held, and that means we've got problems if the page fault then
> recurses back into the filesystem and trips over those locks...

Yeah, I am familiar with the generic meaning of GFP_NOFS flags. I just
think that it is used as a too big of a hammer here (all FS locks is
just too broad).
The page fault is not GFP_NOFS safe now and it never has been (anonymous
pages are not GFP_NOFS, page tables etc...). And I am afraid we cannot
simply change it to use GFP_NOFS all over. Are there any other fs locks
(except for i_mutex) which might be held while doing {get,put}_user or
get_user_pages? I haven't found many instances in the fs/ but there is a
lot of done via indirection.

That being said I think the patch should be safe and an improvement over
the current state. Unless I am missing something obvious or there are
other objections I will repost it along with the other clean up patch
later this week.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read
@ 2015-04-07 12:16                     ` Michal Hocko
  0 siblings, 0 replies; 47+ messages in thread
From: Michal Hocko @ 2015-04-07 12:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Rik van Riel, Andrew Morton, Al Viro, Johannes Weiner,
	Mel Gorman, Neil Brown, Tetsuo Handa, Sage Weil, Mark Fasheh,
	linux-mm, LKML

On Wed 01-04-15 08:46:51, Dave Chinner wrote:
[...]
> GFP_NOFS has also been required in the mapping mask in the past
> because reclaim from page cache allocation points had a nasty habit
> of blowing the stack.

Yeah I remember some scary traces but we are talking about the page
fault path and we definitely have to handle GFP_IOFS allocations
there. We cannot use GFP_NOFS as a workaround e.g. for anonymous pages.

[...]
> > From 292cfcbbe18b2afc8d2bc0cf568ca4c5842d4c8f Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Fri, 27 Mar 2015 13:33:51 +0100
> > Subject: [PATCH] mm: Allow GFP_IOFS for page_cache_read page cache allocation
> > 
> > page_cache_read has been historically using page_cache_alloc_cold to
> > allocate a new page. This means that mapping_gfp_mask is used as the
> > base for the gfp_mask. Many filesystems are setting this mask to
> > GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
> > however, not called from the fs layera directly so it doesn't need this
> > protection normally.
> 
> It can be called from a page fault while copying into or out of a
> user buffer from a read()/write() system call. Hence the page fault
> can be nested inside filesystem locks.

As pointed above, the user buffer might be an anonymous memory as well
and so we have to be able to handle GFP_IOFS allocations from the page
fault without recalaim deadlocks. Besides that we are allocating page
tables which are GFP_KERNEL and probably some more. So either we are
broken by definition or GFP_IOFS is safe from under i_mutex lock. My
code inspection suggests the later but the code is really hard to follow
and dependencies might be not direct.
I remember that nfs_release_page would be prone to i_mutex deadlock
when server and client are on the same machine. But this shouldn't be a
problem anymore because the amount of time client waits for the server is
limited (9590544694bec).
I might be missing other places of course but to me it sounds that
GFP_IOFS must be safe under _some_ FS locks and i_mutex is one of them.

> Indeed, the canonical reason
> for why we can't take the i_mutex in the page fault path is exactly
> this. i.e. the user buffer might be a mmap()d region of the same
> file and so we have mmap_sem/i_mutex inversion issues.
>
> This is the same case - we can be taking page faults with filesystem
> locks held, and that means we've got problems if the page fault then
> recurses back into the filesystem and trips over those locks...

Yeah, I am familiar with the generic meaning of GFP_NOFS flags. I just
think that it is used as a too big of a hammer here (all FS locks is
just too broad).
The page fault is not GFP_NOFS safe now and it never has been (anonymous
pages are not GFP_NOFS, page tables etc...). And I am afraid we cannot
simply change it to use GFP_NOFS all over. Are there any other fs locks
(except for i_mutex) which might be held while doing {get,put}_user or
get_user_pages? I haven't found many instances in the fs/ but there is a
lot of done via indirection.

That being said I think the patch should be safe and an improvement over
the current state. Unless I am missing something obvious or there are
other objections I will repost it along with the other clean up patch
later this week.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2015-04-07 12:16 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-18 14:09 [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko
2015-03-18 14:09 ` Michal Hocko
2015-03-18 14:32 ` Rik van Riel
2015-03-18 14:32   ` Rik van Riel
2015-03-18 14:37   ` Michal Hocko
2015-03-18 14:37     ` Michal Hocko
2015-03-18 14:38 ` Mel Gorman
2015-03-18 14:38   ` Mel Gorman
2015-03-18 14:43   ` Michal Hocko
2015-03-18 14:43     ` Michal Hocko
2015-03-18 14:44 ` Rik van Riel
2015-03-18 14:44   ` Rik van Riel
2015-03-18 14:55   ` Michal Hocko
2015-03-18 14:55     ` Michal Hocko
2015-03-19  7:14     ` Dave Chinner
2015-03-19  7:14       ` Dave Chinner
2015-03-19 11:11       ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache inpage_cache_read Tetsuo Handa
2015-03-19 11:11         ` Tetsuo Handa
2015-03-19 12:44       ` [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Michal Hocko
2015-03-19 12:44         ` Michal Hocko
2015-03-20  3:48         ` Dave Chinner
2015-03-20  3:48           ` Dave Chinner
2015-03-20 13:14           ` Michal Hocko
2015-03-20 13:14             ` Michal Hocko
2015-03-20 22:51             ` Dave Chinner
2015-03-20 22:51               ` Dave Chinner
2015-03-23 13:02               ` Michal Hocko
2015-03-23 13:02                 ` Michal Hocko
2015-03-26  9:53           ` Michal Hocko
2015-03-26  9:53             ` Michal Hocko
2015-03-26 21:43             ` Dave Chinner
2015-03-26 21:43               ` Dave Chinner
2015-03-30  8:22               ` Michal Hocko
2015-03-30  8:22                 ` Michal Hocko
2015-03-31 21:46                 ` Dave Chinner
2015-03-31 21:46                   ` Dave Chinner
2015-04-07 12:16                   ` Michal Hocko
2015-04-07 12:16                     ` Michal Hocko
2015-03-18 15:45 ` Michal Hocko
2015-03-18 15:45   ` Michal Hocko
2015-03-18 21:38   ` NeilBrown
2015-03-19 13:55     ` Michal Hocko
2015-03-19 13:55       ` Michal Hocko
2015-03-19 14:27       ` Michal Hocko
2015-03-19 14:27         ` Michal Hocko
2015-03-20  3:57       ` Dave Chinner
2015-03-20  3:57         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.