Re: [PATCH 3.18.y 3/5] mm: allow GFP_{FS,IO} for page_cache_read page cache allocation

From: Harsh Shandilya <harsh@prjkt.io>
To: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	Mel Gorman <mgorman@suse.de>, Dave Chinner <david@fromorbit.com>,
	Mark Fasheh <mfasheh@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 3.18.y 3/5] mm: allow GFP_{FS,IO} for page_cache_read page cache allocation
Date: Mon, 23 Apr 2018 08:46:06 +0530	[thread overview]
Message-ID: <390EC9D9-32C8-42D2-8970-F4640B45C523@prjkt.io> (raw)
In-Reply-To: <20180422224636.GL17484@dhcp22.suse.cz>

On 23 April 2018 4:16:36 AM IST, Michal Hocko <mhocko@kernel.org> wrote:
>I am not reallu sure it is a good idea to blindly apply this patch to
>the older stable tree. Have you checked all filemap_fault handlers?
>This might have been quite different in 3.18 than for the kernel I was
>developing this against.

I did, but it's likely that I missed a few instances.

>If the sole purpose of this backport is to make other
>patch (abc1be13fd11 ("mm/filemap.c: fix NULL pointer in
>page_cache_tree_insert()")) apply easier then I've already suggested
>how
>to handle those rejects.

I'll look for the email after this exam and spin up a fixed series, thanks for the heads-up!

>On Mon 23-04-18 01:37:44, Harsh Shandilya wrote:
>> From: Michal Hocko <mhocko@suse.com>
>> 
>> Commit c20cd45eb01748f0fba77a504f956b000df4ea73 upstream.
>> 
>> page_cache_read has been historically using page_cache_alloc_cold to
>> allocate a new page.  This means that mapping_gfp_mask is used as the
>> base for the gfp_mask.  Many filesystems are setting this mask to
>> GFP_NOFS to prevent from fs recursion issues.  page_cache_read is
>called
>> from the vm_operations_struct::fault() context during the page fault.
>> This context doesn't need the reclaim protection normally.
>> 
>> ceph and ocfs2 which call filemap_fault from their fault handlers
>seem
>> to be OK because they are not taking any fs lock before invoking
>generic
>> implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
>> reclaim recursion POV because this lock serializes truncate and punch
>> hole with the page faults and it doesn't get involved in the reclaim.
>> 
>> There is simply no reason to deliberately use a weaker allocation
>> context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS
>protection
>> might be even harmful.  There is a push to fail GFP_NOFS allocations
>> rather than loop within allocator indefinitely with a very limited
>> reclaim ability.  Once we start failing those requests the OOM killer
>> might be triggered prematurely because the page cache allocation
>failure
>> is propagated up the page fault path and end up in
>> pagefault_out_of_memory.
>> 
>> We cannot play with mapping_gfp_mask directly because that would be
>racy
>> wrt.  parallel page faults and it might interfere with other users
>who
>> really rely on NOFS semantic from the stored gfp_mask.  The mask is
>also
>> inode proper so it would even be a layering violation.  What we can
>do
>> instead is to push the gfp_mask into struct vm_fault and allow fs
>layer
>> to overwrite it should the callback need to be called with a
>different
>> allocation context.
>> 
>> Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
>> because this should be safe from the page fault path normally.  Why
>do
>> we care about mapping_gfp_mask at all then? Because this doesn't hold
>> only reclaim protection flags but it also might contain zone and
>> movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we
>have
>> to respect those.
>> 
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Acked-by: Jan Kara <jack@suse.com>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Mark Fasheh <mfasheh@suse.com>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Harsh Shandilya <harsh@prjkt.io>
>> ---
>>  include/linux/mm.h |  4 ++++
>>  mm/filemap.c       |  8 ++++----
>>  mm/memory.c        | 17 +++++++++++++++++
>>  3 files changed, 25 insertions(+), 4 deletions(-)
>> 
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5adffb0a468f..9ac4697979e8 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -203,9 +203,13 @@ extern pgprot_t protection_map[16];
>>   *
>>   * pgoff should be used in favour of virtual_address, if possible.
>If pgoff
>>   * is used, one may implement ->remap_pages to get nonlinear mapping
>support.
>> + *
>> + * MM layer fills up gfp_mask for page allocations but fault handler
>might
>> + * alter it if its implementation requires a different allocation
>context.
>>   */
>>  struct vm_fault {
>>  	unsigned int flags;		/* FAULT_FLAG_xxx flags */
>> +	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>>  	void __user *virtual_address;	/* Faulting virtual address */
>>  
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7e6ab98d4d3c..aafeeefcb00d 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1746,18 +1746,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
>>   * This adds the requested page to the page cache if it isn't
>already there,
>>   * and schedules an I/O to read in its contents from disk.
>>   */
>> -static int page_cache_read(struct file *file, pgoff_t offset)
>> +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t
>gfp_mask)
>>  {
>>  	struct address_space *mapping = file->f_mapping;
>>  	struct page *page;
>>  	int ret;
>>  
>>  	do {
>> -		page = page_cache_alloc_cold(mapping);
>> +		page = __page_cache_alloc(gfp_mask|__GFP_COLD);
>>  		if (!page)
>>  			return -ENOMEM;
>>  
>> -		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
>> +		ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask &
>GFP_KERNEL);
>>  		if (ret == 0)
>>  			ret = mapping->a_ops->readpage(file, page);
>>  		else if (ret == -EEXIST)
>> @@ -1940,7 +1940,7 @@ no_cached_page:
>>  	 * We're only likely to ever get here if MADV_RANDOM is in
>>  	 * effect.
>>  	 */
>> -	error = page_cache_read(file, offset);
>> +	error = page_cache_read(file, offset, vmf->gfp_mask);
>>  
>>  	/*
>>  	 * The page we want has now been added to the page cache.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 0c4f5e36b155..5a62c6a42143 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1973,6 +1973,20 @@ static inline void cow_user_page(struct page
>*dst, struct page *src, unsigned lo
>>  		copy_user_highpage(dst, src, va, vma);
>>  }
>>  
>> +static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
>> +{
>> +	struct file *vm_file = vma->vm_file;
>> +
>> +	if (vm_file)
>> +		return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
>> +
>> +	/*
>> +	 * Special mappings (e.g. VDSO) do not have any file so fake
>> +	 * a default GFP_KERNEL for them.
>> +	 */
>> +	return GFP_KERNEL;
>> +}
>> +
>>  /*
>>   * Notify the address space that the page is about to become
>writable so that
>>   * it can prohibit this or wait for the page to get into an
>appropriate state.
>> @@ -1988,6 +2002,7 @@ static int do_page_mkwrite(struct
>vm_area_struct *vma, struct page *page,
>>  	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
>>  	vmf.pgoff = page->index;
>>  	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>>  	vmf.page = page;
>>  
>>  	ret = vma->vm_ops->page_mkwrite(vma, &vmf);
>> @@ -2670,6 +2685,7 @@ static int __do_fault(struct vm_area_struct
>*vma, unsigned long address,
>>  	vmf.pgoff = pgoff;
>>  	vmf.flags = flags;
>>  	vmf.page = NULL;
>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>>  
>>  	ret = vma->vm_ops->fault(vma, &vmf);
>>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
>VM_FAULT_RETRY)))
>> @@ -2834,6 +2850,7 @@ static void do_fault_around(struct
>vm_area_struct *vma, unsigned long address,
>>  	vmf.pgoff = pgoff;
>>  	vmf.max_pgoff = max_pgoff;
>>  	vmf.flags = flags;
>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>>  	vma->vm_ops->map_pages(vma, &vmf);
>>  }
>>  
>> -- 
>> 2.15.0.2308.g658a28aa74af
>> 

-- 
Harsh Shandilya, PRJKT Development LLC