From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-owner@vger.kernel.org>
Received: from mail-pf0-f195.google.com ([209.85.192.195]:37448 "EHLO
        mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1753911AbeDWDQO (ORCPT
        <rfc822;stable@vger.kernel.org>); Sun, 22 Apr 2018 23:16:14 -0400
Received: by mail-pf0-f195.google.com with SMTP id p6so8040958pfn.4
        for <stable@vger.kernel.org>; Sun, 22 Apr 2018 20:16:13 -0700 (PDT)
Date: Mon, 23 Apr 2018 08:46:06 +0530
In-Reply-To: <20180422224636.GL17484@dhcp22.suse.cz>
References: <20180422200746.29118-1-harsh@prjkt.io> <20180422200746.29118-4-harsh@prjkt.io> <20180422224636.GL17484@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain;
 charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [PATCH 3.18.y 3/5] mm: allow GFP_{FS,IO} for page_cache_read page cache allocation
To: Michal Hocko <mhocko@kernel.org>
CC: stable@vger.kernel.org,
        Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
        Mel Gorman <mgorman@suse.de>,
        Dave Chinner <david@fromorbit.com>,
        Mark Fasheh <mfasheh@suse.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linus Torvalds <torvalds@linux-foundation.org>
From: Harsh Shandilya <harsh@prjkt.io>
Message-ID: <390EC9D9-32C8-42D2-8970-F4640B45C523@prjkt.io>
Sender: stable-owner@vger.kernel.org
List-ID: <stable.vger.kernel.org>

On 23 April 2018 4:16:36 AM IST, Michal Hocko <mhocko@kernel=2Eorg> wrote:
>I am not reallu sure it is a good idea to blindly apply this patch to
>the older stable tree=2E Have you checked all filemap_fault handlers?
>This might have been quite different in 3=2E18 than for the kernel I was
>developing this against=2E

I did, but it's likely that I missed a few instances=2E

>If the sole purpose of this backport is to make other
>patch (abc1be13fd11 ("mm/filemap=2Ec: fix NULL pointer in
>page_cache_tree_insert()")) apply easier then I've already suggested
>how
>to handle those rejects=2E

I'll look for the email after this exam and spin up a fixed series, thanks=
 for the heads-up!

>On Mon 23-04-18 01:37:44, Harsh Shandilya wrote:
>> From: Michal Hocko <mhocko@suse=2Ecom>
>>=20
>> Commit c20cd45eb01748f0fba77a504f956b000df4ea73 upstream=2E
>>=20
>> page_cache_read has been historically using page_cache_alloc_cold to
>> allocate a new page=2E  This means that mapping_gfp_mask is used as the
>> base for the gfp_mask=2E  Many filesystems are setting this mask to
>> GFP_NOFS to prevent from fs recursion issues=2E  page_cache_read is
>called
>> from the vm_operations_struct::fault() context during the page fault=2E
>> This context doesn't need the reclaim protection normally=2E
>>=20
>> ceph and ocfs2 which call filemap_fault from their fault handlers
>seem
>> to be OK because they are not taking any fs lock before invoking
>generic
>> implementation=2E  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
>> reclaim recursion POV because this lock serializes truncate and punch
>> hole with the page faults and it doesn't get involved in the reclaim=2E
>>=20
>> There is simply no reason to deliberately use a weaker allocation
>> context when a __GFP_FS | __GFP_IO can be used=2E  The GFP_NOFS
>protection
>> might be even harmful=2E  There is a push to fail GFP_NOFS allocations
>> rather than loop within allocator indefinitely with a very limited
>> reclaim ability=2E  Once we start failing those requests the OOM killer
>> might be triggered prematurely because the page cache allocation
>failure
>> is propagated up the page fault path and end up in
>> pagefault_out_of_memory=2E
>>=20
>> We cannot play with mapping_gfp_mask directly because that would be
>racy
>> wrt=2E  parallel page faults and it might interfere with other users
>who
>> really rely on NOFS semantic from the stored gfp_mask=2E  The mask is
>also
>> inode proper so it would even be a layering violation=2E  What we can
>do
>> instead is to push the gfp_mask into struct vm_fault and allow fs
>layer
>> to overwrite it should the callback need to be called with a
>different
>> allocation context=2E
>>=20
>> Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
>> because this should be safe from the page fault path normally=2E  Why
>do
>> we care about mapping_gfp_mask at all then? Because this doesn't hold
>> only reclaim protection flags but it also might contain zone and
>> movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we
>have
>> to respect those=2E
>>=20
>> Signed-off-by: Michal Hocko <mhocko@suse=2Ecom>
>> Reported-by: Tetsuo Handa <penguin-kernel@I-love=2ESAKURA=2Ene=2Ejp>
>> Acked-by: Jan Kara <jack@suse=2Ecom>
>> Acked-by: Vlastimil Babka <vbabka@suse=2Ecz>
>> Cc: Tetsuo Handa <penguin-kernel@I-love=2ESAKURA=2Ene=2Ejp>
>> Cc: Mel Gorman <mgorman@suse=2Ede>
>> Cc: Dave Chinner <david@fromorbit=2Ecom>
>> Cc: Mark Fasheh <mfasheh@suse=2Ecom>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation=2Eorg>
>> Signed-off-by: Linus Torvalds <torvalds@linux-foundation=2Eorg>
>> Signed-off-by: Harsh Shandilya <harsh@prjkt=2Eio>
>> ---
>>  include/linux/mm=2Eh |  4 ++++
>>  mm/filemap=2Ec       |  8 ++++----
>>  mm/memory=2Ec        | 17 +++++++++++++++++
>>  3 files changed, 25 insertions(+), 4 deletions(-)
>>=20
>> diff --git a/include/linux/mm=2Eh b/include/linux/mm=2Eh
>> index 5adffb0a468f=2E=2E9ac4697979e8 100644
>> --- a/include/linux/mm=2Eh
>> +++ b/include/linux/mm=2Eh
>> @@ -203,9 +203,13 @@ extern pgprot_t protection_map[16];
>>   *
>>   * pgoff should be used in favour of virtual_address, if possible=2E
>If pgoff
>>   * is used, one may implement ->remap_pages to get nonlinear mapping
>support=2E
>> + *
>> + * MM layer fills up gfp_mask for page allocations but fault handler
>might
>> + * alter it if its implementation requires a different allocation
>context=2E
>>   */
>>  struct vm_fault {
>>  	unsigned int flags;		/* FAULT_FLAG_xxx flags */
>> +	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>>  	void __user *virtual_address;	/* Faulting virtual address */
>> =20
>> diff --git a/mm/filemap=2Ec b/mm/filemap=2Ec
>> index 7e6ab98d4d3c=2E=2Eaafeeefcb00d 100644
>> --- a/mm/filemap=2Ec
>> +++ b/mm/filemap=2Ec
>> @@ -1746,18 +1746,18 @@ EXPORT_SYMBOL(generic_file_read_iter);
>>   * This adds the requested page to the page cache if it isn't
>already there,
>>   * and schedules an I/O to read in its contents from disk=2E
>>   */
>> -static int page_cache_read(struct file *file, pgoff_t offset)
>> +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t
>gfp_mask)
>>  {
>>  	struct address_space *mapping =3D file->f_mapping;
>>  	struct page *page;
>>  	int ret;
>> =20
>>  	do {
>> -		page =3D page_cache_alloc_cold(mapping);
>> +		page =3D __page_cache_alloc(gfp_mask|__GFP_COLD);
>>  		if (!page)
>>  			return -ENOMEM;
>> =20
>> -		ret =3D add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
>> +		ret =3D add_to_page_cache_lru(page, mapping, offset, gfp_mask &
>GFP_KERNEL);
>>  		if (ret =3D=3D 0)
>>  			ret =3D mapping->a_ops->readpage(file, page);
>>  		else if (ret =3D=3D -EEXIST)
>> @@ -1940,7 +1940,7 @@ no_cached_page:
>>  	 * We're only likely to ever get here if MADV_RANDOM is in
>>  	 * effect=2E
>>  	 */
>> -	error =3D page_cache_read(file, offset);
>> +	error =3D page_cache_read(file, offset, vmf->gfp_mask);
>> =20
>>  	/*
>>  	 * The page we want has now been added to the page cache=2E
>> diff --git a/mm/memory=2Ec b/mm/memory=2Ec
>> index 0c4f5e36b155=2E=2E5a62c6a42143 100644
>> --- a/mm/memory=2Ec
>> +++ b/mm/memory=2Ec
>> @@ -1973,6 +1973,20 @@ static inline void cow_user_page(struct page
>*dst, struct page *src, unsigned lo
>>  		copy_user_highpage(dst, src, va, vma);
>>  }
>> =20
>> +static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
>> +{
>> +	struct file *vm_file =3D vma->vm_file;
>> +
>> +	if (vm_file)
>> +		return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
>> +
>> +	/*
>> +	 * Special mappings (e=2Eg=2E VDSO) do not have any file so fake
>> +	 * a default GFP_KERNEL for them=2E
>> +	 */
>> +	return GFP_KERNEL;
>> +}
>> +
>>  /*
>>   * Notify the address space that the page is about to become
>writable so that
>>   * it can prohibit this or wait for the page to get into an
>appropriate state=2E
>> @@ -1988,6 +2002,7 @@ static int do_page_mkwrite(struct
>vm_area_struct *vma, struct page *page,
>>  	vmf=2Evirtual_address =3D (void __user *)(address & PAGE_MASK);
>>  	vmf=2Epgoff =3D page->index;
>>  	vmf=2Eflags =3D FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
>> +	vmf=2Egfp_mask =3D __get_fault_gfp_mask(vma);
>>  	vmf=2Epage =3D page;
>> =20
>>  	ret =3D vma->vm_ops->page_mkwrite(vma, &vmf);
>> @@ -2670,6 +2685,7 @@ static int __do_fault(struct vm_area_struct
>*vma, unsigned long address,
>>  	vmf=2Epgoff =3D pgoff;
>>  	vmf=2Eflags =3D flags;
>>  	vmf=2Epage =3D NULL;
>> +	vmf=2Egfp_mask =3D __get_fault_gfp_mask(vma);
>> =20
>>  	ret =3D vma->vm_ops->fault(vma, &vmf);
>>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
>VM_FAULT_RETRY)))
>> @@ -2834,6 +2850,7 @@ static void do_fault_around(struct
>vm_area_struct *vma, unsigned long address,
>>  	vmf=2Epgoff =3D pgoff;
>>  	vmf=2Emax_pgoff =3D max_pgoff;
>>  	vmf=2Eflags =3D flags;
>> +	vmf=2Egfp_mask =3D __get_fault_gfp_mask(vma);
>>  	vma->vm_ops->map_pages(vma, &vmf);
>>  }
>> =20
>> --=20
>> 2=2E15=2E0=2E2308=2Eg658a28aa74af
>>=20


--=20
Harsh Shandilya, PRJKT Development LLC