From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755118AbbCSN4E (ORCPT <rfc822;w@1wt.eu>);
	Thu, 19 Mar 2015 09:56:04 -0400
Received: from cantor2.suse.de ([195.135.220.15]:39277 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1754887AbbCSN4A (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 19 Mar 2015 09:56:00 -0400
Date: Thu, 19 Mar 2015 14:55:58 +0100
From: Michal Hocko <mhocko@suse.cz>
To: NeilBrown <neilb@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Johannes Weiner <hannes@cmpxchg.org>, Mel Gorman <mgorman@suse.de>,
        Rik van Riel <riel@redhat.com>,
        Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
        Sage Weil <sage@inktank.com>, Mark Fasheh <mfasheh@suse.com>,
        linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in
 page_cache_read
Message-ID: <20150319135558.GD12466@dhcp22.suse.cz>
References: <1426687766-518-1-git-send-email-mhocko@suse.cz>
 <20150318154540.GN17241@dhcp22.suse.cz>
 <20150319083835.2115ba11@notabene.brown>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150319083835.2115ba11@notabene.brown>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu 19-03-15 08:38:35, Neil Brown wrote:
[...]
> Nearly half the places in the kernel which call mapping_gfp_mask() remove the
> __GFP_FS bit.
> 
> That suggests to me that it might make sense to have
>    mapping_gfp_mask_fs()
> and
>    mapping_gfp_mask_nofs()
>
> and let the presence of __GFP_FS (and __GFP_IO) be determined by the
> call-site rather than the filesystem.

Sounds reasonable to me but filesystems tend to use this in a very
different ways.
- xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are
  NOFS.
- reiserfs drops GFP_FS only before calling read_mapping_page in
  reiserfs_get_page and never restores the original mask.
- btrfs doesn't seem to rely on mapping_gfp_mask for anything other than
  btree_inode (unless it gets inherrited in a way I haven't noticed).
- ext* doesn't seem to rely on the mapping gfp mask at all.

So it is not clear to me how we should change that into callsites. But I
guess we can change at least the page fault path like the following. I
like it much more than the previous way which is too hackish.
---
>>From 0aff17ef2daf1e4d7b5c7a2fcf3c0aac2670f527 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 19 Mar 2015 14:46:07 +0100
Subject: [PATCH] mm: Allow __GFP_FS for page_cache_read page cache allocation

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is,
however, not called from the fs layer so it doesn't need this
protection.

The protection might be even harmful. There is a strong push to fail
GFP_NOFS allocations rather than loop within allocator indefinitely with
a very limited reclaim ability. Once we start failing those requests
the OOM killer might be triggered prematurely because the page cache
allocation failure is propagated up the page fault path and end up in
pagefault_out_of_memory.

This patch updates mapping_gfp_mask and adds both __GFP_FS and __GFP_IO
in __do_fault before we the address space fault handler is called and
restores the original flags after the callback. This will allow to use
the go into FS from the direct reclaim if needed and prevent from
pre-mature OOM killer for most filesystems and still allow FS layer
to overwrite the mask from the .fault handler should there be a need.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index c4b08e3ef058..ba528787e25f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2701,13 +2701,24 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 {
 	struct vm_fault vmf;
 	int ret;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	gfp_t mapping_gfp;
 
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
 	vmf.page = NULL;
 
+	/*
+	 * Some filesystems always drop __GFP_FS to prevent from reclaim
+	 * recursion back to FS code. This is not the case here because
+	 * we are at the top of the call chain. Add GFP_FS flags to prevent
+	 * from premature OOM killer.
+	 */
+	mapping_gfp = mapping_gfp_mask(mapping);
+	mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO);
 	ret = vma->vm_ops->fault(vma, &vmf);
+	mapping_set_gfp_mask(mapping, mapping_gfp);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-- 
2.1.4

> However I am a bit concerned about drivers/block/loop.c.
> Might a filesystem read on the loop block device wait for a page_cache_read()
> on the loop-mounted file?  In that case you really don't want __GFP_FS set
> when allocating that page.

I guess I see what you mean but I fail to see how would the deadlock
happen from the page fault path. 
-- 
Michal Hocko
SUSE Labs