From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755118AbbCSN4E (ORCPT ); Thu, 19 Mar 2015 09:56:04 -0400 Received: from cantor2.suse.de ([195.135.220.15]:39277 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754887AbbCSN4A (ORCPT ); Thu, 19 Mar 2015 09:56:00 -0400 Date: Thu, 19 Mar 2015 14:55:58 +0100 From: Michal Hocko To: NeilBrown Cc: Andrew Morton , Al Viro , Johannes Weiner , Mel Gorman , Rik van Riel , Tetsuo Handa , Sage Weil , Mark Fasheh , linux-mm@kvack.org, LKML Subject: Re: [PATCH] mm: Use GFP_KERNEL allocation for the page cache in page_cache_read Message-ID: <20150319135558.GD12466@dhcp22.suse.cz> References: <1426687766-518-1-git-send-email-mhocko@suse.cz> <20150318154540.GN17241@dhcp22.suse.cz> <20150319083835.2115ba11@notabene.brown> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150319083835.2115ba11@notabene.brown> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 19-03-15 08:38:35, Neil Brown wrote: [...] > Nearly half the places in the kernel which call mapping_gfp_mask() remove the > __GFP_FS bit. > > That suggests to me that it might make sense to have > mapping_gfp_mask_fs() > and > mapping_gfp_mask_nofs() > > and let the presence of __GFP_FS (and __GFP_IO) be determined by the > call-site rather than the filesystem. Sounds reasonable to me but filesystems tend to use this in a very different ways. - xfs drops GFP_FS in xfs_setup_inode so all page cache allocations are NOFS. - reiserfs drops GFP_FS only before calling read_mapping_page in reiserfs_get_page and never restores the original mask. - btrfs doesn't seem to rely on mapping_gfp_mask for anything other than btree_inode (unless it gets inherrited in a way I haven't noticed). - ext* doesn't seem to rely on the mapping gfp mask at all. So it is not clear to me how we should change that into callsites. But I guess we can change at least the page fault path like the following. I like it much more than the previous way which is too hackish. --- >>From 0aff17ef2daf1e4d7b5c7a2fcf3c0aac2670f527 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 19 Mar 2015 14:46:07 +0100 Subject: [PATCH] mm: Allow __GFP_FS for page_cache_read page cache allocation page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is, however, not called from the fs layer so it doesn't need this protection. The protection might be even harmful. There is a strong push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. This patch updates mapping_gfp_mask and adds both __GFP_FS and __GFP_IO in __do_fault before we the address space fault handler is called and restores the original flags after the callback. This will allow to use the go into FS from the direct reclaim if needed and prevent from pre-mature OOM killer for most filesystems and still allow FS layer to overwrite the mask from the .fault handler should there be a need. Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko --- mm/memory.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index c4b08e3ef058..ba528787e25f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2701,13 +2701,24 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, { struct vm_fault vmf; int ret; + struct address_space *mapping = vma->vm_file->f_mapping; + gfp_t mapping_gfp; vmf.virtual_address = (void __user *)(address & PAGE_MASK); vmf.pgoff = pgoff; vmf.flags = flags; vmf.page = NULL; + /* + * Some filesystems always drop __GFP_FS to prevent from reclaim + * recursion back to FS code. This is not the case here because + * we are at the top of the call chain. Add GFP_FS flags to prevent + * from premature OOM killer. + */ + mapping_gfp = mapping_gfp_mask(mapping); + mapping_set_gfp_mask(mapping, mapping_gfp | __GFP_FS | __GFP_IO); ret = vma->vm_ops->fault(vma, &vmf); + mapping_set_gfp_mask(mapping, mapping_gfp); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; -- 2.1.4 > However I am a bit concerned about drivers/block/loop.c. > Might a filesystem read on the loop block device wait for a page_cache_read() > on the loop-mounted file? In that case you really don't want __GFP_FS set > when allocating that page. I guess I see what you mean but I fail to see how would the deadlock happen from the page fault path. -- Michal Hocko SUSE Labs