From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH] hfsplus: release bnode pages after use, not before
Date: Tue, 9 Jun 2015 16:16:56 -0700
Message-ID: <20150609161656.9c5c67b41d5dd12edfe6e0db@linux-foundation.org>
References: <1433637776-3559-1-git-send-email-saproj@gmail.com>
	<1433778309.2513.11.camel@ubuntu-slavad-14.04>
	<CABikg9zygcMw--rD8g0KgAeSnLY+D=ULFEyJggyQnrf8zWg__g@mail.gmail.com>
	<1433781918.2659.3.camel@slavad-ubuntu-14.04>
	<CABikg9wGLwmF8SkYdQs2Fw99gD14kSeiF6uWMYqQ_HRYqwNntg@mail.gmail.com>
	<20150609151545.8a146bb5d29051e604d4d211@linux-foundation.org>
	<792FFF79-079C-4F6E-89DD-C196C9AFFFBF@tuxera.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: Sergei Antonov <saproj@gmail.com>,
	Viacheslav Dubeyko <slava@dubeyko.com>,
	Anton Altaparmakov <aia21@cam.ac.uk>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"hch@infradead.org" <hch@infradead.org>,
	Hin-Tak Leung <htl10@users.sourceforge.net>,
	Sougata Santra <sougata@tuxera.com>
To: Anton Altaparmakov <anton@tuxera.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail.linuxfoundation.org ([140.211.169.12]:36258 "EHLO
	mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753866AbbFIXQ5 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 9 Jun 2015 19:16:57 -0400
In-Reply-To: <792FFF79-079C-4F6E-89DD-C196C9AFFFBF@tuxera.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, 9 Jun 2015 23:08:48 +0000 Anton Altaparmakov <anton@tuxera.com> wrote:

> Hi Andrew,
> 
> Forgot to reply to one point you made:
> 
> > On 10 Jun 2015, at 01:15, Andrew Morton <akpm@linux-foundation.org> wrote:
> > Yes, I too would like to hear much more about your thinking on this,
> > and a detailed description of the bug and how the patch fixes it.
> 
> Perhaps the patch description is lacking but here is what the current code does:
> 
> struct page *page = read_mapping_page();
> page_cache_release(page);
> u8 *kaddr = kmap(page);
> memcpy(..., kaddr, ...);
> kunmap(page);
> 
> Now in what world is that a valid thing to do?  When the page_cache_release() happens the page is no longer allocated and the kmap() is referencing not-in-use memory and so is the memcpy() and so is the kunmap().
> 
> The only reason the code gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else.
> 
> Sergei said that he got a problem when he was running memory intensive processes at same time so the kernel was thrashing/evicting/resuing page cache pages at high speed and then obviously the kmap() actually mapped a different page to the original that was page_cache_release()d and thus the memcpy() effectively copied random data which was then considered corrupt by the verification code and thus the entire B tree node was considered corrupt and in Sergei's case the volume thus failed to mount.
> 
> And his patch changes the above to this instead:
> 
> struct page *page = read_mapping_page();
> u8 *kaddr = kmap(page);
> memcpy(..., kaddr, ...);
> kunmap(page);
> page_cache_release(page);
> 
> Which is the correct sequence of events.

OK, pinning 8 pages for the duration of hfs_bnode_find() sounds
reasonable.

This is a painful way to write a changelog :(

> Although perhaps there should also be a mark_page_accessed(page);
> thrown in there, too, before the page_cache_release() in the
> expectation that the B tree node is likely to be used again?

Probably.

Also, using read_mapping_page() is quite inefficient: it's a
synchronous read.  Putting a single call to read_cache_pages() before
the loop would be sufficient to get all that IO done in a single lump. 

But first we fix the bug.