From mboxrd@z Thu Jan  1 00:00:00 1970
From: Anton Altaparmakov <anton@tuxera.com>
Subject: Re: [PATCH] hfsplus: release bnode pages after use, not before
Date: Tue, 9 Jun 2015 23:34:27 +0000
Message-ID: <19E0B0BA-C122-4C1F-B54B-C16C3E3A53FB@tuxera.com>
References: <1433637776-3559-1-git-send-email-saproj@gmail.com>
 <1433778309.2513.11.camel@ubuntu-slavad-14.04>
 <CABikg9zygcMw--rD8g0KgAeSnLY+D=ULFEyJggyQnrf8zWg__g@mail.gmail.com>
 <1433781918.2659.3.camel@slavad-ubuntu-14.04>
 <CABikg9wGLwmF8SkYdQs2Fw99gD14kSeiF6uWMYqQ_HRYqwNntg@mail.gmail.com>
 <20150609151545.8a146bb5d29051e604d4d211@linux-foundation.org>
 <792FFF79-079C-4F6E-89DD-C196C9AFFFBF@tuxera.com>
 <20150609161656.9c5c67b41d5dd12edfe6e0db@linux-foundation.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
Cc: Sergei Antonov <saproj@gmail.com>,
	Viacheslav Dubeyko <slava@dubeyko.com>,
	Anton Altaparmakov <aia21@cam.ac.uk>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"hch@infradead.org" <hch@infradead.org>,
	Hin-Tak Leung <htl10@users.sourceforge.net>,
	Sougata Santra <sougata@tuxera.com>
To: Andrew Morton <akpm@linux-foundation.org>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e10-fe01.nebula.fi ([217.149.53.201]:42956 "EHLO ex10.nebula.fi"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1751700AbbFIXea convert rfc822-to-8bit (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 9 Jun 2015 19:34:30 -0400
In-Reply-To: <20150609161656.9c5c67b41d5dd12edfe6e0db@linux-foundation.org>
Content-Language: en-US
Content-ID: <84C815824402E14DBF9EA16A619497D8@nebula.local>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

Hi,

> On 10 Jun 2015, at 02:16, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 9 Jun 2015 23:08:48 +0000 Anton Altaparmakov <anton@tuxera.com> wrote:
>> Hi Andrew,
>> 
>> Forgot to reply to one point you made:
>> 
>>> On 10 Jun 2015, at 01:15, Andrew Morton <akpm@linux-foundation.org> wrote:
>>> Yes, I too would like to hear much more about your thinking on this,
>>> and a detailed description of the bug and how the patch fixes it.
>> 
>> Perhaps the patch description is lacking but here is what the current code does:
>> 
>> struct page *page = read_mapping_page();
>> page_cache_release(page);
>> u8 *kaddr = kmap(page);
>> memcpy(..., kaddr, ...);
>> kunmap(page);
>> 
>> Now in what world is that a valid thing to do?  When the page_cache_release() happens the page is no longer allocated and the kmap() is referencing not-in-use memory and so is the memcpy() and so is the kunmap().
>> 
>> The only reason the code gets away with it is that the kmap/memcpy/kunmap follow very quickly after the page_cache_release() so the kernel has not had a chance to reuse the memory for something else.
>> 
>> Sergei said that he got a problem when he was running memory intensive processes at same time so the kernel was thrashing/evicting/resuing page cache pages at high speed and then obviously the kmap() actually mapped a different page to the original that was page_cache_release()d and thus the memcpy() effectively copied random data which was then considered corrupt by the verification code and thus the entire B tree node was considered corrupt and in Sergei's case the volume thus failed to mount.
>> 
>> And his patch changes the above to this instead:
>> 
>> struct page *page = read_mapping_page();
>> u8 *kaddr = kmap(page);
>> memcpy(..., kaddr, ...);
>> kunmap(page);
>> page_cache_release(page);
>> 
>> Which is the correct sequence of events.
> 
> OK, pinning 8 pages for the duration of hfs_bnode_find() sounds
> reasonable.
> 
> This is a painful way to write a changelog :(

I will grant you that Sergei's change log was a bit brief.  I had to wade through the code to ensure I knew what he was talking about which the changelog should have spared me from doing.

Sergei, perhaps your take home message is that more verbose changelogs would be a good idea because even if something is obvious to you because you have studied and worked on the code it does not mean it is obvious to anyone else who likely has never seen the HFS+ code except in passing.  (-;

/offtopic alert: This reminds me of the maths professor who wrote a long and very complicated proof of some difficult problem on the blackboard and finished with "... and thus this obviously concludes the proof." and an incredulous student asked him "Excuse me professor but is it obvious?".  The professor left the room came back some time later and simply answered "Yes it is" without further explanation.  (-;

>> Although perhaps there should also be a mark_page_accessed(page);
>> thrown in there, too, before the page_cache_release() in the
>> expectation that the B tree node is likely to be used again?
> 
> Probably.
> 
> Also, using read_mapping_page() is quite inefficient: it's a
> synchronous read.  Putting a single call to read_cache_pages() before
> the loop would be sufficient to get all that IO done in a single lump.

That is very true.  IIRC the code came before the advent of the ->readpages address space operation...  But yes it needs modernising...

> But first we fix the bug.

I am glad we agree on that point.  (-:

Best regards,

	Anton
-- 
Anton Altaparmakov <anton at tuxera.com> (replace at with @)
Lead in File System Development, Tuxera Inc., http://www.tuxera.com/
Linux NTFS maintainer